Skip to content

feat: add codeflash compare CLI command#1922

Merged
KRRT7 merged 13 commits intomainfrom
feat/compare-command
Mar 28, 2026
Merged

feat: add codeflash compare CLI command#1922
KRRT7 merged 13 commits intomainfrom
feat/compare-command

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented Mar 27, 2026

Summary

Adds codeflash compare <base_ref> <head_ref> — a first-class CLI command that compares benchmark performance between two git refs.

Flow:

  1. Auto-detects changed functions from git diff (line-level overlap, not just file-level)
  2. Creates isolated git worktrees for each ref
  3. Instruments target functions with @codeflash_trace
  4. Runs benchmarks via trace_benchmarks_pytest
  5. Renders a Rich side-by-side comparison table (total time + per-function breakdown + speedup)

Usage:

# Compare two branches
codeflash compare main feat/optimization

# Compare with a PR number (resolves to branch via gh CLI)
codeflash compare main --pr 1921

# Explicit function list (override auto-detect)
codeflash compare main feat/opt --functions "path/to/file.py::func1,func2"

# Export SVG
codeflash compare main feat/opt --svg /tmp/comparison.svg

Smoke test — ran against main vs origin/cf-cpu-context-extraction on the codeflash repo itself. Auto-detected 5 modified functions across 3 files, produced a full comparison showing 3.04x speedup on the context extraction benchmark.

Known Limitations

  • Only top-level functions are instrumented in auto-detect mode. Methods inside classes (i.e. functions with parents) are skipped. This is intentional — CST visitor methods like visit_Name can be called thousands of times per benchmark, and @codeflash_trace pickles self on every call. When self holds a full CST tree, the overhead is catastrophic (benchmark that normally takes ~23s times out at 120s+). The --functions flag can be used to explicitly target specific methods if needed.
  • Requires benchmarks-root to be configured in [tool.codeflash] in pyproject.toml.
  • Requires unidiff package for diff parsing (already a dependency).

New Files

  • codeflash/benchmarking/compare.py — core orchestration logic
  • codeflash/cli_cmds/cmd_compare.py — CLI handler (config loading, arg parsing, PR resolution)

Modified Files

  • codeflash/cli_cmds/cli.py — added compare subparser
  • codeflash/main.py — added dispatch for compare command

Test plan

  • Smoke tested: codeflash compare main origin/cf-cpu-context-extraction --timeout 120 completes successfully with correct output
  • Test with --pr flag against a real PR
  • Test with --functions explicit targeting
  • Test with --svg export
  • Test error paths: missing benchmarks-root, non-existent refs, no changed functions

…comparison

Adds `codeflash compare <base_ref> <head_ref>` — compares benchmark
performance between two git refs by auto-detecting changed functions
from the diff, creating worktrees, instrumenting with @codeflash_trace,
running benchmarks, and rendering a Rich side-by-side comparison table.

Supports --pr to resolve a PR number to its branch, --functions for
explicit function targeting, --svg for SVG export, and --timeout.
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 27, 2026

Claude finished @KRRT7's task in 3m 35s —— View job


PR Review Summary

  • Triage PR size — LARGE (new files: compare.py 609 lines, cmd_compare.py 113 lines)
  • Run lint/typecheck checks
  • Resolve stale review threads
  • Code review
  • Duplicate detection
  • Test coverage analysis
  • No codeflash-ai[bot] optimization PRs open

Prek Checks

uv run prek run --from-ref origin/main — passed (ruff check, ruff format: clean)
uv run mypy — no issues in 4 changed source files


Code Review

🐛 Bug: --markdown flag is documented but not wired up

The PR description and the live demo comment both use --markdown:

codeflash compare main feat/opt --markdown

But compare_parser in cli.py has no --markdown argument, and cmd_compare.py never calls result.format_markdown(). The format_markdown() method exists and is well-implemented but is completely unreachable via the CLI. Fix this →

🐛 Bug: _parse_functions_arg creates incomplete FunctionToOptimize objects

cmd_compare.py:111FunctionToOptimize is constructed without starting_line / ending_line:

FunctionToOptimize(function_name=name, file_path=file_path, parents=[])

When --functions is used, the instrumentation path (instrument_codeflash_trace_decorator) receives objects with None line numbers. The auto-detect path (via _find_changed_toplevel_functions) correctly populates these from AST. If instrumentation relies on line numbers this will silently fail for explicit --functions targets.

⚠️ Design: Path safety in _run_benchmark_on_worktree

compare.py:443-444:

wt_benchmarks = worktree_dir / benchmarks_root.relative_to(repo_root)
wt_tests     = worktree_dir / tests_root.relative_to(repo_root)

If benchmarks_root or tests_root is configured as an absolute path that falls outside repo_root, .relative_to() raises ValueError with no useful error message. A guard + logger.error before these lines would surface the misconfiguration clearly.

ℹ️ Note: _md_delta sign inconsistency (cosmetic)

compare.py:583-586:

if delta_ms < 0:
    return f"{delta_ms:+,.0f}ms ({pct:+.0f}%)"   # uses format spec for sign
return f"+{delta_ms:,.0f}ms ({pct:+.0f}%)"        # manually prepends "+"

The positive (regression) path manually adds + while the negative path uses the :+ format specifier. These produce the same output but the inconsistency makes the code slightly harder to reason about.


Duplicate Detection

MEDIUM confidence: _discover_changed_functions in compare.py:304-354 duplicates the core diff-parsing logic from get_git_diff in code_utils/git_utils.py:21-72. Both functions:

  • Open a git.Repo, call repo.git.diff(...) with the same ignore_blank_lines/ignore_space_at_eol flags
  • Parse with PatchSet(StringIO(...))
  • Build a file_path → [line_nos] map
  • Use the same deletion-only fallback (hunk.target_start)

The difference is that _discover_changed_functions takes two explicit refs instead of HEAD, and then additionally does AST function discovery. The diff-parsing portion (roughly 25 lines) could be extracted into a shared helper in git_utils.py.


Test Coverage

No tests were added for the new compare command. The following testable units have zero coverage:

  • _discover_changed_functions — pure function, easy to unit-test with a real or in-memory git repo
  • _find_changed_toplevel_functions — pure function taking a file path + set of line numbers, straightforward to test
  • _parse_functions_arg — pure function, easy to unit-test
  • CompareResult.format_markdown() — pure method, easy to test against snapshot strings
  • _fmt_ms, _md_bar, _md_speedup, _md_delta, _pct_bar — pure formatting functions, trivial to test

Per CLAUDE.md: "Everything that can be tested should have tests." These utility functions in particular are entirely self-contained and have clear inputs/outputs. Fix this →


- Move BenchmarkKey to TYPE_CHECKING block (TC001)
- Move argparse.Namespace to TYPE_CHECKING block (TC003)
- Fix negated equality check (SIM201)
- Fix loop variable capture in nested function (B023)

Co-authored-by: Kevin Turcios <undefined@users.noreply.github.com>
KRRT7 added 4 commits March 27, 2026 17:56
Replace find_all_functions_in_file (libcst + metadata resolution) with
lightweight ast.parse + iter_child_nodes for top-level function discovery.
Since compare only instruments top-level functions, the full CST parse
with PositionProvider/ParentNodeProvider was unnecessary overhead.

Also fix pre-existing lint issues: move BenchmarkKey to TYPE_CHECKING,
use != instead of not ==, bind loop var in sort_key closure.
The hook was silently swallowing all prek failures with || true,
including actual lint errors. Now it re-runs prek after the first
pass auto-fixes formatting — only real lint errors block.
Use cairosvg to convert Rich SVG output to PNG at 144 DPI.
Renames the CLI flag from --svg to --png across all touchpoints.
@KRRT7
Copy link
Copy Markdown
Collaborator Author

KRRT7 commented Mar 27, 2026

Benchmark Comparison: 2c4989c6 vs f169e862

Comparing performance before and after the context extraction optimizations (PRs #1920/#1921).

tests.benchmarks.test_benchmark_code_extract_code_context::test_benchmark_extract

Function 2c4989c6458c (ms) f169e862ea6b (ms) Delta Speedup
get_code_optimization_context 14,664 7,983 -6,681ms (-46%) 🟢 1.84x
extract_all_contexts_from_files 9,669 3,248 -6,421ms (-66%) 🟢 2.98x
collect_top_level_defs_with_dependencies 3,518 453 -3,065ms (-87%) 🟢 7.77x
get_function_sources_from_jedi 2,589 1,950 -640ms (-25%) 🟢 1.33x
add_needed_imports_from_module 2,284 661 -1,623ms (-71%) 🟢 3.46x
gather_source_imports 845 145 -699ms (-83%) 🟢 5.81x
collect_existing_class_names 0.46 0.26 -0ms (-45%) 🟢 1.80x
TOTAL 25,535 11,592 -13,942ms (-55%) 🟢 2.20x

tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_code_to_optimize_test_discovery

Ref Time (ms) Delta Speedup
2c4989c6458c 2,906 - -
f169e862ea6b 2,269 -637ms (-22%) 🟢 1.28x

tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_codeflash_test_discovery

Ref Time (ms) Delta Speedup
2c4989c6458c 799 - -
f169e862ea6b 694 -105ms (-13%) 🟢 1.15x

tests.benchmarks.test_benchmark_merge_test_results::test_benchmark_merge_test_results

Ref Time (ms) Delta Speedup
2c4989c6458c 98.3 - -
f169e862ea6b 81.0 -17ms (-18%) 🟢 1.21x

Generated by codeflash compare

Replace PNG export with GitHub-flavored markdown tables. Add unicode
progress bars for improvement and share-of-time columns, remove image
dependencies (pillow, cairosvg, svglib, reportlab).
@KRRT7
Copy link
Copy Markdown
Collaborator Author

KRRT7 commented Mar 27, 2026

Benchmark: 2c4989c6458c vs f169e862ea6b

test_benchmark_extract

Branch Time (ms) vs base Speedup
2c4989c6458c (base) 14,860 - -
f169e862ea6b (head) 7,566 -7,294ms (-49%) 🟢 1.96x
Function base (ms) head (ms) Improvement Speedup
get_code_optimization_context 12,513 6,826 █████░░░░░ +45% 🟢 1.83x
extract_all_contexts_from_files 8,256 2,748 ███████░░░ +67% 🟢 3.00x
collect_top_level_defs_with_dependencies 2,802 458 ████████░░ +84% 🟢 6.12x
get_function_sources_from_jedi 2,174 1,787 ██░░░░░░░░ +18% 🟢 1.22x
add_needed_imports_from_module 2,086 497 ████████░░ +76% 🟢 4.20x
gather_source_imports 725 120 ████████░░ +83% 🟢 6.03x
collect_existing_class_names 0.25 0.19 ██░░░░░░░░ +24% 🟢 1.32x
TOTAL 14,860 7,566 █████░░░░░ +49% 🟢 1.96x
Share of Benchmark Time
Function base head
get_code_optimization_context ████████░░ 84.2% █████████░ 90.2%
extract_all_contexts_from_files ██████░░░░ 55.6% ████░░░░░░ 36.3%
collect_top_level_defs_with_dependencies ██░░░░░░░░ 18.9% █░░░░░░░░░ 6.0%
get_function_sources_from_jedi █░░░░░░░░░ 14.6% ██░░░░░░░░ 23.6%
add_needed_imports_from_module █░░░░░░░░░ 14.0% █░░░░░░░░░ 6.6%
gather_source_imports ░░░░░░░░░░ 4.9% ░░░░░░░░░░ 1.6%
collect_existing_class_names ░░░░░░░░░░ 0.0% ░░░░░░░░░░ 0.0%

test_benchmark_code_to_optimize_test_discovery

Branch Time (ms) vs base Speedup
2c4989c6458c (base) 2,406 - -
f169e862ea6b (head) 1,633 -773ms (-32%) 🟢 1.47x

test_benchmark_codeflash_test_discovery

Branch Time (ms) vs base Speedup
2c4989c6458c (base) 613 - -
f169e862ea6b (head) 570 -43ms (-7%) 🟢 1.08x

test_benchmark_merge_test_results

Branch Time (ms) vs base Speedup
2c4989c6458c (base) 41.9 - -
f169e862ea6b (head) 45.7 +4ms (+9%) 🔴 0.92x

Generated by codeflash optimization agent — you can reproduce this by running:

codeflash compare 2c4989c6458ccfb0100e5ad28e0163a7bc3ee901 f169e862ea6b15f433150a77545eefd9099f6ccb --markdown

KRRT7 added 2 commits March 27, 2026 19:03
Replace plain logger output with a Live panel showing a function tree
and step-by-step progress. Handle KeyboardInterrupt gracefully with
cleanup. Remove --markdown CLI flag (format_markdown remains for
programmatic use).
@KRRT7 KRRT7 force-pushed the feat/compare-command branch from 9ff4e7b to f91f3e0 Compare March 28, 2026 00:10
github-actions bot and others added 2 commits March 28, 2026 00:13
@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai bot commented Mar 28, 2026

⚡️ Codeflash found optimizations for this PR

📄 33% (0.33x) speedup for _render_comparison in codeflash/benchmarking/compare.py

⏱️ Runtime : 35.6 milliseconds 26.9 milliseconds (best of 164 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch feat/compare-command).

Static Badge

@KRRT7 KRRT7 merged commit ab7e1f4 into main Mar 28, 2026
24 of 27 checks passed
@KRRT7 KRRT7 deleted the feat/compare-command branch March 28, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant