feat: add `codeflash compare` CLI command by KRRT7 · Pull Request #1922 · codeflash-ai/codeflash

KRRT7 · 2026-03-27T22:28:03Z

Summary

Adds codeflash compare <base_ref> <head_ref> — a first-class CLI command that compares benchmark performance between two git refs.

Flow:

Auto-detects changed functions from git diff (line-level overlap, not just file-level)
Creates isolated git worktrees for each ref
Instruments target functions with @codeflash_trace
Runs benchmarks via trace_benchmarks_pytest
Renders a Rich side-by-side comparison table (total time + per-function breakdown + speedup)

Usage:

# Compare two branches
codeflash compare main feat/optimization

# Compare with a PR number (resolves to branch via gh CLI)
codeflash compare main --pr 1921

# Explicit function list (override auto-detect)
codeflash compare main feat/opt --functions "path/to/file.py::func1,func2"

# Export SVG
codeflash compare main feat/opt --svg /tmp/comparison.svg

Smoke test — ran against main vs origin/cf-cpu-context-extraction on the codeflash repo itself. Auto-detected 5 modified functions across 3 files, produced a full comparison showing 3.04x speedup on the context extraction benchmark.

Known Limitations

Only top-level functions are instrumented in auto-detect mode. Methods inside classes (i.e. functions with parents) are skipped. This is intentional — CST visitor methods like visit_Name can be called thousands of times per benchmark, and @codeflash_trace pickles self on every call. When self holds a full CST tree, the overhead is catastrophic (benchmark that normally takes ~23s times out at 120s+). The --functions flag can be used to explicitly target specific methods if needed.
Requires benchmarks-root to be configured in [tool.codeflash] in pyproject.toml.
Requires unidiff package for diff parsing (already a dependency).

New Files

codeflash/benchmarking/compare.py — core orchestration logic
codeflash/cli_cmds/cmd_compare.py — CLI handler (config loading, arg parsing, PR resolution)

Modified Files

codeflash/cli_cmds/cli.py — added compare subparser
codeflash/main.py — added dispatch for compare command

Test plan

Smoke tested: codeflash compare main origin/cf-cpu-context-extraction --timeout 120 completes successfully with correct output
Test with --pr flag against a real PR
Test with --functions explicit targeting
Test with --svg export
Test error paths: missing benchmarks-root, non-existent refs, no changed functions

…comparison Adds `codeflash compare <base_ref> <head_ref>` — compares benchmark performance between two git refs by auto-detecting changed functions from the diff, creating worktrees, instrumenting with @codeflash_trace, running benchmarks, and rendering a Rich side-by-side comparison table. Supports --pr to resolve a PR number to its branch, --functions for explicit function targeting, --svg for SVG export, and --timeout.

claude · 2026-03-27T22:28:35Z

Claude finished @KRRT7's task in 3m 35s —— View job

PR Review Summary

Triage PR size — LARGE (new files: compare.py 609 lines, cmd_compare.py 113 lines)
Run lint/typecheck checks
Resolve stale review threads
Code review
Duplicate detection
Test coverage analysis
No codeflash-ai[bot] optimization PRs open

Prek Checks

✅ uv run prek run --from-ref origin/main — passed (ruff check, ruff format: clean)
✅ uv run mypy — no issues in 4 changed source files

Code Review

🐛 Bug: `--markdown` flag is documented but not wired up

The PR description and the live demo comment both use --markdown:

codeflash compare main feat/opt --markdown

But compare_parser in cli.py has no --markdown argument, and cmd_compare.py never calls result.format_markdown(). The format_markdown() method exists and is well-implemented but is completely unreachable via the CLI. Fix this →

🐛 Bug: `_parse_functions_arg` creates incomplete `FunctionToOptimize` objects

cmd_compare.py:111 — FunctionToOptimize is constructed without starting_line / ending_line:

FunctionToOptimize(function_name=name, file_path=file_path, parents=[])

When --functions is used, the instrumentation path (instrument_codeflash_trace_decorator) receives objects with None line numbers. The auto-detect path (via _find_changed_toplevel_functions) correctly populates these from AST. If instrumentation relies on line numbers this will silently fail for explicit --functions targets.

⚠️ Design: Path safety in `_run_benchmark_on_worktree`

compare.py:443-444:

wt_benchmarks = worktree_dir / benchmarks_root.relative_to(repo_root)
wt_tests     = worktree_dir / tests_root.relative_to(repo_root)

If benchmarks_root or tests_root is configured as an absolute path that falls outside repo_root, .relative_to() raises ValueError with no useful error message. A guard + logger.error before these lines would surface the misconfiguration clearly.

ℹ️ Note: `_md_delta` sign inconsistency (cosmetic)

compare.py:583-586:

if delta_ms < 0:
    return f"{delta_ms:+,.0f}ms ({pct:+.0f}%)"   # uses format spec for sign
return f"+{delta_ms:,.0f}ms ({pct:+.0f}%)"        # manually prepends "+"

The positive (regression) path manually adds + while the negative path uses the :+ format specifier. These produce the same output but the inconsistency makes the code slightly harder to reason about.

Duplicate Detection

MEDIUM confidence: _discover_changed_functions in compare.py:304-354 duplicates the core diff-parsing logic from get_git_diff in code_utils/git_utils.py:21-72. Both functions:

Open a git.Repo, call repo.git.diff(...) with the same ignore_blank_lines/ignore_space_at_eol flags
Parse with PatchSet(StringIO(...))
Build a file_path → [line_nos] map
Use the same deletion-only fallback (hunk.target_start)

The difference is that _discover_changed_functions takes two explicit refs instead of HEAD, and then additionally does AST function discovery. The diff-parsing portion (roughly 25 lines) could be extracted into a shared helper in git_utils.py.

Test Coverage

No tests were added for the new compare command. The following testable units have zero coverage:

_discover_changed_functions — pure function, easy to unit-test with a real or in-memory git repo
_find_changed_toplevel_functions — pure function taking a file path + set of line numbers, straightforward to test
_parse_functions_arg — pure function, easy to unit-test
CompareResult.format_markdown() — pure method, easy to test against snapshot strings
_fmt_ms, _md_bar, _md_speedup, _md_delta, _pct_bar — pure formatting functions, trivial to test

Per CLAUDE.md: "Everything that can be tested should have tests." These utility functions in particular are entirely self-contained and have clear inputs/outputs. Fix this →

- Move BenchmarkKey to TYPE_CHECKING block (TC001) - Move argparse.Namespace to TYPE_CHECKING block (TC003) - Fix negated equality check (SIM201) - Fix loop variable capture in nested function (B023) Co-authored-by: Kevin Turcios <undefined@users.noreply.github.com>

codeflash/benchmarking/compare.py

Replace find_all_functions_in_file (libcst + metadata resolution) with lightweight ast.parse + iter_child_nodes for top-level function discovery. Since compare only instruments top-level functions, the full CST parse with PositionProvider/ParentNodeProvider was unnecessary overhead. Also fix pre-existing lint issues: move BenchmarkKey to TYPE_CHECKING, use != instead of not ==, bind loop var in sort_key closure.

The hook was silently swallowing all prek failures with || true, including actual lint errors. Now it re-runs prek after the first pass auto-fixes formatting — only real lint errors block.

Use cairosvg to convert Rich SVG output to PNG at 144 DPI. Renames the CLI flag from --svg to --png across all touchpoints.

KRRT7 · 2026-03-27T23:19:02Z

Benchmark Comparison: `2c4989c6` vs `f169e862`

Comparing performance before and after the context extraction optimizations (PRs #1920/#1921).

tests.benchmarks.test_benchmark_code_extract_code_context::test_benchmark_extract

Function	`2c4989c6458c` (ms)	`f169e862ea6b` (ms)	Delta	Speedup
`get_code_optimization_context`	14,664	7,983	-6,681ms (-46%)	🟢 1.84x
`extract_all_contexts_from_files`	9,669	3,248	-6,421ms (-66%)	🟢 2.98x
`collect_top_level_defs_with_dependencies`	3,518	453	-3,065ms (-87%)	🟢 7.77x
`get_function_sources_from_jedi`	2,589	1,950	-640ms (-25%)	🟢 1.33x
`add_needed_imports_from_module`	2,284	661	-1,623ms (-71%)	🟢 3.46x
`gather_source_imports`	845	145	-699ms (-83%)	🟢 5.81x
`collect_existing_class_names`	0.46	0.26	-0ms (-45%)	🟢 1.80x
TOTAL	25,535	11,592	-13,942ms (-55%)	🟢 2.20x

tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_code_to_optimize_test_discovery

Ref	Time (ms)	Delta	Speedup
`2c4989c6458c`	2,906	-	-
`f169e862ea6b`	2,269	-637ms (-22%)	🟢 1.28x

tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_codeflash_test_discovery

Ref	Time (ms)	Delta	Speedup
`2c4989c6458c`	799	-	-
`f169e862ea6b`	694	-105ms (-13%)	🟢 1.15x

tests.benchmarks.test_benchmark_merge_test_results::test_benchmark_merge_test_results

Ref	Time (ms)	Delta	Speedup
`2c4989c6458c`	98.3	-	-
`f169e862ea6b`	81.0	-17ms (-18%)	🟢 1.21x

Generated by codeflash compare

Replace PNG export with GitHub-flavored markdown tables. Add unicode progress bars for improvement and share-of-time columns, remove image dependencies (pillow, cairosvg, svglib, reportlab).

KRRT7 · 2026-03-27T23:32:35Z

Benchmark: `2c4989c6458c` vs `f169e862ea6b`

test_benchmark_extract

Branch	Time (ms)	vs base	Speedup
`2c4989c6458c` (base)	14,860	-	-
`f169e862ea6b` (head)	7,566	-7,294ms (-49%)	🟢 1.96x

Function	base (ms)	head (ms)	Improvement	Speedup
`get_code_optimization_context`	12,513	6,826	`█████░░░░░` +45%	🟢 1.83x
`extract_all_contexts_from_files`	8,256	2,748	`███████░░░` +67%	🟢 3.00x
`collect_top_level_defs_with_dependencies`	2,802	458	`████████░░` +84%	🟢 6.12x
`get_function_sources_from_jedi`	2,174	1,787	`██░░░░░░░░` +18%	🟢 1.22x
`add_needed_imports_from_module`	2,086	497	`████████░░` +76%	🟢 4.20x
`gather_source_imports`	725	120	`████████░░` +83%	🟢 6.03x
`collect_existing_class_names`	0.25	0.19	`██░░░░░░░░` +24%	🟢 1.32x
TOTAL	14,860	7,566	`█████░░░░░` +49%	🟢 1.96x

Share of Benchmark Time

Function	base	head
`get_code_optimization_context`	`████████░░` 84.2%	`█████████░` 90.2%
`extract_all_contexts_from_files`	`██████░░░░` 55.6%	`████░░░░░░` 36.3%
`collect_top_level_defs_with_dependencies`	`██░░░░░░░░` 18.9%	`█░░░░░░░░░` 6.0%
`get_function_sources_from_jedi`	`█░░░░░░░░░` 14.6%	`██░░░░░░░░` 23.6%
`add_needed_imports_from_module`	`█░░░░░░░░░` 14.0%	`█░░░░░░░░░` 6.6%
`gather_source_imports`	`░░░░░░░░░░` 4.9%	`░░░░░░░░░░` 1.6%
`collect_existing_class_names`	`░░░░░░░░░░` 0.0%	`░░░░░░░░░░` 0.0%

test_benchmark_code_to_optimize_test_discovery

Branch	Time (ms)	vs base	Speedup
`2c4989c6458c` (base)	2,406	-	-
`f169e862ea6b` (head)	1,633	-773ms (-32%)	🟢 1.47x

test_benchmark_codeflash_test_discovery

Branch	Time (ms)	vs base	Speedup
`2c4989c6458c` (base)	613	-	-
`f169e862ea6b` (head)	570	-43ms (-7%)	🟢 1.08x

test_benchmark_merge_test_results

Branch	Time (ms)	vs base	Speedup
`2c4989c6458c` (base)	41.9	-	-
`f169e862ea6b` (head)	45.7	+4ms (+9%)	🔴 0.92x

Generated by codeflash optimization agent — you can reproduce this by running:

codeflash compare 2c4989c6458ccfb0100e5ad28e0163a7bc3ee901 f169e862ea6b15f433150a77545eefd9099f6ccb --markdown

Replace plain logger output with a Live panel showing a function tree and step-by-step progress. Handle KeyboardInterrupt gracefully with cleanup. Remove --markdown CLI flag (format_markdown remains for programmatic use).

Co-authored-by: Kevin Turcios <undefined@users.noreply.github.com>

codeflash/benchmarking/compare.py

…roject_root utils

codeflash-ai · 2026-03-28T07:45:56Z

⚡️ Codeflash found optimizations for this PR

📄 33% (0.33x) speedup for `_render_comparison` in `codeflash/benchmarking/compare.py`

⏱️ Runtime : 35.6 milliseconds → 26.9 milliseconds (best of 164 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _render_comparison by 33% in PR #1922 (feat/compare-command) #1923

If you approve, it will be merged into this PR (branch feat/compare-command).

codeflash-ai bot reviewed Mar 27, 2026

View reviewed changes

codeflash/benchmarking/compare.py Show resolved Hide resolved

KRRT7 added 4 commits March 27, 2026 17:56

style: fix ruff format in cli.py compare parser args

68e633a

fix: make post-edit lint hook surface real errors

14d81ec

The hook was silently swallowing all prek failures with || true, including actual lint errors. Now it re-runs prek after the first pass auto-fixes formatting — only real lint errors block.

feat: replace --svg with --png output for compare command

7e9b447

Use cairosvg to convert Rich SVG output to PNG at 144 DPI. Renames the CLI flag from --svg to --png across all touchpoints.

feat: add markdown output with progress bars for compare command

fb8c542

Replace PNG export with GitHub-flavored markdown tables. Add unicode progress bars for improvement and share-of-time columns, remove image dependencies (pillow, cairosvg, svglib, reportlab).

KRRT7 added 2 commits March 27, 2026 19:03

feat: add Rich Live panel with tree display for compare progress

0b1e581

Replace plain logger output with a Live panel showing a function tree and step-by-step progress. Handle KeyboardInterrupt gracefully with cleanup. Remove --markdown CLI flag (format_markdown remains for programmatic use).

fix: add type annotations and mypy fixes for compare command

f91f3e0

KRRT7 force-pushed the feat/compare-command branch from 9ff4e7b to f91f3e0 Compare March 28, 2026 00:10

github-actions bot and others added 2 commits March 28, 2026 00:13

fix: resolve mypy type errors in compare.py

0753329

Co-authored-by: Kevin Turcios <undefined@users.noreply.github.com>

style: auto-size and center result tables in compare output

5c2956c

codeflash-ai bot reviewed Mar 28, 2026

View reviewed changes

codeflash/benchmarking/compare.py Outdated Show resolved Hide resolved

codeflash-ai bot reviewed Mar 28, 2026

View reviewed changes

codeflash/benchmarking/compare.py Show resolved Hide resolved

KRRT7 added 2 commits March 28, 2026 02:22

fix: address PR review — resolve refs to SHAs, deduplicate worktree/p…

e552ded

…roject_root utils

Update compare.py

7c38d17

codeflash-ai bot mentioned this pull request Mar 28, 2026

⚡️ Speed up function _render_comparison by 33% in PR #1922 (feat/compare-command) #1923

Closed

KRRT7 merged commit ab7e1f4 into main Mar 28, 2026
24 of 27 checks passed

KRRT7 deleted the feat/compare-command branch March 28, 2026 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `codeflash compare` CLI command#1922

feat: add `codeflash compare` CLI command#1922
KRRT7 merged 13 commits intomainfrom
feat/compare-command

KRRT7 commented Mar 27, 2026

Uh oh!

claude bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

KRRT7 commented Mar 27, 2026

Uh oh!

KRRT7 commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

codeflash-ai bot commented Mar 28, 2026

⚡️ Speed up function `_render_comparison` by 33% in PR #1922 (`feat/compare-command`) #1923

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KRRT7 commented Mar 27, 2026

Summary

Known Limitations

New Files

Modified Files

Test plan

Uh oh!

claude bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Summary

Prek Checks

Code Review

🐛 Bug: --markdown flag is documented but not wired up

🐛 Bug: _parse_functions_arg creates incomplete FunctionToOptimize objects

⚠️ Design: Path safety in _run_benchmark_on_worktree

ℹ️ Note: _md_delta sign inconsistency (cosmetic)

Duplicate Detection

Test Coverage

Uh oh!

Uh oh!

KRRT7 commented Mar 27, 2026

Benchmark Comparison: 2c4989c6 vs f169e862

tests.benchmarks.test_benchmark_code_extract_code_context::test_benchmark_extract

tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_code_to_optimize_test_discovery

tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_codeflash_test_discovery

tests.benchmarks.test_benchmark_merge_test_results::test_benchmark_merge_test_results

Uh oh!

KRRT7 commented Mar 27, 2026

Benchmark: 2c4989c6458c vs f169e862ea6b

test_benchmark_extract

test_benchmark_code_to_optimize_test_discovery

test_benchmark_codeflash_test_discovery

test_benchmark_merge_test_results

Uh oh!

Uh oh!

Uh oh!

codeflash-ai bot commented Mar 28, 2026

⚡️ Codeflash found optimizations for this PR

📄 33% (0.33x) speedup for _render_comparison in codeflash/benchmarking/compare.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _render_comparison by 33% in PR #1922 (feat/compare-command) #1923

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Mar 27, 2026 •

edited

Loading

🐛 Bug: `--markdown` flag is documented but not wired up

🐛 Bug: `_parse_functions_arg` creates incomplete `FunctionToOptimize` objects

⚠️ Design: Path safety in `_run_benchmark_on_worktree`

ℹ️ Note: `_md_delta` sign inconsistency (cosmetic)

Benchmark Comparison: `2c4989c6` vs `f169e862`

Benchmark: `2c4989c6458c` vs `f169e862ea6b`

📄 33% (0.33x) speedup for `_render_comparison` in `codeflash/benchmarking/compare.py`

⚡️ Speed up function `_render_comparison` by 33% in PR #1922 (`feat/compare-command`) #1923