Conversation
…comparison Adds `codeflash compare <base_ref> <head_ref>` — compares benchmark performance between two git refs by auto-detecting changed functions from the diff, creating worktrees, instrumenting with @codeflash_trace, running benchmarks, and rendering a Rich side-by-side comparison table. Supports --pr to resolve a PR number to its branch, --functions for explicit function targeting, --svg for SVG export, and --timeout.
|
Claude finished @KRRT7's task in 3m 35s —— View job PR Review Summary
Prek Checks✅ Code Review🐛 Bug:
|
- Move BenchmarkKey to TYPE_CHECKING block (TC001) - Move argparse.Namespace to TYPE_CHECKING block (TC003) - Fix negated equality check (SIM201) - Fix loop variable capture in nested function (B023) Co-authored-by: Kevin Turcios <undefined@users.noreply.github.com>
Replace find_all_functions_in_file (libcst + metadata resolution) with lightweight ast.parse + iter_child_nodes for top-level function discovery. Since compare only instruments top-level functions, the full CST parse with PositionProvider/ParentNodeProvider was unnecessary overhead. Also fix pre-existing lint issues: move BenchmarkKey to TYPE_CHECKING, use != instead of not ==, bind loop var in sort_key closure.
The hook was silently swallowing all prek failures with || true, including actual lint errors. Now it re-runs prek after the first pass auto-fixes formatting — only real lint errors block.
Use cairosvg to convert Rich SVG output to PNG at 144 DPI. Renames the CLI flag from --svg to --png across all touchpoints.
Benchmark Comparison:
|
| Function | 2c4989c6458c (ms) |
f169e862ea6b (ms) |
Delta | Speedup |
|---|---|---|---|---|
get_code_optimization_context |
14,664 | 7,983 | -6,681ms (-46%) | 🟢 1.84x |
extract_all_contexts_from_files |
9,669 | 3,248 | -6,421ms (-66%) | 🟢 2.98x |
collect_top_level_defs_with_dependencies |
3,518 | 453 | -3,065ms (-87%) | 🟢 7.77x |
get_function_sources_from_jedi |
2,589 | 1,950 | -640ms (-25%) | 🟢 1.33x |
add_needed_imports_from_module |
2,284 | 661 | -1,623ms (-71%) | 🟢 3.46x |
gather_source_imports |
845 | 145 | -699ms (-83%) | 🟢 5.81x |
collect_existing_class_names |
0.46 | 0.26 | -0ms (-45%) | 🟢 1.80x |
| TOTAL | 25,535 | 11,592 | -13,942ms (-55%) | 🟢 2.20x |
tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_code_to_optimize_test_discovery
| Ref | Time (ms) | Delta | Speedup |
|---|---|---|---|
2c4989c6458c |
2,906 | - | - |
f169e862ea6b |
2,269 | -637ms (-22%) | 🟢 1.28x |
tests.benchmarks.test_benchmark_discover_unit_tests::test_benchmark_codeflash_test_discovery
| Ref | Time (ms) | Delta | Speedup |
|---|---|---|---|
2c4989c6458c |
799 | - | - |
f169e862ea6b |
694 | -105ms (-13%) | 🟢 1.15x |
tests.benchmarks.test_benchmark_merge_test_results::test_benchmark_merge_test_results
| Ref | Time (ms) | Delta | Speedup |
|---|---|---|---|
2c4989c6458c |
98.3 | - | - |
f169e862ea6b |
81.0 | -17ms (-18%) | 🟢 1.21x |
Generated by codeflash compare
Replace PNG export with GitHub-flavored markdown tables. Add unicode progress bars for improvement and share-of-time columns, remove image dependencies (pillow, cairosvg, svglib, reportlab).
Benchmark:
|
| Branch | Time (ms) | vs base | Speedup |
|---|---|---|---|
2c4989c6458c (base) |
14,860 | - | - |
f169e862ea6b (head) |
7,566 | -7,294ms (-49%) | 🟢 1.96x |
| Function | base (ms) | head (ms) | Improvement | Speedup |
|---|---|---|---|---|
get_code_optimization_context |
12,513 | 6,826 | █████░░░░░ +45% |
🟢 1.83x |
extract_all_contexts_from_files |
8,256 | 2,748 | ███████░░░ +67% |
🟢 3.00x |
collect_top_level_defs_with_dependencies |
2,802 | 458 | ████████░░ +84% |
🟢 6.12x |
get_function_sources_from_jedi |
2,174 | 1,787 | ██░░░░░░░░ +18% |
🟢 1.22x |
add_needed_imports_from_module |
2,086 | 497 | ████████░░ +76% |
🟢 4.20x |
gather_source_imports |
725 | 120 | ████████░░ +83% |
🟢 6.03x |
collect_existing_class_names |
0.25 | 0.19 | ██░░░░░░░░ +24% |
🟢 1.32x |
| TOTAL | 14,860 | 7,566 | █████░░░░░ +49% |
🟢 1.96x |
Share of Benchmark Time
| Function | base | head |
|---|---|---|
get_code_optimization_context |
████████░░ 84.2% |
█████████░ 90.2% |
extract_all_contexts_from_files |
██████░░░░ 55.6% |
████░░░░░░ 36.3% |
collect_top_level_defs_with_dependencies |
██░░░░░░░░ 18.9% |
█░░░░░░░░░ 6.0% |
get_function_sources_from_jedi |
█░░░░░░░░░ 14.6% |
██░░░░░░░░ 23.6% |
add_needed_imports_from_module |
█░░░░░░░░░ 14.0% |
█░░░░░░░░░ 6.6% |
gather_source_imports |
░░░░░░░░░░ 4.9% |
░░░░░░░░░░ 1.6% |
collect_existing_class_names |
░░░░░░░░░░ 0.0% |
░░░░░░░░░░ 0.0% |
test_benchmark_code_to_optimize_test_discovery
| Branch | Time (ms) | vs base | Speedup |
|---|---|---|---|
2c4989c6458c (base) |
2,406 | - | - |
f169e862ea6b (head) |
1,633 | -773ms (-32%) | 🟢 1.47x |
test_benchmark_codeflash_test_discovery
| Branch | Time (ms) | vs base | Speedup |
|---|---|---|---|
2c4989c6458c (base) |
613 | - | - |
f169e862ea6b (head) |
570 | -43ms (-7%) | 🟢 1.08x |
test_benchmark_merge_test_results
| Branch | Time (ms) | vs base | Speedup |
|---|---|---|---|
2c4989c6458c (base) |
41.9 | - | - |
f169e862ea6b (head) |
45.7 | +4ms (+9%) | 🔴 0.92x |
Generated by codeflash optimization agent — you can reproduce this by running:
codeflash compare 2c4989c6458ccfb0100e5ad28e0163a7bc3ee901 f169e862ea6b15f433150a77545eefd9099f6ccb --markdown
Replace plain logger output with a Live panel showing a function tree and step-by-step progress. Handle KeyboardInterrupt gracefully with cleanup. Remove --markdown CLI flag (format_markdown remains for programmatic use).
9ff4e7b to
f91f3e0
Compare
Co-authored-by: Kevin Turcios <undefined@users.noreply.github.com>
⚡️ Codeflash found optimizations for this PR📄 33% (0.33x) speedup for
|
Summary
Adds
codeflash compare <base_ref> <head_ref>— a first-class CLI command that compares benchmark performance between two git refs.Flow:
git diff(line-level overlap, not just file-level)@codeflash_tracetrace_benchmarks_pytestUsage:
Smoke test — ran against
mainvsorigin/cf-cpu-context-extractionon the codeflash repo itself. Auto-detected 5 modified functions across 3 files, produced a full comparison showing 3.04x speedup on the context extraction benchmark.Known Limitations
parents) are skipped. This is intentional — CST visitor methods likevisit_Namecan be called thousands of times per benchmark, and@codeflash_tracepicklesselfon every call. Whenselfholds a full CST tree, the overhead is catastrophic (benchmark that normally takes ~23s times out at 120s+). The--functionsflag can be used to explicitly target specific methods if needed.benchmarks-rootto be configured in[tool.codeflash]in pyproject.toml.unidiffpackage for diff parsing (already a dependency).New Files
codeflash/benchmarking/compare.py— core orchestration logiccodeflash/cli_cmds/cmd_compare.py— CLI handler (config loading, arg parsing, PR resolution)Modified Files
codeflash/cli_cmds/cli.py— addedcomparesubparsercodeflash/main.py— added dispatch forcomparecommandTest plan
codeflash compare main origin/cf-cpu-context-extraction --timeout 120completes successfully with correct output--prflag against a real PR--functionsexplicit targeting--svgexport