perf: optimize context extraction with caching and import pre-filtering#1921
perf: optimize context extraction with caching and import pre-filtering#1921
Conversation
|
Claude finished @KRRT7's task in 5m 21s —— View job PR Review Summary
Prek Checks
Code ReviewNo runtime bugs or breaking changes found. A few observations: Jedi
|
0e91155 to
85e56d7
Compare
…ng (~42% faster) Cache Jedi refs_by_parent across calls, reuse RO results for identical TESTGEN pruning, pre-filter imports by AST-collected referenced names, and skip RemoveImportsVisitor when dst has no pre-existing imports.
85e56d7 to
0593723
Compare
Benchmark Results: Context Extraction PipelineEnd-to-end benchmark comparison using codeflash's Three branches compared:
Key results
Profiled on Python 3.13.7, macOS, with |
| for node in ast.walk(tree): | ||
| if isinstance(node, ast.Name): | ||
| names.add(node.id) | ||
| elif isinstance(node, ast.Attribute) and isinstance(node.value, ast.Name): | ||
| names.add(node.value.id) | ||
| elif isinstance(node, (ast.Import, ast.ImportFrom)): | ||
| has_imports = True | ||
| elif isinstance(node, ast.Constant) and isinstance(node.value, str): | ||
| try: | ||
| inner = ast.parse(node.value, mode="eval") | ||
| for inner_node in ast.walk(inner): | ||
| if isinstance(inner_node, ast.Name): | ||
| names.add(inner_node.id) | ||
| except SyntaxError: | ||
| pass |
There was a problem hiding this comment.
⚡️Codeflash found 34% (0.34x) speedup for _collect_dst_referenced_names in codeflash/languages/python/static_analysis/code_extractor.py
⏱️ Runtime : 47.4 milliseconds → 35.5 milliseconds (best of 5 runs)
⚡️ This change will improve the performance of the following benchmarks:
| Benchmark File :: Function | Original Runtime | Expected New Runtime | Speedup |
|---|---|---|---|
| tests.benchmarks.test_benchmark_code_extract_code_context::test_benchmark_extract | 8.34 seconds | 8.34 seconds | 0.01% |
🔻 This change will degrade the performance of the following benchmarks:
{benchmark_info_degraded}
📝 Explanation and details
The function replaced ast.walk (a generator that yields every node) with an explicit stack-based traversal that skips subtrees early when type checks fail and avoids redundant isinstance calls by pre-binding type constants (NameType, AttributeType, etc.). It also introduced a local cache for parsed string annotations (_str_parse_cache) to avoid reparsing identical type-hint strings—profiler data shows this eliminated repeated ast.parse overhead inside the constant-string handling branch. Manual field iteration via _fields and getattr replaces the generator's internal iteration, cutting per-node dispatch cost from ~2961 ns to ~129 ns in the hot loop (line profiler confirms the main traversal loop dropped from 56.5% to 2.8% of runtime). The 33% speedup justifies the slightly longer code.
✅ Correctness verification report:
| Test | Status |
|---|---|
| ⚙️ Existing Unit Tests | 🔘 None Found |
| 🌀 Generated Regression Tests | ✅ 65 Passed |
| ⏪ Replay Tests | ✅ 6 Passed |
| 🔎 Concolic Coverage Tests | 🔘 None Found |
| 📊 Tests Coverage | 100.0% |
🌀 Click to see Generated Regression Tests
import pytest # used for our unit tests
from codeflash.languages.python.static_analysis.code_extractor import _collect_dst_referenced_names
def test_basic_empty_and_syntax_error():
# An empty module should parse and yield no names and no imports.
names, has_imports = _collect_dst_referenced_names("") # 27.2μs -> 18.9μs (43.9% faster)
assert names == set() # 28.8μs -> 29.3μs (1.55% slower)
assert has_imports is False # no imports present
# A syntactically invalid module should return (set(), False) per function design.
names, has_imports = _collect_dst_referenced_names("def incomplete(")
assert names == set() # function catches SyntaxError and returns empty set
assert has_imports is False # and indicates no imports
def test_collect_names_and_attributes():
# Basic Name nodes: variable references and assignments should be collected.
code = """
a = b + c
d = a + 1
"""
names, has_imports = _collect_dst_referenced_names(code) # 56.9μs -> 41.3μs (37.7% faster)
# 'a', 'b', 'c', 'd' appear as Name nodes in the AST and should be collected.
assert {"a", "b", "c", "d"}.issubset(names)
assert has_imports is False # 35.8μs -> 29.4μs (21.8% faster)
# Attribute access should capture only the base Name (the left-most Name value).
code2 = """
x.y
obj.attr
(p).q
"""
names2, _ = _collect_dst_referenced_names(code2)
# Base names 'x', 'obj', 'p' should be included.
assert {"x", "obj", "p"}.issubset(names2)
# Attribute names 'y' and 'attr' and 'q' should not be included as Name nodes.
assert "y" not in names2
assert "attr" not in names2
assert "q" not in names2
def test_strings_parsed_as_code_and_invalid_inner_are_ignored():
# String constants that contain valid Python expressions should be parsed and names extracted.
# Example: a constant containing "List[int]" will parse as a Subscript with Name nodes.
code = 'const1 = "List[int]"\nconst2 = "MyType"\n'
names, _ = _collect_dst_referenced_names(code) # 62.1μs -> 62.8μs (1.12% slower)
# 'List', 'int', and 'MyType' should be discovered from the inner parses.
assert "List" in names
assert "int" in names # 23.0μs -> 21.1μs (8.91% faster)
assert "MyType" in names
# If the string constant contains invalid Python for eval mode, it should be ignored (no exception).
code_with_invalid = 'bad = "a =" \n' # "a =" is not a valid expression -> inner SyntaxError and ignored
names_bad, _ = _collect_dst_referenced_names(code_with_invalid)
# Nothing valid should be collected from the invalid inner string.
assert "a" not in names_bad
def test_imports_flag_and_imported_names_not_collected_from_import_stmt():
# Import statements should flip the has_imports flag.
code = """
import os, json
from sys import path as sys_path
# but do not reference those module names anywhere else
"""
names, has_imports = _collect_dst_referenced_names(code) # 35.7μs -> 29.0μs (23.1% faster)
# has_imports must be True when imports are present.
assert has_imports is True
# The mere presence of import statements should not cause the module names to be added as Names
# unless they appear elsewhere as Name nodes. So 'os' and 'sys' shouldn't be in the result.
assert "os" not in names
assert "sys" not in names
assert "json" not in names
assert "path" not in names
def test_attribute_chains_add_base_name_even_when_nested():
# For chained attributes like a.b.c the AST contains nested Attribute nodes;
# the inner Attribute has value=Name('a') so 'a' will be collected.
code = "a.b.c\n"
names, _ = _collect_dst_referenced_names(code) # 28.9μs -> 21.4μs (34.9% faster)
assert "a" in names # base of the chain should be present
# The intermediate attribute names 'b' and final attribute 'c' are not Name nodes and should not be present.
assert "b" not in names # 23.8μs -> 25.5μs (6.60% slower)
assert "c" not in names
# An attribute where the value is a Constant (e.g., a literal) should not spuriously add names.
code2 = "'literal'.prop\n"
names2, _ = _collect_dst_referenced_names(code2)
# No Name nodes should be created from a constant string's attribute access.
assert "literal" not in names2
def test_none_input_raises_type_error():
# Passing None should raise a TypeError from ast.parse; the function does not catch this.
with pytest.raises(TypeError):
_collect_dst_referenced_names(None) # 6.14μs -> 6.25μs (1.75% slower)
def test_large_scale_performance_and_correctness():
# Generate a reasonably large module with many distinct Name occurrences and many string-constants
# that themselves contain simple names to be parsed.
N = 1000 # large scale as requested
parts = []
# Create many variable assignments to produce Name nodes for v0..vN-1
for i in range(N):
parts.append(f"v{i} = {i}\n") # assignment target is a Name node 'v{i}'
# Add many string constants that each contain a single simple name like "n{i}" which will be parsed
for i in range(N):
parts.append(f's{i} = "n{i}"\n') # constant string -> inner parse yields Name 'n{i}'
# Add an import to ensure has_imports becomes True
parts.append("import math\n")
# Combine into a single large source string
big_code = "".join(parts)
names, has_imports = _collect_dst_referenced_names(big_code) # 18.1ms -> 15.1ms (19.9% faster)
# The has_imports flag must be True because of the import statement.
assert has_imports is True
# Check a few representative items to ensure correctness and that the parsing actually collected names.
# Check first, middle, and last indices for v* and n*.
assert "v0" in names
assert f"v{N // 2}" in names
assert f"v{N - 1}" in names
assert "n0" in names
assert f"n{N // 2}" in names
assert f"n{N - 1}" in names
# Expect at least 2*N names (v* and n*) to be present.
# Use >= to be resilient to possible additional names from parsing (but ensure the bulk was collected).
assert len(names) >= 2 * N# imports
from codeflash.languages.python.static_analysis.code_extractor import _collect_dst_referenced_names
def test_empty_code():
"""Test that empty code returns empty set and no imports."""
names, has_imports = _collect_dst_referenced_names("") # 18.1μs -> 12.9μs (40.9% faster)
assert names == set()
assert has_imports is False
def test_single_name_reference():
"""Test collection of a single name reference."""
names, has_imports = _collect_dst_referenced_names("x") # 26.9μs -> 18.4μs (46.4% faster)
assert names == {"x"}
assert has_imports is False
def test_multiple_name_references():
"""Test collection of multiple distinct names."""
names, has_imports = _collect_dst_referenced_names("x + y + z") # 36.6μs -> 25.1μs (45.6% faster)
assert names == {"x", "y", "z"}
assert has_imports is False
def test_repeated_name_references():
"""Test that repeated name references appear only once in the set."""
names, has_imports = _collect_dst_referenced_names("x + x + y + y + y") # 41.4μs -> 26.8μs (54.2% faster)
assert names == {"x", "y"}
assert has_imports is False
def test_attribute_access_single_level():
"""Test collection of single-level attribute access (base name only)."""
names, has_imports = _collect_dst_referenced_names("obj.attr") # 28.4μs -> 20.7μs (37.4% faster)
assert names == {"obj"}
assert has_imports is False
def test_attribute_access_multi_level():
"""Test collection of multi-level attribute access (base name only)."""
names, has_imports = _collect_dst_referenced_names("obj.attr1.attr2.attr3") # 34.0μs -> 24.7μs (37.4% faster)
assert names == {"obj"}
assert has_imports is False
def test_simple_import_statement():
"""Test detection of simple import statements."""
names, has_imports = _collect_dst_referenced_names("import os") # 22.2μs -> 16.6μs (33.6% faster)
assert has_imports is True
# os is referenced as a Name node due to import
assert "os" in names
def test_import_from_statement():
"""Test detection of from-import statements."""
names, has_imports = _collect_dst_referenced_names("from os import path") # 25.0μs -> 18.6μs (34.0% faster)
assert has_imports is True
def test_function_call():
"""Test collection of names in function calls."""
names, has_imports = _collect_dst_referenced_names("func(arg1, arg2)") # 36.8μs -> 27.2μs (35.6% faster)
assert "func" in names
assert "arg1" in names
assert "arg2" in names
assert has_imports is False
def test_variable_assignment():
"""Test collection of names in variable assignments."""
names, has_imports = _collect_dst_referenced_names("result = func(x, y)") # 38.4μs -> 26.5μs (44.8% faster)
assert "result" in names
assert "func" in names
assert "x" in names
assert "y" in names
assert has_imports is False
def test_string_annotation_with_name():
"""Test collection of names inside string annotations."""
names, has_imports = _collect_dst_referenced_names('x: "MyType" = 5') # 41.1μs -> 41.8μs (1.77% slower)
assert "MyType" in names
assert "x" in names
assert has_imports is False
def test_string_annotation_with_multiple_names():
"""Test collection of multiple names in string annotations."""
names, has_imports = _collect_dst_referenced_names(
'def f(x: "List[int]") -> "Optional[str]": pass'
) # 68.6μs -> 68.3μs (0.573% faster)
assert "List" in names
assert "int" in names
assert "Optional" in names
assert "str" in names
assert "f" in names
assert "x" in names
assert has_imports is False
def test_syntax_error_recovery():
"""Test that syntax errors return empty set and no imports flag."""
names, has_imports = _collect_dst_referenced_names(
"if this is not valid syntax !!!"
) # 42.7μs -> 45.5μs (6.08% slower)
assert names == set()
assert has_imports is False
def test_incomplete_syntax():
"""Test incomplete syntax that cannot be parsed."""
names, has_imports = _collect_dst_referenced_names("def func(") # 21.2μs -> 19.2μs (10.1% faster)
assert names == set()
assert has_imports is False
def test_whitespace_only():
"""Test code containing only whitespace."""
names, has_imports = _collect_dst_referenced_names(" \n\n \t\t ") # 13.7μs -> 9.30μs (47.4% faster)
assert names == set()
assert has_imports is False
def test_comments_only():
"""Test code containing only comments."""
names, has_imports = _collect_dst_referenced_names(
"# This is a comment\n# Another comment"
) # 13.3μs -> 9.00μs (48.2% faster)
assert names == set()
assert has_imports is False
def test_single_character_names():
"""Test collection of single-character variable names."""
names, has_imports = _collect_dst_referenced_names("a + b + c") # 34.5μs -> 23.9μs (44.0% faster)
assert names == {"a", "b", "c"}
assert has_imports is False
def test_underscore_names():
"""Test collection of underscore-prefixed and underscore names."""
names, has_imports = _collect_dst_referenced_names(
"_ + _var + __private + __dunder__"
) # 39.5μs -> 27.1μs (45.7% faster)
assert "_" in names
assert "_var" in names
assert "__private" in names
assert "__dunder__" in names
assert has_imports is False
def test_numeric_suffixed_names():
"""Test collection of names with numeric suffixes."""
names, has_imports = _collect_dst_referenced_names(
"var1 + var2 + var123 + var_456"
) # 37.9μs -> 26.3μs (44.0% faster)
assert "var1" in names
assert "var2" in names
assert "var123" in names
assert "var_456" in names
assert has_imports is False
def test_keyword_like_but_not_keyword():
"""Test that non-keyword identifiers that look like keywords are collected."""
names, has_imports = _collect_dst_referenced_names("ifx = 5") # 27.3μs -> 20.1μs (35.9% faster)
assert "ifx" in names
assert has_imports is False
def test_nested_attribute_access():
"""Test nested attribute access on attributes."""
names, has_imports = _collect_dst_referenced_names("a.b.c.d.e.f") # 36.8μs -> 26.6μs (37.9% faster)
# Only the base name should be collected
assert names == {"a"}
assert has_imports is False
def test_method_call_chain():
"""Test method call chains."""
names, has_imports = _collect_dst_referenced_names(
"obj.method1().method2().method3()"
) # 43.3μs -> 35.9μs (20.7% faster)
assert "obj" in names
assert has_imports is False
def test_dict_with_names():
"""Test collection of names in dictionary literals."""
names, has_imports = _collect_dst_referenced_names(
"{key1: value1, key2: value2}"
) # 37.0μs -> 28.0μs (32.5% faster)
assert "key1" in names
assert "value1" in names
assert "key2" in names
assert "value2" in names
assert has_imports is False
def test_list_comprehension():
"""Test collection of names in list comprehensions."""
names, has_imports = _collect_dst_referenced_names(
"[x * 2 for x in items if x > threshold]"
) # 55.8μs -> 43.7μs (27.8% faster)
assert "x" in names
assert "items" in names
assert "threshold" in names
assert has_imports is False
def test_lambda_expression():
"""Test collection of names in lambda expressions."""
names, has_imports = _collect_dst_referenced_names("lambda x, y: x + y + z") # 48.0μs -> 34.4μs (39.7% faster)
assert "x" in names
assert "y" in names
assert "z" in names
assert has_imports is False
def test_string_literal_non_annotation():
"""Test that regular string literals are ignored."""
names, has_imports = _collect_dst_referenced_names('message = "hello world"') # 41.9μs -> 37.7μs (11.2% faster)
assert "message" in names
# The string content should not be parsed as code
assert "hello" not in names
assert "world" not in names
assert has_imports is False
def test_invalid_string_annotation():
"""Test that invalid expressions in string annotations are handled gracefully."""
names, has_imports = _collect_dst_referenced_names(
'x: "not valid python !!! @#$" = 5'
) # 39.6μs -> 34.1μs (16.3% faster)
assert "x" in names
# The string contains invalid Python, so no names are extracted from it
assert has_imports is False
def test_complex_expression():
"""Test a complex expression with multiple operators and names."""
names, has_imports = _collect_dst_referenced_names("(a + b) * (c - d) / (e + f)") # 53.7μs -> 38.3μs (40.1% faster)
assert "a" in names
assert "b" in names
assert "c" in names
assert "d" in names
assert "e" in names
assert "f" in names
assert has_imports is False
def test_multiple_imports():
"""Test detection with multiple import types."""
code = """
import os
from sys import argv
import json as j
"""
names, has_imports = _collect_dst_referenced_names(code) # 35.7μs -> 27.3μs (30.9% faster)
assert has_imports is True
def test_imports_and_names():
"""Test that both imports and other names are collected."""
code = """
import os
x = os.path.join(a, b)
"""
names, has_imports = _collect_dst_referenced_names(code) # 49.7μs -> 37.3μs (33.3% faster)
assert has_imports is True
assert "os" in names
assert "x" in names
assert "a" in names
assert "b" in names
def test_conditional_code():
"""Test collection from conditional statements."""
names, has_imports = _collect_dst_referenced_names("""
if condition:
result = func(x)
else:
result = other_func(y)
""") # 56.1μs -> 41.2μs (36.3% faster)
assert "condition" in names
assert "result" in names
assert "func" in names
assert "x" in names
assert "other_func" in names
assert "y" in names
assert has_imports is False
def test_loop_code():
"""Test collection from loop statements."""
names, has_imports = _collect_dst_referenced_names("""
for item in items:
process(item)
""") # 41.7μs -> 30.1μs (38.9% faster)
assert "item" in names
assert "items" in names
assert "process" in names
assert has_imports is False
def test_class_definition():
"""Test collection of names in class definitions."""
names, has_imports = _collect_dst_referenced_names("""
class MyClass(BaseClass):
def __init__(self, x):
self.x = x
""") # 59.3μs -> 47.8μs (23.9% faster)
assert "MyClass" in names
assert "BaseClass" in names
assert "self" in names
assert "x" in names
assert has_imports is False
def test_function_definition():
"""Test collection of names in function definitions."""
names, has_imports = _collect_dst_referenced_names("""
def my_func(arg1, arg2):
return arg1 + arg2 + external_var
""") # 52.2μs -> 40.5μs (28.8% faster)
assert "my_func" in names
assert "arg1" in names
assert "arg2" in names
assert "external_var" in names
assert has_imports is False
def test_nested_attribute_on_literal():
"""Test attribute access on non-Name objects (should not collect base)."""
names, has_imports = _collect_dst_referenced_names('"string".upper()') # 38.9μs -> 41.9μs (7.19% slower)
# The function only collects base names when attribute.value is ast.Name
assert has_imports is False
def test_numeric_literal():
"""Test that numeric literals don't cause issues."""
names, has_imports = _collect_dst_referenced_names("x = 42 + 3.14 + 1j") # 35.9μs -> 27.8μs (28.9% faster)
assert "x" in names
assert has_imports is False
def test_boolean_literals():
"""Test boolean and None literals."""
names, has_imports = _collect_dst_referenced_names(
"result = True and False and None"
) # 34.7μs -> 26.8μs (29.6% faster)
assert "result" in names
assert has_imports is False
def test_bytes_literal():
"""Test bytes string literal."""
names, has_imports = _collect_dst_referenced_names("data = b'bytes'") # 25.2μs -> 18.3μs (37.8% faster)
assert "data" in names
assert has_imports is False
def test_empty_attribute_chain():
"""Test that we handle edge cases in attribute chains."""
names, has_imports = _collect_dst_referenced_names("(x).y.z") # 30.1μs -> 24.0μs (25.6% faster)
assert "x" in names
assert has_imports is False
def test_starred_expression():
"""Test starred expressions in unpacking."""
names, has_imports = _collect_dst_referenced_names("a, *b, c = items") # 36.6μs -> 26.3μs (39.5% faster)
assert "a" in names
assert "b" in names
assert "c" in names
assert "items" in names
assert has_imports is False
def test_ternary_expression():
"""Test ternary conditional expressions."""
names, has_imports = _collect_dst_referenced_names(
"result = x if condition else y"
) # 33.9μs -> 24.2μs (40.2% faster)
assert "result" in names
assert "x" in names
assert "condition" in names
assert "y" in names
assert has_imports is False
def test_many_distinct_names():
"""Test collection with many distinct variable names (500 names)."""
# Build code with 500 distinct names
names_list = [f"var{i}" for i in range(500)]
code = " + ".join(names_list)
names, has_imports = _collect_dst_referenced_names(code) # 2.21ms -> 1.33ms (66.5% faster)
# Verify all names were collected
assert len(names) == 500
for i in range(500):
assert f"var{i}" in names
assert has_imports is False
def test_large_nested_expression():
"""Test deeply nested expressions (1000 levels of attribute access)."""
# Build a chain: a.b.c.d...
code = "a"
for i in range(100):
code += ".attr"
names, has_imports = _collect_dst_referenced_names(code) # 247μs -> 164μs (50.0% faster)
assert names == {"a"}
assert has_imports is False
def test_large_function_call_args():
"""Test function call with many arguments (500 args)."""
args = ", ".join([f"arg{i}" for i in range(500)])
code = f"result = func({args})"
names, has_imports = _collect_dst_referenced_names(code) # 1.40ms -> 895μs (56.9% faster)
assert "result" in names
assert "func" in names
assert len(names) == 502 # result + func + 500 args
assert has_imports is False
def test_large_dictionary_literal():
"""Test large dictionary with many key-value pairs (500 pairs)."""
pairs = ", ".join([f"key{i}: val{i}" for i in range(500)])
code = f"d = {{{pairs}}}"
names, has_imports = _collect_dst_referenced_names(code) # 2.73ms -> 1.67ms (63.7% faster)
assert "d" in names
# 1 for 'd' + 500 keys + 500 values = 1001
assert len(names) == 1001
assert has_imports is False
def test_many_list_comprehensions():
"""Test code with many list comprehensions (100 comprehensions)."""
code = "\n".join([f"list{i} = [x for x in items{i}]" for i in range(100)])
names, has_imports = _collect_dst_referenced_names(code) # 1.48ms -> 952μs (55.7% faster)
assert "x" in names
for i in range(100):
assert f"list{i}" in names
assert f"items{i}" in names
assert has_imports is False
def test_large_class_hierarchy():
"""Test a large class hierarchy (100 classes)."""
code = "class C0: pass\n"
for i in range(1, 100):
code += f"class C{i}(C{i - 1}): pass\n"
names, has_imports = _collect_dst_referenced_names(code) # 691μs -> 473μs (46.2% faster)
for i in range(100):
assert f"C{i}" in names
assert has_imports is False
def test_many_function_definitions():
"""Test many function definitions with parameters (100 functions)."""
code = "\n".join([f"def func{i}(p{i}_0, p{i}_1, p{i}_2): pass" for i in range(100)])
names, has_imports = _collect_dst_referenced_names(code) # 1.39ms -> 1.14ms (22.4% faster)
for i in range(100):
assert f"func{i}" in names
assert f"p{i}_0" in names
assert f"p{i}_1" in names
assert f"p{i}_2" in names
assert has_imports is False
def test_many_string_annotations():
"""Test many string annotations (200 annotations)."""
code = "\n".join([f'x{i}: "TypeA{i}[TypeB{i}]" = None' for i in range(200)])
names, has_imports = _collect_dst_referenced_names(code) # 3.30ms -> 3.02ms (9.41% faster)
# Verify some of the annotated names are present
assert "x0" in names
assert "TypeA0" in names
assert "TypeB0" in names
assert "x199" in names
assert "TypeA199" in names
assert "TypeB199" in names
assert has_imports is False
def test_very_long_single_line():
"""Test very long single-line expression (1000 names concatenated)."""
names_list = [f"v{i}" for i in range(1000)]
code = " + ".join(names_list)
names, has_imports = _collect_dst_referenced_names(code) # 4.21ms -> 2.44ms (72.6% faster)
assert len(names) == 1000
assert has_imports is False
def test_large_conditional_chain():
"""Test large chain of if-elif-else (100 conditions)."""
code = "if c0: x = 0\n"
for i in range(1, 100):
code += f"elif c{i}: x = {i}\n"
code += "else: x = -1\n"
names, has_imports = _collect_dst_referenced_names(code) # 1.03ms -> 644μs (60.1% faster)
assert "x" in names
for i in range(100):
assert f"c{i}" in names
assert has_imports is False
def test_deeply_nested_list_comprehensions():
"""Test deeply nested list comprehensions (50 levels)."""
code = "[x"
for i in range(50):
code += f" for x{i} in items{i}"
code += "]"
names, has_imports = _collect_dst_referenced_names(code) # 339μs -> 199μs (70.7% faster)
assert "x" in names
for i in range(50):
assert f"x{i}" in names
assert f"items{i}" in names
assert has_imports is False
def test_mixed_large_code():
"""Test large code combining imports, functions, classes, and expressions."""
code = """
import os
from sys import argv
import json
class MyClass(BaseClass):
def __init__(self):
self.value = 0
def process_data(data):
result = transform(data)
return result
x = MyClass()
data = [item for item in source if predicate(item)]
output = process_data(data)
final = final_transform(output)
"""
names, has_imports = _collect_dst_referenced_names(code) # 173μs -> 127μs (36.1% faster)
assert has_imports is True
assert "os" in names
assert "argv" in names
assert "json" in names
assert "MyClass" in names
assert "BaseClass" in names
assert "process_data" in names
assert "data" in names
assert "item" in names
assert "source" in names
assert "predicate" in names
assert len(names) > 20
def test_code_with_many_attributes():
"""Test code with many attribute accesses on different base objects (500)."""
code = "\n".join([f"v{i}.attr1.attr2.attr3" for i in range(500)])
names, has_imports = _collect_dst_referenced_names(code) # 5.60ms -> 3.85ms (45.3% faster)
assert len(names) == 500
for i in range(500):
assert f"v{i}" in names
assert has_imports is False⏪ Click to see Replay Tests
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
|---|---|---|---|
benchmarks/codeflash_replay_tests_pj5gk87v/test_tests_benchmarks_test_benchmark_code_extract_code_context__replay_test_0.py::test_codeflash_languages_python_static_analysis_code_extractor__collect_dst_referenced_names_test_benchmark_extract |
2.62ms | 1.94ms | 34.9%✅ |
To test or edit this optimization locally git merge codeflash/optimize-pr1921-2026-03-27T21.35.30
Click to see suggested changes
| for node in ast.walk(tree): | |
| if isinstance(node, ast.Name): | |
| names.add(node.id) | |
| elif isinstance(node, ast.Attribute) and isinstance(node.value, ast.Name): | |
| names.add(node.value.id) | |
| elif isinstance(node, (ast.Import, ast.ImportFrom)): | |
| has_imports = True | |
| elif isinstance(node, ast.Constant) and isinstance(node.value, str): | |
| try: | |
| inner = ast.parse(node.value, mode="eval") | |
| for inner_node in ast.walk(inner): | |
| if isinstance(inner_node, ast.Name): | |
| names.add(inner_node.id) | |
| except SyntaxError: | |
| pass | |
| # Small cache for parsed string annotations to avoid reparsing identical strings. | |
| _str_parse_cache: dict[str, frozenset[str]] = {} | |
| # Manual stack-based traversal is typically faster than using the generator-based ast.walk. | |
| stack = [tree] | |
| NameType = ast.Name | |
| AttributeType = ast.Attribute | |
| ImportTypes = (ast.Import, ast.ImportFrom) | |
| ConstantType = ast.Constant | |
| while stack: | |
| node = stack.pop() | |
| if isinstance(node, NameType): | |
| names.add(node.id) | |
| continue | |
| if isinstance(node, AttributeType) and isinstance(node.value, NameType): | |
| names.add(node.value.id) | |
| # continue to still traverse any children if present (though attribute.value already handled) | |
| # we still push children below | |
| elif isinstance(node, ImportTypes): | |
| has_imports = True | |
| elif isinstance(node, ConstantType) and isinstance(node.value, str): | |
| s = node.value | |
| cached = _str_parse_cache.get(s) | |
| if cached is not None: | |
| if cached: | |
| names.update(cached) | |
| else: | |
| try: | |
| inner = ast.parse(s, mode="eval") | |
| except SyntaxError: | |
| _str_parse_cache[s] = frozenset() | |
| else: | |
| inner_names: set[str] = set() | |
| for inner_node in ast.walk(inner): | |
| if isinstance(inner_node, NameType): | |
| inner_names.add(inner_node.id) | |
| frozen = frozenset(inner_names) | |
| _str_parse_cache[s] = frozen | |
| if frozen: | |
| names.update(frozen) | |
| # Push child AST nodes for traversal. Using _fields avoids creating many intermediate lists. | |
| fields = getattr(node, "_fields", ()) | |
| for field in fields: | |
| value = getattr(node, field, None) | |
| if isinstance(value, list): | |
| # iterate in normal order; append to stack to preserve overall traversal behavior | |
| for item in value: | |
| if isinstance(item, ast.AST): | |
| stack.append(item) | |
| elif isinstance(value, ast.AST): | |
| stack.append(value) | |

Summary
Builds on top of #1918 with 4 additional optimizations to the context extraction pipeline, achieving ~42% further speedup (median 2.93s → 1.71s).
refs_by_parentacross calls — bothget_function_sources_from_jedicalls operate on the same file; caching the expensiveget_names()+ parent grouping eliminates redundant Jedi workadd_needed_imports_from_modulecallast.walkcollects names actually used in the destination code, so only relevant imports are added (reducing the module size RemoveImportsVisitor has to analyze)Test plan
tests/benchmarks/test_benchmark_code_extract_code_context.py::test_benchmark_extract