A static linter that catches silent Pandas performance killers before they ship to production.
pdperf scans your Python code for common Pandas anti-patterns that work correctly but are often 10β100Γ slower at scale than necessary. It's local-first, deterministic, and CI-friendly β no code execution required.
- Why pdperf?
- Quick Start
- CI-Friendly Guarantees
- Rules Reference
- Detailed Rule Examples
- CLI Reference
- How pdperf Works β Technical Deep-Dive
- Integrations
- License
Pandas makes it easy to write code that works but scales poorly:
# This works... but is painfully slow on large datasets
for idx, row in df.iterrows():
total += row['price'] * row['quantity']
# pdperf catches this and suggests:
# π‘ Use vectorized: (df['price'] * df['quantity']).sum()These issues often start in notebooks and quietly move into ETL pipelines. pdperf catches them before production.
# PyPI (coming soon)
# pip install pdperf
# Install from source
git clone https://github.com/adwantg/pdperf.git
cd pdperf
pip install -e .
# Or with dev dependencies
pip install -e ".[dev]"# Scan a file or directory
pdperf scan your_code.py
pdperf scan src/
# List all available rules
pdperf rules
# Get detailed explanation for a rule
pdperf explain PPO003π etl/transform.py
β οΈ 45:12 [PPO001] Avoid df.iterrows() or df.itertuples() in loops; prefer vectorized operations.
π‘ Use vectorized column operations like df['a'] + df['b'], or np.where(), merge(), map(), groupby().agg().
β 67:8 [PPO003] Building DataFrame via append/concat in a loop is O(nΒ²); accumulate in a list first.
π‘ Collect DataFrames in a list, then call pd.concat(frames, ignore_index=True) once after the loop.
π features/pipeline.py
β οΈ 23:15 [PPO002] Row-wise df.apply(axis=1) is slow; prefer vectorized operations.
π‘ Replace with df['x'] + df['y'], np.where(condition, a, b), Series.map(), or merge().
- No code execution: pdperf parses code using AST only β safe on any codebase
- Deterministic output: stable ordering by
path β line β col β rule_id - Schema-versioned JSON:
schema_versionfield for tooling stability - Pattern-based detection: doesn't require import resolution or
import pandas as pd
| Code | Meaning |
|---|---|
0 |
No findings (or --fail-on none) |
1 |
Findings at/above --fail-on threshold |
2 |
Tool error (invalid args, parse error with --fail-on-parse-error) |
{
"schema_version": "1.0",
"tool": "pdperf",
"tool_version": "0.1.0",
"total_findings": 3,
"findings": [
{
"rule_id": "PPO001",
"path": "src/etl.py",
"line": 45,
"col": 12,
"severity": "warn",
"message": "Avoid df.iterrows()...",
"suggested_fix": "Use vectorized..."
}
]
}pdperf includes 8 rules targeting the most impactful Pandas performance anti-patterns:
| Rule | Name | Severity | Patchable | Confidence |
|---|---|---|---|---|
| PPO001 | iterrows/itertuples loop | β | High | |
| PPO002 | apply(axis=1) row-wise | β | High | |
| PPO003 | concat/append in loop | β ERROR | β | High |
| PPO004 | chained indexing | β ERROR | π§ | High |
| PPO005 | index churn in loop | β | High | |
| PPO006 | .values β .to_numpy() | π§ | High | |
| PPO007 | groupby().apply() | β | Medium | |
| PPO008 | string ops in loop | β | Medium |
Legend:
- π§ = Auto-fixable with
--patch - β = Not auto-fixable
- High confidence: Structural AST pattern match (precise)
- Medium confidence: Heuristic-based detection (see rule details for boundaries)
Note: pdperf is import-agnostic by design. In rare cases, non-pandas objects with similar method names (e.g.,
.values) may be flagged. Use--ignoreor--selectto control rules.
What it catches:
# β SLOW: Python loop with iterrows
for idx, row in df.iterrows():
result.append(row['a'] * row['b'])
# β SLOW: itertuples is faster but still not ideal
for row in df.itertuples():
result.append(row.a * row.b)Why it's slow:
- Each row iteration invokes the Python interpreter
iterrows()creates a Series object per row (expensive!)- No vectorization benefits from NumPy's C backend
The fix:
# β
FAST: Vectorized operation
result = df['a'] * df['b']
# β
FAST: Use numpy for complex operations
result = np.where(df['a'] > 0, df['a'] * df['b'], 0)What it catches:
# β SLOW: Row-wise apply with lambda
df['total'] = df.apply(lambda row: row['price'] * row['qty'], axis=1)
# β SLOW: Row-wise apply with custom function
df['category'] = df.apply(categorize_row, axis=1)Why it's slow:
axis=1processes one row at a time- Python function call overhead for each row
The fix:
# β
FAST: Direct vectorized arithmetic
df['total'] = df['price'] * df['qty']
# β
FAST: Use np.where for conditionals
df['category'] = np.where(df['value'] > 100, 'high', 'low')
# β
FAST: Use np.select for multiple conditions
conditions = [df['value'] > 100, df['value'] > 50]
choices = ['high', 'medium']
df['category'] = np.select(conditions, choices, default='low')
# β
FAST: Use map for lookups
df['category'] = df['key'].map(category_mapping)What it catches:
# β EXTREMELY SLOW: O(nΒ²) complexity!
df = pd.DataFrame()
for file in files:
chunk = pd.read_csv(file)
df = pd.concat([df, chunk]) # Copies entire df each time!
# β DEPRECATED AND SLOW: df.append (removed in pandas 2.0)
for item in items:
df = df.append({'col': item}, ignore_index=True)Why it's catastrophic: Each concat copies all existing data. After n iterations: 1 + 2 + 3 + ... + n = O(nΒ²) copies.
β οΈ Note:DataFrame.append()was deprecated in pandas 1.4.0 and removed in 2.0. See pandas docs.
The fix:
# β
FAST: Collect in list, concat once (O(n))
frames = []
for file in files:
chunk = pd.read_csv(file)
frames.append(chunk)
df = pd.concat(frames, ignore_index=True)
# β
EVEN FASTER: List comprehension
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)What it catches:
# β DANGEROUS: May silently fail!
df[df['a'] > 0]['b'] = 10
# β DANGEROUS: Same pattern with variable
mask = df['a'] > 0
df[mask]['b'] = 10Why it's dangerous:
df[mask]might return a copy (unpredictable!)['b'] = 10assigns to the copy, not the original- Your data update is silently lost
Pandas warns with SettingWithCopyWarning, but warnings are often ignored. See Real Python's explanation.
The fix:
# β
CORRECT: Use .loc for safe assignment
df.loc[df['a'] > 0, 'b'] = 10
# β
CORRECT: With named mask
mask = df['a'] > 0
df.loc[mask, 'b'] = 10What it catches:
# β WASTEFUL: Rebuilds index every iteration
for key in keys:
df = df.reset_index()
df = df.set_index('col')
# ... process ...Why it matters:
reset_index()andset_index()create new DataFrame copies- Index operations inside loops multiply the overhead
The fix:
# β
BETTER: Set index once, outside loop
df = df.set_index('col')
for key in keys:
# ... process without index changes ...What it catches:
# β DISCOURAGED: Inconsistent return type
arr = df.values
arr = df['col'].valuesWhy it matters:
.valuessometimes returns NumPy array, sometimes ExtensionArray- Behavior depends on DataFrame dtypes
.to_numpy()is explicit and always returns NumPy array
π Note: Ruff rule PD011 (from pandas-vet) also flags this pattern.
The fix:
# β
RECOMMENDED: Explicit conversion
arr = df.to_numpy()
arr = df['col'].to_numpy()
# With explicit dtype
arr = df.to_numpy(dtype='float64', copy=False)What it catches:
# β SLOW: Custom function invoked per group
result = df.groupby('category').apply(lambda g: g['value'].sum())Why it's slow:
apply()invokes Python for each group- Loses vectorization benefits
The fix:
# β
FAST: Built-in aggregation
result = df.groupby('category')['value'].sum()
# β
FAST: Multiple aggregations with agg()
result = df.groupby('category').agg({
'value': ['sum', 'mean'],
'quantity': 'count'
})
# β
FAST: Named aggregations (pandas 0.25+)
result = df.groupby('category').agg(
total=('value', 'sum'),
average=('value', 'mean')
)Detection boundary: PPO007 flags any
groupby(...).apply(...)call. This is a heuristic β someapply()uses are unavoidable. Use--ignore PPO007if you have legitimate use cases.
What it catches:
# β SLOW: String processing in loop
for idx, row in df.iterrows():
df.at[idx, 'name'] = row['name'].lower()Why it's slow:
- Python string methods called one at a time
- Combined with iterrows overhead
The fix:
# β
FAST: Vectorized string operations
df['name'] = df['name'].str.lower()
df['clean'] = df['text'].str.strip().str.replace(' ', ' ', regex=False)Detection boundary: PPO008 only flags string methods (
.lower(),.strip(), etc.) called on subscript expressions (e.g.,row['col']) inside loops. It does not flag.straccessor usage.
pdperf scan <path> # Scan files for anti-patterns
pdperf rules # List all rules
pdperf explain <RULE_ID> # Explain a specific rule in detail| Option | Description | Default |
|---|---|---|
--format |
Output format: text, json, sarif |
text |
--out |
Write output to file | stdout |
--select |
Only check these rules (comma-separated) | all |
--ignore |
Skip these rules (comma-separated) | none |
--severity-threshold |
Minimum severity: warn, error |
warn |
--fail-on |
Exit 1 threshold: warn, error, none |
error |
--fail-on-parse-error |
Exit 2 if any files have syntax errors | false |
--patch |
Generate unified diff for auto-fixable rules | β |
# Quick check of a single file
pdperf scan etl/transform.py
# Full project scan with JSON output for CI
pdperf scan src/ --format json --out reports/pdperf.json --fail-on error
# Generate SARIF for GitHub Security integration
pdperf scan . --format sarif --out results.sarif
# Focus on critical issues only
pdperf scan . --severity-threshold error --select PPO003,PPO004
# Generate auto-fix patch
pdperf scan . --patch out/fixes.diffpdperf will support configuration via pyproject.toml:
[tool.pdperf]
select = ["PPO001", "PPO002", "PPO003", "PPO004", "PPO005"]
ignore = ["PPO006"]
severity_threshold = "warn"
fail_on = "error"
format = "json"This section explains the internals of pdperf for curious developers. Whether you're a beginner or an expert, you'll understand exactly how we detect performance anti-patterns.
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Your Code β βββΆ β AST Parser β βββΆ β Visitors β βββΆ β Findings β
β (.py) β β (Python) β β (Rules) β β (Report) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
In simple terms: pdperf reads your Python code, converts it into a tree structure, walks through that tree looking for patterns that indicate slow code, and reports what it finds.
When Python reads your code, it doesn't see text β it sees a tree of instructions. This tree is called an Abstract Syntax Tree (AST).
Example code:
for idx, row in df.iterrows():
total += row['value']What Python sees (simplified AST):
For
βββ target: Tuple(idx, row)
βββ iter: Call
β βββ func: Attribute
β βββ value: Name(df)
β βββ attr: "iterrows"
βββ body: [AugAssign...]
| Approach | Pros | Cons |
|---|---|---|
| Regex on text | Simple | Breaks on formatting, comments, strings |
| Running code | Accurate | Dangerous, slow, needs dependencies |
| AST parsing β | Safe, accurate, fast | Requires understanding tree structure |
pdperf uses Python's built-in ast module β the same parser Python itself uses. This means:
- β 100% safe β we never execute your code
- β Handles all Python syntax β even complex expressions
- β Zero false positives from comments/strings β AST ignores them
import ast
# This is what pdperf does internally:
source_code = open("your_file.py").read()
tree = ast.parse(source_code) # Convert text β treeInstead of manually searching the tree, we use a Visitor β an object that automatically walks through every node in the tree and lets us react to specific node types.
Think of it like a security scanner at an airport:
- The scanner (visitor) checks every bag (node)
- It only alerts on specific items (patterns we care about)
- It doesn't modify anything β just observes
class PandasPerfVisitor(ast.NodeVisitor):
def visit_For(self, node):
# Called for every 'for' loop in the code
# Check if iterating over iterrows/itertuples
...
def visit_Call(self, node):
# Called for every function call
# Check for concat(), apply(axis=1), etc.
...Why this is elegant:
- Python automatically walks the entire tree
- We only write code for patterns we care about
- Adding new rules = adding new
visit_Xmethods
Many anti-patterns are only problematic inside loops. For example:
pd.concat()outside a loop β β Finepd.concat()inside a loop β β O(nΒ²) performance
class PandasPerfVisitor(ast.NodeVisitor):
def __init__(self):
self._loop_stack = [] # Track nested loops
def visit_For(self, node):
self._loop_stack.append(node) # Enter loop
self.generic_visit(node) # Check children
self._loop_stack.pop() # Exit loop
def _in_loop(self):
return len(self._loop_stack) > 0This enables rules like:
- PPO003:
concatin loop (only flagged when_in_loop() == True) - PPO009:
groupbyin loop - PPO010:
sort_valuesin loop
Each rule looks for a specific AST pattern. Here's how the most important ones work:
Pattern: A For loop where the iterator is a call to .iterrows() or .itertuples()
def visit_For(self, node):
if isinstance(node.iter, ast.Call):
if isinstance(node.iter.func, ast.Attribute):
if node.iter.func.attr in ("iterrows", "itertuples"):
self._add_finding("PPO001", node)Visual breakdown:
for idx, row in df.iterrows():
β ββ Attribute(attr="iterrows")
βββ For.iter = Call(func=Attribute...)
Pattern: A call to .concat() or pd.concat() while inside a loop
def visit_Call(self, node):
if self._in_loop(): # Only flag inside loops
if isinstance(node.func, ast.Attribute):
if node.func.attr == "concat":
self._add_finding("PPO003", node)Pattern: Assignment where the target is df[x][y] = value
This is tricky because we need to detect nested subscripts on the left side of an assignment:
df[mask]["col"] = value
β β β
β β βββ Subscript (inner)
β βββββββββ Subscript (outer)
ββββββββββββ This is the assignment targetdef visit_Assign(self, node):
for target in node.targets:
if isinstance(target, ast.Subscript):
if isinstance(target.value, ast.Subscript):
# Nested subscript = chained indexing!
self._add_finding("PPO004", target)Not all detections are equally reliable. pdperf includes a confidence score with each finding:
| Level | Meaning | Example |
|---|---|---|
| High | Structural match, very reliable | iterrows() in for loop |
| Medium | Heuristic, some false positives possible | groupby().apply() |
| Low | Suggestion only | (future rules) |
@dataclass
class Finding:
rule_id: str
confidence: Confidence # HIGH, MEDIUM, LOW
confidence_reason: str # Human-readable explanationWhy this matters:
- CI can filter:
--min-confidence high - Users understand reliability of each finding
- Reduces "alert fatigue" from uncertain warnings
For CI/CD reliability, pdperf guarantees deterministic output:
# Findings are always sorted by:
findings.sort(key=lambda f: (f.path, f.line, f.col, f.rule_id))This means:
- Same code β same JSON output
- No flaky CI builds
- Diffs are meaningful
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β pdperf β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β cli.py β Entry point, argument parsing, output β
β analyzer.py β AST parsing, visitor, finding creation β
β rules.py β Rule definitions, severity, messages β
β config.py β pyproject.toml loading, profiles β
β reporting.py β JSON, text, SARIF output formatting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| File | Responsibility | Key Classes/Functions |
|---|---|---|
analyzer.py |
Core detection engine | PandasPerfVisitor, Finding, analyze_path |
rules.py |
Rule registry | Rule, Severity, Confidence, RULES dict |
config.py |
Configuration | Config, load_config, PROFILES |
cli.py |
User interface | build_parser, cmd_scan, cmd_explain |
reporting.py |
Output formatting | format_text, write_json, write_sarif |
| Operation | Algorithm | Complexity |
|---|---|---|
| AST parsing | Python's built-in parser | O(n) where n = file size |
| Tree traversal | Depth-first visitor | O(nodes) β visits each node once |
| Pattern matching | Direct attribute checks | O(1) per node |
| Finding sorting | Timsort | O(k log k) where k = findings |
Total complexity: O(n) for a single file β linear in code size.
Benchmark: pdperf scans ~10,000 lines/second on typical hardware.
| Design Choice | Benefit |
|---|---|
| AST, not regex | Handles all valid Python syntax correctly |
| Visitor pattern | Clean separation, easy to add rules |
| Loop stack | Context-aware detection (loop vs. not-loop) |
| No type inference | Fast, no dependencies, works on any code |
| Confidence levels | Users trust findings at appropriate level |
| Deterministic output | Reliable CI integration |
| Limitation | Why It Exists | Mitigation |
|---|---|---|
| No type inference | Would require running code | Use --ignore for false positives |
| Import-agnostic | Can flag non-pandas .values |
Filter with --select |
| Syntax errors skip file | Can't parse invalid Python | Use --fail-on-parse-error |
| No cross-file analysis | Keeps tool simple and fast | May miss imported patterns |
Want to add a new rule? Here's the template:
# 1. Define in rules.py
PPO011 = register_rule(Rule(
rule_id="PPO011",
name="your-rule-name",
severity=Severity.WARN,
message="...",
suggested_fix="...",
confidence=Confidence.HIGH,
))
# 2. Detect in analyzer.py
def visit_Call(self, node):
if self._should_check("PPO011"):
if your_detection_logic(node):
self._add_finding("PPO011", node)pdperf scan . --format json --out pdperf.json --fail-on errorAdd to .pre-commit-config.yaml:
repos:
- repo: local
hooks:
- id: pdperf
name: pdperf (pandas performance linter)
entry: pdperf scan --fail-on error
language: python
types: [python]name: Lint
on: [push, pull_request]
jobs:
pdperf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -e .
- run: pdperf scan src/ --format sarif --out results.sarif --fail-on error
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: results.sarif# Install dev dependencies
pip install -e ".[dev]"
pip install pytest
# Run all tests (33 tests)
python -m pytest tests/ -v# Check version
pdperf --version
# β pdperf 0.1.0
# List rules (should show 8 rules)
pdperf rules
# Test on example files
pdperf scan examples/pandas-perf-optimizer/
βββ src/pandas_perf_opt/
β βββ __init__.py # Package version
β βββ analyzer.py # AST-based detection engine
β βββ cli.py # Command-line interface
β βββ reporting.py # JSON/text/SARIF output
β βββ rules.py # Rule definitions & explanations
βββ tests/
β βββ test_rules.py # 33 golden tests
β βββ test_smoke.py # Version test
βββ examples/
β βββ slow_iterrows.py # PPO001 example
β βββ slow_apply_axis1.py # PPO002 example
β βββ slow_concat_in_loop.py # PPO003 example
βββ pyproject.toml # Package configuration
βββ Makefile # Dev commands
βββ README.md # This file
| Dependency | Supported |
|---|---|
| Python | 3.10+ |
| Pandas | 1.5+, 2.x (detection is version-agnostic) |
- Pandas Performance Guide β Official pandas performance tips
- SettingWithCopyWarning Explained β Real Python guide
- DataFrame.to_numpy() β Why .to_numpy() over .values
- DataFrame.append() Deprecation β Pandas 1.4+ deprecation notice
- Ruff PD011 β Ruff's
.valuesrule (similar to PPO006)