A research pipeline for cataloging Java deadlock patterns, extracting lock graphs from real-world repositories, injecting realistic deadlocks into target codebases, and benchmarking AI agents on finding and fixing them — all in a single automated workflow.
Deadlocks are one of the most feared classes of production bugs in multi-threaded systems — they cause complete service hangs, are non-deterministic, and notoriously difficult to diagnose. Can LLM-based coding agents reliably find and fix deadlocks in real codebases? To answer this, we need:
- A catalog of real deadlock patterns from known bugs
- A tool to analyze lock ordering in any Java codebase
- A method to inject realistic deadlocks at natural insertion points
- A benchmark harness to evaluate AI agents on fixing them
This repo implements each stage of that pipeline.
deadlock_catalog.md
deadlock_patterns.json
JaConTeBe Benchmark ──────────┐
(47 real concurrency bugs) │
▼
Pattern Catalog
(23 deadlocks + 1 synthetic,
6 pattern types)
│
▼
Target Repo ──► Lock Graph Pipeline ──► lock_graph.json
(e.g. Kafka) (tree-sitter + (lock orderings,
Facebook Infer) nested acquisitions,
wait/notify sites)
│
▼
Deadlock Injection
(LLM-guided, pattern-matched)
│
▼
Harbor Benchmark Tasks
(Docker + Fray verification)
│
▼
AI Agent Evaluation
(Claude Opus 4.6 via Harbor)
.
├── deadlock_catalog.md # Human-readable catalog of 23 deadlock bugs
├── deadlock_patterns.json # Machine-readable patterns with lock graphs
├── lock_graph_pipeline/ # Python tool: extract lock graphs from Java repos
├── harbor_tasks_clean/ # Harbor benchmark tasks (active, Kafka Streams)
│ ├── README.md # Task descriptions and running instructions
│ ├── synth001-graph/ # 3-node cycle, recognizable class names
│ ├── synth001-nograph/ # 3-node cycle, obfuscated class names
│ ├── dbcp270-graph/ # ABBA cycle, recognizable
│ ├── dbcp270-nograph/ # ABBA cycle, obfuscated
│ ├── derby5560-graph/ # Serialization graph cycle, recognizable
│ ├── derby5560-nograph/ # Serialization graph cycle, obfuscated
│ └── mixed/ # 4 bugs simultaneously (very_hard)
├── results2/ # Evaluation results (Claude Opus 4.6, March 2026)
│ ├── README.md # Full report with per-trial results
│ ├── synth001-graph/ # 1 trial (1 success)
│ ├── synth001-nograph/ # 9 trials (2 successes)
│ └── mixed/ # 5 trials (0 successes)
├── tasks/ # Injection targets (real codebases with injected bugs)
│ ├── README.md # Guide for creating new injection tasks
│ └── zookeeper-deadlock/ # ZooKeeper with injected ABBA deadlock
├── output/ # Generated lock graph artifacts
├── JaConTeBe_TSVD/ # Cloned JaConTeBe benchmark (47 bug kernels)
├── jobs/ # Raw Harbor job outputs (local Mac runs)
├── harbor_tasks/ # [DEPRECATED] Early Harbor tasks (SCTBench ports)
└── results/ # [DEPRECATED] Earlier results snapshot (use results2/)
Derived from the JaConTeBe benchmark (ASE 2015) — 47 concurrency bug kernels from 8 open-source projects (DBCP, Derby, Groovy, JDK, Log4j, Lucene, Commons Pool).
We cataloged 23 deadlock bugs into 5 recurring patterns, plus 1 synthetic injection pattern adapted from a Go concurrency bug observed in a separate project:
| Pattern | Count | Example |
|---|---|---|
| Two-Object ABBA Cycle | 9 | Thread 1 locks A→B, Thread 2 locks B→A |
| Callback-Induced Cycle | 4 | Lock held while calling back into locking code |
| All-Waiters / Missed Notify | 4 | notify() instead of notifyAll(), or signal before wait |
| Serialization Graph Cycle | 3 | Lock acquired during readObject()/writeObject() |
| Infrastructure-vs-Application Lock | 3 | Framework lock conflicts with application lock |
| 3-Node Conditional Callback Cycle | 1 | 3-lock cycle with conditional closing edge via listener callback (SYNTH-001) |
The synthetic pattern (SYNTH-001) was derived from observing a 3-node mutex+channel deadlock in a Go codebase and carefully adapting it to idiomatic Java concurrency primitives (synchronized, ReentrantReadWriteLock, ReentrantLock + Condition). The Go channel dependency was translated to a callback/listener pattern — the closest Java analog that preserves the key difficulty property of data-flow-hidden lock dependencies. See the catalog entry for full details on what translates faithfully and what was adapted.
deadlock_catalog.md— Detailed descriptions with lock graphs for each bugdeadlock_patterns.json— Structured data for programmatic use (lock graphs, abstract templates, injection metadata)
Extracts a machine-readable lock graph from any Java codebase. See lock_graph_pipeline/README.md for full documentation.
pip install -r lock_graph_pipeline/requirements.txt
# Fast AST-based extraction (no build required)
python -m lock_graph_pipeline /path/to/java/src -o output/lock_graph.json -m output/lock_graph.md
# Deep analysis using Facebook Infer
python -m lock_graph_pipeline /path/to/java/src --mode infer --infer-out /path/to/infer-out
# Both modes merged
python -m lock_graph_pipeline /path/to/java/src --mode both --infer-out /path/to/infer-outOutput includes: lock acquisitions per class, lock ordering edges (A→B), wait/notify sites, calls made under lock, field type declarations, and candidate pattern matches against the catalog.
Modes:
| Mode | Speed | Build Required | Coverage |
|---|---|---|---|
light |
~1s per 400 files | No | Intra-method lock orderings |
infer |
Minutes | Yes (Maven/Gradle) | Deep interprocedural edges |
both |
Minutes | Yes | Maximum (merged) |
Uses lock graph output + pattern catalog to inject realistic deadlocks into target repos at natural insertion points. See tasks/README.md for the general injection workflow and tasks/zookeeper-deadlock/INJECTION_NOTES.md for a worked example.
Example: DBCP-270 pattern (Two-Object ABBA Cycle) injected into ZooKeeper's leader election code:
Leader.sendObserverPacket()acquiresLeader.this→LearnerHandler.thisLearnerHandler.getLearnerHandlerInfo()acquiresLearnerHandler.this→Leader.this- Verified with
ThreadMXBean.findDeadlockedThreads()under concurrent load
Standalone Harbor tasks that package injected bugs into Apache Kafka Streams for AI agent evaluation using Fray for systematic concurrency testing.
# Run a task with Claude Code agent
ANTHROPIC_API_KEY=<key> harbor run \
--path harbor_tasks_clean/synth001-nograph \
--agent claude-code \
--model claude-opus-4-6 \
--n-concurrent 1 \
--no-deleteAvailable tasks (in harbor_tasks_clean/):
| Task | Difficulty | Bug Type | Variant |
|---|---|---|---|
synth001-graph |
hard | 3-node conditional callback cycle | Recognizable class names |
synth001-nograph |
hard | 3-node conditional callback cycle | Obfuscated class names |
dbcp270-graph |
hard | Two-object ABBA cycle | Recognizable |
dbcp270-nograph |
hard | Two-object ABBA cycle | Obfuscated |
derby5560-graph |
hard | Serialization graph cycle | Recognizable |
derby5560-nograph |
hard | Serialization graph cycle | Obfuscated |
mixed |
very_hard | 4 concurrent bugs (deadlocks + missed signal) | — |
Each task includes:
- Dockerfile: Builds Fray from source, pre-compiles Kafka Streams with injected bugs
- instruction.md: Bug description shown to the agent (no spoilers)
- test.sh: Recompiles agent's fix, runs Fray, writes reward (0 or 1)
- solve.sh: Gold patch for oracle baseline
Claude Opus 4.6 was evaluated on the synth001 and mixed tasks across two environments (local Mac with QEMU emulation, AWS VM with native x86_64). See results2/README.md for the full report.
Summary (15 valid trials, March 9–10, 2026):
| Task | Valid Runs | Successes | Pass Rate |
|---|---|---|---|
| synth001-graph | 1 | 1 | 100% (insufficient sample) |
| synth001-nograph | 9 | 2 | 22.2% |
| mixed | 5 | 0 | 0% |
Key findings:
- The most common failure mode is the agent identifying the wrong lock pattern — a real-looking pattern in existing Kafka code rather than the injected bug
- The
mixedtask (4 simultaneous bugs) appears too hard for a single agent session within the timeout - Obfuscated class names (nograph variant) significantly increase difficulty
- Python 3.10+ with
tree-sitter,tree-sitter-java - Docker (for Harbor tasks — images are x86_64, runs under QEMU on Apple Silicon)
- Harbor (
pip install harbor-bench) for agent evaluation - Facebook Infer (optional, for deep lock graph analysis)
- Lin et al., JaConTeBe: A Benchmark Suite of Real-World Java Concurrency Bugs, ASE 2015
- Fray — Systematic JVM concurrency testing (CMU PASTA Lab)
- Spaghetti Bench — Concurrency bug fixing benchmark (CMU PASTA Lab)
- Harbor — AI agent evaluation framework
Thank you!