GPU-accelerated evolution of Finite State Machines (FSMs) to solve the Tartarus problem, and a behavior discovery methodology to explain the resulting agent's policy.
A 128-state FSM is evolved using a genetic algorithm on CUDA, trained on all 74,760 valid board configurations, achieving a true score of 9.84 where 90% of boards solved with a perfect 10. The agent's policy is analyzed through spectral clustering of a tactic transition graph and Infomap community detection on the state transition graph.
Tartarus is a benchmark problem for evaluating artificial agents. An agent operates on a 6x6 grid containing 6 boxes. The goal is to push boxes to the walls (edges) of the grid within 80 moves.
Scoring:
- Box on edge (not corner): +1 point
- Box in corner: +2 points
- Maximum possible score: 10 points (4 boxes in corners + 2 on edges)
Agent capabilities:
- Sees 8 neighboring cells (each can be: empty, box, or wall)
- 3 actions: move forward, turn left, turn right
- Can push a box if the cell behind it is empty
Problem space:
- 383 valid sensory combinations (out of 3^8 = 6561 possible)
- 1,869 unique board layouts (6 boxes in 4x4 inner grid)
- 74,760 total configurations (1,869 layouts x 10 starting positions x 4 directions)
| File | Description |
|---|---|
kernel.cu |
Original kernel from the 2020 paper, developed for RTX 1050 |
kernel_2026.cu |
Updated kernel for modern GPUs (RTX 50 series) with sampled board training |
kernel_allboards.cu |
Trains on all 74,760 boards using the baseline D0 design |
kernel_D2_allboards.cu |
Main training kernel. Trains on all 74,760 boards using the D2 design (per-gene fitness crossover). Produced the 9.84 agent |
kernel_test_all.cu |
Evaluates a trained solution on all 74,760 boards. Outputs score distribution and per-state usage statistics |
kernel_analyze.cu |
Generates CSV files for state-combination heatmaps, transition matrices, and push statistics |
| File | Description |
|---|---|
analyze_agent.py |
Runs the FSM agent on all 74,760 boards in Python. Builds the state transition graph, combination transition graph, and tactic transition graph. Saves sequences and statistics as .json and .pkl files |
behavior_segmentation.py |
Spectral clustering on the tactic transition graph. Combines transition probabilities with Hamming similarity (controlled by lambda) to discover behavioral clusters |
lambda_sensitivity.py |
Sweeps lambda from 0.0 to 1.0 to validate that the number of clusters is stable |
state_clustering.py |
InfoMap state clustering |
print_cluster_tactics.py |
Extracts and prints the dominant tactic sequences within each behavioral cluster |
analyze_communities.py |
Infomap community detection on the state transition graph. Identifies state modules based on transition flow |
null_model.py |
Null model test for behavior-state alignment. Generates 10,000 random partitions and computes ARI/NMI distributions to test significance |
ablation_study.py |
Lesion study: removes state clusters and evaluates the agent on all boards. Transitions into ablated states are redirected to random non-ablated states |
| File | Description |
|---|---|
viewer_2026.py |
Step-by-step visual replay of the agent on a Tartarus board. Saves each step as a PNG image with state/action annotations |
visualize_graph.py |
Visualizes the state transition graph with node sizes proportional to visit frequency and edge thickness proportional to transition weight. Supports interactive HTML output via PyVis |
visualize_tactic_pattern.py |
Renders tactic sequences as 3x3 grids showing the agent's perception and action at each step |
visualize_analysis.py |
Generates heatmaps, network graphs, and behavioral profiles from kernel_analyze.cu CSV output |
| File | Description |
|---|---|
analyze_behaviors.py |
Earlier behavior analysis approach |
analyze_combo_behaviors.py |
Combination-level behavior analysis |
behavior_cluster_decoder.py |
Attempts to decode actions within behavior clusters |
segment_behaviors.py |
Alternative segmentation approach |
segment_by_independence.py |
Independence-based segmentation |
sequitur_analysis.py |
Sequitur grammar induction on action sequences |
find_patterns.py |
Pattern finding in state sequences |
combo_find_patterns.py |
Pattern finding in combination sequences |
tactic_find_patterns.py |
Pattern finding in tactic sequences |
query_state.py |
Interactive query tool for inspecting individual states |
visualize_brain_map.py |
Brain-map-style visualization of state modules |
visualize_combo_graph.py |
Visualizes the combination transition graph |
analyze_iidx.py |
Compares inverted index arrays between kernels |
viewer.py |
Original step-by-step viewer (2020, hardcoded Windows paths) |
viewer_comparison.py |
Side-by-side comparison of two agents on the same board |
testing.py |
Python-based fitness evaluation on random boards (2020) |
| File | Description |
|---|---|
realboard.txt |
All 1,869 unique board layouts |
boards.txt |
Board file from before the unique board count was established |
best/b-D2-4096-128-3000-1.txt |
Best agent chromosome (the 9.84 agent) |
best/r-D2-4096-128-3000-1.txt |
Training log (generation, best, average) |
best/sc_txt_b-D2-4096-128-3000-1.txt |
Score distribution from kernel_test_all |
best/st_txt_b-D2-4096-128-3000-1.txt |
Per-state usage statistics |
best/results_txt_b-D2-4096-128-3000-1.txt |
Per-board results |
images/ |
Agent and box sprites for the viewer |
CUDA (for training and testing):
- NVIDIA GPU with CUDA support
- CUDA Toolkit (nvcc compiler)
Python (for analysis and visualization):
pip install -r requirements.txt
Key libraries: numpy, matplotlib, networkx, infomap, scikit-learn, scipy, seaborn, pandas, pillow, pyvis
Compile and run the D2 all-boards kernel:
nvcc -arch=sm_120 -o TartarusD2.exe kernel_D2_allboards.cu -lcurand
./TartarusD2.exe <N> <S> <K> <L>| Parameter | Description | Paper value |
|---|---|---|
| N | Population size | 4096 |
| S | Number of FSM states | 128 |
| K | Number of generations | 3000 |
| L | Run ID (for file naming) | 1 |
./TartarusD2.exe 4096 128 3000 1Use -arch=sm_120 for RTX 50 series (Blackwell), -arch=sm_89 for RTX 40 series, -arch=sm_86 for RTX 30 series.
Output files are saved to the txt/ directory:
txt/r-D2-N-S-K-L.txt-- Training log (generation, best score, average score)txt/b-D2-N-S-K-L.txt-- Best agent chromosome
nvcc -arch=sm_120 -o TartarusTestAll.exe kernel_test_all.cu -lcurand
./TartarusTestAll.exe <solution_file> <S>./TartarusTestAll.exe best/b-D2-4096-128-3000-1.txt 128Outputs:
- Average true score across all 74,760 boards
sc_<solution_file>-- Score distribution (how many boards scored 10, 9, 8, ...)st_<solution_file>-- Per-state usage countsresults_<solution_file>-- Per-board scores
Run the agent in Python on all 74,760 boards to collect state, combination, and tactic transition graphs:
python analyze_agent.py best/b-D2-4096-128-3000-1.txt 128This produces analysis_*.json and analysis_*.pkl files used by subsequent scripts.
Sweep lambda values and find k.
python lambda_sensitivity.pyCluster the 64 tactics into behavioral groups:
python behavior_segmentation.py --lambda 0.25To see the dominant tactic sequences in each cluster:
python print_cluster_tactics.pyDetect communities in the state transition graph using k:
python state_clustering.py --k 7Test whether state modules and behavior clusters are aligned:
python null_model.pyMeasure the impact of removing state modules:
python ablation_study.py best/b-D2-4096-128-3000-1.txt 128Replay the agent step by step on a board:
python viewer_2026.py best/b-D2-4096-128-3000-1.txt 128Visualize the state transition graph:
python visualize_graph.py [--interactive]Visualize a tactic sequence:
python visualize_tactic_pattern.py --numeric 34-34-35Each individual in the population consists of:
Sstates (e.g., 128)383valid sensor combinationsG = S x 383 + 1genes total
Each gene is a tuple of (action, next_state, used, fitness):
action: 0 (forward), 1 (turn left), 2 (turn right)next_state: which state to transition to (0 to S-1)used: how many times this gene is activated during evaluationfitness: cumulative score contribution of this gene
The last gene (G-1) stores the initial state. During crossover, offspring inherit the gene with the higher fitness-to-usage ratio, ensuring beneficial genes are preserved.
| Configuration | States | Boards | Design | True Score |
|---|---|---|---|---|
| 2020 Paper | 12 | 256 (sampled) | D2 | 8.54 |
| This paper | 128 | 74,760 (all) | D2 | 9.84 |
Score distribution for the 9.84 agent:
| Score | Boards | Percentage |
|---|---|---|
| 10 | 67,499 | 90.2867% |
| 9 | 5,903 | 7.8959% |
| 8 | 520 | 0.6956% |
| 7 | 50 | 0.0669% |
| 6 | 1 | 0.0013% |
| 5 | 1 | 0.0013% |
| 4 | 666 | 0.8909% |
| 3 | 97 | 0.1297% |
| 2 | 23 | 0.0308% |
| 1 | 0 | 0.00% |
| 0 | 0 | 0.00% |
[1] K. Oguz, "Adaptive Evolution of Finite State Machines for the Tartarus Problem," 2019 Innovations in Intelligent Systems and Applications Conference (ASYU), 2019, pp. 1-5, doi: 10.1109/ASYU48272.2019.8946413
[2] K. Oguz, "True scores for tartarus with adaptive GAs that evolve FSMs on GPU," Information Sciences, 2020, pp. 1-15, doi: 10.1016/j.ins.2020.03.072
[3] K. Oguz, "Estimating the difficulty of Tartarus instances," Pamukkale Univ Muh Bilim Derg, 2021, pp. 114-121, doi: 10.5505/pajes.2020.00515