A semester-long project implementing a tiny compiler in Python for a restricted subset of C.
The compiler goes from source C all the way to optimized IR and Intel-style pseudo-x86 assembly with optional register allocation and stack frames.
-
Front end
- Hand-written lexer (
lexer.py) - Recursive-descent parser (
parser.py) - Typed AST (
abstract_syntax_tree.py) - Semantic analysis with nested scopes and symbol tables (
semantic.py)
- Hand-written lexer (
-
Intermediate representations
- Human-readable three-address code (TAC) (
tac.py) - Internal IR with
Instr/Block/Function(ir/)
- Human-readable three-address code (TAC) (
-
Optimizations (IR level)
- Constant propagation
- Constant folding
- Unreachable code elimination
- Dead store elimination
- Copy propagation
- Simple algebraic simplifications
-
Code generation
- Pseudo-x86 IR (
codegen/x86ir.py) - Lowering from IR to pseudo-x86 (
codegen/pseudo_x86.py) - Optional stack frames for locals
- Graph-coloring register allocation and spilling (
codegen/ra.py)
- Pseudo-x86 IR (
-
Debugging / inspection
- Dump tokens, AST, symbol tables, TAC, IR blocks, and CFG
- Run individual optimization passes and trace them one by one
- Only
inttype (for both variables and function return). - No function parameters and no function calls (yet).
- Statements:
- Variable declarations:
int x;,int x, y; if/if-elsewhilereturn expr;- Expression statements:
x = x + 1;
- Variable declarations:
- Expressions:
- Binary:
+,-,*,/,%,==,!=,<,<=,>,>=,&&,|| - Unary:
!, unary-, unary+ - Parentheses, variables, integer literals
- Binary:
Modulo % is accepted syntactically but not fully supported in codegen yet.
compiler.py– main CLI entry pointlexer.py– tokenizationparser.py– recursive-descent parserabstract_syntax_tree.py– AST node definitionssemantic.py– scopes, symbol tables, semantic checkserrors.py– customLexerError,ParserError,SemanticError
IR and optimization:
tac.py– TAC generation from ASTir/ir_types.py– internal IR value and instruction typesir/tac_adapter.py– TAC ↔ IR conversionir/builder.py– basic block & CFG builderir/pretty.py– IR / CFG printing utilitiesir/const_prop.py,ir/const_fold.py,ir/dce.py,ir/fuse.py,ir/copy_prop.py,ir/algebra.py– optimization passesir/pipeline.py–optimize_function(fn, opt_level)ir/passes.py– named pass mapping +run_passes(...)
Code generation:
codegen/x86ir.py– pseudo-x86 dataclasses + printercodegen/pseudo_x86.py– IR → pseudo-x86 lowering, frame layout, prologue/epiloguecodegen/ra.py– liveness, interference graph, greedy coloring, spilling
Tests:
tests/lexer_output_testing/tests/parser/{valid,invalid}/tests/semantic/{valid,invalid}/tests/tac_test/tests/optimization/*tests/codegen-x86/
From the project root:
python3 compiler.py [options] input.c# Just lex the file and print tokens
python3 compiler.py -l input.c
# Parse and pretty-print the AST
python3 compiler.py -p input.c
# Run semantic analysis (will exit on error)
python3 compiler.py -s input.c
# Print function symbol tables
python3 compiler.py --symtab input.c
# Generate TAC (no optimization)
python3 compiler.py --tac input.c# Use built-in optimization pipeline at level 1+
python3 compiler.py -O1 --tac input.c
python3 compiler.py -O2 --tac input.c
python3 compiler.py -O3 --tac input.c
# Quick alias: enable constant folding (treated like -O1+)
python3 compiler.py --constfold --tac input.c
# Manually choose passes (order matters)
python3 compiler.py --passes constprop,constfold,dse,copyprop,algebra --tac input.c
# Trace IR after each named pass
python3 compiler.py --passes constprop,constfold --trace-passes --tac input.cAvailable pass names (for --passes):
- constprop
- constfold
- drop_unreachable
- dse
- copyprop
- algebra
# Dump basic blocks after CFG construction
python3 compiler.py --tac --dump-blocks input.c
# Also show CFG successors along with each block
python3 compiler.py --tac --dump-blocks --dump-cfg input.c
# Show blocks after the optimization pipeline
python3 compiler.py --tac -O2 --dump-blocks-after input.cTo lower to Intel-style pseudo-x86L
# Simple pseudo-x86 (virtual registers, no stack frame)
python3 compiler.py --emit-pseudo-x86 input.c
# Pseudo-x86 with stack frame for locals (rbp/rsp based)
python3 compiler.py --emit-pseudo-x86 --frame stack input.c
# Pseudo-x86 + stack frame + register allocation
python3 compiler.py --emit-pseudo-x86 --frame stack --ra input.cNotes:
-
--emit-pseudo-x86prints a minimal function header followed by instructions:function mainpush rbp,mov rbp, rsp, etc. when--frame stackis used.
-
Temporaries are lowered into virtual registers R1, R2, ...
With--rathese are replaced by caller-saved x86 registers, with spills stored on the stack.
# Basic TAC with optimizations
python3 compiler.py -O2 --tac tests/optimization/ir_unreach/nested_if_const.c
# Visualize CFG before and after optimization
python3 compiler.py --tac --dump-blocks tests/optimization/ir_unreach/nested_if_const.c
python3 compiler.py -O2 --tac --dump-blocks-after tests/optimization/ir_unreach/nested_if_const.c
# Compare TAC vs pseudo-x86
python3 compiler.py -O2 --tac tests/codegen-x86/simple.c
python3 compiler.py -O2 --emit-pseudo-x86 --frame stack --ra tests/codegen-x86/simple.c- No function parameters or calls yet
- No arrays, pointers, or structs.
- % is not wired all the way through code generation.
- No real assembler or linker and output is pseudo-x86 and direct execution.