A single-layer, single-head transformer written in PDP-11 assembly language.
This project is the spiritual successor of Xortran, a neural network that learns XOR with backpropagation in Fortran IV on the IBM 1130 (1965) and PDP-11/20 (1970).
The natural next step was to see if those machines could successfully train a small transformer in an acceptable amount of time (a few hours).
Architecturally, a transformer is actually a fairly modest extension of a basic neural network. The building blocks such as matrix multiplies, backpropagation, SGD, and cross-entropy are already there.
The three new components are:
- Self-attention: dot-product score between projected queries and keys
- Positional encoding: learned position embeddings, added to the input
- Softmax: to turn scores into a probability distribution
The goal is to train the Transformer to reverse a sequence of digits. Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google's reference implementation of the original transformer in 2017.
The data path is straightforward: tokens are embedded, passed through self-attention with a residual connection, then projected back to the vocabulary and softmaxed into a prediction:
Tokens -> Embedding -> Self-Attention -> Residual -> Projection -> Softmax
| Hyperparameter | Value |
|---|---|
| Layers | 1 |
| Heads | 1 |
| d_model | 16 |
| Sequence length | 8 |
| Vocabulary | 10 (digits 0–9) |
| Parameters | 1,216 |
The model is an encoder-only transformer: embedding, self-attention with residual connection, and output projection. It's a genuine Transformer with self-attention, but not BERT or a GPT either: it has no layer norm, no feed-forward network, and no decoder. The task requires no transformation of the token representations, so attention and the residual connection are sufficient. Layer normalization, useful in deeper networks to prevent activation drift, is unnecessary with a single layer.
The first implementation followed Xortran and was written in Fortran IV. With a uniform learning rate of 0.01, the model took 25mn for 100 steps and needed 1,500 training steps to reach 100% accuracy, which on real hardware would have translated to about 6.5 hours of training, and possibly a whole week on the IBM 1130.
This was unacceptably long even by 1970s standards, as those machines were time-shared and computing time was very valuable.
A first improvement was the switch to hand-tuned, per-layer learning rates:
| Layer | Learning Rate |
|---|---|
| Wq, Wk, Wv (attention) | 0.08 |
| Token & position embeddings | 0.01 |
| Wout (output projection) | 0.0025 |
Attention weights, which encode the reversal pattern, benefit from a high learning rate, while the output projection converges better with a smaller one. With this tuning, training dropped to 600 steps and an estimated 2.5 hours.
The optimizer is plain stochastic gradient descent (SGD). Adam would automatically adapt the step size per parameter, but at the cost of two extra state vectors per weight, tripling the memory devoted to parameters. It would also add a square root and division per update, both expensive on the PDP-11 even with the EIS. The per-layer learning rates achieve a similar effect at no additional cost, and the model is small enough for the three rates to be tuned by hand. It also allows the Transformer to fit in 32KB of core memory instead of 64KB, which was important in the 1970s.
Side note: since it's bare-metal assembly, ATTN/11 doesn't use more memory than Xortran, which pays the cost of RT-11 V3 and the Fortran runtime. The resulting binary is also fairly compact, at exactly 6,179 bytes.
The core arithmetic operations use NN11, a minimal fixed-point neural network stack designed for ATTN/11 and the PDP-11.
NN11 is organized in levels not unlike BLAS: scalar primitives (FXMATH), vector operations such as dot product and scaling (VECOP), then matrix–vector operations (MATOP), each level building on the one below.
Two additional modules extend the stack beyond linear algebra: activation functions and their lookup tables (ACTFN), and layer-level routines (LAYER) that compose the previous operations into embedding, projection, and attention.
The arithmetic is adapted to each pass:
| Pass | Format | Precision |
|---|---|---|
| Forward | Q8 | 8 fractional bits (1/256) |
| Backward | Q15 | 15 fractional bits (1/32768) |
| Weight accumulators | Q16 | 32-bit (16.16 fixed-point) |
The choice of Q8 forward and Q15 backward pairs well on the PDP-11: multiplying a Q8 value by a Q15 value yields Q23 in a 32-bit register pair, and a single ASHC #-8 brings it back to Q15. The backward pass multiply thus costs no more than the forward pass one, while giving gradients 128 times the precision of the activations.
The optimized model converges in 350 steps, bringing the total training time to only 5.5 minutes on my PDP-11/34A.
I don't have an actual paper tape reader, so the object code is directly deposited in memory through the console.
Here is the result after running the Transformer:
ATTN/11 - PAPER TAPE IS ALL YOU NEED
D=16 SEQ=8 V=10 Q8/Q15/Q16
TRAINING...
STEP 50 LOSS=1.6113 ACC=0.217
STEP 100 LOSS=2.1865 ACC=0.255
STEP 150 LOSS=2.1511 ACC=0.267
STEP 200 LOSS=1.3874 ACC=0.395
STEP 250 LOSS=0.0500 ACC=0.662
STEP 300 LOSS=0.0019 ACC=0.982
STEP 350 LOSS=0.0009 ACC=1.000
4 7 4 9 6 3 6 5 -> 5 6 3 6 9 4 7 4 OK
0 5 6 1 6 7 0 3 -> 3 0 7 6 1 6 5 0 OK
0 9 4 5 6 5 4 1 -> 1 4 5 6 5 4 9 0 OK
4 5 4 7 2 5 2 7 -> 7 2 5 2 7 4 5 4 OK
...
ACCURACY 10/10
Before committing to assembly, correctness had to be proven. The floating then fixed-point arithmetic were prototyped and validated in Sheaf, my functional ML framework with built-in observability.
Sheaf offers several advantages over Python for this kind of ML work: about a third less code, stronger correctness guarantees from its purely functional semantics, and built-in tracing of every intermediate tensor with its shape, range, and timing. This is invaluable when developing fixed-point arithmetic.
For instance, a range guard on vtmul immediately catches a missing >>8 shift:
sheaf ./tools/prototype.shf --guard vtmul:range:-5000:5000 --trace vtmul
Input: [3. 1. 4. 1. 5. 9. 2. 6.]
Running forward pass...
├─ [vtmul] f64[16x16] [min:-1.72e2 max:1.86e2] (2.0KB), f64[16] [min:-2.38e2 max:2.55e2] (128B)
└─ ← f64[16] [min:-3.10e4 max:3.28e4] (128B) (0.03ms)
/!\ Guard Breached: Range { lo: -5000.0, hi: 5000.0 }
Function: vtmul
Value range [-3.10e4, 3.28e4] outside allowed [-5000, 5000]
Backtrace (last 2 operations):
├─ [vtmul] f64[16x16] [min:-1.72e2 max:1.86e2] (2.0KB), f64[16] [min:-2.38e2 max:2.55e2] (128B)
└─ ← f64[16] [min:-3.10e4 max:3.28e4] (128B) (0.03ms)
--- End of Backtrace ---
Both the floating and fixed-point prototypes can be found in the proto folder, along with the initial Fortran versions.
Without a floating-point unit, transcendental functions (exp, log) are replaced by precomputed lookup tables. A table read costs a single MOV instruction on the PDP-11, making it far cheaper than polynomial approximations or CORDIC algorithms.
The softmax uses a 256-entry table (EXPTBL, Q8) mapping each index i to exp(−i/32). The computation proceeds in three steps:
- The maximum of the input vector is found and subtracted from each element for numerical stability.
- The difference (max − x_i), divided by 8, serves as the table index, clamped to [0, 255].
- Each resulting exp value is divided by their sum (via FXDIV) to produce a probability distribution.
The table covers roughly 8 units of input before reaching zero, which is more than sufficient for a 10-class vocabulary.
The loss is computed every 50 steps for reporting purposes. It relies on a second table (LOGTBL, 257 entries, Q12) mapping each value x ∈ [0, 256] to −ln(x/256) × 4096.
The computation follows the standard path: softmax of the logits, read the probability assigned to the target token, then look up −ln(p) in the table.
The eight contributions (one per sequence position) are accumulated in a 32-bit register pair (necessary because the sum of eight Q12 values can exceed 16 bits) then divided by 8 via ASHC #-3.
Q12 precision (1/4096 ~= 0.0002) provides four decimal places, which is sufficient to monitor convergence.
ATTN/11 uses two lookup tables. The first one maps each index to exp(−i/32) in Q8, and replaces the exponential in softmax with a single MOV instruction.
The other table is merely a convenience and maps each value to −ln(x/256) in Q12. It is used every 50 steps to compute the cross-entropy loss for monitoring convergence.
| Table | Entries | Format | Function | Location |
|---|---|---|---|---|
| EXPTBL | 256 | Q8 | exp(−i/32) | nn11/ACTFN.MAC |
| LOGTBL | 257 | Q12 | −ln(x/256) × 4096 | TRAIN.MAC |
Both tables are generated offline with a Sheaf script and stored as .WORD constants in the source.
The backward pass takes advantage of a well-known property of the softmax–cross-entropy pair: the gradient of the logits reduces to
dL[i] = softmax(logits[i]) - one_hot(target[i])
which eliminates any logarithm computation during training. The result, initially in Q8, is shifted left by 7 bits to Q15 - the format used throughout the backward pass. The same SFTMX routine serves both the forward and backward passes, with no separate backpropagation variant.
One last note: AI tools were used as assistants during development of those algorithms. All design decisions, scaling strategy and numerical validation were manually verified on hardware.
ATTN/11 uses 19.2 KB of memory. The following table shows the memory map, adapted from the MACRO-11 assembler listing:
Address Section Size
─────────────────────────────────────────────────
000000 Interrupt vectors 512 B
001000 Code (NN11 + model + I/O) 5.1 KB
incl. EXPTBL (256 entries)
013112 Strings 142 B
013330 LOGTBL (257 entries, Q12) 514 B
014332 Tokens + Target 32 B
014372 Q16 weight accumulators 4.8 KB
-> High words (1216) 2.4 KB
-> Low words (1216) 2.4 KB
025772 Q8 weight copies 2.4 KB
032572 Gradient accumulators (Q15) 2.4 KB
037372 Forward cache 1.5 KB
-> X, Y, logits, WORK (Q,K,V,S)
042432 Backward workspace 1.4 KB
-> dL, dY, dA, dQ, dK, dV, dX
045316 Stack 512 B
046316 Free 12.8 KB
077777 End of memory (32 KB)
The 1,216 parameters are replicated three times, by necessity: Q16 accumulators (4.8 KB), Q8 for the forward pass (2.4 KB), and Q15 for the gradients (2.4 KB). The model alone uses 9.6 KB, by far the largest share of the total memory usage.
The two requirements are the MACRO11 assembler and obj2bin to convert the object code into a loadable binary.
$ macro11 TRAIN.MAC -o ATTN-11.obj
$ obj2bin.pl ATTN-11.obj --binary --outfile=ATTN-11.bin
$ ls -l ATTN-11.bin
-rw-r--r-- 1 damien damien 6179 Mar 4 21:53 ATTN-11.bin
You will either need:
- A physical PDP/11 with a CPU supporting the EIS instructions, and 32K of core or MOS memory.
- ll-34: a circuit-level, microcycle-accurate PDP-11/34 emulator I designed as a digital replica of the real hardware.
ll-34 is probably as close to owning an actual 11/34 as can be. Start ATTN/11 with:
ll-34 --lda ATTN-11.binOr for a quick demo, use the WebAssembly version available here: https://dbrll.github.io/ll-34/.
SIMH also works, but it simulates the high level behavior (not the circuit) of a PDP-11 and runs at host CPU speed. It can be slowed down artificially, however the timing is not cycle-accurate, making it less suitable for performance tuning or for an authentic experience.

