Taking a swing at it. This is a working repo, all material should be considered incomplete and for demonstration purpose only.
Clone the dataset from the following repo into this repo directory https://github.com/arcprize/ARC-AGI-2
Here's the idea - we use transformers. Boom! Big reveal!
But we're not throwing a pre-trainer at it, we're training from scratch.
We might be able to get away with this by augenting the dataset a bunch with affine transforms and color conjugations. Estimates suggest the 1,000 example ARC AGI 2 dataset gets yeeted all the way up to like... 1.2B examples that way? It's a lot. Maybe enough.
It's super necessary too because we want the model to focus entirely on in-example patterns rather than learning anything special about the color blue (or the number 9 or whatever).
We're also using something a bit fancy for the positional encoding:
https://arxiv.org/pdf/2406.10322
Using this we can develop a 4D positional encoding (y, x, input_output, example_index) for the examples and 3D positional encoding (y, x, input_output) for the target which will hopefully preserve all the info we need for the model to learn to be generally smört.
Also interesting finding - the NoPE architecture (used intermittently in Llama 4) can be recovered from LieRE (I think) because it can learn not to apply any rotation at all.
https://arxiv.org/pdf/2305.19466
The transformer is going to be a fully tricked out encoder-decoder architecture. The encoder will be non-causal and the decoder will be causal (kinda has to be for sequential prediction).
Then we'll include a clever delimiting scheme for grid row-breaks and start/end sequence things:
# VOCAB = "0123456789,<|>*"
# "," = row separator
# "<" = example begins
# "|" = input/output separator
# ">" = example ends
# "*" = padding
The inputs to the encoder will look as follows:
"<{example_0_input}|{example_0_output}>...<{example_N_input}|{example_N_output}>"
Corresponding inputs to the decoder will look as follows
"<{target_input}|{target_output}>"
A relatively small input / output example (training sample 5) would look like this:
encoder input: <010,101,010,101,010,101|080,808,080,808,080,808,080,808,080><010,011,010,010,011,010|080,088,080,080,088,080,080,088,080><010,011,010,110,010,011|080,088,080,880,080,088,080,880,080>
decoder input: <111,010,010,111,010,010|888,080,080,888,080,080,888,080,080>
These will be accompanied by tensors which encode the positional information about each token in cartesian coordinates, which will be used to compute the learned rotation for that coordinate.
At inference we will prompt the model with "<{target_input}|" and continue to generate until it returns the ">" symbol.
Hopefully it returns some sequence that can be sensibly reconstructed into a tensor/array.