This repository contains an OCR (Optical Character Recognition) model for recognizing LaTeX code from images. This repository allows for custom dataset generation, training, and evaluation of the model. The implementation is written with PyTorch.
I have also written a web application to get fast predictions of LaTeX code from images. The app is built with a FastAPI backend to serve the model.
The model consists of an encoder-decoder architecture that is common for many current OCR systems. TeXOCR is based on the TrOCR model [1] which utilises a Vision Transformer (ViT) [2] encoder and a Transformer [3] decoder. The model architecture is depicted in the figure below:
The vision encoder receives images of LaTeX equations and processes them into a series of embeddings
To clone the repository, run the following:
git clone https://github.com/olibridge01/TeXOCR.git
cd TeXOCRFor package management, set up a conda environment and install the required packages as follows:
conda create -n texocr python=3.11 anaconda
conda activate texocr
pip install -r requirements.txtFor dataset rendering, latex, dvipng, and imagemagick are required. To install these dependencies, follow the instructions in the data_wrangling/ directory.
The data used in this project is taken from the Im2LaTeX-230k dataset (equations only). For use with a model consisting of a ViT encoder, I created custom scripts to generate the full dataset of image-label pairs, where each image has its dimensions altered to the nearest multiple of the patch size. To generate the dataset, simply execute:
./generate_dataset.shThis takes the original equation data data/master_labels.txt, creates the data splits with split_data.py and renders the images with render_images.py (located in the data_wrangling directory). The rendered images are stored in data/train, data/val, and data/test directories. To create the dataset pickle files used in the training/testing scripts, run:
./generate_pickles.shThis repository contains an implementation of the Byte Pair Encoding (BPE) [4] algorithm for tokenizing LaTeX code. To train the tokenizer on the Im2LaTeX-230k equation data, run:
./train_tokenizer.shTo train the tokenizer on any text data, you can play around with the tokenizer/tokenizer.py script:
python tokenizer/tokenizer.py -v [vocab_size] -t -d [data_path] -s [save_path] --special [special_tokens] --verbosewhere vocab_size is the desired vocabulary size, data_path is the path to the training data, save_path is the path to save the tokenizer (.txt file), and special_tokens is the path to a .txt file containing special tokens (e.g. [BOS], [PAD], etc.). Additionally, one can tinker with the RegExTokenizer class in Python as follows:
from TeXOCR.tokenizer import RegExTokenizer
tokenizer = RegExTokenizer()
text = open('path/to/train.txt').read()
tokenizer.train(text)
tokenizer.save('path/to/tokenizer.txt')
# Tokenize a LaTeX string
tokens = tokenizer.encode('\int _ { 0 } ^ { 1 } x ^ 2 d x')
print(tokens)where train.txt is some file containing tokenization training data. The tokenizer can be saved and loaded using the save() and load() methods.
[1] Li et al. - TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (2021)
[2] Dosovitskiy et al. - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)



