This repo is meant to be a collection of my own implementations of diff attention/transformer arch and algorithms for text/video/image transformers
Its also meant to serve as a sandbox env for implementing different technqiues (example if I wanted to implement deepseek sparse attention) Everything is designed to be trained on 1 A100 GPU instance
Text: python main_script.py --train text Image classifier: python main_script.py --train image VideoGPT: python main_script.py --train video
- Learn in detail by doing how the following attention mechanisms works
- Matrix Form Text Attention
- Image Attention
- Temporal Spacial Attention
- Train three types of transformers on basic datasets, including handling masking logic
- text based shakespear one for text
- mnist gen for images
- mnist video for videos
- Implement KV Caching and the following Attention improvements (text only)
- GQA
- MLA
- DSA (DeepSeek Sparse Attention) TBD
- Implement basic top K routing MoE
- Implement VQ-VAE