Msingi1 is a purely experimental research project. This repository contains experimental code and research findings for developing Swahili language models. No pre-trained models have been released yet, but we plan to release multiple variants soon.
Msingi ("Foundation" in Swahili) is our experimental attempt to build decent language models for Swahili, one of Africa's most widely spoken languages. We started small, but have scaled up to multiple experimental models that can generate grammatically correct Swahili text.
The project began with a simple question: Can we build useful language models for African languages without billions of parameters and massive compute? This README documents our experimental journey, what we've learned, and where we're headed.
# Clone the repository
git clone https://github.com/Msingi-AI/msingi1.git
cd msingi1
# Install dependencies
pip install -e .
# Or install from PyPI (when available)
# pip install msingi1from src.model_v2 import Msingi2Config, Msingi2
from transformers import PreTrainedTokenizerFast
# Create a 12-layer model configuration (recommended for reproduction)
config = Msingi2Config(
vocab_size=32000,
block_size=1024,
n_layer=12, # Recommended: 12 layers
n_head=12,
n_embd=768,
dropout=0.15,
gradient_checkpointing=True
)
# Initialize the model
model = Msingi2(config)
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer/swahili_unigram_32000/transformers")
# Generate text (after training)
prompt = "Habari ya leo, jina langu ni"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
generated = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1
)
print(tokenizer.decode(generated[0], skip_special_tokens=True))# Generate text with default settings (after training)
python src/generate_text.py --prompt "Habari ya leo, jina langu ni"
# Train a new model (recommended: 12 layers)
python src/train_msingi2.py --config configs/msingi2_12l.json
# Test model performance
python src/test_model.py --model-path best_model/We have reproduced and trained the following models, in this order:
- Msingi-Spinner: 12 layers, ~85M parameters, RoPE positional embeddings (our first successful reproduction)
- Msingi-Mzizi (mzizi = root/foundation): 12 layers, ~85M parameters, traditional (learned) positional embeddings
- Msingi-Kali (kali = sharp/fierce): 18 layers, ~153M parameters, traditional positional embeddings
- Msingi-Hodari (hodari = skilled/capable): 24 layers, ~336M parameters, traditional positional embeddings
- Msingi-Bingwa (bingwa = expert/master): 36 layers, ~504M parameters, traditional positional embeddings
All models use a vocabulary size of 32,000. The embedding dimensions scale with model size: 768 dimensions for 12 and 18 layers, and 1024 dimensions for 24 and 36 layers. Parameter counts are approximate and rounded for clarity.
| Model Name | Layers | Embedding Dimension | Positional Embeddings | Parameters (approx) |
|---|---|---|---|---|
| Msingi-Spinner | 12 | 768 | RoPE | 85M |
| Msingi-Mzizi | 12 | 768 | Learned | 110M |
| Msingi-Kali | 18 | 768 | Learned | 153M |
| Msingi-Hodari | 24 | 1024 | Learned | 336M |
| Msingi-Bingwa | 36 | 1024 | Learned | 504M |
We recommend reproducing the 12-layer models (Msingi-Mzizi) for most users, as they offer a good balance of performance and computational requirements. Larger models (Msingi-Kali, Msingi-Hodari, Msingi-Bingwa) are in progress and will be released soon. We will release all variants once entire training is done!!
Our experimental training corpus combines multiple high-quality Swahili datasets:
- Swahili-SAFI (C4 Dataset): ~3.5GB of clean Swahili text from the Common Crawl
- Swahili Corpus: Academic and news content from Mendeley Data
- Helsinki Corpus: Linguistic research corpus
- Swahili Wikipedia: Encyclopedic content
- Community Content: News websites, forums, and contemporary content
# Download and prepare the Swahili-SAFI dataset
python src/download_mc4_swahili.py
# This will create:
# - data/train.txt (95% of data)
# - data/valid.txt (5% of data)To handle large datasets efficiently, we use a sharding approach that processes data in manageable chunks:
# Create tokenized shards for training
python src/create_token_shards.pySharding Benefits:
- Memory Efficiency: Only loads necessary tokens into memory
- Training Speed: Reduces I/O bottlenecks through memory mapping
- Scalability: Enables training on larger datasets than would fit in RAM
- Flexibility: Allows for dynamic shard loading and epoch definitions
Shard Configuration:
- Shard Size: 10M tokens per shard (optimized for 13GB+ RAM)
- Validation Chunks: 3M tokens per chunk
- Buffer Size: 3M tokens for memory management
- Format: NumPy arrays with uint16 dtype for efficiency
Recommended Msingi2 Training (12 layers):
- Hardware: A100 GPU (recommended)
- Duration: 4-6 epochs
- Learning Rate: 3e-4 with cosine decay schedule
- Batch Size: 8 with gradient accumulation of 8 (effective batch size of 64)
- Optimization: Mixed precision (FP16), gradient checkpointing
- Monitoring: Weights & Biases integration
- Token-to-Parameter Ratio: ~4.6:1 (optimal for preventing overfitting)
| Epoch | Loss | Learning Rate | Time |
|---|---|---|---|
| 1 | 10.0540 | 1.26e-5 | ~2h 20m |
| 2 | 8.8586 | 2.52e-5 | ~2h 20m |
| 3 | 7.7763 | 3.78e-5 | ~2h 20m |
| 4 | 6.2656 | 5.04e-5 | ~2h 20m |
Swahili is an agglutinative language - it builds complex words by combining smaller meaningful pieces. For example:
- "ninakupenda" = "ni" (I) + "na" (present tense) + "ku" (you) + "penda" (love)
After extensive experimentation with ByteLevelBPE, WordPiece, and Unigram tokenizers, we found that Unigram tokenization works best for Swahili:
- Type: Unigram (SentencePiece-style)
- Vocabulary Size: 32,000 tokens
- Special Tokens:
<s>,</s>,<unk>,<pad>,<mask>,<sw>,<eot> - Training Corpus: Full training dataset (383 MB, ~41.8M words)
- Implementation: Built using Hugging Face Tokenizers library
- Morphological Complexity: Better handles Swahili's agglutinative structure through statistical optimization
- Linguistic Meaning: Creates more linguistically meaningful subword units
- Rare Word Handling: Produces more natural word segmentations for rare words
- Token Efficiency: Typically represents text with fewer tokens than BPE
- Statistical Optimization: Uses likelihood-based training for optimal subword segmentation
| Tokenizer | Vocab Size | Avg Tokens/Sentence | Morphological Handling | Memory Usage |
|---|---|---|---|---|
| ByteLevelBPE | 32K | 15.2 | Good | Medium |
| Unigram | 32K | 13.8 | Excellent | Low |
| WordPiece | 32K | 16.1 | Fair | High |
from transformers import PreTrainedTokenizerFast
# Load tokenizers
bpe_tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer/swahili_bpe_32000/transformers")
unigram_tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer/swahili_unigram_32000/transformers")
# Compare tokenization
text = "Ninapenda kusoma vitabu vya Kiswahili na kusikiliza muziki."
bpe_tokens = bpe_tokenizer.tokenize(text)
unigram_tokens = unigram_tokenizer.tokenize(text)
print(f"BPE tokens: {bpe_tokens}")
print(f"BPE token count: {len(bpe_tokens)}")
print(f"Unigram tokens: {unigram_tokens}")
print(f"Unigram token count: {len(unigram_tokens)}")
# Using the <eot> token for text separation
texts = ["Habari ya leo.", "Habari nzuri sana."]
combined_text = unigram_tokenizer.eos_token.join(texts) # Joins with <eot>
print(f"Combined with <eot>: {combined_text}")
encoded = unigram_tokenizer.encode(combined_text)
print(f"Decoded back: {unigram_tokenizer.decode(encoded)}")msingi1/
├── src/ # Source code
│ ├── model.py # Original model with RoPE embeddings
│ ├── model_v2.py # Current model with traditional embeddings
│ ├── train_msingi1.py # Training script for original model
│ ├── train_msingi2.py # Training script for current model
│ ├── generate_text.py # Text generation
│ ├── test_model.py # Model evaluation
│ ├── data_processor.py # Data preprocessing
│ ├── download_mc4_swahili.py # Download C4 dataset
│ ├── create_token_shards.py # Create training shards
│ └── train_tokenizer.py # Tokenizer training
├── tokenizer/ # Tokenizer files
│ ├── swahili_bpe_32000/ # BPE tokenizer
│ └── swahili_unigram_32000/ # Unigram tokenizer (recommended)
├── msingi_tokens/ # Tokenized dataset shards
├── best_model/ # Trained model checkpoints
├── data/ # Dataset files
├── configs/ # Training configurations
├── Dockerfile # Container setup
├── setup.py # Package configuration
└── requirements.txt # Dependencies
- Python 3.8+
- PyTorch 2.0+
- CUDA-compatible GPU (recommended)
- 16GB+ RAM (32GB+ for full dataset processing)
# 1. Clone repository
git clone https://github.com/Msingi-AI/msingi1.git
cd msingi1
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -e .
# 4. Download and prepare datasets
python src/download_mc4_swahili.py
# 5. Create tokenized shards
python src/create_token_shards.py
# 6. Train your model (12-layer recommended)
python src/train_msingi2.py --config configs/msingi2_12l.json# 1. Prepare your dataset
python src/data_processor.py --input data/raw/ --output data/processed/
# 2. Train tokenizer (if needed)
python src/train_tokenizer.py --data data/processed/ --output tokenizer/custom/
# 3. Create token shards
python src/create_token_shards.py
# 4. Train model with 12 layers (RECOMMENDED)
python src/train_msingi2.py --config configs/custom_12l.jsonCreate a custom configuration file configs/custom_12l.json:
{
"vocab_size": 32000,
"block_size": 1024,
"n_layer": 12,
"n_head": 12,
"n_embd": 768,
"dropout": 0.15,
"bias": true,
"gradient_checkpointing": true
}# Build Docker image
docker build -t msingi1 .
# Run with GPU support
docker run --gpus all -it msingi1
# Run text generation
docker run --gpus all msingi1 python src/generate_text.py --prompt "Habari ya leo"import requests
# Generate text via API
response = requests.post("https://api.msingi.ai/generate", json={
"prompt": "Habari ya leo, jina langu ni",
"max_length": 100,
"temperature": 0.8
})
print(response.json()["generated_text"])from msingi1 import MsingiGenerator
# Initialize generator
generator = MsingiGenerator.from_pretrained("msingi-mzizi")
# Generate text
text = generator.generate("Habari ya leo", max_length=100)
print(text)Prompt: "Habari ya leo, jina langu ni" (Hello, my name is)
Experimental Msingi-Mzizi Output:
"Habari ya leo, jina langu ni Maria. Ninafurahi kukutana nawe leo. Mimi ni mwanafunzi wa chuo kikuu cha Nairobi, ninasomea sayansi ya kompyuta. Ninapenda kusoma, kusikiliza muziki, na kutembea na marafiki zangu wakati wa mapumziko. Je, wewe unaitwa nani? Unapenda kufanya nini wakati wa starehe?"
What's Improved:
- Better topic adherence - stays with personal introduction
- Natural conversational flow
- Grammatically correct Swahili
- Contextually appropriate responses
- Reduced news bias compared to earlier versions
- Perplexity: 2.17 (calculated as exp(0.7764))
- BLEU Score: 18.7 on test set completion tasks
- ROUGE-L: 32.4 on test set completion tasks
- Human Evaluation: 3.2/5 for grammaticality, 2.8/5 for coherence
- Domain Bias: Model tends toward news-style content due to training data composition
- Context Length: Limited to 1024 tokens per sequence
- Repetition: Occasional repetitive patterns in longer generations
- Evaluation: Lack of standardized Swahili NLP benchmarks
We are currently working on the post-training model phase, which includes:
- Instruction Tuning: Adapting models for specific tasks and instructions
- Fine-tuning: Domain-specific adaptations (legal, medical, educational)
- Conversational Abilities: Improving topic coherence and dialogue skills
- Bias Reduction: Addressing domain biases in the training data
Note: This phase is progressing slowly due to limited manpower, as we are working on this project part-time. We welcome contributions from the community to accelerate this work.
- Instruction Tuning: Adapting models for specific tasks
- Multilingual Expansion: Extending to other East African languages
- Model Compression: Quantization and pruning for deployment
- Evaluation Benchmarks: Developing Swahili-specific metrics
We are working on a new project with a novel approach to training Swahili language models. This project will introduce innovative techniques specifically designed for African languages and their unique characteristics.
We are actively looking for collaborators! If you're interested in:
- Novel training methodologies
- African language technology
- Experimental NLP research
- Swahili language processing
Please create an issue or reach out to us. We'd love to collaborate with researchers, developers, and Swahili speakers who are passionate about advancing African language technology.
The Msingi1 language model was trained on a combined corpus from:
-
Swahili-SAFI (C4 Dataset)
- Flax Community. (2023). Swahili-SAFI: A clean Swahili dataset from Common Crawl. Hugging Face Datasets.
-
Swahili Corpus
- Masasi, Noel; Masua, Bernard (2024), "Swahili Corpus", Mendeley Data, V2, doi: 10.17632/d4yhn5b9n6.2
-
Helsinki Corpus of Swahili (HCS-NA-v2)
- Arvi Hurskainen (2004). Helsinki Corpus of Swahili. 2nd edition: Helsinki Corpus of Swahili, Version 2.0 (HCS 2.0) 2004-09-30. University of Helsinki, Institute for Asian and African Studies.
-
Swahili Wikipedia 2021
- Wikimedia Foundation. (2021). Swahili Wikipedia. Retrieved 2021 from https://sw.wikipedia.org/
-
Swahili Community 2023
- Various Swahili news and community websites. (2023). Collected from sources including Mwananchi.co.tz, BBC Swahili, VOA Swahili, and Vodacom Tanzania.
We welcome contributions! This project is purely experimental and particularly in need of help with the post-training phase. Here's how you can contribute:
- Instruction Tuning: Help develop instruction-following capabilities
- Fine-tuning: Create domain-specific model variants
- Evaluation: Develop Swahili-specific benchmarks and metrics
- Documentation: Improve tutorials and guides
- Code Optimization: Optimize training and inference code
- Community Building: Help grow the Swahili NLP community
- New Project Collaboration: Join our novel training methodology project
# 1. Fork the repository
# 2. Create a feature branch
git checkout -b feature/amazing-feature
# 3. Make your changes
# 4. Add tests
python -m pytest tests/
# 5. Commit your changes
git commit -m "Add amazing feature"
# 6. Push to the branch
git push origin feature/amazing-feature
# 7. Open a Pull Request- Join our Discussions: Share ideas and ask questions
- Pick an Issue: Look for issues labeled "good first issue" or "help wanted"
- Start Small: Begin with documentation or small bug fixes
- Ask for Help: Don't hesitate to ask questions in issues or discussions
- Create Issues: Share your ideas, report bugs, or suggest improvements
- Code Style: Follow PEP 8 for Python code
- Documentation: Add docstrings and comments for new functions
- Testing: Add tests for new features
- Commit Messages: Use clear, descriptive commit messages
- Pull Requests: Provide clear descriptions of changes
- Model Card - Detailed model specifications
- Paper Draft - Research paper and methodology
- API Documentation - Complete API reference
- Training Guide - How to train your own models
- Contributing Guidelines - How to contribute to the project
This project is licensed under the MIT License - see the LICENSE file for details.
- Email: korirkiplangat22@gmail.com
- GitHub Issues: Report bugs, request features, or share ideas
- Discussions: Join our community
- Masakhane Community for valuable insights and collaboration
- MsingiAI for supporting this research
- Hugging Face for the transformers library
- PyTorch Team for the deep learning framework
- Flax Community for the Swahili-SAFI dataset
If you use Msingi1 in your research, please cite:
@software{msingi1_2025,
author = {Msingi AI Team},
title = {Msingi1: Scaling Language Modeling Through Small-Scale Pretraining},
year = {2025},
url = {https://github.com/Msingi-AI/msingi1},
note = {Experimental research project}
}We're actively working to improve our Msingi models:
- Model Releases: Soon releasing Msingi-Spinner, Msingi-Mzizi, Msingi-Kali, Msingi-Hodari, and Msingi-Bingwa variants
- Post-Training Phase: Instruction tuning and fine-tuning (needs community help!)
- Better Text Generation: Improved sampling strategies and bias reduction
- Evaluation Framework: Comprehensive Swahili-specific benchmarks
- Efficient Deployment: Model compression for resource-constrained environments
- New Project: Novel training methodology for African languages
The current model is just the beginning - we see it as a foundation (hence the name "Msingi") that we can build upon to create truly useful Swahili language AI.
We need your help to accelerate the post-training phase and collaborate on our new project! (AkiliX) Whether you're a researcher, developer, or Swahili speaker, your contributions can make a real difference in advancing Swahili language technology.
Create issues, share ideas, and join us in building the future of African language AI!
Made with dedication for Swahili