Skip to content
This repository was archived by the owner on May 26, 2025. It is now read-only.

Msingi-AI/MsingiAI-Tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MsingiAI Tokenizers

Building Tokenization Tools for African Languages

Welcome to the MsingiAI Tokenizers Repository! This is a community-driven initiative to develop high-quality tokenizers for African languages, which are currently underserved in the world of natural language processing (NLP). By building tokenizers specifically tailored to the linguistic structure and diversity of African languages, we aim to lay a strong foundation for advancing NLP research and applications across the continent.

Why This Matters

Africa is one of the most linguistically diverse regions in the world, with over 2,000 languages. Unfortunately, many of these languages lack the basic computational tools necessary for natural language processing tasks.

Tokenization is one of the most fundamental tasks in NLP. However, tokenizers designed for widely spoken languages like English, French, or Chinese often fail to account for:

  • Complex Morphology: Many African languages are highly agglutinative, where words are formed by combining multiple morphemes.
  • Tonal Systems: Tone can alter the meaning of words in some languages, but it is often ignored by existing tools.
  • Lack of Standardized Orthography: Some African languages lack standardized writing systems, making tokenization more challenging.
  • Multilingual Contexts: Many African speakers mix multiple languages, often within the same sentence (code-switching).

By addressing these challenges, we aim to empower researchers, developers, and communities to build tools that reflect the true richness of African languages.

🎯 Project Objectives

  1. Develop Tokenizers: Build tokenizers for as many African languages as possible, tailored to their unique linguistic features.
  2. Provide Benchmarks: Establish evaluation frameworks for tokenizers to measure their accuracy and performance.
  3. Create Reusable Tools: Provide easy-to-use libraries and APIs for African language tokenization.
  4. Foster Collaboration: Build a community of contributors passionate about African NLP.

Key Features

  • Language-Specific Tokenizers: Tokenization tools optimized for individual African languages or language families.
  • Support for Code-Switching: Handling multilingual text and mixed-language sentences.
  • Preprocessing Utilities: Tools for text normalization, stemming, lemmatization, and stopword removal.
  • Open-Source and Extensible: Contributions are welcome, and the tools will remain free and open for all.

🛠️ Repository Structure

The repository is structured as follows:

msingi-tokenizers/  
├── src/  
│   ├── tokenizers/         # Directory for tokenizer implementations  
│   │   ├── swahili_tokenizer.py  
│   │   ├── yoruba_tokenizer.py  
│   │   └── ...  
│   ├── preprocessing/      # Text preprocessing utilities (e.g., normalization, stemming)  
│   └── utils/              # Helper functions and utilities  
├── data/                   # Datasets and sample texts for testing  
├── tests/                  # Unit tests for tokenizers and utilities  
├── docs/                   # Documentation and tutorials  
│   ├── language_guides/    # Guides for tokenization challenges in specific languages  
│   └── ...  
├── CONTRIBUTING.md         # Guidelines for contributors  
├── LICENSE                 # License information  
├── README.md               # Project overview  
└── setup.py                # Installation script  

🔍 Getting Started

Installation

To get started, clone this repository and install the required dependencies:

git clone https://github.com/msingi-ai/msingi-tokenizers.git  
cd msingi-tokenizers  
pip install -r requirements.txt  

Example Usage

Here’s an example of using a tokenizer:

from tokenizers.swahili_tokenizer import SwahiliTokenizer  

# Initialize the tokenizer  
tokenizer = SwahiliTokenizer()  

# Tokenize text  
text = "Jambo, karibu MsingiAI!"  
tokens = tokenizer.tokenize(text)  
print(tokens)  

🌟 Contribution Guidelines

We’re building this together!

How to Contribute

  1. Fork this Repository: Click on the fork button on GitHub to create your copy of the repository.
  2. Choose a Task: Select a language to work on, optimize an existing tokenizer, or contribute to documentation.
  3. Write Clean Code: Follow the structure and coding standards provided.
  4. Test Your Code: Add unit tests for your tokenizer in the tests/ directory.
  5. Submit a Pull Request: Create a pull request with a clear description of your contribution.

What to Contribute

  • Tokenizers for specific African languages.
  • Text preprocessing tools (e.g., stemming, normalization).
  • Benchmarks and evaluation frameworks.
  • Documentation, tutorials, and case studies.

📄 License

This project is licensed under the MIT License. By contributing to this repository, you agree to its terms.

📚 Resources

🤝 Join the Community

Let’s build something impactful together!

  • Issues: Use the Issues tab to suggest features or report bugs.
  • Discussions: Share your ideas and feedback in the Discussions.
  • Contributors: Check out the Contributors page.

For questions, feel free to reach out at info@msingi.ai or to me korirkiplangat22@gmail.com.

🌍 Why Contribute?

This project is more than just code. It’s about representation, collaboration, and building tools that make a difference. Join us in laying the foundation for African NLP!

About

Language tokenizers and preprocessing tools for low-resource African languages, developed by MsingiAI to empower NLP research and applications.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors