ForzaEmbed is a Python framework for systematically benchmarking text embedding models and processing strategies. It performs an exhaustive grid search across a configurable parameter space to help you find the optimal configuration for your document corpus.
📖 Documentation · 🚀 Live Demo · 📦 Releases
demo_forzaembed.mp4
ForzaEmbed automates the process of evaluating text embedding configurations by following these steps:
- Data Loading: Ingests source documents from the
markdowns/directory. - Grid Search: Iterates through every combination of parameters defined in your configuration file (e.g.,
configs/config.yml). - Processing Pipeline: For each combination, the framework:
- Chunks the text using a specified strategy.
- Generates embeddings using the selected model.
- Computes similarity scores based on defined themes.
- Evaluation: Calculates clustering metrics (silhouette score with its decomposition and embedding computation time) to assess the quality of the results.
- Persistence & Caching: Stores all results, metrics, and generated embeddings in a SQLite database. This caching accelerates subsequent runs by avoiding redundant computations.
- Report Generation: Produces detailed reports, including a standalone interactive web interface (single HTML file), to visualize and analyze the findings without needing a server.
Understanding the directory layout is key to using ForzaEmbed effectively.
ForzaEmbed/
├── configs/
│ └── config.yml # Your analysis configuration files go here.
├── markdowns/
│ └── document.md # Your source text files (.md) go here.
├── reports/
│ └── ForzaEmbed_config.db # SQLite database for results.
├── src/
│ └── ... # Source code of the application.
└── main.py # The main script to run the tool.
configs/: This directory holds your YAML configuration files. You can create multiple configurations for different experiments (e.g.,config_horaires.yml,config_topics.yml).markdowns/: Place the text documents you want to analyze here. The tool will process all.mdfiles in this folder.reports/: This is where all outputs are stored, including the SQLite database and the final interactive web report.ForzaEmbed_<config_name>.db: The central SQLite database. It stores all experiment results and metrics.
This project uses uv for fast and efficient package management.
# 1. Clone the repository
git clone https://github.com/berangerthomas/ForzaEmbed.git
cd ForzaEmbed
# 2. Install dependencies
pip install uv
uv syncPut your markdown (.md) files into the markdowns/ directory.
Open a configuration file (e.g., configs/config.yml) and define the parameters for your grid search. This includes:
- Chunking strategies, sizes, and overlaps.
- Embedding models to test (from Hugging Face, FastEmbed, etc.).
- Similarity metrics.
- Thematic keywords for the analysis.
Refer to the Configuration Guide below for detailed explanations.
Execute the main script from your terminal to start the process. See the Command-Line Usage section below for detailed commands.
ForzaEmbed is controlled via a command-line interface.
To start a new analysis from scratch, use the --run command and specify your configuration file.
python main.py --run --config-path configs/config.ymlThis command will:
- Read the documents from the
markdowns/directory. - Execute the grid search based on
configs/config.yml. - Save all results and embeddings to
reports/ForzaEmbed_config.db. - Generate a detailed standalone interactive HTML report in the
reports/directory (e.g.,reports/config_index.html).
If a run is interrupted, simply execute the same command again. ForzaEmbed automatically detects completed work and resumes from where it left off.
python main.py --run --config-path configs/config.ymlIf you want to regenerate the reports from existing data in the database without re-running the computations, use the --generate-reports command.
python main.py --generate-reports --config-path configs/config.ymlThis is useful for changing the number of top results displayed (--top-n) or tweaking report settings.
The config.yml file is the control center for your analysis. It's written in YAML and is divided into several sections. Here’s a breakdown based on a real-world example for analyzing text related to business hours:
# Parameters for the grid search
grid_search_params:
chunk_size: [50, 100, 250, 500]
chunk_overlap: [10, 25, 50]
chunking_strategy: ["langchain", "raw", "semchunk", "nltk", "spacy"]
similarity_metrics: ["cosine", "euclidean", "dot_product"]
themes:
horaires_ouverture: ["horaires d'ouverture", "heures d'ouverture", "accueil du public"]
jours_fermeture: ["jour de fermeture", "fermeture exceptionnelle", "fermeture annuelle", "jours fériés"]
jours_semaine: ["lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi", "dimanche"]
# Models to be tested in the grid search
models_to_test:
- type: "fastembed"
name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
dimensions: 384
- type: "fastembed"
name: "intfloat/multilingual-e5-large"
dimensions: 1024
- type: "huggingface"
name: "Qwen/Qwen3-Embedding-0.6B"
dimensions: 1024
- type: "api"
name: "nomic-embed-text"
base_url: "https://api.nomic.ai/v1" # Example, replace with your provider
dimensions: 768
timeout: 240
# General settings
similarity_threshold: 0.6
output_dir: "reports"
# Database settings
database:
# Enable intelligent quantization to reduce storage size.
# For example, embeddings are converted from float64 to float16.
intelligent_quantization: true
# Multiprocessing settings
multiprocessing:
max_workers_api: 16
max_workers_local: null # Set to a number to limit CPU cores for local models
maxtasksperchild: 10
embedding_batch_size_api: 100
embedding_batch_size_local: 500
file_batch_size: 50
api_batch_sizes:
mistral: 50
default: 100This section defines the parameter space for the grid search. The framework will test every possible combination of the values you provide.
chunk_size: A list of integers representing the different chunk sizes (in tokens or characters, depending on the strategy) to test.chunk_overlap: A list of integers for the number of tokens/characters to overlap between chunks.chunking_strategy: A list of chunking algorithms to evaluate.similarity_metrics: A list of metrics for calculating similarity scores.themes: A dictionary where each key is a theme name (e.g.,Economics_and_Finance) and the value is a list of keywords and phrases related to that theme. The analysis will be based on these themes.
A list of embedding models to evaluate. Each model is an object with the following properties:
type: The provider of the model. Can befastembed,huggingface,sentence_transformers, orapi.name: The official model name (e.g.,"intfloat/multilingual-e5-large").dimensions: The embedding dimension of the model.base_url(forapitype): The base URL of the embedding API endpoint.timeout(forapitype, optional): The request timeout in seconds.
similarity_threshold: A float between 0.0 and 1.0. In the t-SNE visualization, points with a similarity score above this threshold will be highlighted.output_dir: The directory where reports will be saved (default is"reports").database.intelligent_quantization: Iftrue, enables optimizations to reduce the database size by storing numerical data in more efficient formats.
Configure settings for parallel processing to speed up computations.
max_workers_api/max_workers_local: The number of parallel workers for API-based and local models.embedding_batch_size_api/embedding_batch_size_local: The number of texts to process in a single batch for embedding generation.
- Smart Grid Search: Intelligently optimizes parameter combinations by avoiding redundant calculations for chunking strategies that don't use chunk_size/overlap parameters (like
nltkandspacy). This can reduce grid search time by up to 40%. - Broad Model Support: Interfaces with multiple embedding providers, including local models (Hugging Face, FastEmbed, SentenceTransformers) and API-based services.
- Versatile Chunking: Implements various chunking methods:
- Parameter-sensitive:
langchain,raw,semchunk(use chunk_size and chunk_overlap) - Parameter-insensitive:
nltk,spacy(sentence-based, ignore chunk parameters)
- Parameter-sensitive:
- Multiple Similarity Metrics: Supports
cosine,euclidean,manhattan,dot_product, andchebyshev. - Focused Evaluation Metrics: Uses silhouette score with intra/inter-cluster distance decomposition and embedding computation time tracking for efficient quality assessment.
- Resumable & Cached: Caches embeddings and t-SNE results in a SQLite database to accelerate subsequent runs and allows resuming interrupted workflows seamlessly.
- Robust Database Management: Uses SQLAlchemy ORM for reliable, efficient, and structured data storage in SQLite.
- Intelligent Database Quantization: Automatically reduces database size by storing numerical data (like embeddings and similarities) in more efficient formats (e.g., float16).
- Rich Reporting: Produces detailed comparison charts, CSV exports, and a standalone interactive web interface (single HTML file) with heatmaps and t-SNE visualizations. No external server or complex setup required to view results.
This project is licensed under the MIT License. See the LICENSE file for details.