ForzaEmbed: Benchmarking Framework for Text Embeddings

ForzaEmbed is a Python framework for systematically benchmarking text embedding models and processing strategies. It performs an exhaustive grid search across a configurable parameter space to help you find the optimal configuration for your document corpus.

📖 Documentation · 🚀 Live Demo · 📦 Releases

demo_forzaembed.mp4

How It Works

ForzaEmbed automates the process of evaluating text embedding configurations by following these steps:

Data Loading: Ingests source documents from the markdowns/ directory.
Grid Search: Iterates through every combination of parameters defined in your configuration file (e.g., configs/config.yml).
Processing Pipeline: For each combination, the framework:
- Chunks the text using a specified strategy.
- Generates embeddings using the selected model.
- Computes similarity scores based on defined themes.
Evaluation: Calculates clustering metrics (silhouette score with its decomposition and embedding computation time) to assess the quality of the results.
Persistence & Caching: Stores all results, metrics, and generated embeddings in a SQLite database. This caching accelerates subsequent runs by avoiding redundant computations.
Report Generation: Produces detailed reports, including a standalone interactive web interface (single HTML file), to visualize and analyze the findings without needing a server.

Project Structure

Understanding the directory layout is key to using ForzaEmbed effectively.

ForzaEmbed/
├── configs/
│   └── config.yml        # Your analysis configuration files go here.
├── markdowns/
│   └── document.md       # Your source text files (.md) go here.
├── reports/
│   └── ForzaEmbed_config.db # SQLite database for results.
├── src/
│   └── ...               # Source code of the application.
└── main.py               # The main script to run the tool.

configs/: This directory holds your YAML configuration files. You can create multiple configurations for different experiments (e.g., config_horaires.yml, config_topics.yml).
markdowns/: Place the text documents you want to analyze here. The tool will process all .md files in this folder.
reports/: This is where all outputs are stored, including the SQLite database and the final interactive web report.
- ForzaEmbed_<config_name>.db: The central SQLite database. It stores all experiment results and metrics.

Getting Started

1. Installation

This project uses uv for fast and efficient package management.

# 1. Clone the repository
git clone https://github.com/berangerthomas/ForzaEmbed.git
cd ForzaEmbed

# 2. Install dependencies
pip install uv
uv sync

2. Place Your Documents

Put your markdown (.md) files into the markdowns/ directory.

3. Configure the Analysis

Open a configuration file (e.g., configs/config.yml) and define the parameters for your grid search. This includes:

Chunking strategies, sizes, and overlaps.
Embedding models to test (from Hugging Face, FastEmbed, etc.).
Similarity metrics.
Thematic keywords for the analysis.

Refer to the Configuration Guide below for detailed explanations.

4. Run the Pipeline

Execute the main script from your terminal to start the process. See the Command-Line Usage section below for detailed commands.

Command-Line Usage

ForzaEmbed is controlled via a command-line interface.

First Run

To start a new analysis from scratch, use the --run command and specify your configuration file.

python main.py --run --config-path configs/config.yml

This command will:

Read the documents from the markdowns/ directory.
Execute the grid search based on configs/config.yml.
Save all results and embeddings to reports/ForzaEmbed_config.db.
Generate a detailed standalone interactive HTML report in the reports/ directory (e.g., reports/config_index.html).

Resuming a Run

If a run is interrupted, simply execute the same command again. ForzaEmbed automatically detects completed work and resumes from where it left off.

python main.py --run --config-path configs/config.yml

Generating Reports Only

If you want to regenerate the reports from existing data in the database without re-running the computations, use the --generate-reports command.

python main.py --generate-reports --config-path configs/config.yml

This is useful for changing the number of top results displayed (--top-n) or tweaking report settings.

Configuration Guide

The config.yml file is the control center for your analysis. It's written in YAML and is divided into several sections. Here’s a breakdown based on a real-world example for analyzing text related to business hours:

# Parameters for the grid search
grid_search_params:
  chunk_size: [50, 100, 250, 500]
  chunk_overlap: [10, 25, 50]
  chunking_strategy: ["langchain", "raw", "semchunk", "nltk", "spacy"]
  similarity_metrics: ["cosine", "euclidean", "dot_product"]
  themes:
    horaires_ouverture: ["horaires d'ouverture", "heures d'ouverture", "accueil du public"]
    jours_fermeture: ["jour de fermeture", "fermeture exceptionnelle", "fermeture annuelle", "jours fériés"]
    jours_semaine: ["lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi", "dimanche"]

# Models to be tested in the grid search
models_to_test:
  - type: "fastembed"
    name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
    dimensions: 384
  - type: "fastembed"
    name: "intfloat/multilingual-e5-large"
    dimensions: 1024
  - type: "huggingface"
    name: "Qwen/Qwen3-Embedding-0.6B"
    dimensions: 1024
  - type: "api"
    name: "nomic-embed-text"
    base_url: "https://api.nomic.ai/v1" # Example, replace with your provider
    dimensions: 768
    timeout: 240

# General settings
similarity_threshold: 0.6
output_dir: "reports"

# Database settings
database:
  # Enable intelligent quantization to reduce storage size.
  # For example, embeddings are converted from float64 to float16.
  intelligent_quantization: true

# Multiprocessing settings
multiprocessing:
  max_workers_api: 16
  max_workers_local: null # Set to a number to limit CPU cores for local models
  maxtasksperchild: 10
  embedding_batch_size_api: 100
  embedding_batch_size_local: 500
  file_batch_size: 50
  api_batch_sizes:
    mistral: 50
    default: 100

`grid_search_params`

This section defines the parameter space for the grid search. The framework will test every possible combination of the values you provide.

chunk_size: A list of integers representing the different chunk sizes (in tokens or characters, depending on the strategy) to test.
chunk_overlap: A list of integers for the number of tokens/characters to overlap between chunks.
chunking_strategy: A list of chunking algorithms to evaluate.
similarity_metrics: A list of metrics for calculating similarity scores.
themes: A dictionary where each key is a theme name (e.g., Economics_and_Finance) and the value is a list of keywords and phrases related to that theme. The analysis will be based on these themes.

`models_to_test`

A list of embedding models to evaluate. Each model is an object with the following properties:

type: The provider of the model. Can be fastembed, huggingface, sentence_transformers, or api.
name: The official model name (e.g., "intfloat/multilingual-e5-large").
dimensions: The embedding dimension of the model.
base_url (for api type): The base URL of the embedding API endpoint.
timeout (for api type, optional): The request timeout in seconds.

General & Database Settings

similarity_threshold: A float between 0.0 and 1.0. In the t-SNE visualization, points with a similarity score above this threshold will be highlighted.
output_dir: The directory where reports will be saved (default is "reports").
database.intelligent_quantization: If true, enables optimizations to reduce the database size by storing numerical data in more efficient formats.

`multiprocessing`

Configure settings for parallel processing to speed up computations.

max_workers_api / max_workers_local: The number of parallel workers for API-based and local models.
embedding_batch_size_api / embedding_batch_size_local: The number of texts to process in a single batch for embedding generation.

Key Features

Smart Grid Search: Intelligently optimizes parameter combinations by avoiding redundant calculations for chunking strategies that don't use chunk_size/overlap parameters (like nltk and spacy). This can reduce grid search time by up to 40%.
Broad Model Support: Interfaces with multiple embedding providers, including local models (Hugging Face, FastEmbed, SentenceTransformers) and API-based services.
Versatile Chunking: Implements various chunking methods:
- Parameter-sensitive: langchain, raw, semchunk (use chunk_size and chunk_overlap)
- Parameter-insensitive: nltk, spacy (sentence-based, ignore chunk parameters)
Multiple Similarity Metrics: Supports cosine, euclidean, manhattan, dot_product, and chebyshev.
Focused Evaluation Metrics: Uses silhouette score with intra/inter-cluster distance decomposition and embedding computation time tracking for efficient quality assessment.
Resumable & Cached: Caches embeddings and t-SNE results in a SQLite database to accelerate subsequent runs and allows resuming interrupted workflows seamlessly.
Robust Database Management: Uses SQLAlchemy ORM for reliable, efficient, and structured data storage in SQLite.
Intelligent Database Quantization: Automatically reduces database size by storing numerical data (like embeddings and similarities) in more efficient formats (e.g., float16).
Rich Reporting: Produces detailed comparison charts, CSV exports, and a standalone interactive web interface (single HTML file) with heatmaps and t-SNE visualizations. No external server or complex setup required to view results.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
.vscode		.vscode
configs		configs
docs		docs
markdowns		markdowns
reports		reports
src		src
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ForzaEmbed: Benchmarking Framework for Text Embeddings

Table of Contents

How It Works

Project Structure

Getting Started

1. Installation

2. Place Your Documents

3. Configure the Analysis

4. Run the Pipeline

Command-Line Usage

First Run

Resuming a Run

Generating Reports Only

Configuration Guide

`grid_search_params`

`models_to_test`

General & Database Settings

`multiprocessing`

Key Features

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ForzaEmbed: Benchmarking Framework for Text Embeddings

Table of Contents

How It Works

Project Structure

Getting Started

1. Installation

2. Place Your Documents

3. Configure the Analysis

4. Run the Pipeline

Command-Line Usage

First Run

Resuming a Run

Generating Reports Only

Configuration Guide

grid_search_params

models_to_test

General & Database Settings

multiprocessing

Key Features

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`grid_search_params`

`models_to_test`

`multiprocessing`

Packages