Skip to content

berangerthomas/ForzaEmbed

Repository files navigation

ForzaEmbed: Benchmarking Framework for Text Embeddings

License: MIT Python 3.13+ Documentation Hugging Face Demo GitHub release

ForzaEmbed is a Python framework for systematically benchmarking text embedding models and processing strategies. It performs an exhaustive grid search across a configurable parameter space to help you find the optimal configuration for your document corpus.

📖 Documentation · 🚀 Live Demo · 📦 Releases

demo_forzaembed.mp4

Table of Contents


How It Works

ForzaEmbed automates the process of evaluating text embedding configurations by following these steps:

  1. Data Loading: Ingests source documents from the markdowns/ directory.
  2. Grid Search: Iterates through every combination of parameters defined in your configuration file (e.g., configs/config.yml).
  3. Processing Pipeline: For each combination, the framework:
    • Chunks the text using a specified strategy.
    • Generates embeddings using the selected model.
    • Computes similarity scores based on defined themes.
  4. Evaluation: Calculates clustering metrics (silhouette score with its decomposition and embedding computation time) to assess the quality of the results.
  5. Persistence & Caching: Stores all results, metrics, and generated embeddings in a SQLite database. This caching accelerates subsequent runs by avoiding redundant computations.
  6. Report Generation: Produces detailed reports, including a standalone interactive web interface (single HTML file), to visualize and analyze the findings without needing a server.

Project Structure

Understanding the directory layout is key to using ForzaEmbed effectively.

ForzaEmbed/
├── configs/
│   └── config.yml        # Your analysis configuration files go here.
├── markdowns/
│   └── document.md       # Your source text files (.md) go here.
├── reports/
│   └── ForzaEmbed_config.db # SQLite database for results.
├── src/
│   └── ...               # Source code of the application.
└── main.py               # The main script to run the tool.
  • configs/: This directory holds your YAML configuration files. You can create multiple configurations for different experiments (e.g., config_horaires.yml, config_topics.yml).
  • markdowns/: Place the text documents you want to analyze here. The tool will process all .md files in this folder.
  • reports/: This is where all outputs are stored, including the SQLite database and the final interactive web report.
    • ForzaEmbed_<config_name>.db: The central SQLite database. It stores all experiment results and metrics.

Getting Started

1. Installation

This project uses uv for fast and efficient package management.

# 1. Clone the repository
git clone https://github.com/berangerthomas/ForzaEmbed.git
cd ForzaEmbed

# 2. Install dependencies
pip install uv
uv sync

2. Place Your Documents

Put your markdown (.md) files into the markdowns/ directory.

3. Configure the Analysis

Open a configuration file (e.g., configs/config.yml) and define the parameters for your grid search. This includes:

  • Chunking strategies, sizes, and overlaps.
  • Embedding models to test (from Hugging Face, FastEmbed, etc.).
  • Similarity metrics.
  • Thematic keywords for the analysis.

Refer to the Configuration Guide below for detailed explanations.

4. Run the Pipeline

Execute the main script from your terminal to start the process. See the Command-Line Usage section below for detailed commands.


Command-Line Usage

ForzaEmbed is controlled via a command-line interface.

First Run

To start a new analysis from scratch, use the --run command and specify your configuration file.

python main.py --run --config-path configs/config.yml

This command will:

  1. Read the documents from the markdowns/ directory.
  2. Execute the grid search based on configs/config.yml.
  3. Save all results and embeddings to reports/ForzaEmbed_config.db.
  4. Generate a detailed standalone interactive HTML report in the reports/ directory (e.g., reports/config_index.html).

Resuming a Run

If a run is interrupted, simply execute the same command again. ForzaEmbed automatically detects completed work and resumes from where it left off.

python main.py --run --config-path configs/config.yml

Generating Reports Only

If you want to regenerate the reports from existing data in the database without re-running the computations, use the --generate-reports command.

python main.py --generate-reports --config-path configs/config.yml

This is useful for changing the number of top results displayed (--top-n) or tweaking report settings.


Configuration Guide

The config.yml file is the control center for your analysis. It's written in YAML and is divided into several sections. Here’s a breakdown based on a real-world example for analyzing text related to business hours:

# Parameters for the grid search
grid_search_params:
  chunk_size: [50, 100, 250, 500]
  chunk_overlap: [10, 25, 50]
  chunking_strategy: ["langchain", "raw", "semchunk", "nltk", "spacy"]
  similarity_metrics: ["cosine", "euclidean", "dot_product"]
  themes:
    horaires_ouverture: ["horaires d'ouverture", "heures d'ouverture", "accueil du public"]
    jours_fermeture: ["jour de fermeture", "fermeture exceptionnelle", "fermeture annuelle", "jours fériés"]
    jours_semaine: ["lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi", "dimanche"]

# Models to be tested in the grid search
models_to_test:
  - type: "fastembed"
    name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
    dimensions: 384
  - type: "fastembed"
    name: "intfloat/multilingual-e5-large"
    dimensions: 1024
  - type: "huggingface"
    name: "Qwen/Qwen3-Embedding-0.6B"
    dimensions: 1024
  - type: "api"
    name: "nomic-embed-text"
    base_url: "https://api.nomic.ai/v1" # Example, replace with your provider
    dimensions: 768
    timeout: 240

# General settings
similarity_threshold: 0.6
output_dir: "reports"

# Database settings
database:
  # Enable intelligent quantization to reduce storage size.
  # For example, embeddings are converted from float64 to float16.
  intelligent_quantization: true

# Multiprocessing settings
multiprocessing:
  max_workers_api: 16
  max_workers_local: null # Set to a number to limit CPU cores for local models
  maxtasksperchild: 10
  embedding_batch_size_api: 100
  embedding_batch_size_local: 500
  file_batch_size: 50
  api_batch_sizes:
    mistral: 50
    default: 100

grid_search_params

This section defines the parameter space for the grid search. The framework will test every possible combination of the values you provide.

  • chunk_size: A list of integers representing the different chunk sizes (in tokens or characters, depending on the strategy) to test.
  • chunk_overlap: A list of integers for the number of tokens/characters to overlap between chunks.
  • chunking_strategy: A list of chunking algorithms to evaluate.
  • similarity_metrics: A list of metrics for calculating similarity scores.
  • themes: A dictionary where each key is a theme name (e.g., Economics_and_Finance) and the value is a list of keywords and phrases related to that theme. The analysis will be based on these themes.

models_to_test

A list of embedding models to evaluate. Each model is an object with the following properties:

  • type: The provider of the model. Can be fastembed, huggingface, sentence_transformers, or api.
  • name: The official model name (e.g., "intfloat/multilingual-e5-large").
  • dimensions: The embedding dimension of the model.
  • base_url (for api type): The base URL of the embedding API endpoint.
  • timeout (for api type, optional): The request timeout in seconds.

General & Database Settings

  • similarity_threshold: A float between 0.0 and 1.0. In the t-SNE visualization, points with a similarity score above this threshold will be highlighted.
  • output_dir: The directory where reports will be saved (default is "reports").
  • database.intelligent_quantization: If true, enables optimizations to reduce the database size by storing numerical data in more efficient formats.

multiprocessing

Configure settings for parallel processing to speed up computations.

  • max_workers_api / max_workers_local: The number of parallel workers for API-based and local models.
  • embedding_batch_size_api / embedding_batch_size_local: The number of texts to process in a single batch for embedding generation.

Key Features

  • Smart Grid Search: Intelligently optimizes parameter combinations by avoiding redundant calculations for chunking strategies that don't use chunk_size/overlap parameters (like nltk and spacy). This can reduce grid search time by up to 40%.
  • Broad Model Support: Interfaces with multiple embedding providers, including local models (Hugging Face, FastEmbed, SentenceTransformers) and API-based services.
  • Versatile Chunking: Implements various chunking methods:
    • Parameter-sensitive: langchain, raw, semchunk (use chunk_size and chunk_overlap)
    • Parameter-insensitive: nltk, spacy (sentence-based, ignore chunk parameters)
  • Multiple Similarity Metrics: Supports cosine, euclidean, manhattan, dot_product, and chebyshev.
  • Focused Evaluation Metrics: Uses silhouette score with intra/inter-cluster distance decomposition and embedding computation time tracking for efficient quality assessment.
  • Resumable & Cached: Caches embeddings and t-SNE results in a SQLite database to accelerate subsequent runs and allows resuming interrupted workflows seamlessly.
  • Robust Database Management: Uses SQLAlchemy ORM for reliable, efficient, and structured data storage in SQLite.
  • Intelligent Database Quantization: Automatically reduces database size by storing numerical data (like embeddings and similarities) in more efficient formats (e.g., float16).
  • Rich Reporting: Produces detailed comparison charts, CSV exports, and a standalone interactive web interface (single HTML file) with heatmaps and t-SNE visualizations. No external server or complex setup required to view results.

License

This project is licensed under the MIT License. See the LICENSE file for details.