A Python-based tool designed to extract key information from PDF documents using AI-powered models. The system parses documents, retrieves relevant sections, and extracts the data points in a structured format.

The core components of the project are:
read: Parses any PDF and identifies different sections (headers, text, images, etc.).retrieve: Retrieves the relevant document chunks based on specified data points.extract: Extracts key data points from the relevant chunks, along with bounding box coordinates and references.api: Serves the product through an API, which takes PDFs as input and returns extracted data points.
doc-assistant/
│
├── api/ # API routes and entry point
├── doc_assistant/
│ ├── pipelines/ # folder for different document pipelines
│ │ ├── epc.py # Energy Performance Certificate pipeline
│ │ ├── invoice.py # Invoice pipeline (can be added)
│ │ └── ...
│ ├── extract/ # Extraction logic (e.g., extractor functions)
│ ├── read/ # Parsing/reading logic
│ ├── retrieve/ # Logic to retrieve relevant chunks of text
│ ├── shared/ # Shared entities and logic
│ └── config/ # Configuration files for various document types
├── .env.sample # Template for environment variables
├── Makefile # Automation through make commands
├── README.md # Documentation
├── requirements.txt # Python dependencies- Python 3.10
- Virtual environment (
venv) - Libraries specified in
requirements.in
-
Clone the repository:
git clone https://github.com/.git cd gide_rag -
Set up a virtual environment: Ensure Python 3.10 is installed, and create a virtual environment using the
Makefile:make create-env
-
Activate the virtual environment: On macOS/Linux:
source venv/bin/activateOn Windows:
venv\Scripts\activate
To run backend tests using pytest, use:
make test- Run the FastAPI backend server:
make run-backend- Test the API with curl:
curl -X 'POST' \
'http://127.0.0.1:8000/extract/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'document_type=epc' \
-F 'language=en' \
-F 'file=@doc_assistant/test_data/energy_performance_certificate_example.pdf'