Skip to content

GuiferrSouza/document-search-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Search API

REST API for full-text indexing and search of PDF documents and web pages using SQLite FTS5.

Features

  • PDF and URL indexing
  • Full-text search with SQLite FTS5
  • Exact term position highlighting
  • Context snippet extraction
  • Complete document management (CRUD)
  • Automatic URL duplicate prevention

Run

dsa --host <host> --port <port>

By default:

  • host = localhost
  • port = 8000

Access interactive documentation at: http://<host>:<port>/docs

Endpoints

Indexing

Upload PDF

POST /documents/upload
Content-Type: multipart/form-data

file: document.pdf

Response:

{
  "id": "uuid",
  "message": "Document indexed successfully"
}

Index URL

POST /documents/from-url?url=https://example.com

Response:

{
  "id": "uuid",
  "message": "URL indexed successfully",
  "action": "created"
}

If the URL already exists, content is updated keeping the same ID (action: "updated").

Search

Basic Search

GET /search?query=python

Returns list of documents with ID, title, and type.

Search with Positions

GET /search?query=python&include_matches=true

Adds exact positions for each occurrence:

{
  "id": "uuid",
  "title": "document.pdf",
  "type": "application/pdf",
  "match_count": 5,
  "matches": [
    {
      "term": "python",
      "start": 245,
      "end": 251,
      "matched_text": "Python"
    }
  ]
}

Complete Search

GET /search?query=python&include_matches=true&include_snippets=true

Adds context snippets around each match.

Optional parameters:

  • max_matches: Match limit per term (default: 200, max: 1000)

List Documents

GET /documents

Returns all documents id, title and type.

Get by ID

GET /documents/{id}

Returns complete document including content.

Delete

DELETE /documents/{id}

Deletes a document by ID.

Data Structure

Document in Database

id: Unique UUID
title: File name or URL
content: Extracted text
type: application/pdf or web

FTS5 Index

SQLite Full-Text Search 5 for optimized search on id, title, content, and type.

Text Search

How It Works

  1. SQLite FTS5 identifies relevant documents using inverted index
  2. Regex finds exact term positions in content
  3. Snippet extraction retrieves context around matches

Search Features

  • Case-insensitive
  • Whole word search (no partial matches)
  • Multiple terms separated by spaces
  • Match ordering by position in document

License

MIT License.

About

A lightweight web API for uploading and searching content from URLs and PDFs.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages