DocManager is a Streamlit-based PDF document manager for locally stored learning content. It helps users upload PDFs, store document metadata in SQLite, search by tags, lecture date, and full text, read documents page by page inside the app, and monitor basic usage analytics and reading progress.
When learning material is stored only as files on local folders, it becomes hard to:
- remember where a document is saved
- organize documents consistently
- find a PDF by tags or lecture date
- search inside PDF content for a keyword
- work with scanned PDFs that do not already contain a text layer
- understand how documents are being used inside the app
- uploading PDF documents only
- saving uploaded files locally
- saving document metadata in SQLite
- generating thumbnails for uploaded PDFs
- converting PDF pages into images for in-app reading
- full-text search inside PDF content
- OCR fallback for scanned or image-based PDFs
- searching by tags
- searching by lecture date
- reader-mode page navigation
- reading-progress tracking by unique pages viewed
- analytics for key app actions such as upload, search, open document, next page, previous page, and close reader
- resetting analytics data
- admin reset controls protected by
ADMIN_PASSWORDin.env
- Language: Python
- UI Framework: Streamlit
- Database: SQLite
- PDF Processing: PyMuPDF
- OCR Engine: Tesseract OCR
- Analytics Table Handling: Pandas
- Environment Configuration: python-dotenv
- Local Storage: filesystem-based storage under
storage/anddata/
The application follows a local-first design. The Streamlit app runs as the presentation layer, uploaded PDFs and generated assets are stored on disk, and structured data is stored in SQLite.
flowchart TD
user[User] --> ui[Streamlit UI<br/>app/main.py]
ui --> service[DocumentService<br/>core/services.py]
ui --> analytics[AnalyticsService<br/>core/analytics.py]
service --> filemgr[FileManager]
service --> thumb[ThumbnailGenerator]
service --> reader[PDFReader]
service --> indexer[DocumentIndexer]
service --> repo[DocumentRepository]
filemgr --> pdfs[(storage/pdfs)]
thumb --> thumbs[(storage/thumbnails)]
reader --> pages[(page image folders)]
indexer --> chunks[DocumentChunk objects]
chunks --> repo
repo --> docs[(documents)]
repo --> chunk_table[(document_chunks)]
repo --> fts[(document_chunks_fts)]
analytics --> page_visits[(page_visits)]
analytics --> app_visits[(app_visits)]
ui --> analytics_tab[Analytics Tab]
analytics_tab --> analytics
app/main.py: Streamlit entry point, session-state management, upload flow, search flow, reader mode, analytics UI, and admin reset controlscore/services.py: orchestration layer for document upload, indexing, search, and reindexingcore/analytics.py: analytics event recording and progress retrievalcore/file_manager.py: PDF file savingcore/thumbnail.py: thumbnail generation and page countingcore/reader.py: PDF-to-image conversion for reader modecore/indexer.py: text extraction, OCR fallback, text normalization, and chunk creationcore/models.py: shared models for documents, chunks, and search resultscore/paths.py: central path and directory helpersdb/database.py: SQLite schema initialization and FTS setupdb/repository.py: persistence and search queries
- PDF files are stored in
storage/pdfs/ - generated thumbnails are stored in
storage/thumbnails/ - page images for reader mode are stored in a folder derived from the PDF filename
- SQLite database is stored under
data/
The schema is separated by responsibility:
documents: one row per uploaded PDF with metadata and file pathsdocument_chunks: many rows per document containing extracted searchable textdocument_chunks_fts: SQLite FTS5 virtual table for full-text searchpage_visits: page-level reading activity used for progress trackingapp_visits: app-level event tracking such as upload, search, open document, next page, previous page, and close reader
- User uploads a PDF from the Streamlit UI.
DocumentServicesaves the file throughFileManager.- The app generates a thumbnail and reads total pages.
- The PDF is converted into page images for reader mode.
DocumentIndexerextracts page text.- If direct extraction returns no text, OCR fallback is attempted.
- Text is normalized and split into overlapping chunks.
- Document metadata is written to
documents. - Text chunks are written to
document_chunks.
- User searches by tag, lecture date, or full text.
- Metadata-only search reads from
documents. - Full-text search reads from
document_chunks_fts. - If FTS is unavailable, the repository falls back to
LIKEsearch ondocument_chunks. - Results return the document, matched page, and snippet.
- Reader mode opens at the matched page for content-based search results.
- User opens a document from search results.
- The app loads page images from local storage.
- Reader navigation updates Streamlit session state.
- Each viewed page is written to
page_visits. - App actions are written to
app_visits. - The Analytics tab aggregates this data into usage charts and per-document progress.
.envis loaded at app startup.- If
ADMIN_PASSWORDexists, admin reset controls are shown. - On successful password confirmation, the app deletes the SQLite database and generated storage folders.
- Required directories are recreated and the database schema is initialized again.
- Presentation layer:
app/main.py - Application layer:
core/services.py,core/analytics.py - Domain/model layer:
core/models.py - Infrastructure layer:
core/file_manager.py,core/thumbnail.py,core/reader.py,core/indexer.py,core/paths.py - Persistence layer:
db/database.py,db/repository.py
- separation of concerns between UI, orchestration, infrastructure, and persistence
- local-first design with no external backend dependency
- repository pattern for database access
- chunk-based indexing so the same structure can later support semantic search
- session-state-driven UI flow in Streamlit
- environment-based protection for destructive admin actions
- The app only supports PDFs.
- OCR quality depends on scan quality and handwriting clarity.
- Handwritten PDFs can still produce noisy text extraction results.
- Full-text search is implemented, but semantic search is not yet implemented.
- Analytics are basic product-usage and progress metrics, not AI analytics.
- semantic search over indexed document chunks
- better search ranking and result grouping
- improved OCR quality handling for difficult handwritten PDFs
- richer analytics on reading behavior and document usage
- a Python virtual environment for the project
- project dependencies installed
- Tesseract OCR installed if you want searchable text from scanned or image-based PDFs
pip install -r requirements.txtIf you are using uv, you can also run:
uv syncTo enable the admin reset controls, create a .env file in the project root:
ADMIN_PASSWORD=changemeYou can use .env.example as the starting point.
Windows OCR Setup
If you want OCR support for scanned PDFs, install Tesseract OCR on Windows. A common install path is:
C:\Program Files\Tesseract-OCR
If the tesseract command is not available in your terminal, set:
$env:PATH = "C:\Program Files\Tesseract-OCR;" + $env:PATH
$env:TESSDATA_PREFIX = "C:\Program Files\Tesseract-OCR\tessdata"On Windows:
.venv\Scripts\activate
streamlit run app/main.py- printed or digital PDFs usually work better for full-text search than handwritten PDFs
- if OCR is not available, scanned PDFs can still be uploaded, but searchable text may not be extracted
- the Analytics tab becomes more useful after you upload, search, open documents, and navigate pages in reader mode