A Retrieval-Augmented Generation (RAG) system for exploring the four Vedas — combining a local LLM, the Google Gemini API, and classical NLP techniques to deliver contextual answers and deep textual analysis from one of humanity's oldest bodies of knowledge.
| Feature | Description |
|---|---|
| Automated NLTK Setup | Checks for and downloads required NLTK data (punkt, wordnet, stopwords) on first run |
| Text Preprocessing | Lowercasing, tokenization, non-alphabetic removal, stop word filtering (including Vedic terms like thou, hymn, veda), and lemmatization |
| Text Statistics | Reports total word count, unique word count, and top frequent terms after preprocessing |
| Topic Modeling (LDA) | Identifies underlying themes across hymns using Latent Dirichlet Allocation |
| TF-IDF Keyword Extraction | Surfaces important, document-specific keywords for individual hymns |
| Collocation Analysis | Discovers significant bigrams and trigrams — frequently co-occurring word pairs and triplets |
| Contextual AI Explanations | Uses the Gemini API to generate cultural, religious, and ritualistic explanations for identified collocations |
- Python 3.x
- The Vedic text file
Four-Vedas-English-Translation.txtplaced in the same directory as the script - A Google Gemini API key — obtain one from Google AI Studio
pip install nltk scikit-learn gensim google-generativeai pandas numpyThe following parameters can be adjusted directly in the script:
| Parameter | Default | Description |
|---|---|---|
file_path |
Four-Vedas-English-Translation.txt |
Path to the input text file |
num_topics |
5 |
Number of themes for LDA to discover |
custom_stopwords |
(set in script) | Words to exclude from analysis |
| Bigram frequency filter | 5 |
Minimum occurrences for a bigram to be considered |
| Trigram frequency filter | 3 |
Minimum occurrences for a trigram to be considered |
| Gemini model | gemini-1.5-flash |
Gemini model used for contextual explanations |
FileNotFoundError
Verify that Four-Vedas-English-Translation.txt (or your custom file_path) exists in the same directory as the script.
MemoryError during LDA
Reduce num_topics or increase the passes parameter in LdaModel to ease memory pressure on large corpora.
Gemini API errors (404, etc.)
Check that your API key is valid and has the necessary permissions. Also confirm that gemini-1.5-flash is still a supported model name in the current API version.