Ancient Texts in Translation

Word Cloud of top 1000 terms in all texts after removal of stopwords

Ancient Texts in Translation

Purpose

To discover whether modern NLP tools and predictive algorithms can provide insights into ancient text corpora

Questions to answer

Will modern subjectivity and polarity evaluations align with ancient sensibilities?
Will counts and weights either confirm known emphases or provide unexpected patterns?
Will a supervised binary classification model be able to differentiate between genre of texts, texts featuring different gods, or featuring different persons?
Will an unsupervised clustering model be able to differentiate between genre of texts, texts featuring different gods, or featuring different persons?

Deliverables

This project has six project notebooks, two appendices, and two py files.
Each notebook has an accompanying report.

Notebook Highlights

NLP section

Notebook Part I: Extensive EDA

Cleaning, wrangling, statistical analyses, and visualizations of data set
An overview of the dataset's components accompanied by an analysis of the rationale behind specific transformations applied

Notebook Part II: Sentiment

Regex extensively employed to prepare text for analyses
Acquired scores with TextBlob and Vader
Aggregated results by genre and featured god Created visuals

Notebook Part III: Counts, Frequency, and Weights

Tokenized and lemmatized text for analysis using NLTK sent_tokenize, WordNetLemmatizer, and Regex
Calculated the word count for each feature (word count)
Compiled a dictionary of the top five words based on their frequency
Developed a dictionary of the top five weighted bi-grams
Created a dictionary of the top five weighted n-grams (3 to 4 words)
Aggregated data by genre, deity, and individual

Notebook Part IV: Balancing Data Set

Applied NLTK sent_tokenize function on cleaned_text to generate sentence_count
Created binary target variable is_hymn derived from b_category
Balanced data through minor recategorization
Employed Regex findall() to assess percentage of breaks in texts and eliminated problematic entries
Utilized Gensim to extract and summarize texts exceeding 1200 words
Produced lemmatized_summaries for further analysis using WordNetLemmatizer

Machine Learning Section with Tokenization

Notebook Part V: Random Forest and XGBoost models

Created hstack feature that includes new features: top_word_freq, divine_power_count, and divine_power_weighted
Collinearity Assessment: calculated VIF and generated a Pearson correlation coefficient matrix
Configured CountVectorizer to produce a feature matrix that captures 2-grams and 3-grams
Constructed a Random Forest model using both training and validation datasets
Developed an XGBoost model using both training and validation datasets
Assessed model performance: created confusion matrices, plotted ROC curves and calculated AUC scores, and plotted Precision-Recall Curve
Extracted feature importance metrics for both models
Evaluated the superior model using test data to assess its performance

Notebook Part V: K-means clustering model

Being built

Appendices

Appendix a: Word Clouds

Generated multiple word clouds by genre, god, and person with WordCloud

Appendix b: Hypothesis Testing

Performed 3 two-sample hypothesis tests (t-tests) using Scipy’s stats.ttest_ind() to ascertain if there is a statistically significant difference or a random sampling occurrence in means by target variable is_hymn and the features: word_count, whole_polarity, and whole_subjectivity

What this is

An experiment using modern NLP tools and predictive algorithms on texts in translation

What this is not

Foolproof. Each of these notebooks was built and run using translations of ancient texts. Translators may or may not have been consistent in word choice, or, conversely, too consistent in word choice when a different English word would have suited context better. Additionally, some of the texts are extremely fragmentary, thus many texts are missing words, phrases, or large chunks. Many of the texts are composites. In the data set, a single composition is given per observation. However, that single composition may have been pieced together from several exemplars of it. These exemplars are not always exact copies, so where one scribe may have used one term (or sign) to represent, say, "adoration", another may have chosen a different term (or sign). Since the compositions are then compilations, if both original terms are thought to mean generally "adoration," this is the selected English translation. This also means that if there is/was any differing nuance between the exemplars it has been removed.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
README.md		README.md
all_texts_wide2.png		all_texts_wide2.png
ancient_NLP_EDA.ipynb		ancient_NLP_EDA.ipynb
ancient_NLP_EDA_insights.pdf		ancient_NLP_EDA_insights.pdf
ancient_NLP_ML_correction.ipynb		ancient_NLP_ML_correction.ipynb
ancient_NLP_ML_insights.pdf		ancient_NLP_ML_insights.pdf
ancient_NLP_balancing_correction.ipynb		ancient_NLP_balancing_correction.ipynb
ancient_NLP_balancing_insights.pdf		ancient_NLP_balancing_insights.pdf
ancient_NLP_sentiment_correction.ipynb		ancient_NLP_sentiment_correction.ipynb
ancient_NLP_sentiment_insights.pdf		ancient_NLP_sentiment_insights.pdf
ancient_NLP_weights_correction.ipynb		ancient_NLP_weights_correction.ipynb
ancient_NLP_weights_insights.pdf		ancient_NLP_weights_insights.pdf
appendix_b, ttest.ipynb		appendix_b, ttest.ipynb
model_scores.py		model_scores.py
prepare_ancient_text.py		prepare_ancient_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ancient Texts in Translation

Purpose

Questions to answer

Deliverables

Notebook Highlights

NLP section

Machine Learning Section with Tokenization

Appendices

What this is

What this is not

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ancient Texts in Translation

Purpose

Questions to answer

Deliverables

Notebook Highlights

NLP section

Machine Learning Section with Tokenization

Appendices

What this is

What this is not

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages