Repository files navigation
Tokenizer implemented using regex -es from scratch
Considered apostrophes as separate tokens
Currency of the form Rs. and $ has been taken care of
Standard email ids, URLs, Hashtags # and mentions @ are also being handled
Implementation of language modelling algorithm
Kneser-Ney Smoothing
Interpolation
N -grams upto order 6 have been considered
corpus_EN.txt contains sentences in standard English
corpus_TW.txt contains assorted tweets
The language model gets stored in a file named "LM"
Visualization of Word Frequency v/s Word Occurence Rank
Resembles a Zipf's Distribution for most analytic languages
Graph for a selected corpus can be constructed
In the present setting:
The first graph considers the top-1000 ranked tokens
The second graph considers 10001 to 11000 ranked words in the corpus
Computation of Model Perplexity Scores
Enter a test_corpus to generate the perplexity scores for each sentence
For the comparison of language models, the average perplexity scores across all sentences in the test_corpus is considered
The maximum N parameter for used N -gram models can be varied
About
Language Modelling for various corpora, Natural Language Generation using LMs, Corpus Statistics Visualization
Topics
Resources
Stars
Watchers
Forks
You can’t perform that action at this time.