Language Modelling

Tokenizer implemented using regex-es from scratch

Considered apostrophes as separate tokens
Currency of the form Rs. and $ has been taken care of
Standard email ids, URLs, Hashtags # and mentions @ are also being handled

Implementation of language modelling algorithm

Kneser-Ney Smoothing
Interpolation
N-grams upto order 6 have been considered
corpus_EN.txt contains sentences in standard English
corpus_TW.txt contains assorted tweets
The language model gets stored in a file named "LM"

Visualization of Word Frequency v/s Word Occurence Rank

Resembles a Zipf's Distribution for most analytic languages
Graph for a selected corpus can be constructed
In the present setting:
1. The first graph considers the top-1000 ranked tokens
2. The second graph considers 10001 to 11000 ranked words in the corpus

Computation of Model Perplexity Scores

Enter a test_corpus to generate the perplexity scores for each sentence
For the comparison of language models, the average perplexity scores across all sentences in the test_corpus is considered

Sentence Generation

The maximum N parameter for used N-gram models can be varied

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_files		data_files
result_plots		result_plots
README.md		README.md
corpus_EN.txt		corpus_EN.txt
corpus_TW.txt		corpus_TW.txt
language_modelling.ipynb		language_modelling.ipynb
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Modelling

Tokenizer implemented using regex-es from scratch

Implementation of language modelling algorithm

Visualization of Word Frequency v/s Word Occurence Rank

Computation of Model Perplexity Scores

Sentence Generation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Language Modelling

Tokenizer implemented using regex-es from scratch

Implementation of language modelling algorithm

Visualization of Word Frequency v/s Word Occurence Rank

Computation of Model Perplexity Scores

Sentence Generation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages