HW1 Retrieval-based Chat Bot

Task: Develop a retrieval-based chat bot using the retrieval-based approach. The bot should engage in dialogue as a specific character from a TV series, imitating their style and manner of speech. It is important to consider the character’s speech patterns, themes they raise, and their typical reactions.

Data Collection

As a foundation for the chatbot, I used scripts from the Star Trek series, which I downloaded from this repository, specifically the lines of Mr. Spock, a crew member and scientist from the planet Vulcan.

The data was processed as follows:

script cleaning
selecting the character’s lines as possible bot responses
extracting the preceding line as the “question”; if it’s the first line of the scene, this field is empty
extracting the previous dialogue lines as context; again, if it's the first line, the context is empty. Context = consecutive previous lines

After processing, about 5,800 character lines were available for the bot to use in conversations. I then prepared datasets for training the candidate ranking and re-ranking models.

Code is in the notebook GNLP_HW1-data_prep.ipynb

Training Data for the Bi-Encoder

Based on the processed data, I prepared training samples for a bi-encoder model. Since I used triplet loss, the data is structured into triples:

ANCHOR – context + question
ANSWER – correct response from the script
WRONG_ANSWER – randomly selected line (from the House M.D. series)

Data for Reranker Training

These same datasets are used for training a reranker model to re-rank response candidates. Correct responses are labeled 0 and augmented with lines from other shows as negative examples labeled 1.

Reranker data includes context-question-answer sequences separated by a special [SEP] token. Around 10,000 samples were created, with a 50/50 class split.

Chat Bot Architecture

The chatbot workflow is shown schematically below.

The reply database consists of vectorized scripts using a model from the trained encoder, including both context and question.

Reply selection happens in two steps:

Retrieval of similar context-question pairs based on cosine similarity from the vector database. The top candidates are selected based on similarity to the user’s input.
A reranker model then classifies whether the retrieved answer is a valid continuation. Only responses labeled as class 0 (logical continuation) are kept and ranked by the model’s confidence. If all responses are class 1, the top cosine-similar response is returned.

The intent classifier is taken from the DialogTag library and is used to label both the training data and incoming user inputs. Intent is also used to filter candidates and is embedded as an additional feature in the bi-encoder.

The bi-encoder model is based on distilroberta-base, fine-tuned on the triplet data using the sentence-transformers library. Training uses a Triplet Loss Function, minimizing the distance between anchor and correct answer, and maximizing it between anchor and incorrect answer.

The model was evaluated by accuracy—specifically, whether the similarity between anchor and correct answer was higher than with the incorrect one. The untrained distilroberta-base achieved 58%, while the trained model reached 98%.

Training code is in this notebook. The model is hosted on Hugging Face (link) and used for inference.

The reranker model is based on bert-base-uncased, trained on the previously prepared labeled data. Classification performance was evaluated by accuracy. Training results are shown below.

The model achieved strong results, with final validation accuracy reaching 95%. The graphs suggest overfitting started after the second epoch.

Model is hosted here: gnlp_hw1_reranker

Training Results Summary

These results indicate high training and validation performance. A consistent 95% validation accuracy suggests two possibilities:

Overfitting – the model fits the training data too closely, reducing generalization. However, stable high accuracy on validation may indicate that the data represents the target pattern well.
Data Quality – the dataset may lack diversity or be too small, allowing the model to easily fit it. Increasing data variety and size is important.

I tend to believe the second explanation, as the final dataset was relatively small.

Repository Structure

│   README.md - HW1 report
│   requirements.txt
│   __init__.py
│   retrieval_bot.py - main inference logic
│   utilities.py - helper functions
│   app.py - Flask UI app launcher
│
├───Notebooks - data prep and training notebooks
├───templates - HTML template for UI
│       chat.html
├───static - styling for UI
│       style.css
├───data
│       spock_dujour.pkl - fallback lines for low similarity
│       spock_lines_vectorized.pkl - vector DB of context-question pairs
│       spock_lines.pkl - raw data
│       spock_lines_reranker.pkl - reranker dataset

Web Chat Service

The chatbot is built with Flask and launched using app.py, which sets up the interface, loads files and models, and initializes the chatbot.

To run locally:

Clone the repo: https://github.com/greatakela/ChatBot.git
Create a virtual environment
Install dependencies: pip install -r requirements.txt
Launch the app: python app.py
The chatbot runs at http://127.0.0.1:5000

Chatbot Evaluation

The bot should be evaluated based on how relevant its replies are to dialogue context. Thus, human evaluation is key.

I tested different encoders:

sentence-transformers/all-mpnet-base-v2
sentence-transformers/LaBSE
greatakela/gnlp_hw1_encoder — trained specifically on Spock lines

I tested with some sample user inputs to compare retrieval quality:

Incoming	Greetings, Mr. Spock.	What is the logical course of action?	Explain your reasoning.	What do you think of Captain Kirk?
all-mpnet-base-v2	Live long and prosper.	Logic is the beginning of wisdom, not the end.	Once you have eliminated the impossible, whatever remains, however improbable, must be the truth.	Captain, you almost make me believe in luck.
LaBSE	Greetings. How may I assist in your endeavors?	It would be illogical to assume that all conditions remain stable.	The universe is vast and full of wonders. It is logical to explore them.	Without followers, evil cannot spread.
gnlp_hw1_encoder	I assume this greeting is a social convention rather than a necessity?	The needs of the many outweigh the needs of the few.	Superior ability breeds superior ambition.	I fail to comprehend your indignation, sir. I have simply made the logical deduction that you are a liar.

Interestingly, embeddings from the custom-trained Spock model produced better similarity results than general-purpose models.

It's hard to judge a “winner” from the examples above. I retained the trained model and added intent filtering and a minimum similarity threshold before reranking, to add determinism.

Conclusion

The analysis shows that the developed chatbot model is effective for this task. However, further testing on larger and more diverse data is necessary to better understand its generalization capabilities and limits.

Web Service Deployment

The service was containerized using Docker on a local machine. I deployed it to a virtual server on Kamatera, running the uploaded Docker container. The chatbot is accessible at:
👉 http://185.53.209.56:5000/

The Docker image ended up being 13 GB even without GPU-related packages. There's room for optimization.
VM Specs: 2 CPUs, 2 GB RAM, 80 GB storage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HW1 Retrieval-based Chat Bot

Data Collection

Training Data for the Bi-Encoder

Data for Reranker Training

Chat Bot Architecture

Training Results Summary

Repository Structure

Web Chat Service

Chatbot Evaluation

Conclusion

Web Service Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Notebooks		Notebooks
__pycache__		__pycache__
data		data
static		static
templates		templates
README.md		README.md
README_RU.md		README_RU.md
__init__.py		__init__.py
app.py		app.py
requirements.txt		requirements.txt
retrieval_bot.py		retrieval_bot.py
utilities.py		utilities.py

Folders and files

Latest commit

History

Repository files navigation

HW1 Retrieval-based Chat Bot

Data Collection

Training Data for the Bi-Encoder

Data for Reranker Training

Chat Bot Architecture

Training Results Summary

Repository Structure

Web Chat Service

Chatbot Evaluation

Conclusion

Web Service Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages