This repository contains the code of my Master’s Degree Final Project. This project aims to provide a pipeline (from extracting tweets to designing an interactive app) to explore a group of tweets with visual widgets and machine learning techniques such as topic modeling and sentiment analysis.
This pipeline consists in:
- Extracting tweets from Twitter with snscrape
- Preprocessing tweets and their metadata with well-known libraries such as pandas or spacy
- Compute tweets embeddings with sentence transformers. These contextual embeddings will improve topic modeling compared to classical techniques like LDA and also allow us to build a simple logistic regression for sentiment classification
- Train a sentiment analysis model with labeled datasets
- Clustering tweets and assigning them topics with contextualized topic modeling
- Build an interactive app with streamlit and plotly
In this project I use two different group of tweets: tweets from @IbaiLlanos and spanish tweets with the keyword 'netflix'.
App in spanish deployed on https://tweets-visualizer.streamlit.app/
- app contains scripts to deploy streamlit's app in Heroku
- data contains all the data needed along the process, from raw data extracted with snscrape to embeddings and datasets with sentiment labels
- dev contains scripts for local development: preprocessing, creating embeddings, training sentiment model and topic modeling
Notebooks for sentiment classification in dev/sentiment_model are run once to train a model and use it for every group of tweets.
GPU is highly recommended when computing embeddings and creating topics
Whenever we want to analyze a new group of tweets:
- First, from data/raw_data folder extract tweets with snscrape's commands. Example used:
snscrape --jsonl --progress twitter-search "from:IbaiLlanos -filter:replies AND -filter:quote" > IbaiLlanos.json - Run main.py in dev/ for preprocessing, embeddings, infering sentiment and saving results
python main.py --data_name IbaiLlanos - Run main_opics.py in dev/ for topics creation. Results saved in data/topic_data
python main_topics.py --data_name IbaiLlanos - Finally, choose which data and script to use in app/app.py and run it
streamlit run app.py
