A modular machine learning pipeline for classifying tweet sentiments as positive or negative, using robust text preprocessing, feature engineering, and model comparison. This project demonstrates practical NLP techniques and evaluates multiple classifiers to identify the most effective model for real-world sentiment prediction.
- Clean and preprocess raw Twitter data for sentiment classification.
- Extract meaningful features using Bag-of-Words (BoW) and TF-IDF.
- Train and evaluate multiple ML models to identify the best performer.
- Visualize insights from the data (hashtags, word clouds).
- Generate predictions on unseen test data using the top-performing model.
graph TD
A[Load Data] --> B[Preprocess Tweets]
B --> C[Feature Extraction: CBoW & TF-IDF]
C --> D[Train Models: LogReg, XGBoost, Decision Tree]
D --> E[Evaluate via F1-Score]
E --> F[Visualize Results]
F --> G[Predict on Test Data]
Install dependencies using:
pip install -r requirements.txtAlso download NLTK tokenizer:
import nltk
nltk.download('punkt')pandasnumpynltkscikit-learnxgboostmatplotlibseabornwordcloud
-
Clone this repository:
git clone https://github.com/SatChittAnand/Twitter-Sentiment-Analysis.git cd Twitter-Sentiment-Analysis -
Place the datasets
train_SentimentAnalysis.csvandtest_SentimentAnalysis.csvin the root directory. -
Run the script:
python sentimentanalysistwitter.py
The script will:
- Preprocess and vectorize the data
- Train and evaluate models
- Generate visualizations
- Save predictions to
predictions.csv
| Model | Feature Technique | Evaluation Metric |
|---|---|---|
| Logistic Regression | BoW, TF-IDF | F1-Score |
| XGBoost Classifier | BoW, TF-IDF | F1-Score |
| Decision Tree | BoW, TF-IDF | F1-Score |
- Logistic Regression with TF-IDF achieved the highest F1-score.
- Visual comparisons via point plots highlight model-feature performance trade-offs.
- Word clouds and hashtag frequency plots offer intuitive insights into tweet content.
twitter-sentiment-analysis/
│
├── sentimentanalysistwitter.py # Main script
├── train_SentimentAnalysis.csv # Training dataset
├── test_SentimentAnalysis.csv # Test dataset
├── predictions.csv # Output predictions
├── requirements.txt # Dependencies
└── README.md # Project documentation
- ✅ Modular design for easy extension
- 📈 Visual insights into tweet content and trends
- 🔁 Reproducible and scalable for larger datasets
- 🧩 Easy to integrate into real-time sentiment monitoring systems
Pull requests are welcome! For major changes, please open an issue first to discuss what you’d like to modify.
This project is licensed under the MIT License.