A comprehensive machine learning system for detecting fake news articles using Natural Language Processing (NLP) and ensemble learning techniques. This project implements and compares multiple classification algorithms to achieve 99.74% accuracy in distinguishing between real and fabricated news content.
- Overview
- Features
- Installation
- Quick Start
- Project Structure
- Model Performance
- Dataset
- Methodology
- Usage
- Documentation
- Author
- License
The spread of misinformation poses a significant threat to public discourse and decision-making. This project addresses this challenge by developing a robust machine learning pipeline capable of automatically classifying news articles as real or fake based on their textual content.
- Multi-model comparison: Evaluates 7+ different ML algorithms
- High accuracy: Achieves 99.74% accuracy with Random Forest classifier
- Production-ready: Includes inference pipeline for real-world deployment
- Comprehensive analysis: Feature importance and model interpretability
- Well-documented: Full API reference and methodology documentation
- URL and HTML tag removal
- Stopword elimination
- Lemmatization
- TF-IDF vectorization with n-grams (unigrams + bigrams)
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine (SVM)
- Naive Bayes
- Gradient Boosting
- AdaBoost
- Multi-layer Perceptron (MLP)
- Soft Voting Ensemble
- Confusion matrices
- ROC curves
- Feature importance analysis
- Model comparison charts
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone https://github.com/atahabilder1/FactCheck.git cd FactCheck -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK data
import nltk nltk.download('stopwords') nltk.download('wordnet') nltk.download('punkt')
-
Prepare the dataset
Download the Fake and Real News Dataset and place
Fake.csvandTrue.csvin thedataset/directory.
Train all models and generate visualizations:
python train.pyTrain a specific model:
python train.py --model logistic_regression
python train.py --model random_forest
python train.py --model gradient_boostingClassify a news article:
python predict.py "Your news article text here..."Interactive mode:
python predict.py --interactiveFrom a file:
python predict.py --file article.txt# Fake news example
$ python predict.py "BREAKING: Scientists discover miracle cure! Share before they delete!"
Prediction: FAKE
Confidence: 92.9%
# Real news example
$ python predict.py "WASHINGTON (Reuters) - The Federal Reserve announced today..."
Prediction: REAL
Confidence: 83.5%FactCheck/
├── dataset/ # Data files (not tracked in git)
│ ├── Fake.csv
│ └── True.csv
├── docs/ # Documentation
│ ├── images/ # Generated visualizations
│ ├── index.md # Documentation index
│ ├── methodology.md # Technical methodology
│ ├── api.md # API reference
│ └── results.md # Results analysis
├── models/ # Saved model files (not tracked in git)
│ ├── best_model.pkl
│ ├── ensemble_model.pkl
│ ├── tfidf_vectorizer.pkl
│ └── metrics.json
├── notebooks/ # Jupyter notebooks
│ └── analysis.ipynb # Exploratory analysis
├── src/ # Source code modules
│ ├── __init__.py
│ ├── preprocessing.py # Text preprocessing utilities
│ ├── models.py # ML model implementations
│ ├── visualization.py # Plotting functions
│ └── utils.py # Helper utilities
├── train.py # Main training script
├── predict.py # Inference script
├── requirements.txt # Python dependencies
├── LICENSE
└── README.md
Performance comparison on the test set (20% of data, 8,960 samples):
| Model | Accuracy | F1 Score | ROC-AUC |
|---|---|---|---|
| Random Forest | 99.74% | 99.75% | 0.9999 |
| Linear SVM | 99.61% | 99.63% | 0.9998 |
| Gradient Boosting | 99.58% | 99.59% | 0.9992 |
| AdaBoost | 99.52% | 99.54% | 0.9997 |
| MLP | 99.36% | 99.39% | 0.9996 |
| Logistic Regression | 98.95% | 98.99% | 0.9991 |
| Ensemble | 98.73% | 98.78% | 0.9993 |
| Naive Bayes | 95.40% | 95.59% | 0.9901 |
precision recall f1-score support
Real News 1.00 1.00 1.00 4283
Fake News 1.00 1.00 1.00 4677
accuracy 1.00 8960
This project uses the Fake and Real News Dataset from Kaggle.
| Category | Count | Percentage |
|---|---|---|
| Fake News | 23,481 | 52.3% |
| Real News | 21,417 | 47.7% |
| Total | 44,898 | 100% |
- Fake News: News, Politics, Left-news, Government News, US News, Middle-east
- Real News: Political News, World News
- Text Cleaning: Remove URLs, HTML tags, special characters
- Normalization: Convert to lowercase, handle whitespace
- Tokenization: Split text into individual tokens
- Stopword Removal: Filter common English stopwords
- Lemmatization: Reduce words to base form
- TF-IDF Vectorization with unigrams and bigrams
- Maximum 10,000 features
- Document frequency thresholds (min: 3, max: 95%)
The model identifies linguistic patterns that distinguish fake from real news:
Top Fake News Indicators: Sensational language, informal tone, vague attributions
Top Real News Indicators: Source citations (Reuters), formal language, specific data
from predict import FakeNewsPredictor
# Initialize predictor
predictor = FakeNewsPredictor()
# Classify an article
result = predictor.predict("""
Scientists at MIT have developed a new renewable energy source
that could power entire cities. The research was published in
Nature journal after peer review.
""")
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.1f}%")from src.preprocessing import TextPreprocessor, FeatureExtractor
from src.models import FakeNewsClassifier
# Preprocess data
preprocessor = TextPreprocessor()
extractor = FeatureExtractor(max_features=15000)
# Train custom model
classifier = FakeNewsClassifier(model_type='gradient_boosting')
classifier.fit(X_train, y_train)
# Evaluate
metrics = classifier.evaluate(X_test, y_test)
print(f"Accuracy: {metrics['accuracy']:.2%}")Detailed documentation is available in the docs/ directory:
- Documentation Index - Start here
- Methodology - Technical approach and algorithms
- API Reference - Function and class documentation
- Results Analysis - Detailed performance analysis
- Analysis Notebook - Interactive exploration
Anik Tahabilder Ph.D. Student in Computer Science Wayne State University
- GitHub: @atahabilder1
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset provided by Kaggle
- Built with scikit-learn, NLTK, and matplotlib
- Inspired by research in computational journalism and misinformation detection
Combating misinformation with machine learning



