FactCheck: Fake News Detection System

A comprehensive machine learning system for detecting fake news articles using Natural Language Processing (NLP) and ensemble learning techniques. This project implements and compares multiple classification algorithms to achieve 99.74% accuracy in distinguishing between real and fabricated news content.

Overview

The spread of misinformation poses a significant threat to public discourse and decision-making. This project addresses this challenge by developing a robust machine learning pipeline capable of automatically classifying news articles as real or fake based on their textual content.

Key Highlights

Multi-model comparison: Evaluates 7+ different ML algorithms
High accuracy: Achieves 99.74% accuracy with Random Forest classifier
Production-ready: Includes inference pipeline for real-world deployment
Comprehensive analysis: Feature importance and model interpretability
Well-documented: Full API reference and methodology documentation

Features

Text Preprocessing Pipeline

URL and HTML tag removal
Stopword elimination
Lemmatization
TF-IDF vectorization with n-grams (unigrams + bigrams)

Machine Learning Models

Logistic Regression
Random Forest Classifier
Support Vector Machine (SVM)
Naive Bayes
Gradient Boosting
AdaBoost
Multi-layer Perceptron (MLP)
Soft Voting Ensemble

Evaluation & Visualization

Confusion matrices
ROC curves
Feature importance analysis
Model comparison charts

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository

git clone https://github.com/atahabilder1/FactCheck.git
cd FactCheck

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download NLTK data

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

Prepare the dataset

Download the Fake and Real News Dataset and place Fake.csv and True.csv in the dataset/ directory.

Quick Start

Training

Train all models and generate visualizations:

python train.py

Train a specific model:

python train.py --model logistic_regression
python train.py --model random_forest
python train.py --model gradient_boosting

Prediction

Classify a news article:

python predict.py "Your news article text here..."

Interactive mode:

python predict.py --interactive

From a file:

python predict.py --file article.txt

Example Predictions

# Fake news example
$ python predict.py "BREAKING: Scientists discover miracle cure! Share before they delete!"
Prediction: FAKE
Confidence: 92.9%

# Real news example
$ python predict.py "WASHINGTON (Reuters) - The Federal Reserve announced today..."
Prediction: REAL
Confidence: 83.5%

Project Structure

FactCheck/
├── dataset/                 # Data files (not tracked in git)
│   ├── Fake.csv
│   └── True.csv
├── docs/                    # Documentation
│   ├── images/             # Generated visualizations
│   ├── index.md            # Documentation index
│   ├── methodology.md      # Technical methodology
│   ├── api.md              # API reference
│   └── results.md          # Results analysis
├── models/                  # Saved model files (not tracked in git)
│   ├── best_model.pkl
│   ├── ensemble_model.pkl
│   ├── tfidf_vectorizer.pkl
│   └── metrics.json
├── notebooks/               # Jupyter notebooks
│   └── analysis.ipynb      # Exploratory analysis
├── src/                     # Source code modules
│   ├── __init__.py
│   ├── preprocessing.py    # Text preprocessing utilities
│   ├── models.py           # ML model implementations
│   ├── visualization.py    # Plotting functions
│   └── utils.py            # Helper utilities
├── train.py                 # Main training script
├── predict.py               # Inference script
├── requirements.txt         # Python dependencies
├── LICENSE
└── README.md

Model Performance

Performance comparison on the test set (20% of data, 8,960 samples):

Model	Accuracy	F1 Score	ROC-AUC
Random Forest	99.74%	99.75%	0.9999
Linear SVM	99.61%	99.63%	0.9998
Gradient Boosting	99.58%	99.59%	0.9992
AdaBoost	99.52%	99.54%	0.9997
MLP	99.36%	99.39%	0.9996
Logistic Regression	98.95%	98.99%	0.9991
Ensemble	98.73%	98.78%	0.9993
Naive Bayes	95.40%	95.59%	0.9901

Best Model Classification Report

              precision    recall  f1-score   support

   Real News       1.00      1.00      1.00      4283
   Fake News       1.00      1.00      1.00      4677

    accuracy                           1.00      8960

Dataset

This project uses the Fake and Real News Dataset from Kaggle.

Dataset Statistics

Category	Count	Percentage
Fake News	23,481	52.3%
Real News	21,417	47.7%
Total	44,898	100%

Topics Covered

Fake News: News, Politics, Left-news, Government News, US News, Middle-east
Real News: Political News, World News

Methodology

Data Preprocessing

Text Cleaning: Remove URLs, HTML tags, special characters
Normalization: Convert to lowercase, handle whitespace
Tokenization: Split text into individual tokens
Stopword Removal: Filter common English stopwords
Lemmatization: Reduce words to base form

Feature Extraction

TF-IDF Vectorization with unigrams and bigrams
Maximum 10,000 features
Document frequency thresholds (min: 3, max: 95%)

Feature Analysis

The model identifies linguistic patterns that distinguish fake from real news:

Top Fake News Indicators: Sensational language, informal tone, vague attributions

Top Real News Indicators: Source citations (Reuters), formal language, specific data

Usage

Python API

from predict import FakeNewsPredictor

# Initialize predictor
predictor = FakeNewsPredictor()

# Classify an article
result = predictor.predict("""
    Scientists at MIT have developed a new renewable energy source
    that could power entire cities. The research was published in
    Nature journal after peer review.
""")

print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.1f}%")

Training Custom Models

from src.preprocessing import TextPreprocessor, FeatureExtractor
from src.models import FakeNewsClassifier

# Preprocess data
preprocessor = TextPreprocessor()
extractor = FeatureExtractor(max_features=15000)

# Train custom model
classifier = FakeNewsClassifier(model_type='gradient_boosting')
classifier.fit(X_train, y_train)

# Evaluate
metrics = classifier.evaluate(X_test, y_test)
print(f"Accuracy: {metrics['accuracy']:.2%}")

Documentation

Detailed documentation is available in the docs/ directory:

Documentation Index - Start here
Methodology - Technical approach and algorithms
API Reference - Function and class documentation
Results Analysis - Detailed performance analysis
Analysis Notebook - Interactive exploration

Author

Anik Tahabilder Ph.D. Student in Computer Science Wayne State University

GitHub: @atahabilder1

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Dataset provided by Kaggle
Built with scikit-learn, NLTK, and matplotlib
Inspired by research in computational journalism and misinformation detection

Combating misinformation with machine learning

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

FactCheck: Fake News Detection System

Table of Contents

Overview

Key Highlights

Features

Text Preprocessing Pipeline

Machine Learning Models

Evaluation & Visualization

Installation

Prerequisites

Setup

Quick Start

Training

Prediction

Example Predictions

Project Structure

Model Performance

Best Model Classification Report

Dataset

Dataset Statistics

Topics Covered

Methodology

Data Preprocessing

Feature Extraction

Feature Analysis

Usage

Python API

Training Custom Models

Documentation

Author

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages