Skip to content

atahabilder1/FactCheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FactCheck: Fake News Detection System

A comprehensive machine learning system for detecting fake news articles using Natural Language Processing (NLP) and ensemble learning techniques. This project implements and compares multiple classification algorithms to achieve 99.74% accuracy in distinguishing between real and fabricated news content.

Table of Contents

Overview

The spread of misinformation poses a significant threat to public discourse and decision-making. This project addresses this challenge by developing a robust machine learning pipeline capable of automatically classifying news articles as real or fake based on their textual content.

Key Highlights

  • Multi-model comparison: Evaluates 7+ different ML algorithms
  • High accuracy: Achieves 99.74% accuracy with Random Forest classifier
  • Production-ready: Includes inference pipeline for real-world deployment
  • Comprehensive analysis: Feature importance and model interpretability
  • Well-documented: Full API reference and methodology documentation

Features

Text Preprocessing Pipeline

  • URL and HTML tag removal
  • Stopword elimination
  • Lemmatization
  • TF-IDF vectorization with n-grams (unigrams + bigrams)

Machine Learning Models

  • Logistic Regression
  • Random Forest Classifier
  • Support Vector Machine (SVM)
  • Naive Bayes
  • Gradient Boosting
  • AdaBoost
  • Multi-layer Perceptron (MLP)
  • Soft Voting Ensemble

Evaluation & Visualization

  • Confusion matrices
  • ROC curves
  • Feature importance analysis
  • Model comparison charts

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

  1. Clone the repository

    git clone https://github.com/atahabilder1/FactCheck.git
    cd FactCheck
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download NLTK data

    import nltk
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('punkt')
  5. Prepare the dataset

    Download the Fake and Real News Dataset and place Fake.csv and True.csv in the dataset/ directory.

Quick Start

Training

Train all models and generate visualizations:

python train.py

Train a specific model:

python train.py --model logistic_regression
python train.py --model random_forest
python train.py --model gradient_boosting

Prediction

Classify a news article:

python predict.py "Your news article text here..."

Interactive mode:

python predict.py --interactive

From a file:

python predict.py --file article.txt

Example Predictions

# Fake news example
$ python predict.py "BREAKING: Scientists discover miracle cure! Share before they delete!"
Prediction: FAKE
Confidence: 92.9%

# Real news example
$ python predict.py "WASHINGTON (Reuters) - The Federal Reserve announced today..."
Prediction: REAL
Confidence: 83.5%

Project Structure

FactCheck/
├── dataset/                 # Data files (not tracked in git)
│   ├── Fake.csv
│   └── True.csv
├── docs/                    # Documentation
│   ├── images/             # Generated visualizations
│   ├── index.md            # Documentation index
│   ├── methodology.md      # Technical methodology
│   ├── api.md              # API reference
│   └── results.md          # Results analysis
├── models/                  # Saved model files (not tracked in git)
│   ├── best_model.pkl
│   ├── ensemble_model.pkl
│   ├── tfidf_vectorizer.pkl
│   └── metrics.json
├── notebooks/               # Jupyter notebooks
│   └── analysis.ipynb      # Exploratory analysis
├── src/                     # Source code modules
│   ├── __init__.py
│   ├── preprocessing.py    # Text preprocessing utilities
│   ├── models.py           # ML model implementations
│   ├── visualization.py    # Plotting functions
│   └── utils.py            # Helper utilities
├── train.py                 # Main training script
├── predict.py               # Inference script
├── requirements.txt         # Python dependencies
├── LICENSE
└── README.md

Model Performance

Performance comparison on the test set (20% of data, 8,960 samples):

Model Accuracy F1 Score ROC-AUC
Random Forest 99.74% 99.75% 0.9999
Linear SVM 99.61% 99.63% 0.9998
Gradient Boosting 99.58% 99.59% 0.9992
AdaBoost 99.52% 99.54% 0.9997
MLP 99.36% 99.39% 0.9996
Logistic Regression 98.95% 98.99% 0.9991
Ensemble 98.73% 98.78% 0.9993
Naive Bayes 95.40% 95.59% 0.9901

Model Comparison

Best Model Classification Report

              precision    recall  f1-score   support

   Real News       1.00      1.00      1.00      4283
   Fake News       1.00      1.00      1.00      4677

    accuracy                           1.00      8960

Confusion Matrix

Dataset

This project uses the Fake and Real News Dataset from Kaggle.

Dataset Distribution

Dataset Statistics

Category Count Percentage
Fake News 23,481 52.3%
Real News 21,417 47.7%
Total 44,898 100%

Topics Covered

  • Fake News: News, Politics, Left-news, Government News, US News, Middle-east
  • Real News: Political News, World News

Methodology

Data Preprocessing

  1. Text Cleaning: Remove URLs, HTML tags, special characters
  2. Normalization: Convert to lowercase, handle whitespace
  3. Tokenization: Split text into individual tokens
  4. Stopword Removal: Filter common English stopwords
  5. Lemmatization: Reduce words to base form

Feature Extraction

  • TF-IDF Vectorization with unigrams and bigrams
  • Maximum 10,000 features
  • Document frequency thresholds (min: 3, max: 95%)

Feature Analysis

The model identifies linguistic patterns that distinguish fake from real news:

Results Summary

Top Fake News Indicators: Sensational language, informal tone, vague attributions

Top Real News Indicators: Source citations (Reuters), formal language, specific data

Usage

Python API

from predict import FakeNewsPredictor

# Initialize predictor
predictor = FakeNewsPredictor()

# Classify an article
result = predictor.predict("""
    Scientists at MIT have developed a new renewable energy source
    that could power entire cities. The research was published in
    Nature journal after peer review.
""")

print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.1f}%")

Training Custom Models

from src.preprocessing import TextPreprocessor, FeatureExtractor
from src.models import FakeNewsClassifier

# Preprocess data
preprocessor = TextPreprocessor()
extractor = FeatureExtractor(max_features=15000)

# Train custom model
classifier = FakeNewsClassifier(model_type='gradient_boosting')
classifier.fit(X_train, y_train)

# Evaluate
metrics = classifier.evaluate(X_test, y_test)
print(f"Accuracy: {metrics['accuracy']:.2%}")

Documentation

Detailed documentation is available in the docs/ directory:

Author

Anik Tahabilder Ph.D. Student in Computer Science Wayne State University

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Dataset provided by Kaggle
  • Built with scikit-learn, NLTK, and matplotlib
  • Inspired by research in computational journalism and misinformation detection

Combating misinformation with machine learning

About

FactCheckAI is an automated fake news detection system that uses natural language processing and machine learning to classify news articles as real or fake. The project evaluates traditional ML models and transformer-based architectures to improve accuracy, robustness, and explainability in identifying misinformation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors