Skip to content

khnguyenn/FraudDetection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Real-Time Fraud Detection System

A production-ready fraud detection pipeline using RabbitMQ, machine learning, and real-time processing. This system processes transaction data through a trained ML model to detect fraudulent activities in real-time.

πŸ“‹ Table of Contents

πŸ“ Files Structure

FraudDetection/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ producer.py           # Transaction data ingestion
β”‚   β”œβ”€β”€ consumer.py           # ML processing engine
β”‚   └── results_viewer.py     # Real-time results display
β”œβ”€β”€ artifacts/
β”‚   β”œβ”€β”€ model.joblib          # Trained ML model
β”‚   └── preprocessor.joblib   # Data preprocessing pipeline
β”œβ”€β”€ data/
β”‚   └── new_applications.csv  # Sample transaction data
β”œβ”€β”€ notebook_and_ppt/
β”‚   └── models.ipynb          # Model training notebook
β”œβ”€β”€ submissions/
β”‚   └── *.csv                 # Model predictions
β”œβ”€β”€ image/
β”‚   └── system_architecture.jpg
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ docker-compose.yml        # RabbitMQ infrastructure
└── README.md

System Architecture

The system is divided into 2 main phases: an offline Training Phase and an online Prediction Phase

Architecture

Training Phase

  1. Data Loading & Merging: Loads raw data from multiple CSV files (train_transaction.csv, train_identity.csv)
  2. Feature Engineering: Performs extensive feature engineering, including creating time-based features (Day, TransactionHours, DayofWeek), amount transformations (dollars, cents, log), email domain mapping, device categorization, and V-column selection. Feature selection parameters and domain mappings are calculated and saved.
  3. Data Splitting: Data is split using GroupKFold and temporal validation to prevent data leakage and mimic realistic fraud detection scenarios.
  4. Preprocessing: A preprocessing pipeline is defined to handle numerical features, categorical encoding, and feature scaling for the selected V-columns and engineered features.
  5. Model Training: A XGBoost Model is trained on the preprocessed fraud detection data with class balancing to handle the imbalanced nature of fraud cases.
  6. Artifact Saving: The trained preprocessor, model, and feature engineering parameters are saved to disk (model.joblib and preprocessor.joblib files) for use in the real-time prediction phase.

Prediction Phase (Deployment Workflow)

1. Data Ingestion (Producer)

  • Reads transaction data from CSV (new_applications.csv)
  • Converts to JSON messages
  • Publishes to fraud_detection_queue
  • Rate-limited processing (1 tx/second)

2. ML Processing (Consumer)

# Feature Engineering (139+ features)
β”œβ”€β”€ Time Features: Day, TransactionHours, DayofWeek
β”œβ”€β”€ V-columns: 100+ anonymized features (selected subset)
β”œβ”€β”€ Amount Features: dollars, cents, TransactionAmt_log
β”œβ”€β”€ Identity Features: email domains, device types
└── Unique IDs: card+email combinations

# ML Model: XGBoost/LightGBM (trained on 500K+ transactions)
β”œβ”€β”€ Input: 139 engineered features
β”œβ”€β”€ Output: Fraud probability [0-1]
└── Threshold: 0.5 (configurable)
  • Consumes from fraud_detection_queue
  • Feature Engineering:
    • Remove V-cols
    • Time-based features (Day, Hour, DayOfWeek)
    • Amount features (dollars, cents, log transform)
    • Email domain mapping
    • Device categorization
    • Unique identifier creation
  • Preprocessing: Transforms the engineered features using the loaded preprocessor.joblib
  • Prediction: Feeds the preprocessed data into the loaded model.joblib to predict the loan default status and probability.
  • Publishing: Results to fraud_results_queue

3. Results Display (Viewer)

  • Consumes from fraud_results_queue
  • Real-time fraud alerts
  • Transaction details and confidence scores

πŸ“¦ Installation

1. Clone Repository

git clone https://github.com/khnguyenn/FraudDetection
cd FraudDetection

2. Install Python Dependencies

pip install -r requirements.txt

3. Start RabbitMQ Infrastructure

docker run -d --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management

4. Verify RabbitMQ is Running

# Check containers
docker ps

πŸš€ Running the System

Quick Start (3 Terminal Setup)

Terminal 1: Start Results Viewer

cd src
python results_viewer.py

Output: 🎯 Listening for results on 'fraud_results_queue'

Terminal 2: Start ML Consumer

cd src
python consumer.py

Output: 🎯 Starting fraud detection consumer...

Terminal 3: Send Transaction Data

cd src
python producer.py

Output: Sent transaction tx_1, tx_2, tx_3...

Expected Results Flow

RAW DATA (SPLITING INTO 1 ROW) -> Producer β†’ Queue β†’ Consumer(Feature Engineering, Preprocessing, Machine Learning model) β†’ Results Queue β†’ Viewer

tx_1: β†’ Feature Engineering β†’ Model Prediction β†’ 85% fraud β†’ 🚨 FRAUD DETECTED
tx_2: β†’ Feature Engineering β†’ Model Prediction β†’ 12% fraud β†’ βœ… LEGITIMATE

Use Real Fraud Cases

SAMPLE_DATA_CSV = "../data/new_applications.csv"
# Change the data file u want in producer.py

Expected Output with Real Fraud

============================================================
πŸ” FRAUD DETECTION RESULT
============================================================
Transaction ID: tx_6
βœ… STATUS: LEGITIMATE TRANSACTION
🎯 Fraud Risk: 0.0%
============================================================

============================================================
πŸ” FRAUD DETECTION RESULT
============================================================
Transaction ID: tx_4
🚨 STATUS: FRAUD DETECTED
🎯 Confidence: 92.6%
============================================================

Application Logs

# Consumer logs
tail -f consumer.log

# Producer logs
tail -f producer.log

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors