A production-ready fraud detection pipeline using RabbitMQ, machine learning, and real-time processing. This system processes transaction data through a trained ML model to detect fraudulent activities in real-time.
- Files Structure
- System Architecture
- Installation
- Running the System
- Expected Results Flow
- Use Real Fraud Cases
- Application Logs
FraudDetection/
βββ src/
β βββ producer.py # Transaction data ingestion
β βββ consumer.py # ML processing engine
β βββ results_viewer.py # Real-time results display
βββ artifacts/
β βββ model.joblib # Trained ML model
β βββ preprocessor.joblib # Data preprocessing pipeline
βββ data/
β βββ new_applications.csv # Sample transaction data
βββ notebook_and_ppt/
β βββ models.ipynb # Model training notebook
βββ submissions/
β βββ *.csv # Model predictions
βββ image/
β βββ system_architecture.jpg
βββ requirements.txt # Python dependencies
βββ docker-compose.yml # RabbitMQ infrastructure
βββ README.md
The system is divided into 2 main phases: an offline Training Phase and an online Prediction Phase
- Data Loading & Merging: Loads raw data from multiple CSV files (
train_transaction.csv,train_identity.csv) - Feature Engineering: Performs extensive feature engineering, including creating time-based features (Day, TransactionHours, DayofWeek), amount transformations (dollars, cents, log), email domain mapping, device categorization, and V-column selection. Feature selection parameters and domain mappings are calculated and saved.
- Data Splitting: Data is split using GroupKFold and temporal validation to prevent data leakage and mimic realistic fraud detection scenarios.
- Preprocessing: A preprocessing pipeline is defined to handle numerical features, categorical encoding, and feature scaling for the selected V-columns and engineered features.
- Model Training: A XGBoost Model is trained on the preprocessed fraud detection data with class balancing to handle the imbalanced nature of fraud cases.
- Artifact Saving: The trained preprocessor, model, and feature engineering parameters are saved to disk (
model.joblibandpreprocessor.joblibfiles) for use in the real-time prediction phase.
- Reads transaction data from CSV (
new_applications.csv) - Converts to JSON messages
- Publishes to
fraud_detection_queue - Rate-limited processing (1 tx/second)
# Feature Engineering (139+ features)
βββ Time Features: Day, TransactionHours, DayofWeek
βββ V-columns: 100+ anonymized features (selected subset)
βββ Amount Features: dollars, cents, TransactionAmt_log
βββ Identity Features: email domains, device types
βββ Unique IDs: card+email combinations
# ML Model: XGBoost/LightGBM (trained on 500K+ transactions)
βββ Input: 139 engineered features
βββ Output: Fraud probability [0-1]
βββ Threshold: 0.5 (configurable)- Consumes from
fraud_detection_queue - Feature Engineering:
- Remove V-cols
- Time-based features (Day, Hour, DayOfWeek)
- Amount features (dollars, cents, log transform)
- Email domain mapping
- Device categorization
- Unique identifier creation
- Preprocessing: Transforms the engineered features using the loaded
preprocessor.joblib - Prediction: Feeds the preprocessed data into the loaded
model.joblibto predict the loan default status and probability. - Publishing: Results to
fraud_results_queue
- Consumes from
fraud_results_queue - Real-time fraud alerts
- Transaction details and confidence scores
git clone https://github.com/khnguyenn/FraudDetection
cd FraudDetectionpip install -r requirements.txtdocker run -d --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management# Check containers
docker pscd src
python results_viewer.pyOutput: π― Listening for results on 'fraud_results_queue'
cd src
python consumer.pyOutput: π― Starting fraud detection consumer...
cd src
python producer.pyOutput: Sent transaction tx_1, tx_2, tx_3...
RAW DATA (SPLITING INTO 1 ROW) -> Producer β Queue β Consumer(Feature Engineering, Preprocessing, Machine Learning model) β Results Queue β Viewer
tx_1: β Feature Engineering β Model Prediction β 85% fraud β π¨ FRAUD DETECTED
tx_2: β Feature Engineering β Model Prediction β 12% fraud β β
LEGITIMATE
SAMPLE_DATA_CSV = "../data/new_applications.csv"
# Change the data file u want in producer.py============================================================
π FRAUD DETECTION RESULT
============================================================
Transaction ID: tx_6
β
STATUS: LEGITIMATE TRANSACTION
π― Fraud Risk: 0.0%
============================================================
============================================================
π FRAUD DETECTION RESULT
============================================================
Transaction ID: tx_4
π¨ STATUS: FRAUD DETECTED
π― Confidence: 92.6%
============================================================
# Consumer logs
tail -f consumer.log
# Producer logs
tail -f producer.log