Skip to content

Luka98122/Malware-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Detection & Classification with Convolutional Neural Networks

This repository contains the implementation of a deep learning-based malware classification system developed for the PFE Camp '26. The project explores the effectiveness of treating Windows executable binaries as 2D images, allowing a Convolutional Neural Network (CNN) to "see" and classify malware families based on visual byte-level textures.

Project Overview

Traditional malware detection relies on signatures or complex heuristics. This project implements a Static Analysis pipeline that transforms raw .bytes files from the Microsoft Malware Classification Challenge into grayscale images. By leveraging the spatial hierarchy of CNNs, the model identifies structural patterns unique to specific malware families.

Final Performance (Current Best)

  • Validation Accuracy: 98.3%
  • Kaggle Public Score: 0.06005
  • Kaggle Private Score: 0.06321
  • Leaderboard Rank: Top 175

Dataset

The project utilizes the Microsoft Malware Classification Challenge (BIG 2015) dataset:

  • Samples: 10,868 training files.
  • Classes: 9 distinct malware families:
    1. Ramnit | 2. Lollipop | 3. Kelihos_ver3 | 4. Vundo | 5. Simda | 6. Tracur | 7. Kelihos_ver1 | 8. Obfuscator.ACY | 9. Gatak.

Iterative Methodology & Evolution

The project followed an incremental improvement strategy to optimize the Log Loss score:

Milestone Resolution Key Features Accuracy
Baseline 256x256 Simple 3-layer CNN ~80%
Stage 2 256x256 Increased depth & epochs ~85%
Stage 3 256x256 Deep 5-layer architecture ~90%
Stage 4 256x256 Added Batch Normalization ~95%
Final Static 512x512 LR Scheduler & Dropout 98.3%

Model Architecture

The final model is a custom deep CNN optimized for 512x512 inputs:

  • Feature Extractor: 6 Convolutional blocks with nn.BatchNorm2d and nn.ReLU.
  • Downsampling: 5 nn.MaxPool2d layers.
  • Global Pooling: nn.AdaptiveAvgPool2d((1, 1)) for resolution independence.
  • Classifier: Dense layer with Dropout(p=0.2) to prevent overfitting.

Tech Stack

  • Framework: PyTorch (v2.x)
  • GPU: NVIDIA GeForce RTX 3080 Ti (CUDA acceleration)
  • Language: Python 3.12
  • Libraries: Pandas, NumPy, Pathlib, tqdm, Scikit-learn

Directory Structure

.
├── data
│   ├── processed_tensors_256
│   │   └── train
│   ├── processed_tensors_512
│   │   └── train
│   ├── test_raw
│   │   └── test
│   └── train_raw
│       └── train
├── models
│   └── all
└── submissions
    └── hidden

Repository Structure

  • Malware_Classification.ipynb: The main notebook containing all logic for data preprocessing, model architecture, training loops, and inference.
  • data/: Contains the raw Microsoft Malware dataset (.bytes and .asm files) and the processed_tensors/ folder.
  • models/: Stores trained model weights (.pth files), including the final high-performance model.
  • submissions/: Stores generated submission.csv files for Kaggle benchmarking.

Future Work (Phase 2 & 3)

  • RGB Representation: Implementing a 3-channel image mapping (Red: Raw Bytes, Green: Local Entropy, Blue: ASM Metadata).
  • Benign vs. Malware: Extending the system to distinguish between safe Windows executables and malicious files.
  • Dynamic Analysis: Integrating sandbox-based behavior monitoring (API calls, network activity) as a secondary classification layer.

Author: Luka Marković
Project: PFE Camp '26 Proposal Implementation

About

My solution to the Microsoft Malware Classification Challenge (BIG 2015) kaggle competition.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors