Skip to content

aycaaktas/TextClassificationUsingLLMandMLTechniques

Repository files navigation

Text Classification Using LLM and ML/DL Techniques

Python Docker LLM ML License

Project Overview

This project presents a comparative study of text classification using Large Language Models (LLMs) and traditional Machine Learning (ML) techniques on a domain-specific dataset.

Sept 2024 – Oct 2024
Text Classification Using LLM and ML/DL Techniques

  • Conducted a comparative study of text classification using LLMs (Llama-3 8B) and classical ML models (TF-IDF + Logistic Regression) on a domain-specific dataset.
  • Performed data preprocessing and class imbalance handling, and applied prompt engineering techniques (zero-shot and EHC) with LLM fine-tuning for domain adaptation.
  • Utilized Ollama, Hugging Face, and LangChain to structure LLM inference, fine-tuning, and evaluation workflows, enabling systematic benchmarking against traditional machine learning approaches.

Two independent approaches are implemented:

  1. LLM-Based Approach
  • Uses Llama-3 8B via Ollama
  • Performed domain-specific fine-tuning to adapt the model for the classification task
  • Applied prompt engineering strategies, including:
    • Zero-shot prompting
    • EHC prompting
  • Evaluated the effectiveness of LLM-based classification compared to traditional ML methods
  1. Classical Machine Learning Approach
    • Uses TF-IDF vectorization
    • Logistic Regression classifier
    • Traditional NLP pipeline for benchmarking against LLMs

The project is fully containerized using Docker, ensuring reproducibility and consistent execution environments.


Project Highlights

  • Comparative evaluation between LLM-based and traditional ML classification
  • Prompt engineering experimentation with LLMs
  • Domain-specific dataset preprocessing
  • Class imbalance handling
  • Modular pipeline for data preprocessing, model execution, and result generation
  • Fully Dockerized workflow

Technologies Used

  • Python
  • Docker
  • Ollama
  • Hugging Face
  • LangChain
  • Scikit-learn
  • TF-IDF
  • Logistic Regression

Project Structure

.
├── data
│   ├── cleaned_test_data.tsv
│   ├── cleaned_train_data.tsv
│   ├── cleaned_test_data_ml.tsv
│   ├── cleaned_train_data_ml.tsv
│   └── TWNERTC_TC_Fine_Grained...
│
├── models
│   ├── llm_model.py
│   └── ml_model.py
│
├── preprocessing
│   ├── dataset.py
│   └── dataset_ml.py
│
├── results
│   ├── llm
│   └── ml
│
├── Dockerfile
├── requirements.txt
└── README.md

Folder Description

data/

Contains datasets used for training and evaluation.

  • cleaned_train_data.tsv – Training dataset for the LLM model
  • cleaned_test_data.tsv – Test dataset for the LLM model
  • cleaned_train_data_ml.tsv – Training dataset for the ML model
  • cleaned_test_data_ml.tsv – Test dataset for the ML model
  • TWNERTC_TC_Fine_Grained... – Original dataset used during preprocessing

models/

Contains the scripts that run the classification models.

  • llm_model.py – Executes the LLM-based text classification pipeline
  • ml_model.py – Executes the machine learning classification pipeline

Each script can be executed independently.


preprocessing/

Contains dataset preprocessing scripts.

  • dataset.py – Preprocessing pipeline for the LLM dataset
  • dataset_ml.py – Preprocessing pipeline for the ML dataset

results/

Stores the outputs generated by the models.

results/
├── llm/
└── ml/
  • llm/ – Output generated by the LLM model
  • ml/ – Output generated by the ML model

Running the Project

Prerequisites

Make sure the following is installed:

  • Docker

Check installation:

docker --version

Build the Docker Image

From the root directory of the project, run:

docker build -t hepsiburadacasestudy .

This builds the Docker image and installs all dependencies from requirements.txt.


Running the Models

Running the LLM Model

To execute the LLM-based model inside Docker:

docker run -v $(pwd)/data:/app/data -v $(pwd)/results/llm:/app/results/llm hepsiburadacasestudy python models/llm_model.py

Running the ML Model

To execute the Machine Learning model:

docker run -v $(pwd)/data:/app/data -v $(pwd)/results/ml:/app/results/ml hepsiburadacasestudy python models/ml_model.py

This command:

  • Mounts the dataset directory
  • Saves results to the local results folder

Docker Volume Mounts Explained

Mount Description
-v $(pwd)/data:/app/data Makes datasets accessible inside the container
-v $(pwd)/results/llm:/app/results/llm Saves LLM outputs to the local machine
-v $(pwd)/results/ml:/app/results/ml Saves ML outputs to the local machine

Dependencies

All Python dependencies are listed in:

requirements.txt

They are automatically installed during the Docker build process.

Key libraries include:

  • scikit-learn
  • pandas
  • numpy
  • langchain
  • huggingface libraries
  • ollama integration tools

Future Improvements

  • Add Deep Learning models (BERT / RoBERTa)
  • Expand evaluation metrics
  • Implement automated experiment tracking
  • Add visualization dashboards

License

This project is intended for research and educational purposes.

About

Text classification project comparing Llama-3 LLM and traditional ML methods using prompt engineering, fine-tuning, and Dockerized workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors