This project presents a comparative study of text classification using Large Language Models (LLMs) and traditional Machine Learning (ML) techniques on a domain-specific dataset.
Sept 2024 – Oct 2024
Text Classification Using LLM and ML/DL Techniques
- Conducted a comparative study of text classification using LLMs (Llama-3 8B) and classical ML models (TF-IDF + Logistic Regression) on a domain-specific dataset.
- Performed data preprocessing and class imbalance handling, and applied prompt engineering techniques (zero-shot and EHC) with LLM fine-tuning for domain adaptation.
- Utilized Ollama, Hugging Face, and LangChain to structure LLM inference, fine-tuning, and evaluation workflows, enabling systematic benchmarking against traditional machine learning approaches.
Two independent approaches are implemented:
- LLM-Based Approach
- Uses Llama-3 8B via Ollama
- Performed domain-specific fine-tuning to adapt the model for the classification task
- Applied prompt engineering strategies, including:
- Zero-shot prompting
- EHC prompting
- Evaluated the effectiveness of LLM-based classification compared to traditional ML methods
- Classical Machine Learning Approach
- Uses TF-IDF vectorization
- Logistic Regression classifier
- Traditional NLP pipeline for benchmarking against LLMs
The project is fully containerized using Docker, ensuring reproducibility and consistent execution environments.
- Comparative evaluation between LLM-based and traditional ML classification
- Prompt engineering experimentation with LLMs
- Domain-specific dataset preprocessing
- Class imbalance handling
- Modular pipeline for data preprocessing, model execution, and result generation
- Fully Dockerized workflow
- Python
- Docker
- Ollama
- Hugging Face
- LangChain
- Scikit-learn
- TF-IDF
- Logistic Regression
.
├── data
│ ├── cleaned_test_data.tsv
│ ├── cleaned_train_data.tsv
│ ├── cleaned_test_data_ml.tsv
│ ├── cleaned_train_data_ml.tsv
│ └── TWNERTC_TC_Fine_Grained...
│
├── models
│ ├── llm_model.py
│ └── ml_model.py
│
├── preprocessing
│ ├── dataset.py
│ └── dataset_ml.py
│
├── results
│ ├── llm
│ └── ml
│
├── Dockerfile
├── requirements.txt
└── README.md
Contains datasets used for training and evaluation.
cleaned_train_data.tsv– Training dataset for the LLM modelcleaned_test_data.tsv– Test dataset for the LLM modelcleaned_train_data_ml.tsv– Training dataset for the ML modelcleaned_test_data_ml.tsv– Test dataset for the ML modelTWNERTC_TC_Fine_Grained...– Original dataset used during preprocessing
Contains the scripts that run the classification models.
llm_model.py– Executes the LLM-based text classification pipelineml_model.py– Executes the machine learning classification pipeline
Each script can be executed independently.
Contains dataset preprocessing scripts.
dataset.py– Preprocessing pipeline for the LLM datasetdataset_ml.py– Preprocessing pipeline for the ML dataset
Stores the outputs generated by the models.
results/
├── llm/
└── ml/
llm/– Output generated by the LLM modelml/– Output generated by the ML model
Make sure the following is installed:
- Docker
Check installation:
docker --versionFrom the root directory of the project, run:
docker build -t hepsiburadacasestudy .This builds the Docker image and installs all dependencies from requirements.txt.
To execute the LLM-based model inside Docker:
docker run -v $(pwd)/data:/app/data -v $(pwd)/results/llm:/app/results/llm hepsiburadacasestudy python models/llm_model.pyTo execute the Machine Learning model:
docker run -v $(pwd)/data:/app/data -v $(pwd)/results/ml:/app/results/ml hepsiburadacasestudy python models/ml_model.pyThis command:
- Mounts the dataset directory
- Saves results to the local results folder
| Mount | Description |
|---|---|
-v $(pwd)/data:/app/data |
Makes datasets accessible inside the container |
-v $(pwd)/results/llm:/app/results/llm |
Saves LLM outputs to the local machine |
-v $(pwd)/results/ml:/app/results/ml |
Saves ML outputs to the local machine |
All Python dependencies are listed in:
requirements.txt
They are automatically installed during the Docker build process.
Key libraries include:
- scikit-learn
- pandas
- numpy
- langchain
- huggingface libraries
- ollama integration tools
- Add Deep Learning models (BERT / RoBERTa)
- Expand evaluation metrics
- Implement automated experiment tracking
- Add visualization dashboards
This project is intended for research and educational purposes.