Skip to content

azeng4499/The-Better-Threads-Project

Repository files navigation

Cyberbullying Detection — GPT Transformer from Scratch

A cyberbullying text classifier built by implementing a GPT-style transformer from scratch in PyTorch, then fine-tuning on a domain-specific dataset. The primary goal is demonstrating hands-on understanding of transformer internals: attention mechanisms, token/positional embeddings, layer normalization, and feed-forward blocks.

Based on Sebastian Raschka's Build a Large Language Model from Scratch, adapted for binary classification.


What Was Built from Scratch

The full transformer architecture lives in gpt/ and was implemented without using Hugging Face or any high-level transformer library:

Component File Description
Token + Positional Embeddings gpt/gpt.py Maps token IDs to dense vectors; adds learned positional encodings
Multi-Head Self-Attention gpt/transformer_block/multi_head_attention.py Scaled dot-product attention with causal masking, 12 heads
Feed-Forward Network gpt/transformer_block/feed_forward.py Two-layer MLP (768 → 3072 → 768) with GELU activation
Layer Normalization gpt/transformer_block/layer_norm.py Custom impl with learnable scale/shift parameters
Transformer Block gpt/transformer_block/transformer_block.py Pre-norm architecture with residual connections
Full GPT Model gpt/gpt.py Stacks 12 transformer blocks; classification head replaces language model head

GPT-2 Small (124M) pre-trained weights are then loaded into this custom architecture for fine-tuning.


Model Architecture

GPT_CONFIG_124M = {
    "vocab_size": 50257,     # GPT-2 BPE tokenizer
    "context_length": 1024,  # must match GPT-2's pretrained positional embedding table [1024, 768]
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": True,
    "num_classes": 2         # cyberbullying / not cyberbullying
}

Note: sequences are truncated to 512 tokens before being fed to the model (tweets are short; this halves memory use during training). The architecture is still initialized with context_length=1024 because the pretrained GPT-2 positional embedding table has shape [1024, 768] — loading weights into a smaller table causes a shape mismatch.

For classification, the last token's hidden state is replaced with mean pooling over the sequence (masking out padding tokens), and the language model head is replaced with a linear layer mapping 768 → 2 classes.

Fine-tuning strategy: freeze all layers except the last transformer block and final layer norm.


Results and Limitations

Epoch Val Accuracy Precision Recall F1
1 90.94% 89.07% 81.09% 84.90%
2 92.19% 91.26% 83.08% 86.98%
3 92.81% 91.89% 84.58% 88.08%
4 92.66% 90.53% 85.57% 87.98%
5 92.66% 90.96% 85.07% 87.92%

Best checkpoint: epoch 5 (lowest val loss: 0.1767). Total training time: ~165 minutes on an A100.

The model plateaus around epoch 3-5 — accuracy and F1 stop meaningfully improving despite val loss continuing to tick down. This is a known ceiling for this architecture and dataset combination. The primary failure mode is recall on subtle or context-dependent cyberbullying that isn't well-represented across the training sources. A larger dataset or unfreezing more transformer layers would be the main levers for improvement.


Training

Epoch 1: val loss 0.2154 | val acc 90.94% | F1 84.90%
Epoch 2: val loss 0.1928 | val acc 92.19% | F1 86.98%
Epoch 3: val loss 0.1822 | val acc 92.81% | F1 88.08%
Epoch 4: val loss 0.1778 | val acc 92.66% | F1 87.98%
Epoch 5: val loss 0.1767 | val acc 92.66% | F1 87.92%

Hyperparameters: AdamW, lr=1e-5, weight decay=0.01, batch size=64, cosine LR annealing, gradient clipping=1.0, early stopping patience=2.


Getting the Trained Model

final_trained_model.pt is not tracked in git. You can reproduce it by training from scratch:

1. Download the datasets

The model is trained on four Kaggle datasets combined (~568k examples). Place each in its own subfolder under ./data/:

Folder Dataset
data/group_1/ Cyberbullying Classification
data/group_2/ Cyberbullying Dataset
data/group_3/ Cyber Bullying Data for Multi-Label Classification
data/group_4/ Cyberbullying Tweets

2. Prepare the data

# place the Kaggle CSVs in ./data/, then:
python prepare_data.py

This combines all CSVs (~500k examples total), pulls the Text and oh_label columns from each, and produces a stratified 90/10 train/validation split under cyberbullying_detector/data/.

3. Train

# in a Python shell or script
from main import train
train()

Training takes ~165 minutes on an A100 (5 epochs, ~568k examples). The best checkpoint is saved automatically to final_trained_model.pt.


Running Locally

pip install -r requirements.txt
python main.py  # interactive inference loop

final_trained_model.pt must be present in the project root.


Deep Dive

See case_study.md for a detailed walkthrough of architecture decisions, the transfer learning strategy, data pipeline, and training analysis.


References

  • Raschka, S. — Build a Large Language Model from Scratch
  • OpenAI GPT-2 pre-trained weights (124M)
  • Kaggle Cyberbullying Detection Dataset

About

The Better Threads Project a cyberbullying text classifier built by implementing a GPT-style transformer from scratch in PyTorch. Adapted from "Build a Large Language Model from Scratch" by Sebastian Raschkad.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages