A cyberbullying text classifier built by implementing a GPT-style transformer from scratch in PyTorch, then fine-tuning on a domain-specific dataset. The primary goal is demonstrating hands-on understanding of transformer internals: attention mechanisms, token/positional embeddings, layer normalization, and feed-forward blocks.
Based on Sebastian Raschka's Build a Large Language Model from Scratch, adapted for binary classification.
The full transformer architecture lives in gpt/ and was implemented without using Hugging Face or any high-level transformer library:
| Component | File | Description |
|---|---|---|
| Token + Positional Embeddings | gpt/gpt.py |
Maps token IDs to dense vectors; adds learned positional encodings |
| Multi-Head Self-Attention | gpt/transformer_block/multi_head_attention.py |
Scaled dot-product attention with causal masking, 12 heads |
| Feed-Forward Network | gpt/transformer_block/feed_forward.py |
Two-layer MLP (768 → 3072 → 768) with GELU activation |
| Layer Normalization | gpt/transformer_block/layer_norm.py |
Custom impl with learnable scale/shift parameters |
| Transformer Block | gpt/transformer_block/transformer_block.py |
Pre-norm architecture with residual connections |
| Full GPT Model | gpt/gpt.py |
Stacks 12 transformer blocks; classification head replaces language model head |
GPT-2 Small (124M) pre-trained weights are then loaded into this custom architecture for fine-tuning.
GPT_CONFIG_124M = {
"vocab_size": 50257, # GPT-2 BPE tokenizer
"context_length": 1024, # must match GPT-2's pretrained positional embedding table [1024, 768]
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": True,
"num_classes": 2 # cyberbullying / not cyberbullying
}
Note: sequences are truncated to 512 tokens before being fed to the model (tweets are short; this halves memory use during training). The architecture is still initialized with context_length=1024 because the pretrained GPT-2 positional embedding table has shape [1024, 768] — loading weights into a smaller table causes a shape mismatch.
For classification, the last token's hidden state is replaced with mean pooling over the sequence (masking out padding tokens), and the language model head is replaced with a linear layer mapping 768 → 2 classes.
Fine-tuning strategy: freeze all layers except the last transformer block and final layer norm.
| Epoch | Val Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| 1 | 90.94% | 89.07% | 81.09% | 84.90% |
| 2 | 92.19% | 91.26% | 83.08% | 86.98% |
| 3 | 92.81% | 91.89% | 84.58% | 88.08% |
| 4 | 92.66% | 90.53% | 85.57% | 87.98% |
| 5 | 92.66% | 90.96% | 85.07% | 87.92% |
Best checkpoint: epoch 5 (lowest val loss: 0.1767). Total training time: ~165 minutes on an A100.
The model plateaus around epoch 3-5 — accuracy and F1 stop meaningfully improving despite val loss continuing to tick down. This is a known ceiling for this architecture and dataset combination. The primary failure mode is recall on subtle or context-dependent cyberbullying that isn't well-represented across the training sources. A larger dataset or unfreezing more transformer layers would be the main levers for improvement.
Epoch 1: val loss 0.2154 | val acc 90.94% | F1 84.90%
Epoch 2: val loss 0.1928 | val acc 92.19% | F1 86.98%
Epoch 3: val loss 0.1822 | val acc 92.81% | F1 88.08%
Epoch 4: val loss 0.1778 | val acc 92.66% | F1 87.98%
Epoch 5: val loss 0.1767 | val acc 92.66% | F1 87.92%
Hyperparameters: AdamW, lr=1e-5, weight decay=0.01, batch size=64, cosine LR annealing, gradient clipping=1.0, early stopping patience=2.
final_trained_model.pt is not tracked in git. You can reproduce it by training from scratch:
1. Download the datasets
The model is trained on four Kaggle datasets combined (~568k examples). Place each in its own subfolder under ./data/:
| Folder | Dataset |
|---|---|
data/group_1/ |
Cyberbullying Classification |
data/group_2/ |
Cyberbullying Dataset |
data/group_3/ |
Cyber Bullying Data for Multi-Label Classification |
data/group_4/ |
Cyberbullying Tweets |
2. Prepare the data
# place the Kaggle CSVs in ./data/, then:
python prepare_data.pyThis combines all CSVs (~500k examples total), pulls the Text and oh_label columns from each, and produces a stratified 90/10 train/validation split under cyberbullying_detector/data/.
3. Train
# in a Python shell or script
from main import train
train()Training takes ~165 minutes on an A100 (5 epochs, ~568k examples). The best checkpoint is saved automatically to final_trained_model.pt.
pip install -r requirements.txt
python main.py # interactive inference loopfinal_trained_model.pt must be present in the project root.
See case_study.md for a detailed walkthrough of architecture decisions, the transfer learning strategy, data pipeline, and training analysis.
- Raschka, S. — Build a Large Language Model from Scratch
- OpenAI GPT-2 pre-trained weights (124M)
- Kaggle Cyberbullying Detection Dataset