Cyberbullying Detection — GPT Transformer from Scratch

A cyberbullying text classifier built by implementing a GPT-style transformer from scratch in PyTorch, then fine-tuning on a domain-specific dataset. The primary goal is demonstrating hands-on understanding of transformer internals: attention mechanisms, token/positional embeddings, layer normalization, and feed-forward blocks.

Based on Sebastian Raschka's Build a Large Language Model from Scratch, adapted for binary classification.

What Was Built from Scratch

The full transformer architecture lives in gpt/ and was implemented without using Hugging Face or any high-level transformer library:

Component	File	Description
Token + Positional Embeddings	`gpt/gpt.py`	Maps token IDs to dense vectors; adds learned positional encodings
Multi-Head Self-Attention	`gpt/transformer_block/multi_head_attention.py`	Scaled dot-product attention with causal masking, 12 heads
Feed-Forward Network	`gpt/transformer_block/feed_forward.py`	Two-layer MLP (768 → 3072 → 768) with GELU activation
Layer Normalization	`gpt/transformer_block/layer_norm.py`	Custom impl with learnable scale/shift parameters
Transformer Block	`gpt/transformer_block/transformer_block.py`	Pre-norm architecture with residual connections
Full GPT Model	`gpt/gpt.py`	Stacks 12 transformer blocks; classification head replaces language model head

GPT-2 Small (124M) pre-trained weights are then loaded into this custom architecture for fine-tuning.

Model Architecture

GPT_CONFIG_124M = {
    "vocab_size": 50257,     # GPT-2 BPE tokenizer
    "context_length": 1024,  # must match GPT-2's pretrained positional embedding table [1024, 768]
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": True,
    "num_classes": 2         # cyberbullying / not cyberbullying
}

Note: sequences are truncated to 512 tokens before being fed to the model (tweets are short; this halves memory use during training). The architecture is still initialized with context_length=1024 because the pretrained GPT-2 positional embedding table has shape [1024, 768] — loading weights into a smaller table causes a shape mismatch.

For classification, the last token's hidden state is replaced with mean pooling over the sequence (masking out padding tokens), and the language model head is replaced with a linear layer mapping 768 → 2 classes.

Fine-tuning strategy: freeze all layers except the last transformer block and final layer norm.

Results and Limitations

Epoch	Val Accuracy	Precision	Recall	F1
1	90.94%	89.07%	81.09%	84.90%
2	92.19%	91.26%	83.08%	86.98%
3	92.81%	91.89%	84.58%	88.08%
4	92.66%	90.53%	85.57%	87.98%
5	92.66%	90.96%	85.07%	87.92%

Best checkpoint: epoch 5 (lowest val loss: 0.1767). Total training time: ~165 minutes on an A100.

The model plateaus around epoch 3-5 — accuracy and F1 stop meaningfully improving despite val loss continuing to tick down. This is a known ceiling for this architecture and dataset combination. The primary failure mode is recall on subtle or context-dependent cyberbullying that isn't well-represented across the training sources. A larger dataset or unfreezing more transformer layers would be the main levers for improvement.

Training

Epoch 1: val loss 0.2154 | val acc 90.94% | F1 84.90%
Epoch 2: val loss 0.1928 | val acc 92.19% | F1 86.98%
Epoch 3: val loss 0.1822 | val acc 92.81% | F1 88.08%
Epoch 4: val loss 0.1778 | val acc 92.66% | F1 87.98%
Epoch 5: val loss 0.1767 | val acc 92.66% | F1 87.92%

Hyperparameters: AdamW, lr=1e-5, weight decay=0.01, batch size=64, cosine LR annealing, gradient clipping=1.0, early stopping patience=2.

Getting the Trained Model

final_trained_model.pt is not tracked in git. You can reproduce it by training from scratch:

1. Download the datasets

The model is trained on four Kaggle datasets combined (~568k examples). Place each in its own subfolder under ./data/:

Folder	Dataset
`data/group_1/`	Cyberbullying Classification
`data/group_2/`	Cyberbullying Dataset
`data/group_3/`	Cyber Bullying Data for Multi-Label Classification
`data/group_4/`	Cyberbullying Tweets

2. Prepare the data

# place the Kaggle CSVs in ./data/, then:
python prepare_data.py

This combines all CSVs (~500k examples total), pulls the Text and oh_label columns from each, and produces a stratified 90/10 train/validation split under cyberbullying_detector/data/.

3. Train

# in a Python shell or script
from main import train
train()

Training takes ~165 minutes on an A100 (5 epochs, ~568k examples). The best checkpoint is saved automatically to final_trained_model.pt.

Running Locally

pip install -r requirements.txt
python main.py  # interactive inference loop

final_trained_model.pt must be present in the project root.

Deep Dive

See case_study.md for a detailed walkthrough of architecture decisions, the transfer learning strategy, data pipeline, and training analysis.

References

Raschka, S. — Build a Large Language Model from Scratch
OpenAI GPT-2 pre-trained weights (124M)
Kaggle Cyberbullying Detection Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
cyberbullying_detector		cyberbullying_detector
data		data
gpt		gpt
logs		logs
pretrained_gpt2		pretrained_gpt2
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
accuracy-plot.png		accuracy-plot.png
case_study.md		case_study.md
global_utils.py		global_utils.py
loss-plot.png		loss-plot.png
main.py		main.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cyberbullying Detection — GPT Transformer from Scratch

What Was Built from Scratch

Model Architecture

Results and Limitations

Training

Getting the Trained Model

Running Locally

Deep Dive

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cyberbullying Detection — GPT Transformer from Scratch

What Was Built from Scratch

Model Architecture

Results and Limitations

Training

Getting the Trained Model

Running Locally

Deep Dive

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages