Fine-tuned language models for product title classification using the Large-Scale Product Corpus (LSPC) V2020 dataset. The project focuses on 6-class category classification using LoRA fine-tuning of Qwen embedding models.
web-product-data/
|-- category_classification/ # Product title classification components
| |-- build_lspc_dataset.py # Dataset preparation script
| |-- build_lspc_dataset_full.py # Dataset with extra full split
| |-- train_qwen3_lora.py # Main training script
| |-- train_qwen3_lora_v2.py # Alternative training script
| |-- train_embedding_classifier.py # Baseline PyTorch classifier trainer
| |-- train_embedding_classifier_v2.py # Trainer w/ residual head + extra losses
| |-- embedding_classifier.py # Baseline embedding encoder
| |-- embedding_classifier_v2.py # Residual-head encoder variant
| |-- results_category.txt # Training results and metrics
| `-- EXPERIMENT_LOG.md # Chronological training notes
|-- docs/
| `-- embedding_classifier_architecture.md # Detailed model description
|-- webdata_discovery.py # Data exploration tool
`-- pyproject.toml # Dependencies
The task classifies product titles into 6 main categories based on web product data from Common Crawl:
- Automotive - Car parts, accessories, tools
- Baby - Baby products, toys, care items
- Books - All types of books and publications
- Clothing - Apparel, fashion items
- Jewelry - Jewelry, watches, accessories
- Shoes - Footwear of all types
- Prepare Dataset (train/val/test + full):
python category_classification/build_lspc_dataset.py --zip lspcV2020.zip --out lspc_dataset
# Or include a `full` split:
python category_classification/build_lspc_dataset_full.py --zip lspcV2020.zip --out lspc_dataset_full
# Append --add_language (with optional --language_num_proc and --read_workers) to annotate records with detected language codes and speed up scanning.- Train Model:
python category_classification/train_qwen3_lora.py --data ./lspc_dataset --out ./qwen3_lora_prodcat --batch 128- Train Pure PyTorch Embedding Classifier (optional: push batch size/grad_accum to fill VRAM):
python category_classification/train_embedding_classifier.py \
--data ./lspc_dataset \
--out ./embedding_classifier_prodcatCurrent best (EmbeddingClassifier baseline, full dataset):
- Test macro F1: 0.9315
- Test micro F1 / accuracy: 0.9487
- Per-language (test macro F1): en 0.946, de 0.753, fr 0.809, es 0.783, ja 0.830
Notable config for the best run:
- Script:
category_classification/train_embedding_classifier.py - Checkpoint:
embedding_classifier_prodcat/checkpoint-epoch5.pt - Batch size 254, grad_accum 2, cosine LR (warmup 3000, min_scale 0.05), AMP enabled
Historical LoRA (Qwen) reference:
- Qwen LoRA (r=16, alpha=32, LR 5e-5, 1 epoch): test macro F1 0.8360, accuracy 0.8791.
category_classification/embedding_classifier.py contains a compact, encoder-only
(bidirectional self-attention) architecture implemented directly with PyTorch,
so tokens can attend left/right (no causal mask). Instantiate it with the
tokenizer�s vocab size and your target label count:
from category_classification.embedding_classifier import (
EmbeddingClassifierConfig,
EmbeddingClassifier,
)
config = EmbeddingClassifierConfig(
vocab_size=len(tokenizer), # always match the tokenizer vocabulary size
num_labels=6,
)
model = EmbeddingClassifier(config)
outputs = model(**batch, return_embeddings=True)
pooled = outputs["pooled_embeddings"]
tokens = outputs["token_embeddings"]Always determine vocab_size from the tokenizer you plan to use (len(tokenizer)
for HuggingFace tokenizers) so the embedding matrix and token IDs stay aligned.
The forward pass accepts input_ids, attention_mask, and optional labels,
returning a dictionary with logits (and loss during training). Set
return_embeddings=True to access both CLS-style pooled embeddings and the
final per-token representations. You can plug the module into a custom PyTorch
training loop or wrap it with your preferred trainer/optimizer stack.
category_classification/train_embedding_classifier.py reproduces the category
classification experiment end-to-end using the pure PyTorch model. It loads the
LSPC dataset from disk, tokenizes with the Qwen tokenizer, and optimizes the
EmbeddingClassifier using AdamW. The default configuration (10 layers, 384-dim
hidden size, 6/2 attention heads, 1536-dim feed-forward) contains roughly
80 million parameters (it prints the exact count on startup).
python category_classification/train_embedding_classifier.py \
--data ./lspc_dataset \
--out ./embedding_classifier_prodcat \
--epochs 3 --batch_size 64 --grad_accum 2 --save_every 1 --amp --memory_summary_freq 5000Metrics for validation/test are written to metrics.json, and the trained
weights + tokenizer are saved under the output directory. Every --save_every
epochs (default: 1) a checkpoint-epochX.pt snapshot is written, so you can
resume later with --resume path/to/checkpoint-epochX.pt. Language metrics are
enabled by default (requires ftlangdetect or langdetect); use
--no_lang_metrics to disable them and --lang_min_samples / --lang_top_k to
control which languages are reported. Add --amp to enable CUDA mixed precision
(AMP) for larger batches and faster training. Use --grad_accum to keep an
effective batch size near 1�2k tokens without running out of memory, and
optionally set --memory_summary_freq for periodic allocator stats.
embedding_classifier_v2.py reuses the backbone but swaps the classifier head
for a residual bottleneck block with RMSNorm and configurable dropout. The
paired trainer, train_embedding_classifier_v2.py, layers in label smoothing
and focal loss to further improve macro-F1 without touching the baseline script.
python category_classification/train_embedding_classifier_v2.py \
--data ./lspc_dataset_full \
--out ./embedding_classifier_prodcat_v2 \
--epochs 5 --batch_size 128 --grad_accum 4 \
--lr_schedule cosine --lr_min_scale 0.05 \
--label_smoothing 0.05 --focal_gamma 1.5 --ampLanguage metrics, checkpointing, and the cosine/exponential LR schedules behave
exactly like the baseline trainer. Refer to EXPERIMENT_LOG.md for the latest
run commands, thermal/cooling notes, and planned architectural tweaks.
If you want to train on all LSPC examples while still keeping clean
evaluation splits, build the dataset with
category_classification/build_lspc_dataset_full.py. It behaves like the
original builder but adds a "full" split (configurable via
--full_split_name). You can point your trainer at train/validation/test
for benchmarking and use the full split for large-scale pretraining or
distillation runs.
This project uses Poetry for dependency management. Poetry automatically handles virtual environments and ensures reproducible builds.
- Install Poetry (if not already installed):
# On Windows (PowerShell)
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -
# On macOS/Linux
curl -sSL https://install.python-poetry.org | python3 -- Install dependencies:
poetry install- Activate the virtual environment:
poetry shellIf you prefer using pip directly:
pip install datasets transformers peft scikit-learn torch torchvision torchaudio accelerate bitsandbytes tqdm numpy- lspcV2020.zip: Download from Web Data Commons
- PDC2020_map.tsv: Auto-generated during dataset preparation
- GPU: 8GB+ VRAM recommended
- RAM: 16GB+ system memory
- Storage: 50GB+ free space