Product Title Classification based on Web Product Common Crawl subset

Fine-tuned language models for product title classification using the Large-Scale Product Corpus (LSPC) V2020 dataset. The project focuses on 6-class category classification using LoRA fine-tuning of Qwen embedding models.

Project Structure

web-product-data/
|-- category_classification/             # Product title classification components
|   |-- build_lspc_dataset.py            # Dataset preparation script
|   |-- build_lspc_dataset_full.py       # Dataset with extra full split
|   |-- train_qwen3_lora.py              # Main training script
|   |-- train_qwen3_lora_v2.py           # Alternative training script
|   |-- train_embedding_classifier.py    # Baseline PyTorch classifier trainer
|   |-- train_embedding_classifier_v2.py # Trainer w/ residual head + extra losses
|   |-- embedding_classifier.py          # Baseline embedding encoder
|   |-- embedding_classifier_v2.py       # Residual-head encoder variant
|   |-- results_category.txt             # Training results and metrics
|   `-- EXPERIMENT_LOG.md                # Chronological training notes
|-- docs/
|   `-- embedding_classifier_architecture.md  # Detailed model description
|-- webdata_discovery.py                 # Data exploration tool
`-- pyproject.toml                       # Dependencies

Product Title Classification

Overview

The task classifies product titles into 6 main categories based on web product data from Common Crawl:

Automotive - Car parts, accessories, tools
Baby - Baby products, toys, care items
Books - All types of books and publications
Clothing - Apparel, fashion items
Jewelry - Jewelry, watches, accessories
Shoes - Footwear of all types

Quick Start

Prepare Dataset (train/val/test + full):

python category_classification/build_lspc_dataset.py --zip lspcV2020.zip --out lspc_dataset
# Or include a `full` split:
python category_classification/build_lspc_dataset_full.py --zip lspcV2020.zip --out lspc_dataset_full
# Append --add_language (with optional --language_num_proc and --read_workers) to annotate records with detected language codes and speed up scanning.

Train Model:

python category_classification/train_qwen3_lora.py --data ./lspc_dataset --out ./qwen3_lora_prodcat --batch 128

Train Pure PyTorch Embedding Classifier (optional: push batch size/grad_accum to fill VRAM):

python category_classification/train_embedding_classifier.py \
    --data ./lspc_dataset \
    --out  ./embedding_classifier_prodcat

Best Results

Current best (EmbeddingClassifier baseline, full dataset):

Test macro F1: 0.9315
Test micro F1 / accuracy: 0.9487
Per-language (test macro F1): en 0.946, de 0.753, fr 0.809, es 0.783, ja 0.830

More details in blog post

Notable config for the best run:

Script: category_classification/train_embedding_classifier.py
Checkpoint: embedding_classifier_prodcat/checkpoint-epoch5.pt
Batch size 254, grad_accum 2, cosine LR (warmup 3000, min_scale 0.05), AMP enabled

Historical LoRA (Qwen) reference:

Qwen LoRA (r=16, alpha=32, LR 5e-5, 1 epoch): test macro F1 0.8360, accuracy 0.8791.

Pure PyTorch Embedding Classifier

category_classification/embedding_classifier.py contains a compact, encoder-only (bidirectional self-attention) architecture implemented directly with PyTorch, so tokens can attend left/right (no causal mask). Instantiate it with the tokenizer�s vocab size and your target label count:

from category_classification.embedding_classifier import (
    EmbeddingClassifierConfig,
    EmbeddingClassifier,
)

config = EmbeddingClassifierConfig(
    vocab_size=len(tokenizer),  # always match the tokenizer vocabulary size
    num_labels=6,
)
model = EmbeddingClassifier(config)
outputs = model(**batch, return_embeddings=True)
pooled = outputs["pooled_embeddings"]
tokens = outputs["token_embeddings"]

Always determine vocab_size from the tokenizer you plan to use (len(tokenizer) for HuggingFace tokenizers) so the embedding matrix and token IDs stay aligned.

The forward pass accepts input_ids, attention_mask, and optional labels, returning a dictionary with logits (and loss during training). Set return_embeddings=True to access both CLS-style pooled embeddings and the final per-token representations. You can plug the module into a custom PyTorch training loop or wrap it with your preferred trainer/optimizer stack.

Training the Embedding Classifier

category_classification/train_embedding_classifier.py reproduces the category classification experiment end-to-end using the pure PyTorch model. It loads the LSPC dataset from disk, tokenizes with the Qwen tokenizer, and optimizes the EmbeddingClassifier using AdamW. The default configuration (10 layers, 384-dim hidden size, 6/2 attention heads, 1536-dim feed-forward) contains roughly 80 million parameters (it prints the exact count on startup).

python category_classification/train_embedding_classifier.py \
    --data ./lspc_dataset \
    --out  ./embedding_classifier_prodcat \
    --epochs 3 --batch_size 64 --grad_accum 2 --save_every 1 --amp --memory_summary_freq 5000

Metrics for validation/test are written to metrics.json, and the trained weights + tokenizer are saved under the output directory. Every --save_every epochs (default: 1) a checkpoint-epochX.pt snapshot is written, so you can resume later with --resume path/to/checkpoint-epochX.pt. Language metrics are enabled by default (requires ftlangdetect or langdetect); use --no_lang_metrics to disable them and --lang_min_samples / --lang_top_k to control which languages are reported. Add --amp to enable CUDA mixed precision (AMP) for larger batches and faster training. Use --grad_accum to keep an effective batch size near 1�2k tokens without running out of memory, and optionally set --memory_summary_freq for periodic allocator stats.

Enhanced Classifier Head (v2)

embedding_classifier_v2.py reuses the backbone but swaps the classifier head for a residual bottleneck block with RMSNorm and configurable dropout. The paired trainer, train_embedding_classifier_v2.py, layers in label smoothing and focal loss to further improve macro-F1 without touching the baseline script.

python category_classification/train_embedding_classifier_v2.py \
    --data ./lspc_dataset_full \
    --out ./embedding_classifier_prodcat_v2 \
    --epochs 5 --batch_size 128 --grad_accum 4 \
    --lr_schedule cosine --lr_min_scale 0.05 \
    --label_smoothing 0.05 --focal_gamma 1.5 --amp

Language metrics, checkpointing, and the cosine/exponential LR schedules behave exactly like the baseline trainer. Refer to EXPERIMENT_LOG.md for the latest run commands, thermal/cooling notes, and planned architectural tweaks.

Full Corpus Split

If you want to train on all LSPC examples while still keeping clean evaluation splits, build the dataset with category_classification/build_lspc_dataset_full.py. It behaves like the original builder but adds a "full" split (configurable via --full_split_name). You can point your trainer at train/validation/test for benchmarking and use the full split for large-scale pretraining or distillation runs.

Installation

Using Poetry (Recommended)

This project uses Poetry for dependency management. Poetry automatically handles virtual environments and ensures reproducible builds.

Install Poetry (if not already installed):

# On Windows (PowerShell)
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

# On macOS/Linux
curl -sSL https://install.python-poetry.org | python3 -

Install dependencies:

poetry install

Activate the virtual environment:

poetry shell

Alternative: pip install

If you prefer using pip directly:

pip install datasets transformers peft scikit-learn torch torchvision torchaudio accelerate bitsandbytes tqdm numpy

Required Files

lspcV2020.zip: Download from Web Data Commons
PDC2020_map.tsv: Auto-generated during dataset preparation

Hardware Requirements

GPU: 8GB+ VRAM recommended
RAM: 16GB+ system memory
Storage: 50GB+ free space

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
brand_classification		brand_classification
category_classification		category_classification
docs		docs
.gitignore		.gitignore
EXPERIMENT_LOG.md		EXPERIMENT_LOG.md
README.md		README.md
pyproject.toml		pyproject.toml
webdata_discovery.py		webdata_discovery.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Title Classification based on Web Product Common Crawl subset

Project Structure