Skip to content

ivan-digital/web-product-data

Repository files navigation

Product Title Classification based on Web Product Common Crawl subset

Fine-tuned language models for product title classification using the Large-Scale Product Corpus (LSPC) V2020 dataset. The project focuses on 6-class category classification using LoRA fine-tuning of Qwen embedding models.

Project Structure

web-product-data/
|-- category_classification/             # Product title classification components
|   |-- build_lspc_dataset.py            # Dataset preparation script
|   |-- build_lspc_dataset_full.py       # Dataset with extra full split
|   |-- train_qwen3_lora.py              # Main training script
|   |-- train_qwen3_lora_v2.py           # Alternative training script
|   |-- train_embedding_classifier.py    # Baseline PyTorch classifier trainer
|   |-- train_embedding_classifier_v2.py # Trainer w/ residual head + extra losses
|   |-- embedding_classifier.py          # Baseline embedding encoder
|   |-- embedding_classifier_v2.py       # Residual-head encoder variant
|   |-- results_category.txt             # Training results and metrics
|   `-- EXPERIMENT_LOG.md                # Chronological training notes
|-- docs/
|   `-- embedding_classifier_architecture.md  # Detailed model description
|-- webdata_discovery.py                 # Data exploration tool
`-- pyproject.toml                       # Dependencies

Product Title Classification

Overview

The task classifies product titles into 6 main categories based on web product data from Common Crawl:

  1. Automotive - Car parts, accessories, tools
  2. Baby - Baby products, toys, care items
  3. Books - All types of books and publications
  4. Clothing - Apparel, fashion items
  5. Jewelry - Jewelry, watches, accessories
  6. Shoes - Footwear of all types

Quick Start

  1. Prepare Dataset (train/val/test + full):
python category_classification/build_lspc_dataset.py --zip lspcV2020.zip --out lspc_dataset
# Or include a `full` split:
python category_classification/build_lspc_dataset_full.py --zip lspcV2020.zip --out lspc_dataset_full
# Append --add_language (with optional --language_num_proc and --read_workers) to annotate records with detected language codes and speed up scanning.
  1. Train Model:
python category_classification/train_qwen3_lora.py --data ./lspc_dataset --out ./qwen3_lora_prodcat --batch 128
  1. Train Pure PyTorch Embedding Classifier (optional: push batch size/grad_accum to fill VRAM):
python category_classification/train_embedding_classifier.py \
    --data ./lspc_dataset \
    --out  ./embedding_classifier_prodcat

Best Results

Current best (EmbeddingClassifier baseline, full dataset):

  • Test macro F1: 0.9315
  • Test micro F1 / accuracy: 0.9487
  • Per-language (test macro F1): en 0.946, de 0.753, fr 0.809, es 0.783, ja 0.830

More details in blog post

Notable config for the best run:

  • Script: category_classification/train_embedding_classifier.py
  • Checkpoint: embedding_classifier_prodcat/checkpoint-epoch5.pt
  • Batch size 254, grad_accum 2, cosine LR (warmup 3000, min_scale 0.05), AMP enabled

Historical LoRA (Qwen) reference:

  • Qwen LoRA (r=16, alpha=32, LR 5e-5, 1 epoch): test macro F1 0.8360, accuracy 0.8791.

Pure PyTorch Embedding Classifier

category_classification/embedding_classifier.py contains a compact, encoder-only (bidirectional self-attention) architecture implemented directly with PyTorch, so tokens can attend left/right (no causal mask). Instantiate it with the tokenizer�s vocab size and your target label count:

from category_classification.embedding_classifier import (
    EmbeddingClassifierConfig,
    EmbeddingClassifier,
)

config = EmbeddingClassifierConfig(
    vocab_size=len(tokenizer),  # always match the tokenizer vocabulary size
    num_labels=6,
)
model = EmbeddingClassifier(config)
outputs = model(**batch, return_embeddings=True)
pooled = outputs["pooled_embeddings"]
tokens = outputs["token_embeddings"]

Always determine vocab_size from the tokenizer you plan to use (len(tokenizer) for HuggingFace tokenizers) so the embedding matrix and token IDs stay aligned.

The forward pass accepts input_ids, attention_mask, and optional labels, returning a dictionary with logits (and loss during training). Set return_embeddings=True to access both CLS-style pooled embeddings and the final per-token representations. You can plug the module into a custom PyTorch training loop or wrap it with your preferred trainer/optimizer stack.

Training the Embedding Classifier

category_classification/train_embedding_classifier.py reproduces the category classification experiment end-to-end using the pure PyTorch model. It loads the LSPC dataset from disk, tokenizes with the Qwen tokenizer, and optimizes the EmbeddingClassifier using AdamW. The default configuration (10 layers, 384-dim hidden size, 6/2 attention heads, 1536-dim feed-forward) contains roughly 80 million parameters (it prints the exact count on startup).

python category_classification/train_embedding_classifier.py \
    --data ./lspc_dataset \
    --out  ./embedding_classifier_prodcat \
    --epochs 3 --batch_size 64 --grad_accum 2 --save_every 1 --amp --memory_summary_freq 5000

Metrics for validation/test are written to metrics.json, and the trained weights + tokenizer are saved under the output directory. Every --save_every epochs (default: 1) a checkpoint-epochX.pt snapshot is written, so you can resume later with --resume path/to/checkpoint-epochX.pt. Language metrics are enabled by default (requires ftlangdetect or langdetect); use --no_lang_metrics to disable them and --lang_min_samples / --lang_top_k to control which languages are reported. Add --amp to enable CUDA mixed precision (AMP) for larger batches and faster training. Use --grad_accum to keep an effective batch size near 1�2k tokens without running out of memory, and optionally set --memory_summary_freq for periodic allocator stats.

Enhanced Classifier Head (v2)

embedding_classifier_v2.py reuses the backbone but swaps the classifier head for a residual bottleneck block with RMSNorm and configurable dropout. The paired trainer, train_embedding_classifier_v2.py, layers in label smoothing and focal loss to further improve macro-F1 without touching the baseline script.

python category_classification/train_embedding_classifier_v2.py \
    --data ./lspc_dataset_full \
    --out ./embedding_classifier_prodcat_v2 \
    --epochs 5 --batch_size 128 --grad_accum 4 \
    --lr_schedule cosine --lr_min_scale 0.05 \
    --label_smoothing 0.05 --focal_gamma 1.5 --amp

Language metrics, checkpointing, and the cosine/exponential LR schedules behave exactly like the baseline trainer. Refer to EXPERIMENT_LOG.md for the latest run commands, thermal/cooling notes, and planned architectural tweaks.

Full Corpus Split

If you want to train on all LSPC examples while still keeping clean evaluation splits, build the dataset with category_classification/build_lspc_dataset_full.py. It behaves like the original builder but adds a "full" split (configurable via --full_split_name). You can point your trainer at train/validation/test for benchmarking and use the full split for large-scale pretraining or distillation runs.

Installation

Using Poetry (Recommended)

This project uses Poetry for dependency management. Poetry automatically handles virtual environments and ensures reproducible builds.

  1. Install Poetry (if not already installed):
# On Windows (PowerShell)
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

# On macOS/Linux
curl -sSL https://install.python-poetry.org | python3 -
  1. Install dependencies:
poetry install
  1. Activate the virtual environment:
poetry shell

Alternative: pip install

If you prefer using pip directly:

pip install datasets transformers peft scikit-learn torch torchvision torchaudio accelerate bitsandbytes tqdm numpy

Required Files

  • lspcV2020.zip: Download from Web Data Commons
  • PDC2020_map.tsv: Auto-generated during dataset preparation

Hardware Requirements

  • GPU: 8GB+ VRAM recommended
  • RAM: 16GB+ system memory
  • Storage: 50GB+ free space

About

Product Title Classification based on Web Product Common Crawl subset

Resources

Stars

Watchers

Forks

Contributors

Languages