Skip to content

23daVinci/ANLI-Classifier

Repository files navigation

ANLI Round 2 — NLI Classifier

3-way Natural Language Inference (entailment / neutral / contradiction) on the Adversarial NLI Round 2 benchmark using DeBERTa-v3.

Results

Model Params Test Accuracy Test Macro F1
Random Uniform (50-seed avg) 33.5% 0.335
Majority Class (neutral) 33.3% 0.167
TF-IDF + LogReg (concat) 35.0% 0.335
LinearSVC (sep TF-IDF + NLI feats) 42.5% 0.421
BERT-base (fine-tuned) 110M 43.5% 0.431
DeBERTa-v3-base (MNLI+FEVER+ANLI) 86M 54.6% 0.546
DeBERTa-v3-large (MNLI+FEVER+ANLI) 304M 68.2% 0.682

ANLI R2 is adversarially constructed — annotators wrote hypotheses specifically designed to fool transformer models. DeBERTa-v3-large achieves 68.2% accuracy, 34.9 points above random and 24.7 points above fine-tuned BERT-base.

Project Structure

anli-nli-classifier/
├── notebooks/
│   ├── phase1_eda.ipynb              # Exploratory Data Analysis
│   ├── phase2_baselines.ipynb        # Baseline models
│   └── phase3_deberta.ipynb          # DeBERTa-v3 evaluation + error analysis
├── figures/                          # Plots from EDA and evaluation
├── best_model/                       # Model weights (not in repo — see Setup)
│   ├── base/                         # DeBERTa-v3-base (~360MB)
│   └── large/                        # DeBERTa-v3-large (~1.2GB, optional)
├── main.py                           # FastAPI inference server
├── static/
│   ├── index.html                    # Web UI with model switching
│   └── presentation.html             # Interactive project presentation
├── download_model.py                 # Model download script (base / large / both)
├── Dockerfile
├── docker-compose.yml
├── requirements.txt                  # API dependencies (CPU-only torch)
├── results_summary.json              # Evaluation metrics
├── presentation/
└── README.md

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Model weights in best_model/ directory (see below)

Model Setup

The model weights are not included in the repository due to size. Download them with the included script:

# Download base model only (~360MB) — fast inference, 54.6% accuracy
python download_model.py

# Download large model only (~1.2GB) — higher accuracy (68.2%)
python download_model.py --model large

# Download both models
python download_model.py --all

The script downloads from HuggingFace Hub, auto-installs huggingface_hub if needed, and skips models that already exist.

Run with Docker Compose (Recommended)

# Build and start
docker compose up --build

# Or run detached
docker compose up --build -d

The server auto-detects available models and loads the best one at startup (prefers large if downloaded).

  • Web UI: http://localhost:8000 — interactive classifier with model switching
  • Presentation: http://localhost:8000/presentation — full project walkthrough with figures
  • Swagger UI: http://localhost:8000/docs — API documentation with test interface
  • Health Check: http://localhost:8000/health

Run with Docker Directly

docker build -t anli-nli-classifier .
docker run -p 8000:8000 -v ./best_model:/app/model:ro anli-nli-classifier

Run without Docker

pip install -r requirements.txt
export MODEL_DIR=./best_model
uvicorn main:app --host 0.0.0.0 --port 8000

Model Switching

The server supports live model switching without restart. Both the Web UI and API support this.

Via Web UI

Click the "Switch to DeBERTa-v3-large" button in the header badge row. The model loads in a few seconds and all subsequent predictions use the new model.

Via API

# List available models
curl http://localhost:8000/models

# Switch to large model
curl -X POST http://localhost:8000/models/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "large"}'

# Switch back to base
curl -X POST http://localhost:8000/models/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "base"}'

API Reference

GET /health — Health Check

curl http://localhost:8000/health
{
  "status": "healthy",
  "model_loaded": true,
  "active_model": "base",
  "device": "cpu",
  "model_dir": "/app/model"
}

GET /models — List Models

curl http://localhost:8000/models
{
  "active_model": "base",
  "models": [
    {"key": "base", "name": "DeBERTa-v3-base", "params": "86M", "downloaded": true, "active": true},
    {"key": "large", "name": "DeBERTa-v3-large", "params": "304M", "downloaded": true, "active": false}
  ]
}

POST /predict — Single Prediction

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "premise": "The Parma trolleybus system comprises four urban routes.",
    "hypothesis": "The trolleybus system has over 2 urban routes."
  }'
{
  "label": "entailment",
  "confidence": 0.9883,
  "probabilities": {
    "entailment": 0.9883,
    "neutral": 0.0107,
    "contradiction": 0.0008
  },
  "inference_time_ms": 472.23,
  "model": "base"
}

POST /predict/batch — Batch Prediction (up to 64 pairs)

curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pairs": [
      {
        "premise": "The Parma trolleybus system comprises four urban routes.",
        "hypothesis": "The trolleybus system has over 2 urban routes."
      },
      {
        "premise": "She moved to Paris in 2010.",
        "hypothesis": "She has never lived in France."
      }
    ]
  }'

POST /models/switch — Switch Active Model

curl -X POST http://localhost:8000/models/switch \
  -H "Content-Type: application/json" \
  -d '{"model": "large"}'
{
  "message": "Switched to 'large' successfully.",
  "active_model": "large",
  "load_time_seconds": 4.32
}

Interactive Docs

Swagger UI at http://localhost:8000/docs with pre-filled examples for all endpoints.

Methodology

Dataset

ANLI Round 2 (Nie et al., 2020) — 45,460 training / 1,000 dev / 1,000 test examples. Created via an adversarial human-and-model-in-the-loop process (HAMLET) where annotators wrote hypotheses to fool RoBERTa ensembles. Premises sourced from Wikipedia via HotpotQA.

Models

Variant Architecture Training Data Parameters
Base DeBERTa-v3-base MNLI + Fever-NLI + ANLI (763K pairs) 86M
Large DeBERTa-v3-large MNLI + Fever-NLI + ANLI + LingNLI + WANLI (885K pairs) 304M

DeBERTa-v3 uses disentangled attention and replaced token detection (RTD) pre-training, giving it a structural advantage over MLM-based models like BERT and RoBERTa on adversarial NLI data.

Key Findings

  • 68.2% accuracy with DeBERTa-v3-large on ANLI R2 test (34.9 points above random)
  • 54.6% accuracy with DeBERTa-v3-base — scaling to large adds +13.6%
  • BERT-base fine-tuned reaches only 43.5%, barely above LinearSVC (42.5%) — ANLI was built to fool MLM models
  • Contradiction is the hardest class: recall of 50.2% (base) / 66.1% (large)
  • The dominant error pattern is contradiction → entailment (29.4% of true contradictions), where the model misses logical reversals
  • 58.1% of base model errors are high-confidence (>0.9), indicating poor calibration on adversarial examples
  • Quantifier reasoning has the highest error rate (50.9%), followed by surface-level reasoning (48.2%)

Configuration

Environment Variable Default Description
MODEL_DIR /app/model Path to model weights directory (contains base/ and/or large/ subdirs)
MAX_LENGTH 256 Maximum token sequence length

References

  1. Nie et al. (2020) — Adversarial NLI: A New Benchmark for Natural Language Understanding
  2. He et al. (2023) — DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
  3. Laurer et al. (2022) — Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI

About

End-to-end NLI classifier on the Adversarial NLI (ANLI) Round 2 benchmark — EDA, baselines, DeBERTa-v3 evaluation (base & large), error analysis, FastAPI inference API with live model switching, Docker deployment, and interactive web UI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors