VisionCoder — RLVR for Screenshot-to-HTML Generation

Fine-tuning a vision-language model with Reinforcement Learning from Verifiable Rewards (RLVR) to convert UI screenshots into clean HTML/CSS.

Input: UI screenshot → Output: HTML + CSS that visually reproduces it

Results

Trained on 500 steps of GRPO with HuggingFaceM4/WebSight (2 000 samples), evaluated on the Design2Code benchmark (484 held-out examples).

Metric	Qwen3-VL-2B (base)	VCoder-GRPO-CLIP	Δ
Overall	0.232	0.788	+240%
Block-Match	0.101	0.795	+687%
Text Match	0.107	0.884	+722%
Position	0.088	0.706	+703%
Color	0.091	0.698	+663%
CLIP Similarity	0.772	0.858	+11%

The base model struggles to produce structured HTML in the correct format; a single epoch of GRPO training with rendering-based rewards closes most of the gap.

Training Overview

Key observations across 500 steps:

Total reward rises steadily from ~2.4 to ~5.0
CLIP reward (visual fidelity) climbs from near-zero to ~2.4 after 3× boosting
Format + validity rewards converge to near-perfect within ~100 steps, freeing the model to focus on visual quality
Completion length drops sharply (~1 000 → ~460 tokens) as the model learns to generate clean, minimal HTML
Entropy decreases monotonically, indicating a confident but not collapsed policy

Reward Design

Reward	Weight	Signal
`boosted_clip_reward`	3×	CLIP image-image similarity between rendered HTML and reference screenshot
`format_reward`	1×	Presence of `<think>` + `<html>` structure
`html_validity_reward`	1×	HTML parses without critical errors
`structural_similarity_reward`	1×	DOM-level structural similarity to reference

All rewards are computed without any human annotation. Rendering is done with a headless Playwright browser pool.

Detailed Training Curves

Reward Signals

Training Dynamics

Completion Statistics

Architecture

vcoder/
├── pipelines/training.py       # GRPO training entry point (accelerate launch)
├── rewards/
│   ├── visual_rewards.py       # CLIP + SSIM rewards via async Playwright rendering
│   ├── structural_rewards.py   # DOM tree similarity reward
│   ├── validity_rewards.py     # HTML validity reward
│   └── format_rewards.py       # Format / thinking-tag reward
├── rendering/
│   ├── html_renderer.py        # Headless browser rendering
│   └── browser_pool.py         # Async Playwright browser pool
├── data/websight.py             # WebSight dataset loader
├── eval/
│   ├── generate_predictions.py # Batch inference via VLLM server
│   └── extract_testset.py      # Extract Design2Code parquet → PNG/HTML
└── demo/inference.py            # Single-image inference
experiments/
└── plot_run.py                  # Plot training curves from trainer_state.json

Setup

# Install package
pip install -e . --no-deps

# Install Playwright for rendering
playwright install chromium

Training

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
    --config_file configs/accelerate_4gpu.yaml \
    vcoder/pipelines/training.py \
    --model_id Qwen/Qwen3-VL-2B-Instruct \
    --output_dir outputs/vcoder-grpo-clip \
    --max_samples 2000 \
    --num_train_epochs 1 \
    --batch_size 4 \
    --num_generations 8 \
    --max_completion_length 2048

Evaluation

1. Start VLLM inference servers

# Base model
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-VL-2B-Instruct --port 8000 \
    --gpu-memory-utilization 0.45 --max-model-len 8192 --trust-remote-code

# Fine-tuned model
CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server \
    --model outputs/vcoder-grpo-clip/checkpoint-500 --port 8001 \
    --gpu-memory-utilization 0.45 --max-model-len 8192 --trust-remote-code

2. Generate predictions

python3 vcoder/eval/generate_predictions.py \
    --model vcoder-grpo-clip \
    --testset_dir ../Design2Code/testset_final_extracted

3. Run Design2Code eval

cd ../Design2Code/Design2Code
python3 metrics/multi_processing_eval.py

Plot Training Curves

python3 experiments/plot_run.py                          # latest checkpoint
python3 experiments/plot_run.py --run_dir outputs/vcoder-grpo-clip --checkpoint 300

Plots saved to <run_dir>/plots/.

Model

Base model: Qwen/Qwen3-VL-2B-Instruct
Training: GRPO (TRL), 4× A40 GPUs, ~500 steps, ~7 hours
Dataset: 2 000 samples from HuggingFaceM4/WebSight
Benchmark: Design2Code — 484 held-out screenshot→HTML pairs

Team

Amal Joe · Job J

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets/plots		assets/plots
configs		configs
experiments		experiments
notebooks		notebooks
scripts		scripts
tests		tests
vcoder		vcoder
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_train.py		test_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionCoder — RLVR for Screenshot-to-HTML Generation

Results

Training Overview

Reward Design

Detailed Training Curves

Reward Signals

Training Dynamics

Completion Statistics

Architecture

Setup

Training

Evaluation

Plot Training Curves

Model

Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VisionCoder — RLVR for Screenshot-to-HTML Generation

Results

Training Overview

Reward Design

Detailed Training Curves

Reward Signals

Training Dynamics

Completion Statistics

Architecture

Setup

Training

Evaluation

Plot Training Curves

Model

Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages