Model Optimizer - Windows: Benchmark Reference

This document provides a summary of the performance and accuracy measurements of Model Optimizer - Windows for several popular models. The benchmark results in the following tables serve as reference points and should not be viewed as the maximum performance achievable by Model Optimizer - Windows.

1 Performance And Accuracy Comparison: ONNX INT4 vs ONNX FP16 Models

1.1 Performance Comparison

All performance metrics are tested using the onnxruntime-genai perf benchmark with the DirectML backend.

Configuration: Windows OS, GPU RTX 4090, NVIDIA Model Optimizer v0.19.0.
Batch Size: 1

Memory savings and inference speedup are compared to the ONNX FP16 baseline.


Model	Input Prompt Length	Output tokens length	GPU Memory Saving	Generation Phase Inference Speedup
Llama3.1-8B-Instruct	128	256	2.44x	2.68x
Phi3.5-mini-Instruct	128	256	2.53x	2.51x
Mistral-7B-Instruct-v0.3	128	256	2.88x	3.41x
Llama3.2-3B-Instruct	128	256	1.96x	2.19x
Gemma-2b-it	128	256	1.64x	1.94x

1.2 Accuracy Comparison

1.2.1 MMLU

For accuracy evaluation, the Massive Multitask Language Understanding (MMLU) benchmark has been utilized. Please refer to the detailed instructions for running the MMLU accuracy benchmark.

The table below shows the MMLU 5-shot score for some models.

FP16 ONNX model: Generated using GenAI Model Builder with DML EP
INT4 AWQ model: Generated by quantizing FP16 ONNX model using ModelOpt-Windows
Configuration: Windows OS, GPU RTX4090, nvidia-modelopt v0.19.0, onnxruntime-genai-directml 0.4, transformers 4.44

Model	ONNX FP16	ONNX INT4
Llama3.1-8B-Instruct	68.45	66.1
Phi3.5-mini-Instruct	68.9	65.7
Mistral-7B-Instruct-v0.3	61.76	60.73
Llama3.2-3B-Instruct	60.8	57.71
Gemma-2b-it	37.01	37.2

1.2.2 Perplexity (PPL)

Perplexity measures how well a probability model predicts a sample. Lower perplexity values indicate better model quality. The following table shows perplexity values at input sequence length 1024 with chunk size of 512.

Learn more about Perplexity: Perplexity - Wikipedia | Hugging Face - Perplexity of Fixed-Length Models

FP16-MB: Baseline FP16 genai model (Model Builder)
Mixed AWQ-MO: Important linear layers in INT8, rest in INT4 (AWQ), with ModelOpt.
Mixed RTN-MO: Important linear layers in INT8, rest in INT4 (RTN), with ModelOpt.
Pure INT4 AWQ-MO: All linear layers INT4 (AWQ) with ModelOpt.
Pure INT4 RTN-MO: All linear layers INT4 (RTN) with ModelOpt.
Pure INT8 RTN-MO: All linear layers INT8 (RTN) with ModelOpt.
Pure INT8 AWQ-MO: All linear layers INT8 (AWQ) with ModelOpt.
Configuration: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0

Model	FP16-MB	Mixed AWQ-MO	Mixed RTN-MO	Pure INT4 AWQ-MO	Pure INT4 RTN-MO	Pure INT8 RTN-MO	Pure INT8 AWQ-MO
DeepSeek R1 Distill Qwen 1.5B	39.447	41.699	44.332	44.213	46.304	39.802	39.713
Llama 3.2 1B Instruct	12.631	13.852	14.176	14.549	16.900	12.664	12.637
Phi-3.5 Mini Instruct	6.046	6.500	6.599	6.711	7.070	-	-
Phi-4 Mini Instruct	9.039	9.673	9.712	10.015	10.911	-	-
Qwen 2.5 1.5B Instruct	9.216	10.084	10.338	10.495	10.933	9.227	9.232

For detailed instructions on evaluating perplexity, please refer to the Perplexity Evaluation Guide.

1.2.3 KL-divergence

KL-divergence (Kullback-Leibler divergence) quantifies the distributional difference between the quantized model and the baseline model. Lower KL-divergence values indicate that the quantized model's output distribution is closer to the original model.

Learn more about KL-divergence: KL Divergence - Wikipedia | Understanding KL Divergence

Supported backends: PyTorch and Onnxruntim-cuda, onnxruntime-trt-rtx-ep are both supported for evaluation.

Baseline model: Hugging Face FP16 model
Quantized models: Models where quantization is simulated (a.k.a. fake quantization), typically using the PyTorch-CUDA backend for evaluation. Fake quantization means quantized weights and dequantized simultaneously to simulate quantization. The inference backend column in the table below indicates whether the reported results are from PyTorch simulation or ONNX-runtime-based inference.
Configuration: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0

Model	Quantization Method	Quantization Granularity	KL-divergence	Inference Backend
Qwen2.5-1.5B-Instruct	Base FP16 (Baseline)	-	0.000	PyTorch (FP16)
Qwen2.5-1.5B-Instruct	int4+int8 Blockwise-max_algo-mixed_quant (simulated)	INT4: per-block (block-size=128), INT8: per-channel (row-wise)	0.336	PyTorch (fake quantization)
Qwen2.5-1.5B-Instruct	int4+int8 max_algo-mixed_quant (simulated, per-channel)	INT4: per-block (block-size=128), INT8: per-channel (row-wise)	0.337	PyTorch (fake quantization)
Llama-3.2-3B-Instruct	Base FP16 (Baseline)	-	0.000	PyTorch (FP16)
Llama-3.2-3B-Instruct	int4+int8 Blockwise-awq-lite_algo-mixed_quant (simulated)	INT4: per-block (block-size=128), INT8: per-channel (row-wise)	0.228	PyTorch (fake quantization)
Llama-3.2-3B-Instruct	int4+int8 per-channel-awq-lite_algo-mixed_quant (simulated)	INT4: per-block (block-size=128), INT8: per-channel (row-wise)	0.230	PyTorch (fake quantization)
Llama-3.2-3B-Instruct	int4+int8 Blockwise-max_algo-mixed_quant (simulated)	INT4: per-block (block-size=128), INT8: per-channel (row-wise)	0.238	PyTorch (fake quantization)
Llama-3.2-3B-Instruct	int4+int8 per-channel-max_algo-mixed_quant (simulated)	INT4: per-block (block-size=128), INT8: per-channel (row-wise)	0.238	PyTorch (fake quantization)
Llama-3.2-3B-Instruct	int4 Blockwise-max_algo only (simulated)	INT4: per-block (block-size=128)	0.334	PyTorch (fake quantization)

All KL-divergence results above are obtained via PyTorch fake quantization simulation unless otherwise noted. Inference with ONNX-runtime can also be evaluated .

For detailed instructions on computing KL-divergence, please refer to the KL-divergence Evaluation Guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Optimizer - Windows: Benchmark Reference

1 Performance And Accuracy Comparison: ONNX INT4 vs ONNX FP16 Models

1.1 Performance Comparison

1.2 Accuracy Comparison

1.2.1 MMLU

1.2.2 Perplexity (PPL)

1.2.3 KL-divergence

FilesExpand file tree

Benchmark.md

Latest commit

History

Benchmark.md

File metadata and controls

Model Optimizer - Windows: Benchmark Reference

1 Performance And Accuracy Comparison: ONNX INT4 vs ONNX FP16 Models

1.1 Performance Comparison

1.2 Accuracy Comparison

1.2.1 MMLU

1.2.2 Perplexity (PPL)

1.2.3 KL-divergence