This repository contains the official implementation of the CVPR 2026 paper "FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation".
δΈζη | English
FOZO proposes a novel backpropagation-free paradigm for Test-Time Adaptation (TTA).
Traditional TTA methods typically rely on backpropagation to update model parameters, which is challenging to deploy on edge devices or quantized models. FOZO optimizes a small number of visual prompts inserted into the model through zeroth-order optimization. To address instability in TTA data streams, we introduce a dynamic decay perturbation mechanism, combined with an unsupervised loss function that integrates deep and shallow feature statistics alignment and prediction entropy minimization.
- Pure Forward-Only Inference: Completely eliminates the need for gradient computation or storing intermediate activations, resulting in extremely low memory overhead.
-
Dynamic Perturbation Strategy: Automatically adjusts the zeroth-order gradient perturbation scale
$\epsilon$ and learning rate$\eta$ based on loss fluctuations. - Strong Robustness: Achieves SOTA performance on ImageNet-C (5K), ImageNet-R, and ImageNet-Sketch.
- Quantization-Friendly: Natively supports INT8 quantized models (e.g., PTQ4ViT), addressing the challenge of updating weights in quantized models.
- Efficient and Practical: Completes adaptation with only 2 forward passes, making it suitable for edge device deployment.
FOZO is particularly suitable for the following scenarios:
- Edge Device Deployment: Test-time adaptation on devices with limited computational resources
- Quantized Models: Adaptation for low-precision models (INT8/INT4)
- Real-time Applications: Online learning scenarios requiring fast response
- Cross-Domain Generalization: Rapid adaptation of models to new data domains
- Privacy Protection: No need to store intermediate activations, reducing privacy leakage risks
The core idea of FOZO is to estimate gradients through zeroth-order optimization (Simultaneous Perturbation Stochastic Approximation, SPSA), thereby updating learnable visual prompt parameters. The algorithm flow is as follows:
- Initialization: Insert a small number of learnable prompts into the input layer of Vision Transformer
-
Zeroth-Order Gradient Estimation: Estimate gradients through two forward passes (positive perturbation and negative perturbation)
$g(Z) = (l^+ - l^-) / (2 \epsilon_t)$
-
Dynamic Adjustment: Dynamically adjust perturbation scale
$\epsilon_t$ and learning rate$\eta$ based on loss changes - Parameter Update: Update prompt parameters using the estimated gradient
- Feature Alignment: Optimize the objective function through deep and shallow feature statistics alignment and entropy minimization
We recommend using Python 3.9+ and PyTorch 2.0+ environment.
# Create and activate conda environment
conda env create -f environment.yml
conda activate fozoPrepare datasets according to the following structure and specify paths through parameters (e.g., --data_corruption) in main.py:
Used for source domain statistics calculation and baseline testing:
# Download ImageNet validation set (50,000 images)
# Get from https://www.image-net.org/download.php
# Extract to the following directory structure:
ILSVRC2012_img_val/
βββ val/
βββ n01440764/
βββ n01443537/
βββ ...Contains 15 types of image corruptions (noise, blur, weather, etc.), each with 5 severity levels:
- Step 1: Download from ImageNet-C: zenodo link
- Step 2: Extract and organize as follows:
imagenet-c/
βββ gaussian_noise/
β βββ 1/
β βββ 2/
β βββ 3/
β βββ 4/
β βββ 5/
βββ shot_noise/
βββ impulse_noise/
βββ defocus_blur/
βββ glass_blur/
βββ motion_blur/
βββ zoom_blur/
βββ snow/
βββ frost/
βββ fog/
βββ brightness/
βββ contrast/
βββ elastic_transform/
βββ pixelate/
βββ jpeg_compression/
Used to test model generalization on resampled ImageNet data:
- Step 1: Download from ImageNet-V2: HuggingFace link
- Step 2: Extract
imagenetv2-matched-frequency.tar.gzand organize:
imagenet-v2/
βββ imagenetv2-matched-frequency-format-val/
βββ 1/
βββ 2/
βββ 3/
βββ 4/
βββ 5/
βββ ...
Contains 30,000 images across 200 categories including art, cartoons, sketches, etc.:
- Step 1: Download from ImageNet-R: download link
- Step 2: Extract the tar file
Contains 50,000 hand-drawn sketches:
- Step 1: Download from ImageNet-Sketch: Google Drive link
- Step 2: Extract the zip file
Before running experiments, ensure that dataset paths are correctly set in main.py or command line arguments:
--data /path/to/imagenet/val # ImageNet original validation set
--data_corruption /path/to/imagenet-c # ImageNet-C
--data_rendition /path/to/imagenet-r # ImageNet-R
--data_sketch /path/to/imagenet-sketch # ImageNet-Sketch
--data_v2 /path/to/imagenet-v2 # ImageNet-V2Run FOZO on ImageNet-C (5K) with default parameters:
python main.py \
--algorithm fozo \
--data /path/to/imagenet/val \
--data_corruption /path/to/imagenet-c \
--num_prompts 3 \
--fitness_lambda 0.4 \
--lr 0.08 \
--zo_eps 0.5 \
--batch_size 64 \
--continualpython main.py \
--algorithm no_adapt \
--data /path/to/imagenet/val \
--data_corruption /path/to/imagenet-cTo test performance on quantized models, add the --quant flag:
python main.py \
--algorithm fozo \
--quant \
--data /path/to/imagenet/val \
--data_corruption /path/to/imagenet-c \
--tag _quant_experimentWe provide an example script run.sh that can be run directly:
bash run.shResults on ImageNet-C (5K subset, severity level 5) based on ViT-Base model:
| Method | Top-1 Acc (%) | Memory (MiB) | FP Count | Runtime |
|---|---|---|---|---|
| NoAdapt | 55.57 | 819 | 1 | 94 |
| FOA | 58.13 | 831 | 2 | 224 |
| ZOA | 58.56 | 859 | 2 | 198 |
| FOZO (Ours) | 59.52 | 831 | 2 | 179 |
Note: FP represents forward pass count. FOZO achieves faster convergence while maintaining low memory.
Faster convergence: On ImageNet-C, only 66% of the test time required by previous methods (FOA/ZOA) is needed to achieve the same 65% accuracy.
If you use this code or reference the paper in your research, please cite:
@inproceedings{fozo2026,
title={FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation},
author={Anonymous},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}This project's code partially references the following excellent works:
- FOA - Forward-Only Adaptation method
- RobustBench - Standardized robustness evaluation benchmark
- PTQ4ViT - Vision Transformer quantization tool
- VPT - Visual Prompt Tuning method
