This repository provides the official PyTorch implementation for Wiki-PRF, a novel three-stage method for Knowledge-Based Visual Question Answering (KB-VQA). Wiki-PRF consists of Processing, Retrieval, and Filtering stages that dynamically extract multimodal cues, perform joint visual-text knowledge retrieval, and filter irrelevant results. The paper has been accepted at NeurIPS 2025.
- ✅ Release core test implementation
- ✅ Complete README documentation
- ✅ Release core train implementation
- ✅ Add More detailed Quick Start.
- (2026.3.3) Our new work has been accepted to CVPR 2026. If you're interested, please click the link to learn more.
- (2026.2.27) The core training code has been released. Tell me any problem in the issue.
- (2026.2.17) The dataset setting can be found in echosight (https://github.com/Go2Heart/EchoSight)
- (2026.2.02) Model checkpoint released! You can check the Huggingface link to download the model.
- (2025.9.19) 🎉 Our paper (Wiki-PRF) is accepted as Neurlps 2025!
- (2025.10.17) 📄 Paper released on arXiv
Official PyTorch implementation of Wiki-PRF, accepted at NeurIPS 2025.

Wiki-PRF achieves state-of-the-art results on KB-VQA benchmarks.
Knowledge-based visual question answering (KB-VQA) requires models to combine visual understanding with external knowledge. While retrieval-augmented generation (RAG) helps, it often suffers from poor multimodal queries and noisy retrieved content.
We propose Wiki-PRF, a three-stage framework:
- 🔍 Processing: Dynamically invokes visual tools to extract precise multimodal cues for querying.
- 📚 Retrieval: Integrates visual and text features to retrieve relevant knowledge.
- 🧹 Filtering: Filters out irrelevant or low-quality results using reinforcement learning rewards based on answer accuracy and format consistency.
Our method significantly improves performance on E-VQA and InfoSeek, achieving new state-of-the-art results.
Our framework consists of three main components:
-
🔍 Processing Module
Uses vision-language tools to generate accurate, grounded queries for knowledge retrieval. -
📚 Multimodal Retrieval Module
Combines image and text embeddings to retrieve top-k relevant passages from a knowledge base. -
🧹 Filtering & Refinement Module
Applies RL-based filtering to discard noisy context and refine the final answer generation.
All values are accuracy (%). Best results in bold, second best underlined.
| Method | Model | Retriever | E-VQA (Single-Hop) | E-VQA (All) | InfoSeek (Unseen-Q) | InfoSeek (Unseen-E) | InfoSeek (All) |
|---|---|---|---|---|---|---|---|
| BLIP-2 | Flan-T5XL | – | 12.6 | 12.4 | 12.7 | 12.3 | 12.5 |
| InstructBLIP | Flan-T5XL | – | 11.9 | 12.0 | 8.9 | 7.4 | 8.1 |
| LLaVA-v1.5 | Vicuna-7B | – | 16.3 | 16.9 | 9.6 | 9.4 | 9.5 |
| GPT-4V | – | – | 26.9 | 28.1 | 15.0 | 14.3 | 14.6 |
| Qwen2.5-VL-3B (Base) | – | – | 17.9 | 19.6 | 20.4 | 21.9 | 21.4 |
| Qwen2.5-VL-7B (Base) | – | – | 21.7 | 20.3 | 22.8 | 24.1 | 23.7 |
| Method | Model | Retriever | E-VQA (Single-Hop) | E-VQA (All) | InfoSeek (Unseen-Q) | InfoSeek (Unseen-E) | InfoSeek (All) |
|---|---|---|---|---|---|---|---|
| DPRV+T | Multi-passage BERT | CLIP ViT-B/32 | 29.1 | – | – | – | 12.4 |
| RORA-VLM | Vicuna-7B | CLIP + Google Search | – | 20.3 | 25.1 | 27.3 | – |
| EchoSight | Mistral-7B / LLaMA-3-8B | EVA-CLIP-8B | 19.4 | – | – | – | 27.7 |
| Wiki-LLaVA | Vicuna-7B | CLIP ViT-L/14 + Contriever | 17.7 | 20.3 | 30.1 | 27.8 | 28.9 |
| ReflectiVA | LLaMA-3.1-8B | EVA-CLIP-8B | 28.0 | 29.2 | 40.4 | 39.8 | 40.1 |
| MMKB-RAG | LLaMA-3.1-8B | EVA-CLIP-8B | 39.7 | 35.9 | 36.4 | 36.3 | 36.4 |
| VLM-PRF (w/o RL) | Qwen-2.5VL-3B | EVA-CLIP-8B | 26.6 | 25.6 | 34.2 | 33.7 | 34.0 |
| VLM-PRF (w/o RL) | Qwen-2.5VL-7B | EVA-CLIP-8B | 28.9 | 28.6 | 40.0 | 39.4 | 39.5 |
| Method | Model | Retriever | E-VQA (Single-Hop) | E-VQA (All) | InfoSeek (Unseen-Q) | InfoSeek (Unseen-E) | InfoSeek (All) |
|---|---|---|---|---|---|---|---|
| VLM-PRF (Ours) | LLaMA-3.1-8B | EVA-CLIP-8B | 36.3 | 35.5 | 41.3 | 40.6 | 40.8 |
| VLM-PRF (Ours) | Qwen-2.5VL-3B | EVA-CLIP-8B | 31.1 | 32.4 | 39.7 | 38.8 | 39.0 |
| VLM-PRF (Ours) | Qwen-2.5VL-7B | EVA-CLIP-8B | 37.1 | 36.0 | 43.3 | 42.7 | 42.8 |
| VLM-PRF (Ours) | InternVL3-8B | EVA-CLIP-8B | 40.1 | 39.2 | 43.5 | 42.1 | 42.5 |
git clone https://github.com/cqu-student/Wiki-PRF.git
cd Wiki-PRF
pip install -r test/requirements.txtSome components require API keys. Set them before running:
# Required for GPT-4 based answer generation
export OPENAI_API_KEY=<your-openai-api-key>
# Required for Google PaLM based answer generation
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export GOOGLE_CLOUD_PROJECT=<your-gcp-project-id>
export GOOGLE_CLOUD_REGION=us-central1 # optional, defaults to us-central1
# Required for training with Weights & Biases logging
export WANDB_API_KEY=<your-wandb-api-key>Run inference and evaluation with the trained model:
cd Wiki-PRF
python test/test.pyKey paths to configure inside test/test.py:
MODEL_PATH: path to the base model (e.g.Qwen2.5-VL-3B-Instruct)PEFT_MODEL_PATH: path to the fine-tuned LoRA checkpointDATA_ROOT: path to the evaluation data config YAMLIMAGE_ROOT: path to the image directory
Launch distributed training with the provided script:
cd Wiki-PRF/train
export WANDB_API_KEY=<your-wandb-api-key>
bash train.shKey paths to configure inside train/train.sh:
--model_name_or_path: path to the base model--dataset_name: path to the data config YAML (default:data_config/rag_data.yaml)--image_root: path to the image directory--output_dir: directory to save checkpoints
