Knowledge-based-Visual-Question-Answering-with-Multimodal-Processing-Retrieval-and-Filtering

This repository provides the official PyTorch implementation for Wiki-PRF, a novel three-stage method for Knowledge-Based Visual Question Answering (KB-VQA). Wiki-PRF consists of Processing, Retrieval, and Filtering stages that dynamically extract multimodal cues, perform joint visual-text knowledge retrieval, and filter irrelevant results. The paper has been accepted at NeurIPS 2025.

🪵 TODO List

✅ Release core test implementation
✅ Complete README documentation
✅ Release core train implementation
✅ Add More detailed Quick Start.

🔥 What's New

(2026.3.3) Our new work has been accepted to CVPR 2026. If you're interested, please click the link to learn more.
(2026.2.27) The core training code has been released. Tell me any problem in the issue.
(2026.2.17) The dataset setting can be found in echosight (https://github.com/Go2Heart/EchoSight)
(2026.2.02) Model checkpoint released! You can check the Huggingface link to download the model.
(2025.9.19) 🎉 Our paper (Wiki-PRF) is accepted as Neurlps 2025!
(2025.10.17) 📄 Paper released on arXiv

🧠 Wiki-PRF: A Three-Stage Framework for Knowledge-Based Visual Question Answering

Official PyTorch implementation of Wiki-PRF, accepted at NeurIPS 2025.

Wiki-PRF achieves state-of-the-art results on KB-VQA benchmarks.

📌 Abstract

Knowledge-based visual question answering (KB-VQA) requires models to combine visual understanding with external knowledge. While retrieval-augmented generation (RAG) helps, it often suffers from poor multimodal queries and noisy retrieved content.

We propose Wiki-PRF, a three-stage framework:

🔍 Processing: Dynamically invokes visual tools to extract precise multimodal cues for querying.
📚 Retrieval: Integrates visual and text features to retrieve relevant knowledge.
🧹 Filtering: Filters out irrelevant or low-quality results using reinforcement learning rewards based on answer accuracy and format consistency.

Our method significantly improves performance on E-VQA and InfoSeek, achieving new state-of-the-art results.

🏗️ Architecture

Our framework consists of three main components:

🔍 Processing Module
Uses vision-language tools to generate accurate, grounded queries for knowledge retrieval.
📚 Multimodal Retrieval Module
Combines image and text embeddings to retrieve top-k relevant passages from a knowledge base.
🧹 Filtering & Refinement Module
Applies RL-based filtering to discard noisy context and refine the final answer generation.

📊 Results

📊 Main Results on E-VQA and InfoSeek

All values are accuracy (%). Best results in bold, second best underlined.

Zero-shot MLLMs

Method	Model	Retriever	E-VQA (Single-Hop)	E-VQA (All)	InfoSeek (Unseen-Q)	InfoSeek (Unseen-E)	InfoSeek (All)
BLIP-2	Flan-T5_XL	–	12.6	12.4	12.7	12.3	12.5
InstructBLIP	Flan-T5_XL	–	11.9	12.0	8.9	7.4	8.1
LLaVA-v1.5	Vicuna-7B	–	16.3	16.9	9.6	9.4	9.5
GPT-4V	–	–	26.9	28.1	15.0	14.3	14.6
Qwen2.5-VL-3B (Base)	–	–	17.9	19.6	20.4	21.9	21.4
Qwen2.5-VL-7B (Base)	–	–	21.7	20.3	22.8	24.1	23.7

Retrieval-Augmented Models

Method	Model	Retriever	E-VQA (Single-Hop)	E-VQA (All)	InfoSeek (Unseen-Q)	InfoSeek (Unseen-E)	InfoSeek (All)
DPR_V+T	Multi-passage BERT	CLIP ViT-B/32	29.1	–	–	–	12.4
RORA-VLM	Vicuna-7B	CLIP + Google Search	–	20.3	25.1	27.3	–
EchoSight	Mistral-7B / LLaMA-3-8B	EVA-CLIP-8B	19.4	–	–	–	27.7
Wiki-LLaVA	Vicuna-7B	CLIP ViT-L/14 + Contriever	17.7	20.3	30.1	27.8	28.9
ReflectiVA	LLaMA-3.1-8B	EVA-CLIP-8B	28.0	29.2	40.4	39.8	40.1
MMKB-RAG	LLaMA-3.1-8B	EVA-CLIP-8B	39.7	35.9	36.4	36.3	36.4
VLM-PRF (w/o RL)	Qwen-2.5VL-3B	EVA-CLIP-8B	26.6	25.6	34.2	33.7	34.0
VLM-PRF (w/o RL)	Qwen-2.5VL-7B	EVA-CLIP-8B	28.9	28.6	40.0	39.4	39.5

Retrieval-Augmented Models with Reinforcement Learning (Ours)

Method	Model	Retriever	E-VQA (Single-Hop)	E-VQA (All)	InfoSeek (Unseen-Q)	InfoSeek (Unseen-E)	InfoSeek (All)
VLM-PRF (Ours)	LLaMA-3.1-8B	EVA-CLIP-8B	36.3	35.5	41.3	40.6	40.8
VLM-PRF (Ours)	Qwen-2.5VL-3B	EVA-CLIP-8B	31.1	32.4	39.7	38.8	39.0
VLM-PRF (Ours)	Qwen-2.5VL-7B	EVA-CLIP-8B	37.1	36.0	43.3	42.7	42.8
VLM-PRF (Ours)	InternVL3-8B	EVA-CLIP-8B	40.1	39.2	43.5	42.1	42.5

🚀 Get Started

git clone https://github.com/cqu-student/Wiki-PRF.git
cd Wiki-PRF
pip install -r test/requirements.txt

Environment Variables

Some components require API keys. Set them before running:

# Required for GPT-4 based answer generation
export OPENAI_API_KEY=<your-openai-api-key>

# Required for Google PaLM based answer generation
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export GOOGLE_CLOUD_PROJECT=<your-gcp-project-id>
export GOOGLE_CLOUD_REGION=us-central1  # optional, defaults to us-central1

# Required for training with Weights & Biases logging
export WANDB_API_KEY=<your-wandb-api-key>

🧪 Test (Evaluation)

Run inference and evaluation with the trained model:

cd Wiki-PRF
python test/test.py

Key paths to configure inside test/test.py:

MODEL_PATH: path to the base model (e.g. Qwen2.5-VL-3B-Instruct)
PEFT_MODEL_PATH: path to the fine-tuned LoRA checkpoint
DATA_ROOT: path to the evaluation data config YAML
IMAGE_ROOT: path to the image directory

🏋️ Train

Launch distributed training with the provided script:

cd Wiki-PRF/train
export WANDB_API_KEY=<your-wandb-api-key>
bash train.sh

Key paths to configure inside train/train.sh:

--model_name_or_path: path to the base model
--dataset_name: path to the data config YAML (default: data_config/rag_data.yaml)
--image_root: path to the image directory
--output_dir: directory to save checkpoints

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
assets		assets
data_process		data_process
test		test
train		train
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge-based-Visual-Question-Answering-with-Multimodal-Processing-Retrieval-and-Filtering

🪵 TODO List

🔥 What's New

🧠 Wiki-PRF: A Three-Stage Framework for Knowledge-Based Visual Question Answering

📌 Abstract

🏗️ Architecture

📊 Results

📊 Main Results on E-VQA and InfoSeek

Zero-shot MLLMs

Retrieval-Augmented Models

Retrieval-Augmented Models with Reinforcement Learning (Ours)

🚀 Get Started

Environment Variables

🧪 Test (Evaluation)

🏋️ Train

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge-based-Visual-Question-Answering-with-Multimodal-Processing-Retrieval-and-Filtering

🪵 TODO List

🔥 What's New

🧠 Wiki-PRF: A Three-Stage Framework for Knowledge-Based Visual Question Answering

📌 Abstract

🏗️ Architecture

📊 Results

📊 Main Results on E-VQA and InfoSeek

Zero-shot MLLMs

Retrieval-Augmented Models

Retrieval-Augmented Models with Reinforcement Learning (Ours)

🚀 Get Started

Environment Variables

🧪 Test (Evaluation)

🏋️ Train

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages