Skip to content

cqu-student/Wiki-PRF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge-based-Visual-Question-Answering-with-Multimodal-Processing-Retrieval-and-Filtering

arXiv Neurlps 2025 Python PyTorch Hugging Face

This repository provides the official PyTorch implementation for Wiki-PRF, a novel three-stage method for Knowledge-Based Visual Question Answering (KB-VQA). Wiki-PRF consists of Processing, Retrieval, and Filtering stages that dynamically extract multimodal cues, perform joint visual-text knowledge retrieval, and filter irrelevant results. The paper has been accepted at NeurIPS 2025.

🪵 TODO List

  • ✅ Release core test implementation
  • ✅ Complete README documentation
  • ✅ Release core train implementation
  • ✅ Add More detailed Quick Start.

🔥 What's New

  • (2026.3.3) Our new work has been accepted to CVPR 2026. If you're interested, please click the link to learn more.
  • (2026.2.27) The core training code has been released. Tell me any problem in the issue.
  • (2026.2.17) The dataset setting can be found in echosight (https://github.com/Go2Heart/EchoSight)
  • (2026.2.02) Model checkpoint released! You can check the Huggingface link to download the model.
  • (2025.9.19) 🎉 Our paper (Wiki-PRF) is accepted as Neurlps 2025!
  • (2025.10.17) 📄 Paper released on arXiv

🧠 Wiki-PRF: A Three-Stage Framework for Knowledge-Based Visual Question Answering

Official PyTorch implementation of Wiki-PRF, accepted at NeurIPS 2025.

guanggao
Wiki-PRF achieves state-of-the-art results on KB-VQA benchmarks.


📌 Abstract

Knowledge-based visual question answering (KB-VQA) requires models to combine visual understanding with external knowledge. While retrieval-augmented generation (RAG) helps, it often suffers from poor multimodal queries and noisy retrieved content.

We propose Wiki-PRF, a three-stage framework:

  • 🔍 Processing: Dynamically invokes visual tools to extract precise multimodal cues for querying.
  • 📚 Retrieval: Integrates visual and text features to retrieve relevant knowledge.
  • 🧹 Filtering: Filters out irrelevant or low-quality results using reinforcement learning rewards based on answer accuracy and format consistency.

Our method significantly improves performance on E-VQA and InfoSeek, achieving new state-of-the-art results.


🏗️ Architecture

Wiki-PRF Architecture

Our framework consists of three main components:

  1. 🔍 Processing Module
    Uses vision-language tools to generate accurate, grounded queries for knowledge retrieval.

  2. 📚 Multimodal Retrieval Module
    Combines image and text embeddings to retrieve top-k relevant passages from a knowledge base.

  3. 🧹 Filtering & Refinement Module
    Applies RL-based filtering to discard noisy context and refine the final answer generation.


📊 Results

📊 Main Results on E-VQA and InfoSeek

All values are accuracy (%). Best results in bold, second best underlined.

Zero-shot MLLMs

Method Model Retriever E-VQA (Single-Hop) E-VQA (All) InfoSeek (Unseen-Q) InfoSeek (Unseen-E) InfoSeek (All)
BLIP-2 Flan-T5XL 12.6 12.4 12.7 12.3 12.5
InstructBLIP Flan-T5XL 11.9 12.0 8.9 7.4 8.1
LLaVA-v1.5 Vicuna-7B 16.3 16.9 9.6 9.4 9.5
GPT-4V 26.9 28.1 15.0 14.3 14.6
Qwen2.5-VL-3B (Base) 17.9 19.6 20.4 21.9 21.4
Qwen2.5-VL-7B (Base) 21.7 20.3 22.8 24.1 23.7

Retrieval-Augmented Models

Method Model Retriever E-VQA (Single-Hop) E-VQA (All) InfoSeek (Unseen-Q) InfoSeek (Unseen-E) InfoSeek (All)
DPRV+T Multi-passage BERT CLIP ViT-B/32 29.1 12.4
RORA-VLM Vicuna-7B CLIP + Google Search 20.3 25.1 27.3
EchoSight Mistral-7B / LLaMA-3-8B EVA-CLIP-8B 19.4 27.7
Wiki-LLaVA Vicuna-7B CLIP ViT-L/14 + Contriever 17.7 20.3 30.1 27.8 28.9
ReflectiVA LLaMA-3.1-8B EVA-CLIP-8B 28.0 29.2 40.4 39.8 40.1
MMKB-RAG LLaMA-3.1-8B EVA-CLIP-8B 39.7 35.9 36.4 36.3 36.4
VLM-PRF (w/o RL) Qwen-2.5VL-3B EVA-CLIP-8B 26.6 25.6 34.2 33.7 34.0
VLM-PRF (w/o RL) Qwen-2.5VL-7B EVA-CLIP-8B 28.9 28.6 40.0 39.4 39.5

Retrieval-Augmented Models with Reinforcement Learning (Ours)

Method Model Retriever E-VQA (Single-Hop) E-VQA (All) InfoSeek (Unseen-Q) InfoSeek (Unseen-E) InfoSeek (All)
VLM-PRF (Ours) LLaMA-3.1-8B EVA-CLIP-8B 36.3 35.5 41.3 40.6 40.8
VLM-PRF (Ours) Qwen-2.5VL-3B EVA-CLIP-8B 31.1 32.4 39.7 38.8 39.0
VLM-PRF (Ours) Qwen-2.5VL-7B EVA-CLIP-8B 37.1 36.0 43.3 42.7 42.8
VLM-PRF (Ours) InternVL3-8B EVA-CLIP-8B 40.1 39.2 43.5 42.1 42.5

🚀 Get Started

git clone https://github.com/cqu-student/Wiki-PRF.git
cd Wiki-PRF
pip install -r test/requirements.txt

Environment Variables

Some components require API keys. Set them before running:

# Required for GPT-4 based answer generation
export OPENAI_API_KEY=<your-openai-api-key>

# Required for Google PaLM based answer generation
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export GOOGLE_CLOUD_PROJECT=<your-gcp-project-id>
export GOOGLE_CLOUD_REGION=us-central1  # optional, defaults to us-central1

# Required for training with Weights & Biases logging
export WANDB_API_KEY=<your-wandb-api-key>

🧪 Test (Evaluation)

Run inference and evaluation with the trained model:

cd Wiki-PRF
python test/test.py

Key paths to configure inside test/test.py:

  • MODEL_PATH: path to the base model (e.g. Qwen2.5-VL-3B-Instruct)
  • PEFT_MODEL_PATH: path to the fine-tuned LoRA checkpoint
  • DATA_ROOT: path to the evaluation data config YAML
  • IMAGE_ROOT: path to the image directory

🏋️ Train

Launch distributed training with the provided script:

cd Wiki-PRF/train
export WANDB_API_KEY=<your-wandb-api-key>
bash train.sh

Key paths to configure inside train/train.sh:

  • --model_name_or_path: path to the base model
  • --dataset_name: path to the data config YAML (default: data_config/rag_data.yaml)
  • --image_root: path to the image directory
  • --output_dir: directory to save checkpoints

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors