Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

This repository accompanies our paper, Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models, and contains FUGU, a benchmark for studying how vision-language models understand simple data visualizations. It includes code for generating scatter-plot stimuli and associated prompts (as well as config files for generating bar and line charts), running model evaluations, analyzing model responses, and performing intervention-based experiments.

Paper: arXiv

Files

behavioral_eval.py: Main evaluation script. Runs a specified model on a dataset of stimuli/prompts and saves per-item responses.
analyze_results.py: Post-processes saved model outputs, extracts answers, and uses an LLM judge to validate correctness or analyze chain-of-thought responses.
vanilla_interventions.py: Code for intervention and counterfactual experiments using the local pyvene package.
requirements.txt: Conda-style environment/package list used for the project.
LICENSE: Repository license.

Folders

datasets/: Dataset configs and generated data (including FUGU and FUGU Ensemble).
datasets/fugu/: Main stimulus set (FUGU) with stimuli/, metadata/, segmentation/, and stimuli_and_prompts.csv.
datasets/fugu_ensemble/: Ensemble/statistical stimuli (FUGU Ensemble) with the same basic structure as datasets/fugu/.
datasets/*.yaml: YAML configs for generating different chart/dataset variants.
models/: Model loading and inference wrappers.
models/models.py: Unified handlers for supported VLMs and API-based models.
utils/: Helper code for prompt generation, stimulus construction, segmentation, and chain-of-thought templates.
utils/generate_prompts.py: Builds prompts/answers for the benchmark from stimulus metadata.
utils/stimuli.py: Stimulus generation, plotting, segmentation-mask creation, and related utilities.
utils/cot_templates.py: Prompt templates for different chain-of-thought conditions.
pyvene/: Local copy of the intervention library used by the intervention experiments.

Notes

This repo currently contains benchmark data and evaluation code for counting, position, distance, extrema, mean, correlation, clustering, function identification, and outlier detection tasks. Other tasks may be added in the future; it is also quite straightforward to add your own tasks by editing the utils/generate_prompts.py file.
Evaluation scripts typically write outputs to a results/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

Files

Folders

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
datasets		datasets
models		models
pyvene/pyvene		pyvene/pyvene
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_results.py		analyze_results.py
behavioral_eval.py		behavioral_eval.py
requirements.txt		requirements.txt
vanilla_interventions.py		vanilla_interventions.py

Folders and files

Latest commit

History

Repository files navigation

Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

Files

Folders

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages