This repository accompanies our paper, Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models, and contains FUGU, a benchmark for studying how vision-language models understand simple data visualizations. It includes code for generating scatter-plot stimuli and associated prompts (as well as config files for generating bar and line charts), running model evaluations, analyzing model responses, and performing intervention-based experiments.
Paper: arXiv
behavioral_eval.py: Main evaluation script. Runs a specified model on a dataset of stimuli/prompts and saves per-item responses.analyze_results.py: Post-processes saved model outputs, extracts answers, and uses an LLM judge to validate correctness or analyze chain-of-thought responses.vanilla_interventions.py: Code for intervention and counterfactual experiments using the localpyvenepackage.requirements.txt: Conda-style environment/package list used for the project.LICENSE: Repository license.
datasets/: Dataset configs and generated data (including FUGU and FUGU Ensemble).datasets/fugu/: Main stimulus set (FUGU) withstimuli/,metadata/,segmentation/, andstimuli_and_prompts.csv.datasets/fugu_ensemble/: Ensemble/statistical stimuli (FUGU Ensemble) with the same basic structure asdatasets/fugu/.datasets/*.yaml: YAML configs for generating different chart/dataset variants.models/: Model loading and inference wrappers.models/models.py: Unified handlers for supported VLMs and API-based models.utils/: Helper code for prompt generation, stimulus construction, segmentation, and chain-of-thought templates.utils/generate_prompts.py: Builds prompts/answers for the benchmark from stimulus metadata.utils/stimuli.py: Stimulus generation, plotting, segmentation-mask creation, and related utilities.utils/cot_templates.py: Prompt templates for different chain-of-thought conditions.pyvene/: Local copy of the intervention library used by the intervention experiments.
- This repo currently contains benchmark data and evaluation code for counting, position, distance, extrema, mean, correlation, clustering, function identification, and outlier detection tasks. Other tasks may be added in the future; it is also quite straightforward to add your own tasks by editing the
utils/generate_prompts.pyfile. - Evaluation scripts typically write outputs to a
results/directory.