Skip to content

cogtoolslab/fugu

Repository files navigation

Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

This repository accompanies our paper, Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models, and contains FUGU, a benchmark for studying how vision-language models understand simple data visualizations. It includes code for generating scatter-plot stimuli and associated prompts (as well as config files for generating bar and line charts), running model evaluations, analyzing model responses, and performing intervention-based experiments.

Paper: arXiv

Files

  • behavioral_eval.py: Main evaluation script. Runs a specified model on a dataset of stimuli/prompts and saves per-item responses.
  • analyze_results.py: Post-processes saved model outputs, extracts answers, and uses an LLM judge to validate correctness or analyze chain-of-thought responses.
  • vanilla_interventions.py: Code for intervention and counterfactual experiments using the local pyvene package.
  • requirements.txt: Conda-style environment/package list used for the project.
  • LICENSE: Repository license.

Folders

  • datasets/: Dataset configs and generated data (including FUGU and FUGU Ensemble).
  • datasets/fugu/: Main stimulus set (FUGU) with stimuli/, metadata/, segmentation/, and stimuli_and_prompts.csv.
  • datasets/fugu_ensemble/: Ensemble/statistical stimuli (FUGU Ensemble) with the same basic structure as datasets/fugu/.
  • datasets/*.yaml: YAML configs for generating different chart/dataset variants.
  • models/: Model loading and inference wrappers.
  • models/models.py: Unified handlers for supported VLMs and API-based models.
  • utils/: Helper code for prompt generation, stimulus construction, segmentation, and chain-of-thought templates.
  • utils/generate_prompts.py: Builds prompts/answers for the benchmark from stimulus metadata.
  • utils/stimuli.py: Stimulus generation, plotting, segmentation-mask creation, and related utilities.
  • utils/cot_templates.py: Prompt templates for different chain-of-thought conditions.
  • pyvene/: Local copy of the intervention library used by the intervention experiments.

Notes

  • This repo currently contains benchmark data and evaluation code for counting, position, distance, extrema, mean, correlation, clustering, function identification, and outlier detection tasks. Other tasks may be added in the future; it is also quite straightforward to add your own tasks by editing the utils/generate_prompts.py file.
  • Evaluation scripts typically write outputs to a results/ directory.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages