Skip to content

Picrew/ConStory-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ConStory-Bench

ConStory-Bench

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Project Page arXiv Dataset Leaderboard License


πŸ” Overview

LLMs can generate stories with tens of thousands of words, but they often contradict themselves along the way β€” characters forget their backstories, timelines break, and world rules silently change.

ConStory-Bench is a benchmark for evaluating narrative consistency in long-form story generation. It includes prompts, an automated evaluation pipeline (ConStory-Checker), and pre-computed results for a wide range of models.

ConStory-Checker detects consistency errors across 5 categories (19 subtypes):

  • Characterization β€” memory contradictions, knowledge conflicts, skill/power fluctuations, forgotten abilities
  • Factual Detail β€” appearance mismatches, nomenclature confusions, quantitative errors
  • Narrative Style β€” perspective shifts, tone inconsistencies, style breaks
  • Timeline & Plot β€” time contradictions, duration errors, causality violations, abandoned plots
  • World-building & Setting β€” rule violations, social norm conflicts, geographical contradictions

GRR Leaderboard

CED vs Average Output Length

πŸ† With ConStory-Bench, we aim to track how well LLMs maintain narrative consistency as they scale. View our Leaderboard (updating).

πŸ”₯ News

  • [2026-04-07] Our paper Lost in Stories: Consistency Bugs in Long Story Generation by LLMs was accepted to ACL 2026.

πŸ“„ Paper

πŸ“¦ Dataset

All data is hosted on HuggingFace: jayden8888/ConStory-Bench

File Description
prompts.parquet Benchmark prompts (4 task types)
stories.parquet Generated stories from multiple models
evaluations/*.csv ConStory-Checker results per model

Load Data

from datasets import load_dataset

# Load prompts
prompts = load_dataset("jayden8888/ConStory-Bench", data_files="prompts.parquet", split="train")
print(len(prompts))  # 2000

# Load all stories
stories = load_dataset("jayden8888/ConStory-Bench", data_files="stories.parquet", split="train")

Or with pandas:

import pandas as pd

prompts = pd.read_parquet("hf://datasets/jayden8888/ConStory-Bench/prompts.parquet")
stories = pd.read_parquet("hf://datasets/jayden8888/ConStory-Bench/stories.parquet")

⚑ Quick Start

Install

git clone https://github.com/Picrew/ConStory-Bench.git
cd ConStory-Bench
pip install -r requirements.txt

Step 1 β€” Generate Stories

Use any OpenAI-compatible API:

export OPENAI_API_KEY="your-key"

python -m constory.generate \
    --input data/prompts.parquet \
    --output data/stories/my_model.parquet \
    --model gpt-4o \
    --concurrent 5

Also works with local servers (vLLM, Ollama, etc.):

python -m constory.generate \
    --input data/prompts.parquet \
    --output data/stories/llama3.parquet \
    --model meta-llama/Llama-3-70B-Instruct \
    --api-base http://localhost:8000/v1 \
    --api-key token-abc123

Step 2 β€” Evaluate with ConStory-Checker

python -m constory.judge \
    --input data/stories/my_model.parquet \
    --story-column generated_story \
    --model-name my_model \
    --concurrent 3

Step 3 β€” Compute Metrics

# All models
python -m constory.metrics \
    --eval-dir evaluations/ \
    --config configs/models.yaml \
    --mode both

# Single model
python -m constory.metrics \
    --eval-dir evaluations/ \
    --mode ced \
    --eval-file my_model.csv \
    --story-column generated_story \
    --model-name my_model

Step 4 β€” Error Correlation Analysis

Compute conditional probability matrices P(B|A) between the 5 error categories. For example: "Given a story has Timeline errors, what is the probability it also has Factual errors?"

# All models
python -m constory.correlation \
    --eval-dir evaluations/ \
    --config configs/models.yaml

# 8 representative models from the paper
python -m constory.correlation \
    --eval-dir evaluations/ \
    --config configs/models.yaml \
    --models "GPT-5-Reasoning,Claude-Sonnet-4.5,Gemini-2.5-Pro,Qwen3-235B-A22B-Thinking,GLM-4.6,DeepSeek-V3.2-Exp,Kimi-K2-2509,GPT-4o-1120"

Step 5 β€” Error Positional Distribution

Analyze where in the story errors appear β€” the position (0–100%) where the original fact is established vs. where the contradiction occurs, and the gap between them.

# 8 representative models from the paper
python -m constory.positional \
    --eval-dir evaluations/ \
    --config configs/models.yaml \
    --models "GPT-5-Reasoning,Claude-Sonnet-4.5,Gemini-2.5-Pro,Qwen3-235B-A22B-Thinking,GLM-4.6,DeepSeek-V3.2-Exp,Kimi-K2-2509,GPT-4o-1120"

Leaderboard

Full results on our πŸ† Leaderboard (updating).

Model Category CED Avg Words Total
GPT-5-Reasoning Proprietary 0.113 9,050 1,990
Gemini-2.5-Pro Proprietary 0.302 5,091 1,996
Gemini-2.5-Flash Proprietary 0.305 5,504 1,996
Claude-Sonnet-4.5 Proprietary 0.520 8,929 1,998
GLM-4.6 Open-source 0.528 4,949 2,000
Qwen3-32B Open-source 0.537 6,237 2,000
Ring-1T Open-source 0.539 5,264 1,999
DeepSeek-V3.2-Exp Open-source 0.541 3,724 2,000
Qwen3-235B-A22B-Thinking Open-source 0.559 5,424 2,000
GLM-4.5 Open-source 0.595 5,421 2,000
LongWriter-Zero-32B Capability-enhanced 0.669 13,393 1,857
Grok-4 Proprietary 0.670 2,765 2,000
SuperWriter Agent-enhanced 0.674 6,036 2,000
Ling-1T Open-source 0.699 5,088 2,000
GPT-4o-1120 Proprietary 0.711 1,241 1,774
Step3 Open-source 0.845 3,793 1,916
Qwen3-Next-80B-Thinking Open-source 0.959 4,820 1,973
DOME Agent-enhanced 1.033 8,399 1,969
Doubao-1.6-Thinking-2507 Proprietary 1.217 3,713 2,000
Kimi-K2-2509 Open-source 1.300 3,227 1,792
Kimi-K2-2507 Open-source 1.330 3,046 2,000
Mistral-Medium-3.1 Proprietary 1.355 2,447 2,000
Qwen3-235B-A22B Open-source 1.447 3,246 2,000
Qwen3-Next-80B Open-source 1.603 4,013 2,000
Qwen3-4B-Instruct-2507 Open-source 1.685 4,919 1,997
Nvidia-Llama-3.1-Ultra Open-source 1.833 1,224 1,998
Qwen3-30B-A3B-Instruct-2507 Open-source 2.130 2,968 2,000
DeepSeek-V3 Open-source 2.422 670 2,000
Suri-ORPO Capability-enhanced 2.445 4,279 2,000
QwenLong-L1-32B Open-source 3.413 1,234 2,000
DeepSeek-R1 Open-source 3.419 680 1,952
MiniMax-M1-80k Open-source 3.447 1,442 1,716
LongAlign-13B Capability-enhanced 3.664 1,624 2,000

Repository Structure

ConStory-Bench/
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE                    # MIT
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ assets/                    # Logo, figures from paper
β”œβ”€β”€ configs/
β”‚   └── models.yaml            # Model registry (name, file, column, category)
β”œβ”€β”€ constory/                  # Core Python package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ generate.py            # Story generation (OpenAI-compatible API)
β”‚   β”œβ”€β”€ judge.py               # ConStory-Checker (LLM-as-judge)
β”‚   β”œβ”€β”€ metrics.py             # CED & GRR computation
β”‚   β”œβ”€β”€ correlation.py         # Error correlation analysis (P(B|A))
β”‚   └── positional.py          # Error positional distribution analysis
β”œβ”€β”€ prompts/                   # Judge prompt templates (5 categories)
β”‚   β”œβ”€β”€ characterization.md
β”‚   β”œβ”€β”€ factual_detail.md
β”‚   β”œβ”€β”€ narrative_style.md
β”‚   β”œβ”€β”€ timeline_plot.md
β”‚   └── world_building.md
└── scripts/
    β”œβ”€β”€ run_generation.sh
    └── run_judge.sh

πŸ“ Citation

@misc{li2026loststoriesconsistencybugs,
  title={Lost in Stories: Consistency Bugs in Long Story Generation by LLMs},
  author={Junjie Li and Xinrui Guo and Yuhao Wu and Roy Ka-Wei Lee and Hongzhi Li and Yutao Xie},
  year={2026},
  eprint={2603.05890},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.05890}
}

License

MIT License

About

ConStory-Bench (ACL 26')

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors