Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
LLMs can generate stories with tens of thousands of words, but they often contradict themselves along the way β characters forget their backstories, timelines break, and world rules silently change.
ConStory-Bench is a benchmark for evaluating narrative consistency in long-form story generation. It includes prompts, an automated evaluation pipeline (ConStory-Checker), and pre-computed results for a wide range of models.
ConStory-Checker detects consistency errors across 5 categories (19 subtypes):
- Characterization β memory contradictions, knowledge conflicts, skill/power fluctuations, forgotten abilities
- Factual Detail β appearance mismatches, nomenclature confusions, quantitative errors
- Narrative Style β perspective shifts, tone inconsistencies, style breaks
- Timeline & Plot β time contradictions, duration errors, causality violations, abandoned plots
- World-building & Setting β rule violations, social norm conflicts, geographical contradictions
π With ConStory-Bench, we aim to track how well LLMs maintain narrative consistency as they scale. View our Leaderboard (updating).
- [2026-04-07] Our paper Lost in Stories: Consistency Bugs in Long Story Generation by LLMs was accepted to ACL 2026.
- arXiv Abstract: https://arxiv.org/abs/2603.05890
- arXiv PDF: https://arxiv.org/pdf/2603.05890
All data is hosted on HuggingFace: jayden8888/ConStory-Bench
| File | Description |
|---|---|
prompts.parquet |
Benchmark prompts (4 task types) |
stories.parquet |
Generated stories from multiple models |
evaluations/*.csv |
ConStory-Checker results per model |
from datasets import load_dataset
# Load prompts
prompts = load_dataset("jayden8888/ConStory-Bench", data_files="prompts.parquet", split="train")
print(len(prompts)) # 2000
# Load all stories
stories = load_dataset("jayden8888/ConStory-Bench", data_files="stories.parquet", split="train")Or with pandas:
import pandas as pd
prompts = pd.read_parquet("hf://datasets/jayden8888/ConStory-Bench/prompts.parquet")
stories = pd.read_parquet("hf://datasets/jayden8888/ConStory-Bench/stories.parquet")git clone https://github.com/Picrew/ConStory-Bench.git
cd ConStory-Bench
pip install -r requirements.txtUse any OpenAI-compatible API:
export OPENAI_API_KEY="your-key"
python -m constory.generate \
--input data/prompts.parquet \
--output data/stories/my_model.parquet \
--model gpt-4o \
--concurrent 5Also works with local servers (vLLM, Ollama, etc.):
python -m constory.generate \
--input data/prompts.parquet \
--output data/stories/llama3.parquet \
--model meta-llama/Llama-3-70B-Instruct \
--api-base http://localhost:8000/v1 \
--api-key token-abc123python -m constory.judge \
--input data/stories/my_model.parquet \
--story-column generated_story \
--model-name my_model \
--concurrent 3# All models
python -m constory.metrics \
--eval-dir evaluations/ \
--config configs/models.yaml \
--mode both
# Single model
python -m constory.metrics \
--eval-dir evaluations/ \
--mode ced \
--eval-file my_model.csv \
--story-column generated_story \
--model-name my_modelCompute conditional probability matrices P(B|A) between the 5 error categories. For example: "Given a story has Timeline errors, what is the probability it also has Factual errors?"
# All models
python -m constory.correlation \
--eval-dir evaluations/ \
--config configs/models.yaml
# 8 representative models from the paper
python -m constory.correlation \
--eval-dir evaluations/ \
--config configs/models.yaml \
--models "GPT-5-Reasoning,Claude-Sonnet-4.5,Gemini-2.5-Pro,Qwen3-235B-A22B-Thinking,GLM-4.6,DeepSeek-V3.2-Exp,Kimi-K2-2509,GPT-4o-1120"Analyze where in the story errors appear β the position (0β100%) where the original fact is established vs. where the contradiction occurs, and the gap between them.
# 8 representative models from the paper
python -m constory.positional \
--eval-dir evaluations/ \
--config configs/models.yaml \
--models "GPT-5-Reasoning,Claude-Sonnet-4.5,Gemini-2.5-Pro,Qwen3-235B-A22B-Thinking,GLM-4.6,DeepSeek-V3.2-Exp,Kimi-K2-2509,GPT-4o-1120"Full results on our π Leaderboard (updating).
| Model | Category | CED | Avg Words | Total |
|---|---|---|---|---|
| GPT-5-Reasoning | Proprietary | 0.113 | 9,050 | 1,990 |
| Gemini-2.5-Pro | Proprietary | 0.302 | 5,091 | 1,996 |
| Gemini-2.5-Flash | Proprietary | 0.305 | 5,504 | 1,996 |
| Claude-Sonnet-4.5 | Proprietary | 0.520 | 8,929 | 1,998 |
| GLM-4.6 | Open-source | 0.528 | 4,949 | 2,000 |
| Qwen3-32B | Open-source | 0.537 | 6,237 | 2,000 |
| Ring-1T | Open-source | 0.539 | 5,264 | 1,999 |
| DeepSeek-V3.2-Exp | Open-source | 0.541 | 3,724 | 2,000 |
| Qwen3-235B-A22B-Thinking | Open-source | 0.559 | 5,424 | 2,000 |
| GLM-4.5 | Open-source | 0.595 | 5,421 | 2,000 |
| LongWriter-Zero-32B | Capability-enhanced | 0.669 | 13,393 | 1,857 |
| Grok-4 | Proprietary | 0.670 | 2,765 | 2,000 |
| SuperWriter | Agent-enhanced | 0.674 | 6,036 | 2,000 |
| Ling-1T | Open-source | 0.699 | 5,088 | 2,000 |
| GPT-4o-1120 | Proprietary | 0.711 | 1,241 | 1,774 |
| Step3 | Open-source | 0.845 | 3,793 | 1,916 |
| Qwen3-Next-80B-Thinking | Open-source | 0.959 | 4,820 | 1,973 |
| DOME | Agent-enhanced | 1.033 | 8,399 | 1,969 |
| Doubao-1.6-Thinking-2507 | Proprietary | 1.217 | 3,713 | 2,000 |
| Kimi-K2-2509 | Open-source | 1.300 | 3,227 | 1,792 |
| Kimi-K2-2507 | Open-source | 1.330 | 3,046 | 2,000 |
| Mistral-Medium-3.1 | Proprietary | 1.355 | 2,447 | 2,000 |
| Qwen3-235B-A22B | Open-source | 1.447 | 3,246 | 2,000 |
| Qwen3-Next-80B | Open-source | 1.603 | 4,013 | 2,000 |
| Qwen3-4B-Instruct-2507 | Open-source | 1.685 | 4,919 | 1,997 |
| Nvidia-Llama-3.1-Ultra | Open-source | 1.833 | 1,224 | 1,998 |
| Qwen3-30B-A3B-Instruct-2507 | Open-source | 2.130 | 2,968 | 2,000 |
| DeepSeek-V3 | Open-source | 2.422 | 670 | 2,000 |
| Suri-ORPO | Capability-enhanced | 2.445 | 4,279 | 2,000 |
| QwenLong-L1-32B | Open-source | 3.413 | 1,234 | 2,000 |
| DeepSeek-R1 | Open-source | 3.419 | 680 | 1,952 |
| MiniMax-M1-80k | Open-source | 3.447 | 1,442 | 1,716 |
| LongAlign-13B | Capability-enhanced | 3.664 | 1,624 | 2,000 |
ConStory-Bench/
βββ README.md
βββ LICENSE # MIT
βββ requirements.txt
βββ assets/ # Logo, figures from paper
βββ configs/
β βββ models.yaml # Model registry (name, file, column, category)
βββ constory/ # Core Python package
β βββ __init__.py
β βββ generate.py # Story generation (OpenAI-compatible API)
β βββ judge.py # ConStory-Checker (LLM-as-judge)
β βββ metrics.py # CED & GRR computation
β βββ correlation.py # Error correlation analysis (P(B|A))
β βββ positional.py # Error positional distribution analysis
βββ prompts/ # Judge prompt templates (5 categories)
β βββ characterization.md
β βββ factual_detail.md
β βββ narrative_style.md
β βββ timeline_plot.md
β βββ world_building.md
βββ scripts/
βββ run_generation.sh
βββ run_judge.sh
@misc{li2026loststoriesconsistencybugs,
title={Lost in Stories: Consistency Bugs in Long Story Generation by LLMs},
author={Junjie Li and Xinrui Guo and Yuhao Wu and Roy Ka-Wei Lee and Hongzhi Li and Yutao Xie},
year={2026},
eprint={2603.05890},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.05890}
}

