Surprise! Evaluating AI’s ability to tell surprising stories
In a new paper lead-authored by Anneliese Bissell and Ella Paulin and presented at the 2025 Workshop on Narrative Understanding, we introduce a novel framework for evaluating narrative surprise in stories produced by large language models (LLMs), grounding our approach in psychological theories of storytelling and reader response.
As theorists like Brewer, Lichtenstein, and Ortony have argued, narrative surprise emerges from the gap between what we expect and what actually happens, while still maintaining logical coherence. It is a surprisingly complex form of communication and equally difficult to measure. To create more coherence for the field moving forward, Bissell and Paulin propose six concrete criteria to assess narrative surprise: initiatoriness (how well the ending explains prior events), immutability violation (whether the ending breaks “facts” of the story world), predictability, post-dictability (retrospective coherence), importance (impact on the protagonist), and valence (emotional positivity or negativity).
To test their framework, we collected a set of 30 mystery stories sourced from the Reedsy platform, a site for user-generated storytelling. Each story was truncated before its resolution and completed with four alternate endings—one written by the original human author and three generated by language models (GPT-4 and Phi3), using different prompting strategies. Then, a team of trained readers rated each ending for its surprise quality and ranked which one they found most and least surprising.
The results showed that while human-written endings were clearly preferred, GPT-4 came was not far behind. Indeed, GPT-4 won as many times as human authors but since they had twice as many possible endings humans were more likely to be preferred given their baseline chances.
Importantly, the study found that four of the six criteria were strongly associated with reader preferences, suggesting that this is a viable framework for studying narrative surprise. Endings that clearly explained earlier events (high initiatoriness) and made sense in retrospect (high post-dictability) were significantly more likely to be chosen as surprising. Meanwhile, endings that were predictable or overly positive were far less likely to engage readers. The strongest single predictor was predictability—the more obvious the twist, the less it surprised (not surprisingly!).
While the study highlights current limitations in LLM-generated fiction—such as a tendency toward feel-good endings and sometimes incoherent twists—it also offers a roadmap for improvement. Our evaluation criteria can be used as benchmarks to test AI’s ability to surprise readers. As we can see in the figure below, there was a fairly strong correlation between GPT-4 and human endings according to our annotation team.
Looking ahead, we suggest expanding this research to larger datasets and moving to other genres where surprise will be less structurally codified as in mysteries.