Announcing NARRABENCH: A New Framework for Testing Narrative Intelligence in Language Models

In our new paper, NARRABENCH: A Comprehensive Framework for Narrative Benchmarking, we (Sil Hamilton, Matthew Wilkens, and Andrew Piper) introduce the first systematic framework for evaluating narrative understanding in AI. Drawing on decades of narrative theory, NARRABENCH defines four foundational dimensions of storytellingstory, narration, discourse, and situatedness—and maps fifty specific narrative skills that large language models should be able to demonstrate, from recognizing characters and events to grasping perspective, style, and moral intent.

After surveying 78 existing benchmarks in NLP, we found that only 27% meaningfully capture narrative understanding, with major gaps in areas such as perspective, revelation, and style. In other words, most current tests focus on what happens in stories but ignore how and why those stories are told.

NARRABENCH provides a taxonomy and roadmap for filling those gaps—an evolving, community-driven effort to create better, more interpretable benchmarks for narrative intelligence. The goal is not only to help AI researchers build models that can follow a plot, but also to assess whether machines can engage with the interpretive, subjective, and moral dimensions that make human storytelling so powerful.

In short: if we want AIs that can truly understand stories, we need benchmarks that reflect how complex storytelling really is. NARRABENCH is a step toward that future.