Can AI Understand Stories? Large Language Models Take on Narrative Topic Labeling
For years, researchers in the digital humanities and computational social sciences have relied on topic models like LDA to identify the contours of narrative content at large scale. But anyone who has worked with these models knows their limitations—ambiguous labels, fiddly parameters, and sometimes baffling “topics” like “seemed, appeared, length, moment…”
Recent advances in large language models (LLMs) like GPT-4 and open-weight alternatives such as Gemma and LLaMA promise a better way forward. In our new study, we evaluate whether LLMs can generate meaningful and reader-preferred topic labels for narrative texts—both factual (news articles) and fictional (novels). Spoiler: they can, and in some cases, they even outperform humans.
Moving Beyond Bag-of-Words
Traditional topic models rely on word co-occurrence patterns, which can reduce richly textured texts to shallow clusters of frequent words. In contrast, LLMs can be prompted to generate explicit topic labels using natural language, allowing them to consider context, tone, and narrative structure.
Our study compares LLM-generated topic labels to human annotations using a ranked voting survey with 200 crowdworkers. Participants read passages and ranked five topic labels—one from a human annotator and four from different LLMs. We tested both news and fiction.
Key Findings

Fiction is where LLMs shine. In our fiction sample, LLMs like Gemma2 consistently outperformed human annotators in reader preferences. For instance, models preferred “family relationships” over the more vague “sibling relationship,” showing a better grasp of narrative generality.
For news, it’s a tie. While GPT-4 slightly edged out the others, most models performed similarly—and on par with humans—when labeling topics in news articles.
LLMs are consistent. Across different models, outputs were often strikingly similar. Human annotations matched at least one model’s label 50% (news) to 72% (fiction) of the time.
Interpretability is a win. Compared to LDA, which often produces generic or grammatical clusters (e.g., “seemed,” “though,” “looking”), LLMs surface richer concepts like “revolution,” “civil war,” or “social pressure.”
Case Study: What Novels Talked About in the 19th Century
We applied our methods to 25,000 passages from 19th-century British novels to explore changes in topic over time. Using Dunning’s log-likelihood, we found that LLMs captured meaningful historical shifts: rising attention to themes like “slavery,” “civil war,” and “marriage” after 1850, and a decline in topics like “territory” and “Native American culture.” LDA, in contrast, surfaced more linguistic patterns than thematic ones.
