How Thousands of Citizen Readers Helped Build the Largest Open-Vocabulary Dataset of Narrative Emotions
CR4-NarrEmote is a project we released at EMNLP 2025. It’s the first large-scale, open-vocabulary dataset of emotions in narrative text—built not by professional annotators or microtask workers but by 3,738 volunteer readers from around the world. Over four months, they generated more than 200,000 emotion annotations across 43,000 passages of long-form fiction and nonfiction using our Citizen Readers platform.
Most emotion datasets in NLP look at tweets, headlines, or social-media posts. And most of them rely on closed label sets—fixed taxonomies of preset emotions. But emotions are multifaceted and narrative emotions can be subtle. Experiences like “dread,” “relief,” “bittersweetness,” “envy,” or “apprehension” or “trying to be brave” all capture different aspects of a character’s experience. An open-vocabulary approach lets those nuances appear on the page.
How the Project Worked
We built the project on Zooniverse.org, the world’s largest citizen-science platform as part of our larger SSHRC-funded Citizen Readers project. Volunteers saw a single sentence from a book with a character highlighted. Their job was to write in any number of emotions that the character might be feeling. Every sentence was annotated by at least five different people before being retired.

The results:
- 1,880 unique emotion terms (after cleaning)
- A long-tail distribution: the top ~200 labels cover 80% of all annotations
- Only 15% of emotion words appear in the sentence itself
→ meaning readers infer emotions far beyond surface text
To make the data usable for researchers, we mapped these labels into two widely used affective frameworks:
1. VAD (Valence, Arousal, Dominance)
We implemented three models—from lexicon lookup to neural regressors—to place each emotion word on a continuous 0–1 scale. Interestingly:
- Lexicon-only models were expressive but exaggerated extremes.
- Sentence-embedding models smoothed everything toward neutrality.
- A hybrid approach (lexicon + embeddings) offered the best balance.
2. NRC’s Basic Emotions
Using embeddings and valence constraints, we mapped labels to eight categories: joy, sadness, fear, anger, disgust, trust, anticipation, surprise. Surprisingly (pun intended), anticipation was the most common emotion across all of the data suggesting how narratives exploit future thinking more than backward-lookingness.
This allowed us to compare our dataset with others while preserving its open-ended expressivity.
Did Volunteers Agree with Each Other? Surprisingly, Yes.
Emotion labeling is subjective. But we found convergences at multiple levels:
- High semantic alignment (0.93 cosine similarity) even when people used different words
- Nearly 50% categorical agreement on NRC emotions
- Thousands of cases with perfect agreement across annotators
- Patterns far above chance in permutation tests
We also ran a validation study with trained project moderators. Result? No meaningful differences between expert and citizen judgments across valence, arousal, or dominance.
Citizen science—that is, everyday people—proved to be a reliable, nuanced source of emotional intuition.
Can Today’s AI Models Do This? Not Very Well (Yet)
We benchmarked three approaches:
1. Supervised classification
A logistic regression trained on SBERT embeddings reached ~57% accuracy, noticeably below performance on GoEmotions or similar datasets. Narrative emotions are harder.
2. Embedding-based retrieval
Zero-shot retrieval models could only recover a small portion of the citizen-label space.
3. GPT-4o (zero-shot prompting)
Better—but still far from matching human richness and diversity.
The takeaway: Narrative emotion understanding remains a frontier problem for AI.
Next step: a deep dive into the emotions of fiction. Coming soon!
