Detecting Literary Characters

We are pleased to announce the acceptance of a new paper in this year’s Conference for Empirical Methods in Natural Language Processing (EMNLP-15). The paper offers additional methods beyond NER for identifying characters in novels.

This work is part of our on-going project of studying social networks in fiction. As we’ve come to realize, just extracting characters is a non-trivial task. (Building interaction networks is another thing altogether.) We’ve been inspired by the work of David Bamman, Ted Underwood, and Noah Smith and are working hard to improve upon the results they’ve achieved. We’ve realized over this past year how diverse naming conventions can be around characters and in particular the way a great number of characters have functional names rather than proper names (like the minister or the undertaker). Traditional systems aren’t very good at detecting these and yet depending on the type of genre you’re looking at, they may play an important role (this is particularly true for example in detective fiction).

We will be continuing to work on character detection and social networks this year. In particular, we’re interested in studying the nature of characterization in novels. What kinds of linguistic entities are characters and what kinds of distributions do we see around characters in novels. Understanding their social relationships is the ultimate goal, but an important first step is to better understand how characters function in literature. It’s another reminder of how different digital methods and text annotation tools can be used to learn about the often absence of knowledge or assumptions about some of our most basic literary categories. Character doesn’t seem complex until you try to use it to study something else. Then you realize how interesting it is all by itself.

And to give a sneak peek at an initial finding about character numbers: so far, taking a broad view of characters in novels we see a remarkably flat average number of characters for novels, with a very wide variation. There is much more to study here, but I have a feeling that novelistic characters are going to look pretty systematic as a cultural construct. But we’ll see.

characters