LIWC for Literature: Releasing Data on 25,000 Documents

Increasing emphasis is being placed in the humanities on sharing data. Projects like the Open Syllabus Project, for example, have made a tremendous effort in discovering, collecting, and cleaning large amounts of data relevant to humanities research. Much of our data, however, is still locked-up behind copyright and paywalls within university libraries, even when the underlying information is part of the public domain. This is one of the main inhibitors to the field’s development.

In an effort to contribute to the opening of closed humanities data, I am sharing LIWC tables for 25,000+ documents that were used in a recent study I did on “fictionality.” The documents consist of both fiction and non-fiction texts drawn from a number of different periods (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) as well as two separate languages (German and English). They allow us to explore a variety of literary historical questions across broad swaths of time and place.

LIWC stands for Linguistic Inquiry and Word Count Software and is a lexicon-based tool that aggregates individual words into larger semantic and syntactic categories. Some of these categories, like punctuation marks or personal pronouns, are more straightforward than others, like “cognitive insight” or the thematics of “home.” But as we know in literary studies, even straightforward marks can have multiple meanings.

LIWC is far from perfect. Much work has been done to address the problems of polysemy of individual words. Nevertheless I want to make the case that it can be an effective tool for solving three problems within the computational study of literature and culture.

1) It gives us a useful way of beginning to categorize the lexical orientation of different populations of texts. Unlike topic modeling, where labels are provided after the fact, LIWC categories allow us to test hypotheses in advance. The categories are independent of the texts we are observing.

2) It gives us a way of reducing the dimensional complexity of linguistic features. Given enough documents, novels, poetry, or plays, you can easily end up with tens if not hundreds of thousands of word types when you’re building a given model. That’s often way too many variables from which to make statistical inferences. LIWC offers a very straightforward way of reducing the number of dimensions according to categories that are intrinsically relevant to the study of literature. There is plenty more to be done here to better understand the correlation between features or how much information is lost in this process. But with ca. 80 dimensions you are on much better footing for a variety of modeling tasks than you are with three- or ten-thousand.

3) Finally, these aggregate features allow us to share data that is otherwise not sharable. This is a huge problem in the humanities right now. LIWC provides a solution. Again, it’s not perfect. But it is better than keeping the data locked up.

I hope in the future that people will do more of this kind of sharing of transformed data. While it is always better to have the underlying data so you can understand and be in control of the process of transformation (let alone collection), we can at least start to generate some shared data sets. This is a point nicely discussed by Sarah Allison in the Journal of Cultural Analytics (CA) as well as in a forthcoming piece by Andrew Goldstone.

Perhaps the most important point though is that LIWC is just one way of reducing the complexity of a text into higher-level categories, which are largely in LIWC semantically oriented. We can argue about those categories and the degree of ambiguity within them. More importantly though we need to also think beyond purely semantic models of texts (as many are increasingly doing). We need “LIWC for literature” in a different sense — i.e. in the creation of new kinds of literary features that are derived from texts that aren’t purely semantic. Here I’m thinking of plot features, character features, dialogue features, you name it. There is so much work to do to identify features specific to literary texts. These will ultimately help not only in the sharing of data, but also in the process of literary modeling more generally.

This is a core area of exciting new research that I hope people increasingly engage with.

Upward Looking and Forward Thinking? The Stance of the Modern Novel

What would it mean for the novel to take a stance? To position itself relative to the world? How would it do so and how might we understand this positioning? At the individual level, we can imagine how certain novels are written from a particular orientation to the world,  from “below” as in the case of Notes from Underground, or “above” (Cloud Atlas), “on” (On the Road), or even “along,” as in “along the river” (Huckleberry Finn). Prepositions begin to tell us how novels aren’t just mirrors of the worlds around them, but have a distinct location.

Prepositions tend to be those words that we think have little meaning. Like other small words — conjunctions and articles come to mind — prepositions seem to serve a purpose more than anything else. We often refer to them as “function” words. Occasionally, we just need them. But they can also begin to accumulate, to add up to an overall orientation. Prepositions don’t just serve a purpose, but can also make a point.

These are some of the questions I started with when I began to investigate prepositions in the novel over the last two centuries. What kinds of “positional communities” might we find when we look at novels that tend to use certain types of prepositions in higher than average amounts? And could we discern any larger changes over time in how novels position themselves vis a vis the world?

I began with the assumption that there should be no large historical shifts in the positional orientation of the novel. My assumption was that novels could talk about all sorts of relationships (above, below, through, near, far, among, within, etc) and that any single orientation would get drowned out in the overall proliferation. Prepositions comprise roughly 12% of all words in a novel, which to put it in perspective, means about 14,000 words in a novel like Pride and Prejudice. It seemed unlikely that this might add-up to some kind of spatial bias or that such bias might change distinctly over time. As far as I can tell, I was wrong.

Below is a time / space graph of the novel over two centuries. The further right a novel is the more it tends to use words like ahead, beyond, next, after and toward instead of before, behind, and back. The further up the more it favours words like above, up, on, and over instead of below, beneath, down, and under. Prepositions are famously polysemous, and some of these words are not always used as prepositions. But I think the value of thinking at scale is the way in aggregation they can begin to show us a larger sense of orientation beyond any specific single use. As we can see, there does appear to be an overall shift from the nineteenth-century (grey) to the late twentieth (black) that moves from the lower left quadrant (below and behind) towards the upper right (after and above).

Ratio of time / space prepositions in 3000 novels from 1790 to 1990.
Ratio of time / space prepositions in 3000 U.S. novels from 1790 to 1990.

To illustrate just how stark this shift is, I include just novels from the first half of the nineteenth century and then those published after the second world war.

Ratio of time / space prepositions in novels published between 1800 and 1850.
Ratio of time / space prepositions in novels published between 1800 and 1850.
Ratio of time / space prepositions in postwar novels in English.
Ratio of time / space prepositions in postwar novels in English.

While I do not yet have gender information about all novels in this data set, for a smaller subset of 200 canonical novels published between 1770 and 2000, we see how this shift does not seem to have any significance for the way men and women use these prepositions. This is an interesting finding in light of the ways we might have assumptions about women’s marginal authorial status translating into a particular orientation to the world.

Ratio of time / space words in 200 English-novels from the late eighteenth-century to the end of the twentieth colour-coded by gender (red = women, blue = men).
Ratio of time / space words in 200 English-novels from the late eighteenth-century to the end of the twentieth colour-coded by gender (red = women, blue = men).

It is also important to note that when we look at other languages we see some differences, both small and large. The caveat is that these data sets are considerably smaller (on the order of 150 to 300 novels) so need to be taken with caution. Nonetheless, given that the canonical English-language novels follow the same path as the larger collection of English novels, it is not unreasonable to assume that this would be the case for German and French.

As we see with the German novel, the move is less bi-axial (right and up) and more one from below to above. In addition, the significance of that shift seems much less pronounced (p-values to come). Though again, for now my German novels stop in the 1930s and thus the postwar shift is unaccounted for.

Excess_NovelStance_German_Plot_18C

Excess_NovelStance_German_Plot_20C

In the French novel,  the phenomenon appears to move in reverse! Novels written around 1800 tend to be more forward-looking and those written during the modernist period orient backwards. The question is whether this might look different with more data, or whether this reflects a distinctly different cultural sensibility. The anglo-american novel looks forward and upward over time, while the French novel moves back and down and the German hovers somewhat ambiguously in between? Are these reflective of cultural stances, effects of the different samples, or potential flaws in the model?

Excess_NovelStance_French_Plot_Romantic

Excess_NovelStance_French_Plot_Modernism

The larger question of course is what might this all mean. Answering that question will take a great deal more work. But provisionally, if we stay with the English-language novel for now, I believe that we are seeing a trend in how novels handle their relationship to time and space differently. The early-nineteenth-century novel manifests a broadly Romantic picturesque point of view — in which a character surveys the world below her to experience a degree of emotional intensity, to lose herself in melancholy or sublime feeling. Take for example, this passage from The Mysteries of Udolpho:

Emily, often as she travelled among the clouds, watched in silent awe their billowy surges rolling below; sometimes, wholly closing upon the scene, they appeared like a world of chaos, and, at others, spreading thinly, they opened and admitted partial catches of the landscape—the torrent, whose astounding roar had never failed, tumbling down the rocky chasm, huge cliffs white with snow, or the dark summits of the pine forests, that stretched mid-way down the mountains. But who may describe her rapture, when, having passed through a sea of vapour, she caught a first view of Italy.

And then compare it to the openings of works of both popular and high fiction from the twentieth century:

Jewel and I come up from the field, following the path in single file. Although I am fifteen feet ahead of him, anyone watching us from the cotton house can see Jewel’s frayed and broken straw hat a full head above my own. (William Faulkner, As I Lay Dying)

At that very moment, in the very sort of Park Avenue co-op apartment that so obsessed the Mayor…twelve-foot ceilings…two wings, one for the white Anglo-Saxon Protestants who own the place and one for the help…Sherman McCoy was kneeling in his front hall trying to put a leash on a dachshund. The floor was a deep green marble, and it went on and on. It led to a five-foot-wide walnut staircase that swept up in a sumptuous curve to the floor above. It was the sort of apartment the mere thought of which ignites flames of greed and covetousness under people all over New York and, for that matter, all over the world. But Sherman burned only with the urge to get out of this fabulous spread of his for thirty minutes. (Tom Wolfe, Bonfire of the Vanities)

As I said above, we can see how these words can mean different kinds of things in specific contexts (“over” is a great case of a word that can be used in numerous different ways and entire books have been written by linguists to study it). But taken together, these novels begin to orient our minds in a specific direction, one that is both positionally different than Radcliffe and the Romantic novel as well as attitudinally different. Ideology follows orientation, or as Kenneth Burke liked to point out, ideas have a dramatic structure — his famous scene/act ratio suggests that specific thoughts need specific spaces. Prepositions (and adverbs) can give us a sense of relational space, and thus of the types of ideas that can be possible within novelistic writing. At any given moment, a novel may turn us down or up or back or forward. But when it does so passage after passage, again and again in a particular direction, this is saying something about how narrative is relating readers to the world. The question becomes, Can we understand the spectrum of ideas that correspond to the novel’s stance? This is the question I would like to follow next in more detail.

 

 

Prizewinners versus Bestsellers. Timeless Reads or the Spotlight of Fame

This post is the first in a series by this year’s .txtLAB interns. It is authored by Eva Portelance.

Building Corpuses

The first step in our search for answers required that we build solid corpuses for comparison. The PW corpus was selected from five main literary awards given in the United-States, Canada and Britain. These were the National Book Awards, the PEN/Faulkner Award for Fiction, the Governor General Literary Award for Fiction, the Scotiabank Giller Prize and The Man Booker Prize, this last one also awards international authors who have been published in the United Kingdom. From these awards, all shortlisted books, including the winners, from years 2005 to 2014 that were available as e-publications in Canada were selected. This amounted to 216 books. Publications that had won several prizes were only added once to the set. As for the BS, the 200 most popular books from the New York Times Bestsellers list from 2008 to 2014 were selected. This criteria was defined by the number of weeks spent on the list. The additional criteria that the novels had to have been published post- 2000 was also considered to try to better match the publication dates of the PW.

Defining Dictionaries

The corpuses created, we began testing different avenues in search of clues that could help us create a clearer picture of what it was that made these groups distinct within their shared fictionality. The two sets were rather similar, but the most interesting differences seemed to lie in their distinct lexicons, suggesting different themes and approach to written work in general. To illustrate these differences, dictionaries highlighting these themes and behaviours were selected. The process which led to their creation was thorough and avoided subjective criteria as best possible to ensure their validity. First, we ran a likelihood test which creates a matrix of common words to a first set, that is, words that seem to be present throughout the corpus and thus, possibly representative of the set. This matrix is then cross-referenced with a second set to only look at words which are present in both corpuses and uses a Wilcoxon Rank Sum test to rank and select the 400 most distinctive words, which in turn are likely to be indicative of characteristics of the first set. We ran the test in both directions thereby creating a dictionary representing each of the corpuses. It is important to note that the sets were both ridded of stop words and stemmed, so not to be surprised by the unconventional orthography or lack of inflection on the resulting dictionaries presented in the graphs bellow. The words used for the subsequent dictionaries investigating theme and language use were selected from these two resulting lists.

Timelessness and Momentary

Recurring themes for the PW corpus seemed to be family, nature as well as sadness and spirituality: the key components of a good soul searching endeavor. This concentration on nature also suggested the importance of descriptive passages. As for BS, interesting sets of words that were explored were technology related words and vernacular words. What peeked my curiosity was the most however was not necessarily the distinct themes themselves, but rather the distinctive words used within similar categories. The example I will share here is that of time. PW seemed to use words that spoke of time in terms of visual cues or spatial relations, referencing the age of characters, or seasons, whereas the BS had words that were based on the factual nature of time, like “minute”, “hour” or “yesterday”. With this in mind, I found that the other key categories mentioned for PW could also be looked at in this light, seeking universal and timeless values such as family and spirituality, no matter if they be discussed in a positive or negative light. Those mentioned for BS speak of things in passing, technology and popular speech are always evolving and certainly do not represent language or ideas that are expected to withstand time, often expiring even within a few years. These are things that readers will understand and breathe in the moment. To this extent, they propose a very different relation to time than do the writings in the PW corpus. They speak of momentary ideas and if this also applies to their storylines, it would suggests events of ephemeral pleasure or pain, rather than contemplation.

Language and Thought

To generalise this idea even further, I question whether the use of language in PW and BS is indicative of different intuitions on language, but also on the world it chooses describe. Whether something is well written if often highly based on prescriptive ruling and thus, there is less interest in knowing what makes a good book. However, what is chosen to be written about and the perspective used to do so is anchored in descriptive thought processing. Therefore, I center my attention for further reflection on a new question: Is the language used by the authors of these books from two distinct sets indicative of a shared thought process or perspective of writing, or even the world they choose to describe? 

List of Graphs

Here are bar graphs representing each dictionary mentioned. The dictionaries were originally a little longer, but the most indicative words were selected here. This selection was based on the sparsity level of a word through its most representative corpus. The data presented compares the scaled word count of a specific variable throughout the sets.

 

The words chosen were present between 100 and 95 percent of the books in the PW corpus, indicating very high relevance. It is interesting to note the lower presence of male characters in BS compared to female counterparts.
The words chosen were present between 100 and 95 percent of the books in the PW corpus, indicating very high relevance. It is interesting to note the lower presence of male characters in BS compared to female counterparts.
The words chosen were present between 100 and 75 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 100 and 75 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 95 and 83 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 95 and 83 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 90 and 70 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 90 and 70 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 98 and 71 percent of the books in the BS corpus, indicating very high relevance, whereas in the PW, they only appear in 30 to 90 percent of them, much less unified.
The words chosen were present between 98 and 71 percent of the books in the BS corpus, indicating very high relevance, whereas in the PW, they only appear in 30 to 90 percent of them, much less unified.
These words are less common than the other sets, however, most are still very common in the BS and very uncommon in the PW, indicating a clear point of comparison. They were present between 95 and 40 percent of the books in the BS corpus, whereas in the PW, they only appear in 23 and 80 percent of them.
These words are less common than the other sets, however, most are still very common in the BS and very uncommon in the PW, indicating a clear point of comparison. They were present between 95 and 40 percent of the books in the BS corpus, whereas in the PW, they only appear in 23 and 80 percent of them.

 

The New Young Adult Fiction. More Human, More Me.

What difference does an editor make?

This was the question posed by a recent profile of the highly successful editor of young adult fiction, Julie Strauss-Gabel, who manages the imprint Dutton Children’s Books. Her titles have consistently performed well over recent years and it was a timely reminder of the impact that a good editor can have on writers’ careers. In an age of dwindling resources in publishing – yet another sign of the disappearing middle – a good editor appears to make a big difference.

Here at .txtLAB, we were curious to see whether Strauss-Gabel’s books were recognizably different from other successful young adult fiction. Was the success of her books due more to extrinsic factors, like reputation or marketing, or did her list have something unique about it in terms of content? Understanding what makes these books stand out is not only a way for other editors to keep up with the competition. It could also be useful for aspiring writers who don’t have access to high-powered agents or editors (or pre-established reputations like John Grisham).

 

So what difference does a Dutton imprint make? To find out, we compared the 22 most recent Dutton books to a collection of 200 titles drawn from the Goodreads Best of 2014 young adult fiction list and Amazon’s list of best-selling YA fiction. We used the popular data mining tool, LIWC, developed by James W. Pennebaker, which compares linguistic features across 85 different dimensions, including grammatical features like pronouns and punctuation and more complex phenomena like social, cognitive, and perceptual processes.

The first and most salient point that we found is that there is a statistically significant difference between Ms. Strauss-Gabel’s books and other popular young adult fiction. This is not something that should be taken for granted. But her editorial sensibility has indeed produced a unique signature among her books.

When we looked more closely at which features made her books stand out, we found Dutton books were defined by stylistic aspects like the use of first person pronouns (I, We), a vocabulary of inclusivity (and, with, plus), and a greater use of conjunctions and commas, suggesting more complex sentence structure (sentences were also longer on average). An emphasis on time-words also emerged, suggesting a greater degree of narrative sophistication (or at least diversity). Finally, the Dutton books tended to focus more explicitly on “humans” (adults, boys, girls, etc.), suggesting an investment in the description of people rather than emotions. Perhaps surprisingly, the only significant theme to emerge was “money.”

Recent discussions of Ms. Strauss-Gabel’s list have indeed emphasized the sophistication of her taste (the Times spoke of her “high-quality”), and we can see this reflected in the way her books rely on longer and more complex sentences, an important point if you’re trying to guide your own young adult towards more high-brow material. More interesting, and so far unnoticed, is the way her books are more focused on individuals as well as social belonging. The features that were most indicative of non-Dutton books, for example, were body-related words (heart, head, hands), words related to feelings (caress, feel, grab), and a range of negative emotions (anger and anxiety being the highest). These struck me as the more stereotypically teenage: an attention to physique, physical sensation, and emotional negativity capture the adolescent imaginary rather well. It is decidedly interesting that the Dutton books don’t fit this mold, indicating a potential new trend towards more humanistic YA fiction, away from the dystopian worlds of hunger games and fantasy conflict.

These trends became even more pronounced when we subsetted our Amazon list by the top- and bottom-selling books. Top sellers emphasized death far more then their bottom counterparts, as well as the pronouns “we” and “they,” suggesting a highly binary, and collective, moral universe. They focused more on space and the present tense, while the lesser-selling books focused on the past tense, positive emotions, friends, religion, sex and family. If you wanted to make it to the top of the heap in the last few years in young adult fiction, then your best bet was to go negative, stay in the present, and avoid families and positive emotions. Once again, the Dutton books seem to stand out in how different they are.

One question we had was whether Ms. Strauss-Gabel’s list was more diverse than other samples of young adult fiction. As the New York Times noted, “Ms. Strauss-Gabel’s books are strikingly diverse.” According to three different measures of similarity, we found that the Dutton books tended to be significantly less similar to each other than comparable sample sizes of young adult fiction from our other collections. There appears to be more linguistic range to the Dutton books than in the typical subset of the genre as a whole, one more indication of these books’ sophistication.

This last point made us curious whether the sophistication of Dutton Books indicated that they were in fact more “adult” than “young adult.” As the Times pointed out, more and more adults are reading YA fiction these days. Is this a dumming down of readers or a scaling-up of the genre? When we compared our young adult fiction collection with a collection of bestselling fiction, it turned out that the non-Dutton books tended to be slightly more similar with their adult counterparts, though the significance of this was minimal. What this suggests is that the Dutton books in particular are not mimicking bestselling adult books in any overtly recognizable way. Indeed, young adult fiction in general tends to look more like itself than adult bestsellers. While many people are critical of the rising popularity of young adult books, I was pleased to see that genre differences continued to exist for readers of different ages. Young adult doesn’t seem to be blending into adult (or vice versa), at least for now.

These were just some of things that we were able to learn in our lab in a few days studying these books. The insights are of course more coarse than a highly-trained human reader might be able to offer. But they are also more generalizable and less dependent on individual judgment. Every writer knows someone whom she can ask for an opinion of her manuscript. But not every writer has access to an understanding of a genre as a whole or trends in readers’ taste. This is where computers can be useful and, I feel, democratizing. They can make something as complex as the publishing industry – which can look like a secret society from the outside – seem more transparent.

 

Validation and Subjective Computing

Like many others I have been following the debate between Matthew Jockers and Annie Swafford regarding the new syuzhet R package created by Jockers, which has been given a very nice storified version by Eileen Clancy. As others have pointed out, the best part of the exchange has been the civility and depth of replies, a rare thing online these days.

To me, what the debate has raised more than anything else is the question of validation and its role within the digital humanities. Validation is not a process that humanists are familiar with or trained in. We don’t validate a procedure; we just read until we think we have enough evidence to convince someone of something. But as Swafford has pointed out, right now we don’t have enough evidence to validate or — and this is a key point — invalidate Jockers’ findings. It’s not enough to say that sentiment analysis fails on this or that example or the smoothing effect no longer adequately captures the underlying data. One has to be able to show at what point the local errors of sentiment detection impact the global representation of a particular novel or when the discrepancy between the transformed curve and the data points it is meant to represent (goodness of fit) is no longer legitimate, when it passes from “ok” to “wrong,” and how one would go about justifying that threshold. Finally, one would have to show how these local errors then impact the larger classification of the 6 basic plot types.

As these points should hopefully indicate, and they have been duly addressed by both Jockers and Swafford, what is really at stake is not just validation per se, but how to validate something that is inherently subjective. How do we know when a curve is “wrong”? Readers will not universally agree on the sentiment of a sentence, let alone more global estimates of sentimental trajectories in a novel. Plot arcs are not real things. They are constructions or beliefs about the directionality of fortune in a narrative.  The extent to which readers disagree is however something that can and increasingly must be studied, so that it can be included in our models. As we’ve recently undertaken here at .txtLAB, in order to study social networks in literature we decided to study the extent to which readers agree on basic narrative units within stories, like characters, relationships, and interactions. It has been breathtaking to see just how much disagreement there is (you’d never guess that readers do not agree on how many characters there are in 3 Little Pigs — and it’s in the title). Before we extract something as subjectively constructed as a social network or a plot, we need to know the correlations between our algorithms and ourselves. Do computers fail most when readers do, too?

What I’m suggesting is that while validation has a role to play, we need a particularly humanistic form of it. As I’ve written elsewhere on conversional plots in novels, validation should serve as a form of discovery, not confirmation of previously held beliefs (see the figure below). Rather than start with some pre-made estimates of plot arcs, we should be asking what do these representations tell us about the underlying novels? Which novels have the worst fit according to the data? Which ones have the worst fit according to readers? How can this knowledge be built into the analytical process in a feedback loop rather than a single, definitive statement? How can we build perspective into our exercises of validating algorithms?

While I don’t have any clear answers right now, I know this is something imperative for our field. We can’t import the standard model of validation from computer science because we start from the fundamental premise that our objects of study are inherently unstable and dissensual. But we also need some sort of process to arrive at interpretive consensus about the validity of our analysis. We can’t not validate either.

The debate between Jockers and Swafford is an excellent case in point where (in)validation isn’t possible yet. We have the novel data, but not the reader data. Right now DH is all texts, but not enough perspectives.

Here’s a suggestion: build a public platform for precisely these subjective validation exercises. It would be a way of basing our field on new principles of readerly consensus rather than individual genius. I think that’s exciting.

This diagram captures the different stages of computational reading and the different types of practices each stage entails. Traditional close reading encompasses the first stage of “belief.” Current understandings of distant reading bring us as far as “measurement.” This model advocates for the continuation of the process in an oscillatory fashion, moving back and forth between close and distant forms of reading in order to approach an imaginary conceptual center. The initial sample (here Augustine’s Confessions) is chosen and understood with reference to a larger category (here “The Novel”), as is the new sample of quantitatively significant texts derived from the model (“Sample2”). “Sample2” is also mediated by the larger sample from which it is drawn (“Whole’”, here my subset of 450 novels that are representative of “The Novel”). The process of interpreting “Sample2” is both one of validation – did the model work – and also one of refinement – in what other ways can we understand and thus measure this group of texts? The overall process is represented as a spiral that does not return to the initial sample, but gradually, though never completely, converges on an imagined generic center.
This diagram captures the different stages of computational reading and the different types of practices each stage entails. Traditional close reading encompasses the first stage of “belief.” Current understandings of distant reading bring us as far as “measurement.” This model advocates for the continuation of the process in an oscillatory fashion, moving back and forth between close and distant forms of reading in order to approach an imaginary conceptual center. The initial sample (here Augustine’s Confessions) is chosen and understood with reference to a larger category (here “The Novel”), as is the new sample of quantitatively significant texts derived from the model (“Sample2”). “Sample2” is also mediated by the larger sample from which it is drawn (“Whole’”, here my subset of 450 novels that are representative of “The Novel”). The process of interpreting “Sample2” is both one of validation – did the model work – and also one of refinement – in what other ways can we understand and thus measure this group of texts? The overall process is represented as a spiral that does not return to the initial sample, but gradually, though never completely, converges on an imagined generic center.

 

The Eighteenth-Century Family

Novel_Feelings_Family_1750_1799

 

 

This animation represents the emotional network of the family in the eighteenth-century novel. It measures the co-occurrence of emotions and family members within sentences in a sample of eighty novels in English published between 1750-1800. It begins with the most strongly weighted connection (“man”-“good”) and then gradually grows to include the entire network. Overall what is striking about this network (compared to the general emotion network) is the high degree of heterogeneity of emotions surrounding family members. I had expected far clearer divisions, but while the eighteenth-century family does have a fairly coherent core, it’s larger network appears to involve quite a range of emotions. Families have been complicated for a long time.

Some notable moments to look for:

– the opening dyad of “man” and “good” tells us a great deal about beliefs about the family;

– the dyad gradually grows to include man, woman, and god organized around good, love, and fortune.

– “person” appears before “mother”

– the first negative emotion is “cried”

– with “passion” comes “pleasure” and “death”

– brothers appear before sisters, but girls appear before boys

– “fear” comes before “bad,” which is followed by “pride”

– “child” enters quite late, along with those moral words like “respect”, “friendship”, “care”

– “desire” and “melancholy” enter with “afraid” but also “tenderness”

– more and more sad words will accrue around “mind”, while more and more happy words will accrue around “woman”

– finally, a load of anger words (“revenge”, “aversion”, “prejudice”) enter the latest.

Emotion Networks in the Novel

For my ongoing project on the history of emotions in the novel, I thought I’d post a first pass of emotion networks that appear in the Romantic Novel versus the Postwar Novel. The networks are based on emotion words that occur in the same sentence. The more often emotions appear in the same sentence the stronger their connections, the closer they will appear. The size of the word is an indication of the number of different emotion words that each word connects with.

The initial finding of interest here is the way the postwar network is both less dense and also more heterogeneous (what network scientists would call a decline of assortativity). The emotional intensity of the novel has declined, but the emotional complexity has arguably increased. Emotion words are not grouping quite as strongly with words in their own emotions. The hypothesis would be that there is more emotional conflict happening at the sentence level of the novel as it appears in the second half of the twentieth century.

These networks represent small sets of around 40 novels each. I am taking a second pass on larger data sets and am curious if the results hold. I will also be calculating the actual measures of things like density and assortativity to better understand the extent of this shift. The next step will be going in and finding out what it means when different kinds of emotion words appear in sentences together. What is being captured here?

I thought these graphs give a nice initial idea of the ways in which the emotional networks of the novel have changed over the course of two centuries.

Network of emotions in 40 novels written in English between 1800 and 1851. Yellow = Joy, Green = Love, Blue = Sadness, Purple = Fear, and Red = Anger. The underlying edges between emotion words have been removed for clarity.
Network of emotions in 40 novels written in English between 1800 and 1851, from Maria Edgeworth’s Castle Rackrent to Nathaniel Hawthorne’s House of the Seven Gables. Yellow = Joy, Green = Love, Blue = Sadness, Purple = Fear, and Red = Anger. The underlying edges between emotion words have been removed for clarity. Emotions are based on a dictionary of 872 emotion words.
Network of emotions in 42 novels written in English between 1943 and 2000. Yellow = Joy, Green = Love, Blue = Sadness, Purple = Fear, and Red = Anger. The underlying edges between emotion words have been removed for clarity.
Network of emotions in 42 novels written in English between 1943 and 2000, from Betty Smith’s A Tree Grows in Brooklyn to Zadie Smith’s White Teeth. Yellow = Joy, Green = Love, Blue = Sadness, Purple = Fear, and Red = Anger. The underlying edges between emotion words have been removed for clarity. Emotions are based on a dictionary of 872 emotion words.

 

 

NovelTM

This partnership brings together 21 researchers and partners from academic and non-academic institutions in order to produce the first large-scale, cross-cultural study of the novel according to quantitative methods. Ever since its putative rise in the eighteenth century, the novel has emerged as a central means of expressing what it means to be modern. And yet despite this cultural significance, we still lack a comprehensive study of the novel’s place within society that accounts for the vast quantity of novels produced since the eighteenth century, the period most often identified as marking the origins of the novel’s quantitative rise. Our aim is thus twofold: 1) to enliven our understanding of one of the most culturally significant modern art forms according to new computational means, and 2) to establish the methodological foundations of a new disciplinary formation. Text mining is arguably one of the most important fields driving growth, innovation, and even citizenship within a modern information economy. This partnership seeks to bring the unique knowledge of literary studies to bear on larger debates about text mining and the place of information technology within society. In so doing, it will impact how we think about the nature of reading and the way we increasingly access our cultural heritage today.