How Cultural Capital Works: Prizewinning Novels, Bestsellers, and the Time of Reading

Screen Shot 2016-05-11 at 2.40.17 PM

This new essay published in Post45 is about the relationship between prizewinning novels and their economic counterparts, bestsellers. It is about the ways in which social distinction is symbolically manifested within the contemporary novel and how we read social difference through language. Not only can we observe very strong stylistic differences between bestselling and prizewinning writing, but this process of cultural distinction appears to revolve most strongly around the question of time. The high cultural work of prizewinning novels appears most strongly defined by an attention to childhood, nature, and retrospection, while the economic work of bestsellers is defined by a diligent attention to the moment. As the forthcoming work of James English has shown, it is these temporal frameworks, or what Bakhtin might have called “chronotopes,” that emerge as some of the more meaningful ways to distinguish the work of cultural capital from that of economic capital.

The approach we use draws on the emerging field of textual analytics within the framework of Bourdieu’s theory of the literary field. Our interest lies in exploring a larger population of works, but also the ways in which groups of works help to mutually define one another through their differences. As Bourdieu writes, “Only at the level of the field of positions is it possible to grasp both the generic interests associated with the fact of taking part in the game and the specific interests attached to different positions.” We wanted to test the extent to which “bestsellers” and “prizewinners” cohere as categories and how this coherence may be based on meaningful, and meaningfully distinguishing, textual features.

Our aim is to help us see the values and ideological investments that accrue around different cultural categories  the act of “position taking,” in Bourdieu’s words  and the ways in which these horizons of expectation help maintain positions of power and social hierarchy. Our project is ultimately about asking how forms of social and symbolic distinction correspond and how that knowledge may be used to critique normative assumptions about what counts as significant within the literary field. The high cultural investment in retrospection that we see on display should not be taken as a default but can also be seen in a critical light — as privileging a more regressive, backward looking narrative mode.

For the full article, you can go here.

How I predicted the Giller Prize (and still lost the challenge)

This Fall we created a lab challenge to see if anyone could predict this year’s Giller Prize winner using a computer. The winner was announced last night, and it turns out I correctly predicted the winner. But I still lost the challenge. In this lies an instructive tale about humans, computers, and predicting human behaviour. Let me explain.

The rules for our challenge were straightforward. Given data about prizewinners and jury members over the past ten years, could your algorithm correctly predict last year’s winner and this year’s. (Since you have a 1 in 12 chance of guessing correctly once the long list is announced, we wanted to lower the odds a bit.) My algorithm predicted André Alexis’s Fifteen Dogs, which was the winner of this year’s prize. However, if you go to my earlier post and look at our predictions, you’ll see he isn’t listed anywhere. Here’s how I predicted his book correctly and why I changed my answer.

First, the prediction.

As we’ve learned over the past year, identifying what makes a literary prizewinner stand out from the pack is actually quite challenging. This is somewhat surprising as one of the things we have been learning in our lab is just how formulaic writing is in general. We can predict novels out of a pool of random books very easily (with about 95-96% accuracy), just as we can tell the difference between certain types of novels, say Romances and Science Fiction, with a high degree of accuracy (usually around 98%). We can even predict bestsellers, though currently not nearly as well (around 75%), though I suspect that number could improve with more work.

Prizewinners, on the other hand, present a different story. At the bottom of this post are two graphs that show just how unclear the distinctions are between prizewinners and their next of kin (non-winners).

But this messiness begins to clear up if you approach the problem not as one of similarity — what do prizewinners have in common? — and look at it as one of dissimilarity. What do prizewinners do differently from other books, including previous prizewinners.

The way I modelled the winning novel over the past two years was simply to look for the novel that was the most dissimilar from previous prizewinners and jury members’ own novels (based on about 80 different linguistic features which are based on the Linguistic Inquiry Word Count Software). Instead of looking for those prizewinning features that all novels have in common (make ’em cry Johnny!), I began to think about the problem as one of modelling human behaviour. It turns out, at least in this instance, jury members are looking for something fresh, something they haven’t seen before, either in past winners or in their own writing. André Alexis’ Fifteen Dogs is just such a book.

Now here’s is where the human operator comes in and messes everything up.

As you can see in the figure below (fig. 1), both André Alexis and Sean Michaels (the past two winners) rank lowest in terms of being similar to previous winners and jury members. Ok, I was done. Being curious, however, I wanted to review my findings, so I went ahead and read the book! Instead of this helping, it actually messed things up. For those who have read Fifteen Dogs, you will know it is a whacky, intense, challenging book. It is great and wild. My human intuition told me no prize committee would ever pick such a challenging book. I’ve sat on prize committees before. I’ve read a lot of prizewinners. I didn’t think this could happen.

So instead of being satisfied with my answer, I went looking for data to back me up. If you look at the second set of columns (fig. 2) you see a different way of measuring similarity. And there Sean Michaels didn’t look so unique. André Alexis still did. That second measure was a kind of reality check that assessed just how different people’s writing styles were from each other overall. Sean Michaels wasn’t radically different from his peers, but Alexis was. So I added another filter that said, you have to be the most dissimilar according to the first measure, but near the mean for the second. What this said was that you needed to be dissimilar to previous years’ winners, but not too far out in terms of style.

I was modelling human behaviour according to my theory of committees. It turns out I made a bad choice. And now I get to be forever known as the guy who predicted the Giller Prize but doesn’t get to take credit for it because he second-guessed the computer.

Why does any of this matter? Because it shows just how involved humans are in the process of data modelling. I made those initial choices to correctly predict the winner. The computer was a useful tool. I then thought more about my problem, looked more closely at the data, and decided to model my problem differently. The computer was still a useful tool, only my thinking about the problem no longer was.

Let’s say Martin John, my final prediction, had won. I’d look like a genius (or whatever) for having constructed my model to take into account expectations about group behaviour. But it turns out I didn’t do a good job of anticipating this group. That’s either bad luck or means I needed more data about how small groups make decisions.

Why do I love this work so much? Because it shows us how human behaviour is so unpredictable, whether it is the work of committees or data scientists. This is especially true when it comes to deciding which novels we love to read.

 

This table shows the relative similarities between the authors and the average for a given category. The far left column refers to previous prizewinners, the next column to the right the jury members' novels. The values are ranks, so lower equals being more dissimilar from the mean of that group. What are we measuring? About 80 different features that are derived from the Linguistic Inquiry and Word Count Software designed by James Pennebaker.
Fig. 1. This table shows the relative similarities between the authors and the average for a given category. The far left column refers to previous prizewinners, the next column to the right the jury members’ novels. The values are ranks, so lower equals being more dissimilar from the mean of that group. As you can see both Sean Michaels and André Alexis score lowest on combined ranks. To calculate the first similarity score I use Burrow’s Delta. What are we measuring in the novels? About 80 different features that are derived from the Linguistic Inquiry and Word Count Software designed by James Pennebaker.

 

This image shows the second set of measures in which Sean Michaels scores more near the middle of the pack and André Alexis continues to be far from the average.
Fig. 2. This image shows the second set of measures in which Sean Michaels scores more near the middle of the pack and André Alexis continues to be far from the average. Here I measure similarity based on cosine similarity of the feature vectors to the group mean.

 

Results of prizewinner prediction using the stylo package in R. Past winners are highlighted in green. Top most frequent 3000 words and classic delta were used. In general there was stronger clustering the more words were included in the model, with 3000 representing an optimal limit.
Fig. 3 Results of prizewinner prediction using the stylo package in R. Past winners are highlighted in green. Top most frequent 3000 words and classic delta were used. In general there was stronger clustering the more words were included in the model, with 3000 representing an optimal limit.
Cluster plot showing the similarities between prizewinning novels and novels reviewed in the New York Times Sunday Book Review. Under current methods, our chances are no better than random when guessing which novels will eventually win prizes. The cluster plot was generated using principal component analysis of 80 features derived using the LIWC linguistic analysis software.
Fig. 4. Cluster plot showing the similarities between prizewinning novels and novels reviewed in the New York Times Sunday Book Review. Under current methods, our chances are no better than random when guessing which novels will eventually win prizes. The cluster plot was generated using principal component analysis of 80 features derived using the LIWC linguistic analysis software.

 

Can a computer predict a literary prize?

This evening the Giller Prize winner will be announced. For those not in the know, the Giller Prize is Canada’s most prestigious literary award. Like the Man Booker in the UK or National Book Award in the US, the Giller Prize serves as a way of signalling to Canadian readers important new fiction. It relies on the judgments of experts, not markets. This is what makes literary prizes unique, but also challenging to predict (and of course fun, since the outcome is highly unknown in advance).

For the past year we have been studying what makes prizewinning novels unique when compared with novels that don’t win prizes. (Our newest paper will be available shortly). As part of that project we created a lab challenge to see if anyone in the lab was able to predict this year’s Giller Prize winner. We wanted to see how well a computer could track human judgments when it comes to literary value. Are there common traits that unite prizewinning novels?

For the purposes of the competition, we supplied students in our lab with all novels from the Giller long lists over the past two years, one sample novel from each jury member for the past two years, and then 10 years of historical data on past winners and shortlists for both the Giller Prize and 4 other literary prizes.

The rules for the competition were straightforward:

a. your algorithm has to predict both last year’s winner and this years in order to win (since you have a 1 in 12 chance of being right in any given year once the long list is announced, we thought we’d up the ante to avoid just being lucky…)

b. your predictions must be submitted before Monday, October 5, 2015, when the short list will be announced. They will be forwarded to the Dean’s Office for safe-keeping.

c. Good luck. You’ll need it 🙂

And the predictions are:

 

  1. All True Not a Lie in It by Alix Hawley
  2. All True Not a Lie in It by Alix Hawley
  3. Close to Hugh by Marina Endicott
  4. Outline by Rachel Cusk
  5. Martin John by Anakana Schofield

And the winner is…

Prizewinners versus Bestsellers. Timeless Reads or the Spotlight of Fame

This post is the first in a series by this year’s .txtLAB interns. It is authored by Eva Portelance.

Building Corpuses

The first step in our search for answers required that we build solid corpuses for comparison. The PW corpus was selected from five main literary awards given in the United-States, Canada and Britain. These were the National Book Awards, the PEN/Faulkner Award for Fiction, the Governor General Literary Award for Fiction, the Scotiabank Giller Prize and The Man Booker Prize, this last one also awards international authors who have been published in the United Kingdom. From these awards, all shortlisted books, including the winners, from years 2005 to 2014 that were available as e-publications in Canada were selected. This amounted to 216 books. Publications that had won several prizes were only added once to the set. As for the BS, the 200 most popular books from the New York Times Bestsellers list from 2008 to 2014 were selected. This criteria was defined by the number of weeks spent on the list. The additional criteria that the novels had to have been published post- 2000 was also considered to try to better match the publication dates of the PW.

Defining Dictionaries

The corpuses created, we began testing different avenues in search of clues that could help us create a clearer picture of what it was that made these groups distinct within their shared fictionality. The two sets were rather similar, but the most interesting differences seemed to lie in their distinct lexicons, suggesting different themes and approach to written work in general. To illustrate these differences, dictionaries highlighting these themes and behaviours were selected. The process which led to their creation was thorough and avoided subjective criteria as best possible to ensure their validity. First, we ran a likelihood test which creates a matrix of common words to a first set, that is, words that seem to be present throughout the corpus and thus, possibly representative of the set. This matrix is then cross-referenced with a second set to only look at words which are present in both corpuses and uses a Wilcoxon Rank Sum test to rank and select the 400 most distinctive words, which in turn are likely to be indicative of characteristics of the first set. We ran the test in both directions thereby creating a dictionary representing each of the corpuses. It is important to note that the sets were both ridded of stop words and stemmed, so not to be surprised by the unconventional orthography or lack of inflection on the resulting dictionaries presented in the graphs bellow. The words used for the subsequent dictionaries investigating theme and language use were selected from these two resulting lists.

Timelessness and Momentary

Recurring themes for the PW corpus seemed to be family, nature as well as sadness and spirituality: the key components of a good soul searching endeavor. This concentration on nature also suggested the importance of descriptive passages. As for BS, interesting sets of words that were explored were technology related words and vernacular words. What peeked my curiosity was the most however was not necessarily the distinct themes themselves, but rather the distinctive words used within similar categories. The example I will share here is that of time. PW seemed to use words that spoke of time in terms of visual cues or spatial relations, referencing the age of characters, or seasons, whereas the BS had words that were based on the factual nature of time, like “minute”, “hour” or “yesterday”. With this in mind, I found that the other key categories mentioned for PW could also be looked at in this light, seeking universal and timeless values such as family and spirituality, no matter if they be discussed in a positive or negative light. Those mentioned for BS speak of things in passing, technology and popular speech are always evolving and certainly do not represent language or ideas that are expected to withstand time, often expiring even within a few years. These are things that readers will understand and breathe in the moment. To this extent, they propose a very different relation to time than do the writings in the PW corpus. They speak of momentary ideas and if this also applies to their storylines, it would suggests events of ephemeral pleasure or pain, rather than contemplation.

Language and Thought

To generalise this idea even further, I question whether the use of language in PW and BS is indicative of different intuitions on language, but also on the world it chooses describe. Whether something is well written if often highly based on prescriptive ruling and thus, there is less interest in knowing what makes a good book. However, what is chosen to be written about and the perspective used to do so is anchored in descriptive thought processing. Therefore, I center my attention for further reflection on a new question: Is the language used by the authors of these books from two distinct sets indicative of a shared thought process or perspective of writing, or even the world they choose to describe? 

List of Graphs

Here are bar graphs representing each dictionary mentioned. The dictionaries were originally a little longer, but the most indicative words were selected here. This selection was based on the sparsity level of a word through its most representative corpus. The data presented compares the scaled word count of a specific variable throughout the sets.

 

The words chosen were present between 100 and 95 percent of the books in the PW corpus, indicating very high relevance. It is interesting to note the lower presence of male characters in BS compared to female counterparts.
The words chosen were present between 100 and 95 percent of the books in the PW corpus, indicating very high relevance. It is interesting to note the lower presence of male characters in BS compared to female counterparts.
The words chosen were present between 100 and 75 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 100 and 75 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 95 and 83 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 95 and 83 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 90 and 70 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 90 and 70 percent of the books in the PW corpus, indicating very high relevance.
The words chosen were present between 98 and 71 percent of the books in the BS corpus, indicating very high relevance, whereas in the PW, they only appear in 30 to 90 percent of them, much less unified.
The words chosen were present between 98 and 71 percent of the books in the BS corpus, indicating very high relevance, whereas in the PW, they only appear in 30 to 90 percent of them, much less unified.
These words are less common than the other sets, however, most are still very common in the BS and very uncommon in the PW, indicating a clear point of comparison. They were present between 95 and 40 percent of the books in the BS corpus, whereas in the PW, they only appear in 23 and 80 percent of them.
These words are less common than the other sets, however, most are still very common in the BS and very uncommon in the PW, indicating a clear point of comparison. They were present between 95 and 40 percent of the books in the BS corpus, whereas in the PW, they only appear in 23 and 80 percent of them.