How do we Model Stereotypes without Stereotyping (Again)? How about information theory.

How do we Model Stereotypes without Stereotyping (Again)? How about information theory.

In a previous post, we explored how using language models and the idea of “perplexity” can allow us to study stereotypes in movie character roles using their dialogue as a basis. We examined a corpus of 750 Hollywood films, released between 1970 and 2014, and tried to model assumptions from the research that people of colour are more often criminalized or depicted in criminal roles than white actors.

In this post, we want to discuss how entropy, and information theory, can also be a useful approach to this kind of research. It is a measure of how “surprising” an event is (i.e., how much “information” it carries), based on the probability of that event occurring – the less probable, the more surprising. In the previous post, we used a crime language model, built from crime TV shows, to approximate film character dialogue (not limited to any genre). A perplexity score, measuring how surprising the new dialogue was, told us how different the dialogue was from the model.

In coming up with potential models to explore the feature of the “criminality” of a role, we discovered a huge flaw in this kind of research: creating a model for stereotyping presupposes an existing stereotype that you, the researcher, have to define. In an effort to call attention to pigeonholing and tokenism, your own biases, however subconscious, will undoubtedly come forward.

One method to circumvent this is to get rid of a particular (and potentially subjective) language model and search for more general linguistic variability between groups. Forget any model or any expectation of how these groups would sound, and ask, how similar do the groups sound to each other?

So, we tried a new approach. Sticking with information theory and the idea of surprisal, we turned to Kullback-Leibler divergence (KLD), or relative entropy. Instead of building a model to which the dialogue will be compared, the dialogue of one group will serve as the the model to approximate the dialogue of another. KLD, then, is valuable because of the asymmetry it offers. Corpus A serves as the model for Corpus B, and we get a surprisal score, telling us how well Corpus B predicts the words that we find in Corpus A. Then we go the other way, seeing how well Corpus A predicts the words we find in Corpus B. These two directions will not necessarily yield the same result, because one corpus could be far more varied than the other, but include all of the same words that the other has.

We have to tread carefully, as we encountered a reoccurring challenge in various phases of our original study on racial and ethnic representation in Hollywood movies. Our linguistic analysis models would often break because of the difference in group size (our biggest group, White characters, was almost 10 times larger than the second biggest, Black characters). In other cases, they could give us significant results that did nothing more than point to the gaping difference in sample size. Ultimately, this measure required doing something we spent the whole project avoiding: turn our measure into one of White characters as compared to not-White characters, effectively bundling all people of colour into one group. Although it is far from ideal to lose the nuanced representation of different racial and ethnic groups, any kind of research in this direction does unveil valuable insight into how the casting of actors into certain roles functions. In the end, with the data that we had, this research was not possible without such a grouping.

From our data set, we ended up with 3,525 White characters and 533 non-White characters (how we came to this dataset, and the distributions of characters among racial and ethnic groups, is explained in our original publication).

We repeatedly sampled 100 characters from each of the two groups, with replacement, under the assumption that, if the smaller group was the same size as the larger, it would continue to exhibit the same kinds of linguistic patterns we were already seeing. We did this 1,000 times, and ran a KLD measure, assessing how well dialogue from White characters could be used to predict dialogue from characters who were not-White, and how well dialogue from non-White characters could model dialogue from White characters.

Using a density plot to visualize the distribution of 1000 KLD scores in each direction, we see a significant (t = 73.643, df = 1963.8, p-value < 2.2e-16) difference in the way these two models function:

When dialogue from non-White characters was used to approximate dialogue from White characters, there was an average KLD score of 2.88. Going in the other direction, using dialogue from White characters to approximate the dialogue from non-White characters, yielded a KLD score, on average, of 3.67.

That is to say, the words spoken by non-White characters are a better predictor of the words spoken by White characters than vice versa. They say some of the same things as White characters, increasing their ability to predict those sequences, but they also say things that don’t occur, or don’t occur as frequently, in the dialogue files of White characters. When the words spoken by White characters are used as the predictor (i.e. when we use White characters’ speech to approximate non-White characters), the model encounters more surprise thus raising the divergence score. There is a significant portion of language that is specific to non-White characters that cannot be accounted for by White speech.

Looking at only the top quartile of roles (in this case, only the characters who spoke more than 1,186 words), we still see a significant difference between the two groups in their ability to approximate the other’s dialogue (t = 42.381, df = 1985.3, p-value < 2.2e-16). Specifically, using dialogue from non-White characters led to an average KLD of 4.58, and the other way around led to an average KLD of 5.38. We see the same pattern that we saw overall. Thus it is not the case that top actors of color begin to sound more White but retain their semantic distinctiveness.

The KLD scores, however, are higher when comparing top roles, and significantly so. Both groups are worse at approximating the other in the top roles than they are overall, meaning White actors and non-White actors occupy even more distinct starring roles.

The same thing happens at the bottom quartile, where characters had fewer than 370 words in a film.

The average KLD was 4.55 when using dialogue from non-White characters to approximate dialogue from White characters, and 5.38 in the other direction (t = 43.354, df = 1992.5, p-value < 2.2e-16). Again, the models perform worse in the bottom roles than they do overall.

If you are approaching this kind of research to determine whether people of colour are relegated to playing the same small set of roles over and over again, you might assume that their language would make up a subset of that of White characters. That is, White actors get to play a more varied array of roles. This isn’t what we see here. Non-white characters play similar roles to White characters, but they are also talking about things that White characters don’t, in roles that don’t appear in the repertoires of White actors in this dataset.

The disparity between roles played by people of colour and those played by White people is heightened when zeroing in on top or bottom roles. Without a close study of the films involved or an analysis of the actual words and patterns that distinguish the two groups, the actual mechanism driving the perceptible difference in the auditory feature space of film roles is difficult to define. A systemic, significant distinction, however, between these groups that is only magnified at the levels of both leading roles and bit parts is unsettling, especially when combined with the other measures in our paper and with other research on racial and ethnic representations in films. This can point to evidence of tokenism, where roles are being designed specifically for actors of colour, which can lead to stereotypical representations of these groups and the pigeonholing of these groups into specific bins of characteristics.

Relative entropy with its emphasis on asymmetry was a useful starting point to extract how the linguistic difference between these two groups functions. We’ve moved away from any kind of defining model, so it’s not enough to say that non-White characters sound different from White characters. Here we find that non-White characters are living in the same universe as White characters but are also occupying another, distinct space. What are the features of these spaces and how are they different? You can find the data that we used for our paper and these measures here, and have a look for yourself.