How do we model stereotypes without stereotyping?
We recently put out a paper on how racial bias functions in Hollywood films. This work was based on a few studies that came before it, namely this one, from USC Annenberg. We presented numerical analyses like the number of characters in different racial and ethnic groups and the number of words spoken by these groups, as well as who occupied the top roles in these films. These numbers give us tangible measures of the visual aspects of these films, but they exclude the entire other half of film: dialogue. We wanted to take this research a step further from other studies, aiming to learn more about racial bias in casting and writing through an analytical study of the dialogue spoken by these characters, to analyze the actual “quality” of the language as a stand-in for the “quality” of a role, and to answer questions like, are people of colour being relegated to the same kinds of roles in the disproportionately few times that they do appear on screen?
This was, predictably, much more difficult to carry out than we had initially thought when we started out last summer.
Using text mining and computational methods, the goal of this aspect of the study was to distance ourselves from any kind of subjective, close interpretation of the dialogue. One way we were able to do this is laid out in the paper. We found that characters whose racial or ethnic identity could be mapped to a corresponding geographical location (e.g., Latinx characters and Latin America) were more likely to reference cities and countries in that region than white characters were.
This was a relatively straightforward and objective measure. We tried to present it equally objectively and not pull any far-fetched analyses from it. We felt comfortable putting this into our paper without causing any controversy. But we wanted to do more, and try to see whether, on a measurable, linguistic level, people of colour are pigeon-holed in ways that their white counterparts are not.
We tried many text mining methods to answer this question. We tried topic models to find unique themes among the speech of different groups, vector subtraction to find unique words used by each group, machine learning to see if a computer could classify speech by racial or ethnic group; however, each one of these failed because the groups in our database were of dramatically different sizes. White characters made up 86.9% of our dataset. Many of our methods don’t function well when the largest group has 3,525 members, and the next largest has 385.
But we kept trying. In doing research for our project, we came across analyses of Hollywood and television that indicated that criminals in movies were more likely to be played by men of colour. Was there a way to measure this linguistically? First off, we needed to come up with a way to model criminal language. Building a dictionary of “criminal” words is incredibly dangerous, open to our systemic, implicit biases. The model would need to be more objective. But newspaper crime reports are of a completely different register: could we expect a movie criminal to say, “He was granted a $50,000 personal surety bond and will be arraigned on June 25”? Unlikely, although this is probably an example of that implicit bias we were just talking about. This process of thinking through various models showed us the importance of ensuring that the domain and register of our model match that of the corpus we wanted to test, and so we settled on dialogue from crime TV shows.
But, understandably, this model was still not perfect. We could not automatically pull out criminals and build the model based solely on their dialogue. Thus, the model was based on the speech of all characters, including criminals, but also police officers, victims, lawyers, etc. Further, if Hollywood is subject to biases from producers, directors, and writers, television is not much better off, and we may be basing our measure off an already biased model. Biases against some groups in television will translate by selecting those groups as the most criminal in the data set! This makes for a very problematic model that we are about to use to make some very sensitive, very controversial claims. Great. So, what did we find?
First, it is important to note that the model is not completely far-fetched. Conditioning on the most commonly occurring words in the crime television show scripts, we find these as the top 125 words:
Sprinkled among common English words like those seen above, murder, shot, gun, problem, and words for family members (husband, wife, kids) also show up in the top 200 words, while detective, security, death, truth, blood, killer, evidence, crime, and agent show up in the top 300 words.
We used a measure called “perplexity”, which was used in this paper that analyzed interview questions posed to tennis players of different genders to determine to what extent questions pointed at women tennis players diverged from the kind of language used in tennis commentary. It works with sentence probabilities, namely in that it measures how likely one word is to appear given the word that precedes it. You implement some kind of model, in our case, the crime television dialogue, and if the tested corpus follows the same patterns found in the model, you get a low “perplexity”. That is, the program isn’t confused, and the model can reasonably be used to predict the test corpus.
Here’s what our data looked like:
The box plots show us that all the groups were very, very close to each other. Near Eastern characters, however, had significantly lower perplexity scores than all other groups except Latinx characters, meaning that they were significantly closer to the crime model than other groups, and Latinx characters were close behind. South Asian characters had significantly higher perplexity scores than all other groups. They were significantly farther from the crime model. Other than these two extremes, no other groups were significantly different. This tells us some things, but it doesn’t exactly corroborate the idea of disproportionate casting of men of colour as criminals presented in past studies.
But this isn’t the whole picture. This is:
There is a lower bound to perplexity scores. All groups will have examples of sentences with scores of 0, that appear word-for-word in the crime model (as we capped the minimum sentence left at 3 words, a common one was “How are you?”). There is no upper bound. There is no limit to how foreign a sentence can be to the model, and yet, white characters have the largest range. They have the sentences with the highest perplexities. These numbers correspond almost exactly to the size of groups.
Variation and diversity in language depend on the size of the group. “This is obvious,” you scoff. Yes, but think about the implications this has for film. If (by some miracle) the crime model we designed is an accurate representation of criminal language in movies, then we may have just shown that there is not that much of a disproportional casting of people of colour into criminalized roles. In fact, a Kullback-Leibler divergence (KLD) measure between even samples from each group and the crime model, which was another way to test how well the model could predict the dialogue, found no significant differences in divergence between any of the groups.
The perception that disproportionate casting is happening is coming from the fact that the groups are so small that they need more positive appearances to counteract negative portrayals.
The point is, this work is tricky. This kind of large-scale research on dialogue is quite new, so we find ourselves in uncharted territory. Furthermore, without accurate transcriptions of film dialogue, we are limited to scripts, which don’t provide a complete picture. How did the dialogue change when the actor was cast?
Our main question was this: how do you find a way to measure how people talk without making judgements on the way they talk? The majority of research is led by a hypothesis. Here, a hypothesis can also be very dangerous. It can lead to the further stereotyping of groups of people. A year after starting our project, we didn’t find a way to test the variability between the groups in our dataset in a way that we felt comfortable publishing. But these questions still need to be answered. How do we answer them without our biases clouding the process, effectively answering our question before we’ve even formed it? TBD, as they say.
You can find the data we used for our paper (and for these measures) here.