I am pleased to announce the publication of a new piece I have written that appears today in CA: Journal of Cultural Analytics. The aim of the piece is to take a first look at the ways in which fictional language distinguishes itself from non-fiction using computational approaches. When authors set out to write an imaginary narrative as opposed to an ostensibly “true” one, what kinds of language do they use to signal such fictionality? One of the interesting findings that the piece offers is the way such signalling has remained remarkably constant for the past two centuries. Using a classification algorithm trained on nineteenth-century fiction, we can still predict contemporary fiction with above 91% accuracy (down from about 95% when tested against data from its own time period). These results hold across at least one other European language (German). In the future I hope to be able to test more languages to better understand just how constant such fictional discourse can be said to be.

In addition to seeing the constancy of these features across time and languages, the piece also highlights the specific nature of those features. As I argue in the piece, fictional language distinguishes itself most strongly by an attention to a phenomenological investment: an attention to a language of sensing and perceiving embodied individuals. It is this heightened focus on sense perception — the world’s feltness — that makes fiction stand out as a genre. When we look at the ways novels in particular distinguish themselves from other kinds of fictional texts, we see a very interesting case of a language of “doubt” and “prevarication” emerge, suggesting that the novel does not put us into the world in a fundamentally realist way, but inserts people into the world in a skeptical, testing, hypothetical relationship to the world around them.

This piece is part of a nascent project to use computation to better understand creative human practices. The aim is not to replace human judgments about literary meaning or quality, but to make more transparent the semantic profiles of different types of cultural practices. Computation can be a useful tool in showing us how different cultures use different kinds of writing to convey meaning to readers over time. It helps us transcend the impressionistic ideas we develop when we read a smaller sample of novels or stories and test the extent to which these beliefs hold across much broad collections of writing.

While the original text data could not be shared in this project, all derived data has been shared as part of the article. One of the advantages of using non-word-based feature sets as I do in the piece is that that derived data can then be freely shared.

The Sweep of History

This is the second in a series of posts by .txtLAB interns. This post is authored by Magdalene Klassen.

Many if not most contemporary historians would probably agree with the statement that “the typical mode of explanation used by historians [is] narrative.” (Roberts 2001) Storytelling, then, is not the difference between history and fiction. Instead, we could say, the scope of the story is what differentiates historical and fictional writing. For the past four months, I have been comparing a corpus of historical texts with a corpus of novels in English, French, and German. Based on my interpretation of the results, fictional texts have a smaller scope than histories, thematically, structurally, and lexically.

I considered works published between 1770-1930. All of the novels were in third person for comparative purpose. My results should taken with caution, as my data included more novels than histories.[1] Few nineteenth-century histories have been well-digitized because the historical narrative has changed, and these texts have become primary rather than secondary sources. For example, Edward Gibbon’s The Decline and Fall of the Roman Empire is no longer an authoritative account – we now think of this “fall” as a transformation. Now his text is a means to understand how late-eighteenth-century historians understood their task, and the Roman Empire.

I have defined a “history” as an account of people and their actions in wider societal events in the past. This definition is meant to exclude:

  • histories of disciplines (science/philosophy/art)
  • speculative evolutionary histories (dawn of time/state of nature)
  • memoirs/histories written by those who took part

I ran five main tests on the data sets, studying sentence length, type-token ratio, corpus homogeneity, vocabulary likeliness, and dictionary frequency. The first two were primarily structural tests, and the last three had a greater focus on the words of the corpora.

Sentence Length

I was only able to run the sentence length test on English and German texts, due to the software, and in both languages the historical texts had longer sentences than novels. This is far more pronounced in English; sentences in novels are on average 21.47150 words long, while in histories the average sentence is 25.69378 words long (p-value 5.137e-07). The shorter sentences of novels is likely a result of the frequent use of dialogue, which is often composed of short interjections. Alternatively, these results may confirm that historical, academic writing is more dense than fiction. A further test not including dialogue might yield more conclusive results.

The difference in German is much smaller, novels having sentences that are on average 21.53037 words long, while the average for histories was 22.62373 words (p-value 0.2505). Again, a test which did not include dialogue would be helpful to better understand the genre difference in sentence length in German. My results may suggest that the difference between functional and literary language is lesser in German, or that the difference between dialogue and description is less pronounced.



Type-Token Ratio

Histories also have a higher type-token ratio than novels, in all three languages.

TTR Nove History P-value
English 0.2158489 0.2404620 9.53E-09
German 0.2786909 0.3232653 8.802E-16
French 0.2549798 0.2666072 0.005711

These results suggest that, although history describes the past, it does so with more novel language that the novel, especially in English and German. These nineteenth-century histories may have introduced new vocabularies because they were often about exotic others, whether ancient or far away. In contrast, the vast majority of French histories – though I did try to maintain a variety – were about the French Revolution. Although one might think that conjuring up an entire literary world is harder than writing about one everybody agrees existed – for example, Fontane describing Berlin in Irrungen, Wirrungen compared to Franz Kugler describing the Berlin of more than a century earlier in his Friedrich der Große – it may be that in a fictional effort, the author repeats the same words in order to solidify those characteristics in a reader’s mind.





Novels as a genre are significantly more linguistically homogenous than histories. I determined this by a series of correlation measures, by which each text was compared to every other text in the generic corpus, to determine the overall similarity of the corpus to itself.

The stark difference in homogeneity across all three languages is difficult to explain, but once again these results suggest that novels have a smaller scope than histories. Whereas histories argue a thesis or perspective that must be defended, novels seek to convey a recognizable social reality.

Homogeneity Novel History P-value
English 0.6700383 0.4467730 < 2.2e-16
French 0.6843885 0.4686732 < 2.2e-16
German 0.7715615 0.6196034 < 2.2e-16


Distinctive Words

I measured a word’s likelihood in a given corpus by a paired difference test: the Wilcoxon rank-sum test. The results strongly confirmed the thematic differences between nineteenth-century novels and histories: the words most characteristic of novels focused on individual emotions and bodies, while histories tended to use larger-scale words about war, diplomacy, and geopolitics. My results confirm earlier impressions about the difference between these genres. Yet the degree to which they validate conceptions of nineteenth century historical methodologies is fascinating. For example, in all three languages, the name Alexander (the Great, most likely) indicated histories. In German, 53 of the words characteristic of histories were either first or last names. This is clear proof of the so-called Great Man Theory, which was first popularized by the English historian Thomas Carlyle in the 1840s. In many ways, history has a broader scope than fiction, but certain common threads still remain. Significantly, these commonalities confirm what we understand about the established historiography of the nineteenth century.



I based the final test on several thematic dictionaries I collated, but here I focus on only one. This dictionary consisted of approximately 100 words in each language that referenced the document’s status as a text. Words included historian, archive, metaphor, and storytelling, as well as “neutral” words that applied to both genres, such as editor and page. I found that histories are much more reflexive than novels, though the dictionary was meant to represent equally “history words” and “novel words.”

Reflexivity Novel History P-value
English 0.2158489 0.2404620 9.53E-09
German 0.2786909 0.3232653 8.802E-16
French 0.2549798 0.2666072 0.005711



Historical texts draw their legitimacy from their ability to interact with other texts, and so references to themselves as texts, surrounded by other texts, are crucial to their authoritative function. In contrast, novels attempt to immerse the reader in the world of the novel, despite the occasional “dear reader” moment. Where novels pull the reader inward, histories draw them outward, offering an explicitly broader scope and frequent examples.


[1] English: 86 histories/108 novels, French: 83 histories/100 novels, German: 75 histories/110 novels. I also tried not to use individual volumes from multivolume works, which severely limited my choice of histories, and so some first or second volumes were included.