Should We Teach Literature Students How To Analyze Texts Algorithmically?

Textual analysis like the type that revealed J.K. Rowling’s nom de plume could change the way we understand the very concept of writing style. Is this the answer to the staleness and despair that has crept into the study of literature?

When the U.K. newspaper the Sunday Times outed J.K. Rowling as the author of detective novel The Cuckoo’s Calling earlier this year, computer scientists were among the first people called in. Although the novel was published under the pen name Robert Galbraith, two computational scholars—including Duquesne University’s Patrick Juola—were tasked with confirming or denying whether the novel belonged to the Harry Potter author, or one of three other possible writers.

That Juola succeeded (his conclusions were later confirmed by Rowling herself) speaks volume about the the potential that algorithms and computer science can have, even with application to a field as notoriously subjective as literature. Which raises an interesting question: Can we use software to help us think about literature?

Reverse Engineering J.K. Rowling

To explore that question, we should first look at how the Juola cracked Rowling’s writing style. To begin the process, Juola loaded 1,000-word samples of The Cuckoo’s Calling in to his self-designed Java Graphical Authorship Attribution Program (JGAAP), along with several other texts, including The Casual Vacancy, Rowling’s first post-Harry Potter novel. A freely available Java-based, modular program for textual analysis, categorization, and authorship attribution, JGAAP analyzed the texts on four different variables: word-length distribution, the use of common words like "the" and "of," recurring-word pairings, and the distribution of "character 4-grams," or groups of four adjacent characters, words, or parts of words. The computer analysis took around 30 minutes in total.

"Nothing that we’re doing is magic," Juola said recently about the process. "What we are doing is the same type of judgment that experts have always done about reading documents and figuring out something about the author—just a lot faster, and more accurate than most."

Juola’s work is hardly the first piece of evidence that algorithms may have a useful role to play in the field of textual analysis. Whether it is using algorithms to determine the semantic difference between male and female tweets, or companies like Narrative Science, which utilize machine learning tools to generate entire new works in a number of styles, the possibility that computer science can change the way books are read is apparent to everyone.

Of all the people to celebrate literary analyses’ "algorithmic turn," perhaps none have been more outspoken about the subject than Jonathan Gottschall, a literary scholar at Washington and Jefferson College in Pennsylvania, who specializes in the field of literature and evolution.

Not only does Gottschall believe that algorithmic analysis can be used to change the way we read literature, he also believes it should profoundly alter the way that we should study it. Most recently the author of The Storytelling Animal: How Stories Make us Human, Gottschall has spent much of the past five years arguing that literary studies needs to adopt a more scientific approach—including up-to-date scientific theories, research methods, use of statistical tools, and insistence on hypothesis and proof.

Curing The Despair Over Humanities

Speaking to FastCo.Labs, Gottschall says that he was prompted to embrace the digital humanities by "frustration and near despair with the way academic literary studies within the humanities were conducting themselves." In particular, he points to literary studies' inability (or refusal) to ever get closer to an objective truth—with their being, he claims, little accumulation of knowledge from generation to generation which meaningfully builds on the work of predecessors, without trying to shoot them down first.

"The sciences are full of vigor," Gottschall says. "They’re full of energy—there’s a real sense that, boy, we’re doing important stuff, and that we’re really getting somewhere. Compare that to my field of English, within the humanities, where there’s the feeling that our best days are behind us, that this is a dying field with no cultural prestige, that jobs aren’t safe—and all of those things are really quite true."

Gottschall believes that these problems (the humanities’ increasing irrelevance, decreasing popularity, and the subjects' inability to ever answer questions conclusively) are not only connected, but can be solved by implementing the kind of data mining tools and computational textual analysis carried out by the likes of Patrick Juola.

"For me, it all comes down to the question that we are asking," Gottschall says, "and often that is a question that cannot be solved without the empirical method. If you look back over the history of literary studies what you will see if an endless argument that never gets anywhere. If you want to get a conclusion, you can turn to the sciences."

Anatomy Of The "Reading Machines" Of Tomorrow

Taking something of a less positivistic approach is Stephen Ramsay, associate professor of English and a fellow at the Center for Digital Research in the Humanities at the University of Nebraska-Lincoln, as well as author of last year’s Reading Machines: Toward an Algorithmic Criticism. In his book, Ramsay attacks the idea that the digital humanities should take over entirely from traditional literary studies, while additionally detailing the ways in which algorithms can be usefully integrated into the field without attempting to turn the humanities into a branch of statistical science in the process.

Algorithms can, Ramsay points out, be used to determine "vocabulary richness" by measuring the number of different words that appear in a 50,000-word block of text. This, in turn, can reveal useful insights like the fact that a "popular" author like Sinclair Lewis (sometimes derided for his supposed lack of style) regularly demonstrates twice the vocabulary of Nobel Prize laureate William Faulkner, whose work is considered notoriously difficult.

Tools like Google Books, meanwhile, offer the possibility of completely transforming the way in which literary scholars approach questions and comparisons, by allowing the simultaneous searching of up to 35,000 novels, and perhaps opening up the cultural space for works such as Pierre Bayard’s How To Talk About Books You Haven’t Read.

As Ramsay observes, "The rigid calculus of computation, which knows nothing about the nature what it’s examining, can shock us out of our preconceived notions on a particular subject. When we read, we do so with all kinds of biases. Algorithms have none of those. Because of that they can take us off our rails and make us say, ‘A-ha! I’d never noticed that before.’"

The idea that the literary studies of the (near) future may involve more data visualization and machine learning than speculation on the "death of the author" or studying the sexism of the Western canon could be enough to unsettle some of those working within the field. But it might also be closer to becoming a reality than many of us realize.

[Image: Flickr user Mikael Altemark]

Add New Comment


  • This article reads like Sinclair Lewis wasn't a Noble Laureate, when in fact he was the first American to win the Nobel Prize in Literature.

  • As a writer, I can see a lot of value in being able to scientifically evaluate a piece of literature. I think the real value will be found in discovering/understanding (and marketing) to individual readers. Imagine if publishing companies (or sellers like Amazon) could group fiction not by genre, but by word flow for customers. Imagine being able to type in a list of favorite authors or books and being able to learn hundreds or thousands of other books that share a similar word flow. I wouldn't be surprised if Google isn't already working on it. I'd pay to use such a tool. The reason Literature studies is so subjective (and to a large degree pointless) is because people prefer different word flows. Few people even know they have a word flow preference; what they like is well written, what they hate is badly written. There are other aspects to literary preferences, but those too could be sussed out by the right pattern searching software! If this isn't the future, it should be.

  • Terry A Davis


    (or are legally required to prepare) your periodic tax
    returns. Royalty payments should be clearly marked as such and
    sent to the Project Gutenberg Literary Archive Foundation at the
    address specified in Section 4, "Information about donations to
    the Project Gutenberg Literary Archive Foundation."

    - You provide a full refund of any money paid by a user who notifies
    you in writing (or by e-mail) within 30 days of receipt that s/he
    does not agree to the terms of the fu

  • Alex Ezell

    There's a long history of more scientific textual analyses of literature. The Dominicans built the first concordance of the Bible in 1230. Similar concordances can now be compiled on any text in a matter of seconds or minutes. While concordances are the simplest form of this analysis, knowing that an author uses the word "crepuscular" 17 times in a given novel might be helpful in deriving some critical analysis on tone or point of view or setting.

    I'd agree that it is one tool among many that should enable deeper thinking about a given text. I decry the notion that an algorithm might someday tell me that a novel is good or bad or recommended for me based on textual analysis only.

  • trapezium

    Once we've demonstrated that Sinclair Lewis uses a larger vocabulary than William Faulkner, we have an objective truth that Gottschall calls for.

    And then so what?

    The humanities aren't a science. The purpose of studying them isn't the same as the purpose of studying the properties of benzene or the components of an atom.

    It baffles me to hear that people expect objective truths from studying literature. They seem to have so thoroughly missed the point.

  • Jan Svensson

    Alas, I 'm more twit than twitterer, But I just want to say that this is one of the most interesting articles I've read this year, and I plan to look into all of the references.