Of the thousands of books published this year, how can we predict which will become bestsellers? Book lovers, Jodie Archer and Matt Jockers, set out to answer this million dollar question with an algorithm. In this episode of ‘Fixed That For You,’ we merge art and science by turning words into numbers. Discover the secret similarities between The Da Vinci Code and Fifty Shades of Grey, and learn how an open-source library can help you write a best-seller of your own.
Here’s a list of Stephen Marche’s books. Check out “Literature Is Not Data”, Stephen’s essay, which was mentioned in this episode. Curious to learn more about Adam Hammond’s role as a digital humanist? Learn more on his website.
Similar introduction for algorithmic judgment has been made in contemporary pop music.
Jodie Archer: There 20, 30, 40, 50 novels coming out every week, every month. Why is it that one takes the whole the cultural imagination by storm? Why do we have taste en masse?
Matt Jockers: What are the things that all best selling writers seem to sort of leave behind in an almost, you know, crime scene-y kind of way, that human beings don't notice?
Cara Santa Maria: Welcome to Fixed That For You, an original podcast from Segment about solving problems with data and algorithms. I'm Cara Santa Maria, and in this episode we merge art and science by turning words into numbers. It's a story of how Jodie Archer and Matt Jockers built an algorithm to predict the next Da Vinci Code or Fifty Shades of Grey.
Stephen Marche: What they were looking at was how you could use statistical modeling to create better analysis of literature.
Cara Santa Maria: And then how a writer named Stephen Marche tweaked that code.
Stephen Marche: What I'm really doing is using it as a tool to help me write better literature.
Cara Santa Maria: In order to write the perfect sci-fi short story. Jodie Archer's hunt for the mysterious elements that can predict a bestseller started with a mystery novel. She was working as an editor at Penguin Books in England when Dan Brown published The Da Vinci Code.
Jodie Archer: It was just Dan Brown week after week after week, just taking the top spot in every bookstore and why? This question just haunted me. What was it about that book?
Cara Santa Maria: The novel was panned by critics, but it kept on selling. 80 million copies at last count.
Jodie Archer: When I left Penguin to go to Stanford to do a PhD, that a question was still with me. I had no idea that it was possible to bring machine learning to how we study novels.
Cara Santa Maria: So at Stanford, Jody met Matt, a pioneer in the statistical analysis of literature.
Matt Jockers: Early on in my career I developed sort of two parallel tracks of interest. I was doing a PhD in Literature and then I was doing this sort of hobby of computing on the side.
Cara Santa Maria: This was the early '90s, so hobby of computing means he was writing some very rudimentary software. Back then, computers couldn't handle video or even still images, but they could collect data by searching through text.
Matt Jockers: There was actually a very specific moment when I was in my last year of my PhD program and I was doing an analysis of Beowulf, the poem, and I was studying the meter and the rhythm of the poem and I remember, I was writing down all these numbers, my calculations about the meter on sheets of yellow paper. And then I was going back and computing the statistics by hand and it sort of dawned on me at a certain point that you know, I really could do this with the computer.
Cara Santa Maria: Matt went on to use computing to analyze 19th century literature, but Jody wanted to understand less literary novelists like John Grisham and Danielle Steel.
Jodie Archer: There's a real snobbery against contemporary bestsellers in most universities. So I had to go and say, "Okay, Matt, let's look at Stieg Larsson and Dan Brown and not James Joyce. And let's see if we can find a reason why some of these books best-sell, regardless of genre." My hunch was there's something similar in all of those books that readers are responding to.
Matt Jockers: And so that's when we really began digging down and trying to build the algorithms that would identify what the features were that are similar across a big corpus of best sellers.
Cara Santa Maria: On Fixed That For You, we like to define problem solving in terms of data and algorithms. So here we go. Matt and Jody's data set was no surprise: books.
Jodie Archer: Go to the bookstore, right? There's a pile of them at the front that says bestselling table and then there's all the rest. So, we've got two categories.
Cara Santa Maria: They decided to focus on books from The New York Times best seller list, and the plan was to compare them to every other book that had never been a best seller, so thousands and thousands of books.
Cara Santa Maria: But books don't exist in a vacuum. There are factors outside of the actual writing that impact sales, like appearing on Oprah's Book Club, but that effect is short lived and easily screened out of the data set.
Matt Jockers: In order to be in our corpus as a best seller, the book had to have been on the list for at least 10 weeks, so we weren't looking at the one week wonders.
Cara Santa Maria: That took care of marketing. Reputation sales is when a book sells just because the author's previous books were popular, like John Grisham novels, so they needed a way to account for this in their comparisons.
Matt Jockers: We configured the algorithms in such a way that any time it was looking at an individual author like Grisham, the machine was told not to allow any of the other Grisham works in the model, so that it was treating every novel as if it was a first novel.
Cara Santa Maria: As for the algorithm that does all of this, Matt and Jody didn't design it from scratch. They repurposed a program that had been designed for something a little more serious.
Matt Jockers: It was originally developed for analyzing gene data, DNA data, and differentiating between different types of cancer.
Cara Santa Maria: The program was written in something called R, a language popular with statisticians mostly because it's really good at generating graphs. They gave the algorithm some broad parameters to start looking at: theme, plot, style and character.
Matt Jockers: Those are what you might call composites, made up of many, many features. At one time, we were working with around 28,000 different individual features.
Jodie Archer: What the computer is doing is looking at all of the words and it will come back with it with a cluster and say, okay, this word “bar” appears a lot of times in the books that you've given me, and sometimes the word “bar” is around the word law and it's around the word “courtroom” and sometimes it's around the word “whiskey” and “evening” and “bartender”.
Matt Jockers: And what Jodie just described is called latent Dirichlet allocation, more commonly known as topic modeling. And it's a fantastic unsupervised machine learning technique.
Cara Santa Maria: Topic modeling is key to detecting themes, but instead of just adding up how many times a word is used, it calculates what words are used near it most often, giving it context. So the algorithm detects the word “bar” and it knows the writer's referring to lawyers, not alcohol.
Cara Santa Maria: But that doesn't really explain why John Grisham's books sell so much more than other courtroom novels.
Jodie Archer: The question on Grisham about whether law is going to sell or not as just one feature, so it's got to also have, you know, many, many, many, many, many, 2,000 other features that we're getting yes on. How he's using style, how he's using character, how he's using plot.
Cara Santa Maria: Processing that many books and that many data points takes time, a lot of math and a ton of processing power.
Matt Jockers: I actually did the analysis on a high performance compute cluster at the University of Nebraska.
Cara Santa Maria: In the end, Matt and Jody were only interested in the most repetitive stuff.
Jodie Archer: The computer is showing you, hey, this seems to be a theme and it's coming up in 50 of the 100 books you've given me.
Cara Santa Maria: One theme they thought would be popular fell flat.
Jodie Archer: The working title of my thesis at Stanford was "Sex Does Not Sell," because we saw that this sex element was actually much stronger in all the stuff that isn't on the best seller list.
Cara Santa Maria: That's the kind of information you only get from objective data, but still surprising when they ran Fifty Shades of Grey through the algorithm.
Jodie Archer: I mean, it's quantified. So it's not our impression, the sex scene was 10 pages in the whole story and in a 300 page novel, it's not a high percentage that that book is about sex.
Cara Santa Maria: But if Fifty Shades of Grey isn't about sex, what is it about?
Jodie Archer: From the data perspective, just as much, there are scenes of relationship discussion and this theme of human closeness that we have found is the magic ingredient in bestsellers across all genres. It's those scenes where you can slow down action, whether that's in a bedroom or in a courtroom or in a murder trial and have two people have a moment of just soft human, I don't mean sexual, but closeness.
Matt Jockers: As people from a literary background who care deeply about the arts and literature as it intersects with our humanity, right? This was a great finding to see that, hey, the books that really do well are ones that engage a human dimension and relationships.
Cara Santa Maria: So that's theme. But what about plot? How can a machine identify plot structure?
Matt Jockers: What I'm measuring is the emotional valiance of words over narrative time, and when there's a lot of negative words, that's a downturn in the plot.
Jodie Archer: What the computer draws is a graph of the plot line based on changes in sentiment. So, bad thing happens, good thing happens, bad thing happens. You know that punchy, page turner, kind of beat.
Cara Santa Maria: It turns out when you track a novel that way, you see some very strange pairings.
Jodie Archer: Fifty Shades of Grey and Dan Brown, The Da Vinci Code, we found, have despite everything that seems different about them, two of the best-selling books of the last century, and they have an almost identical plot line.
Matt Jockers: They have those peaks and valleys in a very rhythmic sort of way. And the peaks are about the same height, the valleys are about the same depth and they kind of hover back and forth over neutral terrain.
Cara Santa Maria: Next, Matt and Jody tackled style.
Matt Jockers: If you want to get after style, you're interested in the use of words. You're interested in the order of the words, the syntax and the units which are sentences.
Cara Santa Maria: Some of the things that differentiate the two datasets were shockingly simple. Like, how many times they use the word “the”. There's a sweet spot, not too many, not too few that are indicative of best sellers.
Cara Santa Maria: Okay, so for the fourth element, characters, we need to go back to grammar class and review our verbs.
Jodie Archer: The algorithm was really putting certain verbs in box A and certain in box B, and so we needed some interpretation for what does this mean and we looked at it for a long time and realize that the best selling hero or heroine is absolutely, definitely active.
Jodie Archer: So, this is someone who has a want, has a need, drives it, they drive, they take action, they think, they do, they stare, they shout, they speak, and then the characters in the books that weren't making it were doing things like hesitate, whisper, suggest, ponder, daydream, as in life, right?
Cara Santa Maria: Theme, plot, style and characters are all terms that anyone interested in writing will be aware of. The difference is, Matt and Jody were able to definitively show how manipulating those factors determines whether a book will end up on the best seller shelf or in a discount bin.
Cara Santa Maria: In a retrospective blind test, the algorithm had an 80% success rate in identifying past books on The New York Times best seller list. Matt and Jody think if applied to the future with new books, their code could help editors answer a million dollar question: Where's the next best seller going to come from?
Cara Santa Maria: So they started consulting with publishers and authors and they wrote about the algorithm in their own book called The Bestseller Code, but other people are using their work for a different purpose.
Stephen Marche: Writing a novel in particular is a very cruel process.
Cara Santa Maria: Stephen Marche is a freelance writer for The New York Times, the Atlantic and Esquire magazine. That's what he does for a living. Writing novels and short stories is what he does for fun.
Stephen Marche: It's like a campaign of war, like you have to devote yourself everyday to it for several years.
Cara Santa Maria: Okay, so maybe not exactly fun.
Stephen Marche: I mean, I write out of compulsion, I would say. I write because I feel I have to.
Cara Santa Maria: He's an artist driven by passion and he was once completely opposed to Matt and Jody's system.
Stephen Marche: I wrote an essay that was called "Literature Is Not Data," in which I sort of pointed out that one of the problems with statistical analysis of literature is that it removed context and that you know, you could say the same exact phrases in two different literary works and they would have totally different meanings and that this was kind of the point of literature.
Cara Santa Maria: That was his initial reaction.
Stephen Marche: But while I was writing that critique, I actually, it was kind of one of those things while you're critiquing it, I kind of became fascinated with what it could be.
Cara Santa Maria: Specifically, Stephen saw beyond the predictive function to something he thought had more potential.
Stephen Marche: When you're writing a short story anyway, what you're doing is you're trying to imitate the best, like you learn how short stories work by reading the great short story writers and then taking what's best in them and fusing it into something that is your own.
Cara Santa Maria: Steven wondered if he could use an algorithm to tell him which storytelling techniques to take from other writers.
Stephen Marche: So, essentially you're taking the process that writers use anyway and adding this technical component to it. Really, a calculator. It's like instead of doing the arithmetic in your own head, you've essentially created a machine to help you do it.
Cara Santa Maria: It's the same thought that had occurred to Matt Jockers back when he was studying Beowulf, so Stephen wanted to create a digital writing coach, but he's a writer, not a programmer, so he turned to Adam Hammond, a digital humanities professor at the University of Toronto.
Cara Santa Maria: Adam took a long look at what Matt Jockers had created.
Adam Hammond: I modified his code a little bit here and there, but to do this project with Stephen, it's pretty much just using Jockers' code.
Cara Santa Maria: Adam used Matt's algorithm to create a set of rules for writing, but based on a different data set, one provided by Stephen.
Adam Hammond: He sent us 50 short stories that he was thought was good science fiction.
Cara Santa Maria: Adam then compared those 50 to 2,000 other sci-fi short stories.
Adam Hammond: I did topic modeling to derive what are the topics that he needs to avoid or embrace.
Cara Santa Maria: So, if a group of words showed up repeatedly in Steven's 50, but not very often in the other stories, Adam declared that to be a rule.
Adam Hammond: Apple tree, farm house, cow. Like it was really obvious that it was a farm topic, so one of his rules was that there had to be a scene set on a farm.
Cara Santa Maria: Some were like, really specific.
Adam Hammond: They needed to have a scene in which a large metal ship escaped from a building at night and the ship had to have a bed in it.
Cara Santa Maria: Others were confusing for Stephen.
Stephen Marche: One of the rules was it must be set on a foreign planet and the other rule was it must be set on Earth. And so I was like, "I don't know how that's possible." And then it occurred to me like, what I'll do is I'll have people on earth watching a distant planet without traveling to it. So, the algorithm did actually give me a pretty interesting premise, which would never have occurred to me.
Cara Santa Maria: Next, Stephen asked the algorithm to find repeated styles and turn those into rules.
Stephen Marche: Like for example, this is a question I ask myself every time I write a short story: how much dialogue should I have? Because there's different schools of thought on this. What I got from this algorithm was your story needs to have exactly 29% dialogue.
Cara Santa Maria: Those quotas were put into an interactive program that would analyze his writing in real time.
Stephen Marche: I could take my story and put it in, and then there would be a series of red lights on the side showing what was off.
Adam Hammond: His first draft was way off, like none of the lights turned green. They were either over or under.
Cara Santa Maria: The algorithm was like the pickiest English teacher you ever had, judging not just the number of adverbs or adjectives, but what kind. That's one of the things Adam and his team added to the original algorithm: the ability to assess the quality of Stephen's vocabulary.
Stephen Marche: So, like if there were too many fancy adjectives, there was a red light that said you need to decrease the effusiveness of your adjectives, and then I would do that until it became green.
Cara Santa Maria: That might be as simple as changing the word crimson to red, but every change had a ripple effect.
Stephen Marche: The interesting thing about it is that it really made clear from the algorithm's point of view, a story is a nexus, right? It's a network. And you know when you edit a short story normally, it's sentence by sentence. You improve one sentence and then you go onto the next. And the thing about this process is, it was holistic. So it's like if you add an adjective to make that sentence better, you have to take out an adjective later, in order to fit the algorithm. Like, you solve the green face of the Rubik's cube and you've messed up the yellow face.
Adam Hammond: He just kept editing it and editing it.
Cara Santa Maria: Then one day, all the lights turned green.
Stephen Marche: So it's like you have completed the algorithm, right? It's done. You have fulfilled the statistical terms which you set out.
Adam Hammond: We were so picky about everything, and he did it. It's incredible.
Cara Santa Maria: It satisfies all the rules. But Stephen's short story actually incredible? It was good enough for WIRED magazine to publish it, but that was based more on its technology roots.
Stephen Marche: The universe could only exist under conditions in which ourselves and the others were there to witness it. At its peak, the institution for the study of extraterrestrial life had employed 264 fully trained researchers at the banks of screens and everybody called it the yonder.
Cara Santa Maria: Whether it's a great story is kind of besides the point. The purpose of the experiment was to create something that resembled the 50 stories Stephen had chosen.
Stephen Marche: I actually love the story. I mean, it was not like anything that I would have written, but I will tell you that, I hope this doesn't sound vain, but if I'd come across it in a 1950s, you know, Asimov stories or one of the Classics of Science Fiction anthologies and magazines? I would have thought it was a really great short story.
Cara Santa Maria: But we should be clear that would Adam Hammond and Stephen Marche did with Matt and Jody's algorithm was not at all what it was built for.
Matt Jockers: I don't think you can engineer creativity.
Cara Santa Maria: That's Matt Jockers again.
Matt Jockers: I'm still very much a believer in sort of the creative enterprise and that if you're sitting there in front of your word processor and every time you type a word, you're getting some sort of a pop up that says you've used the word “the” too many times, I think that's going to interrupt to the creative process.
Cara Santa Maria: But for Stephen, this was a new way of thinking about the creative process.
Stephen Marche: I wanted to think of it as an engineer rather than a writer. And I mean, when you're a writer, you're thinking about what kind of writer am I? And, what will the market be for this? And, is this going to make me famous?
Stephen Marche: I mean, the whole point of this experiment is to get away from all that and to get to something external and objective, like an engineer has. Plus, it was just fascinating. Like you're learning like, well, what are the metrical components of a short story, of a science fiction short story?
Cara Santa Maria: Think like an engineer. That's what we try to do on every episode of Fixed That for You, a podcast by Segment about unusual problems solved with data and algorithms.
Cara Santa Maria: If you want to learn more about Matt and Jody's story and read Steven's short story, check out the show notes. And you can find us at segment.com/podcast. Plus subscribe at Apple Podcasts, Google Podcasts, Spotify, or wherever you do that sort of thing. We drop a new episode every two weeks. I'm Cara Santa Maria. Thanks for listening.