Can Writing Styles Be Boiled Down to Statistics?
Twas the week before Christmas Break, when all through the school, students were complaining, and trying to do new content would make me a fool… so we did something kind of interesting in my statistics class that was a cool application and reviews a lot of great stuff with inference testing. Inspired by this book called Nabakhov’s Favorite Word is Mauve by Ben Blatt, a super cool statistics-based analysis of literature, and this post on the Stats Medic, “Does Beyonce Write Her Own Lyrics?”. The basic questions are “How can we use basic statistics to examine and tell apart writing styles? What do statistics about your own writing say about your style?”
STEP 1: WHICH OF THESE AUTHORS ARE THE SAME?
To start, I gave them a page (available here, spoilers down below) from three different books (page 154 from each book, thanks Siri for the random number!). I told them two of these were written by the same author and one was written by a different author. How could we tell who wrote what? I told them the story of Hamilton, Madison and the disputed Federalist Papers to whet their appetite as a “real-world” example of this, but to be honest, they didn’t care about this, but were VERY intrigued just by the puzzle of figuring out which authors were the same.
And the statistics began, but not from a canned dataset that they ran pre-prescribed tests on – in fact, I was scrambling that week and hadn’t tried anything myself. I had no idea this was going to work! What things could we measure about the text to tell the difference between them? Some suggestions were too difficult to measure (i.e. tone), some had nothing to do with the writer (i.e. how many lines there were on the page), but others seemed easy to measure and perhaps distinctive of a writer (frequency of commas, length of words etc.). The students were skeptical that those things could distinguish authors, but we went after it anyway! We spent about a half-hour counting various things about the text, collected them on a document and then highlighted which two of the three were roughly more alike on each measure:
Of the 16 things we measured, 8 were the same between writers G and U (and 2 others were pretty much the same between all 3). Here come some interesting statistical questions… Why might one random page be off (one sample could be skewed for no reason other than randomness)? What’s the advantage and disadvantage of measuring a bunch of things (more things = more opportunity for random associations, but more opportunity to see a pattern). Which of these differences are “significant”?
We then spent about a class on that last question. Given that we know a chi-squared test and a t-test, how could you use those on these things we measured? We did this in R, and I can give some details about that for anyone interested, but the interesting part here is getting kids to imagine how you format data so that you could use a statistical test. What do you stick in about the sentence length in a t-test? How could -ly adverbs be a chi-squared test? Are either even appropriate here? (Meh, mostly… )
Wait… what? Those are three different authors. NOT SO FAST! Robert Galbraith is actually a pseudonym for… J.K. Rowling! (I wish I had played that up a bit more) So our statistics worked in a way – there were more similarities between G and U than the other combinations. So even when J.K. Rowling was writing under a pseudonym, her writing style was similar Cool!!!!!!
STEP 2: WHAT DOES YOUR WRITING STYLE LOOK LIKE?
Now, I wanted them to do something similar with their own writing. They had just written a joint paper with a partner, and I wanted them to see if their joint paper more closely resembled their own writing or their partners HAHAHAHAHAHA. They were hilariously sheepish about this idea, which told me immediately who had done what 🙂 (but it was all in good fun).
Enter a new tool, Count Wordsworth, an online tool that automatically measures a WHOLE BUNCH of statistics about any text that you paste in there (at which point they got mad at me because they had done so much by hand for the pages of the books, but they’re always mad at me for stuff like that). For example, here is just part of the output when I put in my teaching philosophy from my teaching portfolio:
I had them all put in a recent English paper and then find the THREE biggest differences between their paper and their partners. Again, a bunch of fun data questions – do the quotes in the paper mess things up? How about the number of words? What about the topic (English vs. a lab report)?
Then, once they had discovered the three biggest differences, I had them put in their joint paper and try to figure out whose writing style is more closely resembled. This class was a blast, and once they finished this, they were so curious so just kept exploring… Some kids put in their freshman year papers, some put in the headmaster’s emails etc. Lots of fun curiosity!
STEP 3: HOW DOES A PROFESSIONAL STATISTICIAN DO THE SAME SORT OF ANALYSIS?
Lastly, we read a short 10-page segment of the book I mentioned in the beginning,Ben Blatt’s Nabokhov’s Favorite Word is Mauve, specifically a chapter called “Searching for Fingerprints.” It was fun to see what a professional statistician does and we talked about how he could possibly measure some of the things that he did with the computing power we have nowadays.
Good stuff! Happy Holidays everyone!