Lovecraft in words

In one of the Java courses I took on Coursera, a simple exercise consists in counting the occurrences of the words in a file. The examples we used were plays from Shakespeare and while I’ve read a few of them I prefer Lovecraft and Python.

The idea is simple:

we read a file
split it into words
count the occurrences of the words that are not common english words.

We’ll repeat this operation for many short stories of Lovecraft and we’ll visualize the most frequent words of each short story. Hopefully, this should give us an idea of what each one is about.

This post was created using the Jupyter Notebook with a few libraries:

%matplotlib inline
import os
from collections import defaultdict
import time
import re
from fntools import count
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.style.use('ggplot')
from wordcloud import WordCloud
from itertools import repeat

Counting the relevant words

Here are the functions we’ll need to count the words.

def count_words_frequency(inputfile, excluded=[]):
    """Count the frequency of the words in the inputfile
    
    Excluded words can be specified with the keywords `excluded`.
    
    """
    frequencies = defaultdict(int)
    with open(inputfile) as f:
        for line in f:
            for word in line.split():
                word = word.lower().strip()
                word = re.sub("[^a-z]", "", word)
                if len(word) > 0 and word not in excluded:
                    frequencies[word] += 1
    return frequencies


def count_total_words(frequencies):
    """Count the total number of words from the given frequency table"""
    return sum(count for word, count in frequencies.items())

In order to exclude the words that occur frequently in the english language and that would distort the results I used the 10000 most common english words as determined by n-gram frequency analysis of the Google’s Trillion Word Corpus.

CURRENT_DIR = os.path.expanduser("~/Dropbox/Projects/Lovecraft")
ENGLISH_WORDS_FILE = CURRENT_DIR + "/data/google-10000-english.txt"

def get_most_common_english_words(max_word_count=1000):
    """Return a list of the most common english words.
    
    The maximum number of words can be specified with `max_word_count`.
    The maximum number of available words is 10000.
    
    """
    most_common_english_words = []
    n = 0
    with open(ENGLISH_WORDS_FILE) as f:
        while n < max_word_count:
            most_common_english_words.append(f.readline().strip())
            n += 1
    return most_common_english_words

Now the processing part. There’s a bit of calibration to do to determine the most common english words we consider. Indeed if we exclude the 10000 most common words, we’ll have mainly the names of the protagonists, or rare words left. In some cases, this may not tell us much about what is happening but rather who is in the story. On the other hand if we choose a lower value we may get some useful words but other may be irrelevant. After some tests, I chose to exclude the first 5000 most common words. Note that we could adapt this value to each short story but, for the sake of simplicity, we’ll stick to 5000 for the 6 texts we’ll analyze.

Then let’s apply our function for counting the frequency of the words (count_words_frequency) and store them in a pandas.DataFrame:

filenames = ("cthulhu.txt", "mountains_of_madness.txt", "the_unnamable.txt", "charles_dexter_ward.txt", 
         "wall_of_sleep.txt", "erich_zann.txt")

english_words = get_most_common_english_words(5000)

words_frequency = {}
total_words = {}
dfs = {}

for filename in filenames:
    inputfile = os.path.join(CURRENT_DIR, "data", filename)
    name = filename[:-4]
    print 'Reading %s...' % filename
    words_frequency[name] = count_words_frequency(inputfile, english_words)
    total_words[name] = count_total_words(words_frequency[name])
    dfs[name] = pd.DataFrame(words_frequency[name].items(), columns=('word', 'frequency'))

Let’s look at the example of ‘The Call of Cthulhu’ and see what the 3 most frequent words are:

dfs['cthulhu'].sort_values(by='frequency', ascending=False).head(3)

word	frequency
cult	30
johansen	23
cthulhu	22

From what we see, ‘The Call of Cthulhu’ is mainly about a cult, a person called Johansen and the famous Cthulhu. Sounds right!

Visualization

We visualize the most frequent words with a word cloud using the word_cloud package:

def plot_word_cloud(data, n=20):
    """Plot the n most frequent words with wordcloud"""
    text = []
    for word, freq in data:
        text.extend(list(repeat(word, freq)))

    text = ','.join(text)
    # Generate a word cloud image
    wordcloud = WordCloud(background_color='white').generate(text)

    # Display the generated image:
    # the matplotlib way
    plt.figure(figsize=(8, 6))
    plt.imshow(wordcloud)
    plt.axis("off")
    return plt

The call of Cthulhu

cthulhu = dfs['cthulhu'].sort_values(by='frequency', ascending=False).head(10)
cthulhu

word	frequency
cult	30
johansen	23
cthulhu	22
uncle	21
legrasse	20
wilcox	19
angell	13
whilst	12
manuscript	11
basrelief	11

plot_word_cloud(cthulhu.values);

png

The mountains of madness

mountains_of_madness = dfs['mountains_of_madness'].sort_values(by='frequency', ascending=False).head(10)
mountains_of_madness

word	frequency
danforth	53
antarctic	48
vast	43
sculptures	41
specimens	37
monstrous	31
curious	31
carvings	30
abyss	29
peaks	28

plot_word_cloud(mountains_of_madness.values);

png

The unnamable

the_unnamable = dfs['the_unnamable'].sort_values(by='frequency', ascending=False).head(10)
the_unnamable

word	frequency
manton	13
unnamable	9
attic	9
tomb	7
deserted	7
whispered	5
slab	4
spectral	4
legends	3
slate	3

plot_word_cloud(the_unnamable.values);

png

The case of Charles Dexter Ward

charles_dexter_ward = dfs['charles_dexter_ward'].sort_values(by='frequency', ascending=False).head(10)
charles_dexter_ward

word	frequency
willett	169
curwen	158
ye	87
wards	52
pawtuxet	50
curwens	43
providence	37
weeden	35
capt	33
curious	33

plot_word_cloud(charles_dexter_ward.values);

png

Beyond the wall of sleep

wall_of_sleep = dfs['wall_of_sleep'].sort_values(by='frequency', ascending=False).head(10)
wall_of_sleep

word	frequency
slater	25
waking	6
decadent	5
cosmic	5
cannot	4
couch	4
ethereal	4
valleys	4
oppressor	4
luminous	4

plot_word_cloud(wall_of_sleep.values);

png

The music of Erich Zann

erich_zann = dfs['erich_zann'].sort_values(by='frequency', ascending=False).head(10)
erich_zann

word	frequency
zann	17
viol	13
rue	11
dauseil	11
garret	8
erich	8
dumb	6
zanns	5
strains	4
shutter	4

plot_word_cloud(erich_zann.values);

png

Conclusion

From the results, we clearly see the topics Lovecraft treated in his stories: strange cults, dreams, madness and powerless men facing terrifying creatures. From a technical point of view, this approach is a bit simplistic but it’s a good approximation. This study could be improved with natural language processing. Instead of creating a meaning from a list of words, we could analyze the sentiments in the each short story.