In one of the Java courses I took on Coursera, a simple exercise consists in counting the occurrences of the words in a file. The examples we used were plays from Shakespeare and while I’ve read a few of them I prefer Lovecraft and Python.

The idea is simple:

  • we read a file
  • split it into words
  • count the occurrences of the words that are not common english words.

We’ll repeat this operation for many short stories of Lovecraft and we’ll visualize the most frequent words of each short story. Hopefully, this should give us an idea of what each one is about.

This post was created using the Jupyter Notebook with a few libraries:

%matplotlib inline
import os
from collections import defaultdict
import time
import re
from fntools import count
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib'ggplot')
from wordcloud import WordCloud
from itertools import repeat

Counting the relevant words

Here are the functions we’ll need to count the words.

def count_words_frequency(inputfile, excluded=[]):
    """Count the frequency of the words in the inputfile
    Excluded words can be specified with the keywords `excluded`.
    frequencies = defaultdict(int)
    with open(inputfile) as f:
        for line in f:
            for word in line.split():
                word = word.lower().strip()
                word = re.sub("[^a-z]", "", word)
                if len(word) > 0 and word not in excluded:
                    frequencies[word] += 1
    return frequencies

def count_total_words(frequencies):
    """Count the total number of words from the given frequency table"""
    return sum(count for word, count in frequencies.items())

In order to exclude the words that occur frequently in the english language and that would distort the results I used the 10000 most common english words as determined by n-gram frequency analysis of the Google’s Trillion Word Corpus.

CURRENT_DIR = os.path.expanduser("~/Dropbox/Projects/Lovecraft")
ENGLISH_WORDS_FILE = CURRENT_DIR + "/data/google-10000-english.txt"

def get_most_common_english_words(max_word_count=1000):
    """Return a list of the most common english words.
    The maximum number of words can be specified with `max_word_count`.
    The maximum number of available words is 10000.
    most_common_english_words = []
    n = 0
    with open(ENGLISH_WORDS_FILE) as f:
        while n < max_word_count:
            n += 1
    return most_common_english_words

Now the processing part. There’s a bit of calibration to do to determine the most common english words we consider. Indeed if we exclude the 10000 most common words, we’ll have mainly the names of the protagonists, or rare words left. In some cases, this may not tell us much about what is happening but rather who is in the story. On the other hand if we choose a lower value we may get some useful words but other may be irrelevant. After some tests, I chose to exclude the first 5000 most common words. Note that we could adapt this value to each short story but, for the sake of simplicity, we’ll stick to 5000 for the 6 texts we’ll analyze.

Then let’s apply our function for counting the frequency of the words (count_words_frequency) and store them in a pandas.DataFrame:

filenames = ("cthulhu.txt", "mountains_of_madness.txt", "the_unnamable.txt", "charles_dexter_ward.txt", 
         "wall_of_sleep.txt", "erich_zann.txt")

english_words = get_most_common_english_words(5000)

words_frequency = {}
total_words = {}
dfs = {}

for filename in filenames:
    inputfile = os.path.join(CURRENT_DIR, "data", filename)
    name = filename[:-4]
    print 'Reading %s...' % filename
    words_frequency[name] = count_words_frequency(inputfile, english_words)
    total_words[name] = count_total_words(words_frequency[name])
    dfs[name] = pd.DataFrame(words_frequency[name].items(), columns=('word', 'frequency'))

Let’s look at the example of ‘The Call of Cthulhu’ and see what the 3 most frequent words are:

dfs['cthulhu'].sort_values(by='frequency', ascending=False).head(3)

word frequency
cult 30
johansen 23
cthulhu 22

From what we see, ‘The Call of Cthulhu’ is mainly about a cult, a person called Johansen and the famous Cthulhu. Sounds right!


We visualize the most frequent words with a word cloud using the word_cloud package:

def plot_word_cloud(data, n=20):
    """Plot the n most frequent words with wordcloud"""
    text = []
    for word, freq in data:
        text.extend(list(repeat(word, freq)))

    text = ','.join(text)
    # Generate a word cloud image
    wordcloud = WordCloud(background_color='white').generate(text)

    # Display the generated image:
    # the matplotlib way
    plt.figure(figsize=(8, 6))
    return plt

The call of Cthulhu

cthulhu = dfs['cthulhu'].sort_values(by='frequency', ascending=False).head(10)

word frequency
cult 30
johansen 23
cthulhu 22
uncle 21
legrasse 20
wilcox 19
angell 13
whilst 12
manuscript 11
basrelief 11



The mountains of madness

mountains_of_madness = dfs['mountains_of_madness'].sort_values(by='frequency', ascending=False).head(10)

word frequency
danforth 53
antarctic 48
vast 43
sculptures 41
specimens 37
monstrous 31
curious 31
carvings 30
abyss 29
peaks 28



The unnamable

the_unnamable = dfs['the_unnamable'].sort_values(by='frequency', ascending=False).head(10)

word frequency
manton 13
unnamable 9
attic 9
tomb 7
deserted 7
whispered 5
slab 4
spectral 4
legends 3
slate 3



The case of Charles Dexter Ward

charles_dexter_ward = dfs['charles_dexter_ward'].sort_values(by='frequency', ascending=False).head(10)

word frequency
willett 169
curwen 158
ye 87
wards 52
pawtuxet 50
curwens 43
providence 37
weeden 35
capt 33
curious 33



Beyond the wall of sleep

wall_of_sleep = dfs['wall_of_sleep'].sort_values(by='frequency', ascending=False).head(10)

word frequency
slater 25
waking 6
decadent 5
cosmic 5
cannot 4
couch 4
ethereal 4
valleys 4
oppressor 4
luminous 4



The music of Erich Zann

erich_zann = dfs['erich_zann'].sort_values(by='frequency', ascending=False).head(10)

word frequency
zann 17
viol 13
rue 11
dauseil 11
garret 8
erich 8
dumb 6
zanns 5
strains 4
shutter 4




From the results, we clearly see the topics Lovecraft treated in his stories: strange cults, dreams, madness and powerless men facing terrifying creatures. From a technical point of view, this approach is a bit simplistic but it’s a good approximation. This study could be improved with natural language processing. Instead of creating a meaning from a list of words, we could analyze the sentiments in the each short story.