3. Cleaning and Counting#

At the end of the last chapter, we briefly discussed the complexities involved in working with textual data. Textual data presents a number of challenges – which stem as much from general truths about language as they do from data representations – that we need to address so that we can formalize text in a computationally tractable manner.

This formalization enables us to count words. Nearly all methods in text analytics begin by counting the number of times a word occurs and by taking note of the context in which that word occurs. With these two pieces of information, counts and context, we can identify relationships among words and, on this basis, formulate interpretations.

This chapter will discuss how to wrangle the messiness of text so we can count it. We’ll continue with Frankenstein and learn how to prepare text so as to generate valuable metrics about the words within the novel (later sessions will use these metrics for multiple texts).

Learning Objectives

By the end of this chapter, you will be able to:

  • Clean textual data

  • Recognize how cleaning changes the findings of text analysis

  • Implement preliminary counting operations on cleaned text

  • Use a statistical measure (pointwise mutual information) to measure unique phrases

3.1. Preliminaries#

3.1.1. Overview#

Think back to the end of the last chapter. There, we discussed differences between how computers represent and process text and our own way of reading. A key difference involves details like spelling and capitalization. For us, the meaning of text tends to cut across these details. But they make all the difference in how computers track information. So, if we want to work at a higher order of meaning, not just character sequences, we need to eliminate as many variances as possible in textual data.

Eliminating these variances is known as cleaning text. The entire process typically happens in steps, which include:

  1. Resolving word cases

  2. Removing punctuation

  3. Removing numbers

  4. Removing extra whitespaces

  5. Removing stop words

Note though that there is no pre-set way to clean text. The steps you need to perform all depend on your data and the questions you have. We’ll walk through each of these steps below and, along the way, compare how they alter the original text to show why you might (or might not) implement them.

3.1.2. Setup#

To make these comparisons, load in Frankenstein.

with open("data/session_one/shelley_frankenstein.txt", 'r') as fin:
    frankenstein = fin.read()

And import some libraries.

import re
from collections import Counter
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import collocations
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Finally, define a simple function to count words. This will help us quickly check the results of a cleaning step.

def count_words(doc):
    """Count words in a document."""
    doc = doc.split()
    counts = Counter(doc)

    return counts

Let’s use this function to get the original number of words in Frankenstein.

counts = count_words(frankenstein)
print("Unique words:", len(counts))
Unique words: 11590

3.2. Basic Cleaning#

3.2.1. Case normalization#

The first step in cleaning is straightforward. Since computers are case-sensitive, we need to convert all characters to upper- or lowercase. It’s standard to change all letters to their lowercase forms.

uncased = frankenstein.lower()

This should reduce the number of unique words.

uncased_counts = count_words(uncased)
print("Unique words:", len(uncased_counts))
Unique words: 11219

Double check: will we face the same problems from the last chapter?

print("'Letter' in `uncased`:", ("Letter" in uncased_counts))
print("Number of times 'the' appears:", uncased_counts['the'])
'Letter' in `uncased`: False
Number of times 'the' appears: 4152

So far so good. Now, “the” has become even more prominent in the counts: there are ~250 more instances of this word after changing its case (it was 3,897 earlier).

3.2.2. Removing Punctuation#

Time to tackle punctuation. This step is trickier and it typically involves some back and forth between inspecting the original text and the output. This is because punctuation marks have different uses, so they can’t all be handled the same way.

Consider the following:

s = "I'm a self-taught programmer."

It seems most sensible to remove punctuation with some combination of regular expressions, or “regex,” with re.sub(), which substitutes a regex sequence with something else. For example, using regex to remove anything that is not (^) a word (\w) or a space (\s) will give the following:

print(re.sub(r"[^\w\s]", "", s))
Im a selftaught programmer

This method has its advantages. It sticks the m in “I’m” back to the I. While this isn’t perfect, as long as we remember that, whenever we see “Im,” we mean “I’m,” it’s doable. That said, this method also sticks “self” and “taught” together, which we don’t want. It would be better to separate those two words than create a new one altogether. Ultimately, this is a tokenization question: what do we define as acceptable tokens in our data, and how are we going to create those tokens?

Different NLP libraries in Python will handle this question in different ways. For example, the word_tokenize() function from nltk returns the following:

nltk.word_tokenize(s)
['I', "'m", 'a', 'self-taught', 'programmer', '.']

See how it handles punctuation differently, depending on the string?

In our case, we’ll want to separate phrases like “self-taught” into their components. The best way to do so is to process punctuation marks in stages. First, remove hyphens, then remove other punctuation marks.

s = re.sub(r"-", " ", s)
s = re.sub(r"[^\w\s]", "", s)
print(s)
Im a self taught programmer

Let’s use the same logic on Frankenstein. Note that we’re actually removing two different kinds of hyphens, the en dash (-) and the em dash (—). We also use different replacement strategies depending on the type of character.

no_punct = re.sub(r"[-—]", " ", uncased)
no_punct = re.sub(r"[^\w\s]", "", no_punct)

Finally, we remove underscores, which regex classes as word characters. The expression ^\w does not capture this punctuation.

no_punct = re.sub(r"_", "", no_punct)

Tip

If you didn’t want to do this separately, you could always include underscores in your code for handling hyphens. That said, punctuation removal is almost always a multi-step process, the honing of which involves multiple iterations.

With punctuation removed, here is the text:

print(no_punct[:351])
letter 1

to mrs saville england


st petersburgh dec 11th 17 


you will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings i arrived here yesterday and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking

3.2.3. Removing numbers#

Removing numbers presents less of a problem. All we need to do is find characters 0-9 and replace them.

no_punctnum = re.sub(r"[0-9]", "", no_punct)

Now that we’ve removed punctuation and numbers, unique word counts should significantly decrease. This is because we’ve separated these characters from word sequences, so “letter:” and “letter.” will count as “letter.”

no_punctnum_counts = count_words(no_punctnum)
print("Unique words:", len(no_punctnum_counts))
Unique words: 6992

That’s nearly a 40% reduction in the number of unique words!

3.2.4. Text formatting#

Our punctuation and number removal introduced extra whitespaces in the text (recall that we used a whitespace as a replacement character for some punctuation). We need to remove those, along with newlines and tabs. There are regex patterns for doing so, but Python’s .split() method captures all whitespace characters. In fact, the count_words() function above has been doing this all along. So tokenizing our text as before will also take care of this step.

cleaned = no_punctnum.split()

With that, we are back to the list representation of Frankenstein that we worked with in the last chapter – but this time, our metrics are much more robust. Here are the top 25 words:

cleaned_counts = Counter(cleaned)
for entry in cleaned_counts.most_common(25):
    print(entry)
('the', 4194)
('and', 2976)
('i', 2850)
('of', 2642)
('to', 2094)
('my', 1776)
('a', 1391)
('in', 1129)
('was', 1021)
('that', 1017)
('me', 867)
('but', 687)
('had', 686)
('with', 667)
('he', 608)
('you', 574)
('which', 558)
('it', 547)
('his', 535)
('as', 528)
('not', 510)
('for', 498)
('by', 460)
('on', 460)
('this', 402)

3.3. Stopword Removal#

With the first few steps of our cleaning done, let’s pause and look more closely at our output. Inspecting the counts above shows a pattern: nearly all of them are deictic words, or words that are highly dependent on the context in which they appear. We use these words constantly to refer to specific times, places, and persons – indeed, they’re the very sinew of language, and their high frequency counts reflect this.

3.3.1. High frequency words#

Plotting word counts shows the full extent of these high frequency words. Below, we define a function to show them.

def plot_counts(counts, n_words = 200, by_x = 10):
    """Plot word counts."""
    counts = pd.DataFrame(counts.items(), columns = ('word', 'count'))
    counts.sort_values('count', ascending = False, inplace = True)

    fig, ax = plt.subplots(figsize = (9, 6))
    g = sns.lineplot(x = 'word', y = 'count', data = counts[:n_words], ax = ax)
    g.set(xlabel = "Words", ylabel = "Counts")
    plt.xticks(rotation = 90, ticks = range(0, n_words, by_x));

plot_counts(cleaned_counts, len(cleaned_counts), 150)
../_images/786cfcb3f466cf8db180bbbd98378fb4e32f97978fe67348d643f84ad49e07a6.png

See that giant drop? Let’s look at the 200-most frequent words.

plot_counts(cleaned_counts, 200, 5)
../_images/1e3fb75f54caf4719be359ff95149ff6aa7041addbaed644b307a93143e6a84a.png

Only a few non-deictic words appear in the first half of this graph – “eyes,” “night,” “death,” for example. All others are words like “my,” “from,” etc. And this second set of words have incredibly high frequency counts. In fact, the 50-most frequent words in Frankenstein comprise nearly 50% of the total number of words in the novel!

top50 = sum(count for word, count in cleaned_counts.most_common(50))
total = cleaned_counts.total()
print(f"Percentage of 50-most frequent words: {top50 / total * 100:.02f}%")
Percentage of 50-most frequent words: 47.71%

The problem here is that, even though these high frequency words help us mean what we say, they paradoxically don’t seem to have much meaning in and of themselves. These words are so common and so context-dependent that it’s difficult to find much to say about them in isolation. Worse still, every novel we put through the above analyses is going to have a very similar distribution in terms – they’re just a general fact of language.

We need a way to handle these words. The most common way to do this is to remove them altogether. This is called stopping the text. But how do we know which stopwords to remove?

3.3.2. Defining a stop list#

The answer comes in two parts. First, compiling various stop lists has been an ongoing research area in NLP since the emergence of information retrieval in the 1950s. There are several popular lists, which capture many of the words we’d think to remove: “the,” “do,” “as,” etc. Popular NLP packages even come preloaded with generalized lists; we’ll be using one compiled by the developers of Voyant, a text analysis portal.

with open("data/voyant_stoplist.txt", 'r') as fin:
    stopwords = fin.read().split("\n")

print(stopwords[:10])
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost']

Removing stopwords can be done with a list comprehension:

stopped = [word for word in cleaned if word not in stopwords]
stopped_counts = Counter(stopped)
plot_counts(stopped_counts, 200, 5)
../_images/116f094ca38a2c565bdf7d6074904899c188a21ca8b2ebd080f80f08bd2bfa71.png

This is looking good, but it would be nice if we could also remove “said” from our text. Doing so brings us to the second, and most important part of the answer to the question, which stopwords should we use? In truth, removing stopwords depends on your texts and your research question(s). We’re looking at a novel, so we know there will be a lot of dialogue. But dialogue markers aren’t useful for understanding something a concept like topicality (i.e. what the novel is about). So, we’ll remove those markers.

But in other texts, or with other research questions, we might not want to do so. A good stop list, then, is application-specific. You may in fact find yourself using different stop lists for different parts of a project.

That all said, there are a broad set of NLP tasks that can really depend on keeping stopwords in your text. These are tasks that fall under part-of-speech tagging: they rely on stopwords to parse the grammatical structure of text. Below, we will discuss one such example of these tasks, though for now, we’ll go ahead with our current stop list.

Let’s add “said” to our stop list, redo the stopping process, and count the results. Note that it’s also customary to remove words that are two characters long or less (this prevents us from seeing things like “st,” for street).

stopwords += ['said']
stopped = [word for word in cleaned if word not in stopwords]
stopped = [word for word in stopped if len(word) > 2]
stopped_counts = Counter(stopped)
plot_counts(stopped_counts, 200, 5)
../_images/a7ee20db285343fd15b04dfe5af22a9f56df4ba921911a9ab4c6ce64b237dd64.png

Finally, here are the 50-most frequent words from our completely cleaned text:

for entry in stopped_counts.most_common(50):
    print(entry)
('man', 132)
('life', 115)
('father', 113)
('shall', 105)
('eyes', 104)
('time', 98)
('saw', 94)
('night', 92)
('elizabeth', 88)
('mind', 85)
('heart', 81)
('day', 80)
('felt', 80)
('death', 79)
('feelings', 76)
('thought', 74)
('dear', 72)
('soon', 71)
('friend', 71)
('passed', 67)
('miserable', 65)
('place', 64)
('like', 63)
('heard', 62)
('became', 61)
('love', 59)
('clerval', 59)
('little', 58)
('human', 58)
('appeared', 57)
('country', 54)
('misery', 54)
('words', 54)
('friends', 54)
('justine', 54)
('nature', 53)
('cottage', 51)
('feel', 50)
('great', 50)
('old', 50)
('away', 50)
('hope', 50)
('felix', 50)
('return', 49)
('happiness', 49)
('know', 49)
('days', 49)
('despair', 49)
('long', 48)
('voice', 46)

3.4. Advanced Cleaning#

With stopwords removed, all primary text cleaning steps are complete. But there are two more steps that we could perform to further process our data: stemming and lemmatizing. We’ll consider these separately from the steps above because they entail making significant changes to our data. Instead of simply removing pieces of irrelevant information, as with stopword removal, stemming and lemmatizing transform the forms of words.

3.4.1. Stemming#

Stemming algorithms are rule-based procedures that reduce words to their root forms. They cut down on the amount of morphological variance in a corpus, merging plurals into singulars, changing gerunds into static verbs, etc. As with the steps above, stemming a corpus cuts down on lexical variety. More, it enacts a shift to a generalized form of words’ meanings: instead of counting “have” and “having” as two different words with two different meanings, stemming would enable us to count them as a single entity, “have.”

nltk implements stemming with its PorterStemmer. It’s a class object, which must be initialized by saving it to a variable.

stemmer = PorterStemmer()

Let’s look at a few words.

to_stem = ('books', 'having', 'running', 'complicated', 'complicity')
for word in to_stem:
    print(f"{word:<12} => {stemmer.stem(word)}")
books        => book
having       => have
running      => run
complicated  => complic
complicity   => complic

There’s a lot of potential value in enacting these transformations. So far we haven’t developed a method for handling plurals; the stemmer can take care of them. Likewise, it usefully merges “have” with “having.” It would be difficult to come up with a custom algorithm that could handle the complexities of such a transformation.

That said, the problem with stemming is that the process is rule-based and struggles with certain words. It can inadvertently merge what should be two separate words, as with “complicated” and “complicity” both becoming “complic.” And more, “complic” isn’t really a word. How would we know what it means when looking at word distributions?

3.4.2. Lemmatizing#

Lemmatizing solves some of these problems, though at the cost of more complexity. Like stemming, lemmatization removes the inflectional forms of words. While it tends to be more conservative in its approach, it is better at avoiding lexical merges like “complic.” More, the result of lemmatization is always a fully readable word.

3.4.2.1. Part-of-speech tags and dependency parsing#

Lemmatizers can do all this because they use the context provided by part-of-speech tags (POS tags). To get the best results, you need to pipe in a tag for each word, which the lemmatizer uses to make its decisions. In principle, this is easy enough to do. Libraries like nltk can assign POS tags through a process called dependency parsing. This process analyzes the grammatical structure of a text string and tags words accordingly.

But now for the catch: to work at their best, dependency parsers require both stop words and some punctuation marks. Because of this, if you know you want to lemmatize your text, you’ll need to tag your text before doing other steps in the text cleaning process. Many lemmatizers also rely on assumptions made by associated tokenization processes, so it’s best to use those processes when lemmatizing.

A revised text cleaning workflow with the above considerations in mind would look like this:

  1. Tokenize with an automatic tokenizer

  2. Assign POS tags

  3. Lemmatize

  4. Remove punctuation

  5. Remove numbers

  6. Remove extra whitespace

  7. Remove stop words

3.4.2.2. Sample lemmatization workflow#

We won’t do all of this for Frankenstein, but in the next session, when we start to use classification models to understand the difference between texts, we’ll do so using texts cleaned according to the above workflow. For now, we’ll demonstrate an example of POS tagging using the nltk tokenizer in concert with its lemmatizer.

example = """The strong coffee, which I had after lunch, was $3. 
It kept me going the rest of the day."""

tokenized = nltk.word_tokenize(example)
for token in tokenized:
    print(token)
The
strong
coffee
,
which
I
had
after
lunch
,
was
$
3
.
It
kept
me
going
the
rest
of
the
day
.

Assigning POS tags:

tagged = nltk.pos_tag(tokenized)
for token in tagged:
    print(token)
('The', 'DT')
('strong', 'JJ')
('coffee', 'NN')
(',', ',')
('which', 'WDT')
('I', 'PRP')
('had', 'VBD')
('after', 'IN')
('lunch', 'NN')
(',', ',')
('was', 'VBD')
('$', '$')
('3', 'CD')
('.', '.')
('It', 'PRP')
('kept', 'VBD')
('me', 'PRP')
('going', 'VBG')
('the', 'DT')
('rest', 'NN')
('of', 'IN')
('the', 'DT')
('day', 'NN')
('.', '.')

Time to lemmatize. We do so in two steps. First, we need to convert the POS tags nltk produces to tags that this lemmatizer expects. We did say this is more complicated!

wordnet = nltk.corpus.wordnet

def convert_tag(tag):
    """Convert a TreeBank tag to a WordNet tag."""
    if tag.startswith('J'):
        tag = wordnet.ADJ
    elif tag.startswith('V'):
        tag = wordnet.VERB
    elif tag.startswith('N'):
        tag = wordnet.NOUN
    elif tag.startswith('R'):
        tag = wordnet.ADV
    else:
        tag = ''

    return tag

tagged = [(word, convert_tag(tag)) for (word, tag) in tagged]

Now we can lemmatize.

lemmatizer = WordNetLemmatizer()

def lemmatize(word, tag):
    """Lemmatize a word."""
    if tag:
        return lemmatizer.lemmatize(word, pos = tag)

    return lemmatizer.lemmatize(word)

lemmatized = [lemmatize(word, tag) for (word, tag) in tagged]

Joining the list entries back into a string will give the following:

joined = " ".join(lemmatized)
print(joined)
The strong coffee , which I have after lunch , be $ 3 . It keep me go the rest of the day .

From here, pass this string back through the cleaning steps we’ve already covered.

3.5. Chunking with N-Grams#

We are now finished cleaning text. The last thing we’ll discuss in this session is chunking. Chunking is closely related to tokenization. It involves breaking text into multi-token spans. This is useful if you want to find phrases in data, or even entities. For example, all the steps above would dissolve “New York” into “new” and “york.” A multi-token span, on the other hand, would keep this string intact.

In this sense, it’s often useful to count not only single words in text, but continuous two-word strings, or even longer ones. These strings are called n-grams, where n is the number of tokens in the span. “Bigrams” are two-token spans. “Trigrams” have three tokens, while “4-grams” and “5-grams” have four and five tokens, respectively. Technically, there’s no limit to n-gram sizes, though their usefulness depend on your data and research questions.

To finish this chapter, we’ll produce bigram counts on Frankenstein. nltk has built-in functionality to help us do so. There are a few options here. We’ll use objects from the collocations module.

A BigramCollocationFinder will find the bigrams.

finder = collocations.BigramCollocationFinder.from_words(stopped)

Access its ngram_fd attribute for counts, which we’ll store in a pandas DataFrame.

bigrams = finder.ngram_fd

bigrams = [(word, pair, count) for (word, pair), count in bigrams.items()]
bigrams = pd.DataFrame(bigrams, columns = ('word', 'pair', 'count'))
bigrams.sort_values('count', ascending = False, inplace = True)

Top bigrams:

bigrams.head(10)
word pair count
826 old man 32
623 native country 15
3333 natural philosophy 14
6999 taken place 13
5901 fellow creatures 12
3395 dear victor 10
6784 short time 9
7606 young man 9
2631 long time 9
2434 poor girl 8

Looks good! Some phrases are peeking through. But while raw counts provide us with information about frequently occurring phrases in text, it’s hard to know how unique these phrases are. For example, “man” appears throughout the novel, so it’s likely to appear in many bigrams. How, then, might we determine whether there’s something unique about whether “old” and “man” consistently stick together?

One way to do this is with a PMI, or pointwise mutual information, score. PMI measures the association strength of a pair of outcomes. In our case, the higher the score, the more likely a given bigram pair will be with respect to the other bigrams in which the two words of the one under consideration.

We can get a PMI score for each bigram using our finder’s .score_ngrams() method in concert with a BigramAssocMeasures object; we send the latter as an argument to the former. As before, let’s format the result for a pandas DataFrame.

measures = collocations.BigramAssocMeasures()

bigram_pmi = finder.score_ngrams(measures.pmi)
bigram_pmi = [(word, pair, val) for (word, pair), val in bigram_pmi]
bigram_pmi = pd.DataFrame(bigram_pmi, columns = ('word', 'pair', 'pmi'))
bigram_pmi.sort_values('pmi', ascending = False, inplace = True)

10 bottom-most scoring bigrams:

bigram_pmi.tail(10)
word pair pmi
29025 eyes eyes 1.491758
29026 eyes shall 1.477952
29027 man saw 1.293655
29028 saw man 1.293655
29029 father father 1.252280
29030 life father 1.226969
29031 man eyes 1.147804
29032 man shall 1.133998
29033 shall man 1.133998
29034 man father 1.028065

And 10 random bigrams with scores above the 75th percentile.

bigram_pmi[bigram_pmi['pmi'] > bigram_pmi['pmi'].quantile(0.75)].sample(10)
word pair pmi
4519 sense term 11.307675
6457 sufficient conquer 10.570710
5677 boy singular 10.722713
3775 perilous situation 11.570710
429 aspired omnipotence 13.892638
4198 dome mont 11.307675
5751 expiration month 10.722713
4517 sensation helplessness 11.307675
3360 enters cabin 11.722713
473 colonization trade 13.892638

Among the worst-scoring bigrams there are words that are likely to appear alongside many different words: “said,” “man,” “shall,” etc. On the other hand, among the best-scoring bigrams there are coherent entities and suggestive pairings. The latter especially begin to sketch out the specific qualities of Shelley’s prose style.