2. Text Annotation with spaCy#

This chapter introduces workshop participants to the general field of natural language processing, or NLP. While NLP is often used interchangeably with text mining/analytics in introductory settings, the former differs in important ways from many of the core methods in the latter. We will highlight a few such differences over the course of this session, and then more generally throughout the workshop series as a whole.

Learning objectives

By the end of this chapter, you will be able to:

  • Explain how document annotation differs from other representations of text data

  • Have a general sense of how spaCy models and their respective pipelines work

  • Extract linguistic information about text using spaCy

  • Describe key terms in NLP, like part-of-speech tagging, dependency parsing, etc.

  • Know how/where to look for more information about the linguistic data spaCy makes available

2.1. NLP vs. Text Mining: In Brief#

The short space of this reader, as well as that of our series, necessarily limits a conversation about everything that characterizes NLP from text mining/analytics. That there are differences at all is in itself worth noting and merits further exploration. But for the purposes of this series’s focus, there are two such instances of these differences that are especially worth calling out.

2.1.1. Data structures#

At the outset, one way of distinguishing NLP from text mining has to do with NLP’s underlying data structure. Generally speaking, NLP methods are maximally preservative when it comes to representing textual information in a way computers can read. Unlike text mining’s atomizing focus on bags of words, in NLP we often use literal transcriptions of the input text and run our analyses directly on that. This is because much of the information NLP methods provide is context-sensitive: we need to know, for example, the subject of a sentence in order to do dependency parsing; part-of-speech taggers are most effective when they have surrounding tokens to consider. Accordingly, our workflow needs to retain as much information about our documents as possible, for as long as possible. In fact, many NLP methods build on each other, so data about our documents will grow over the course of processing them (rather than getting pared down, as with text mining). The dominant paradigm, then, for thinking about how text data is represented in NLP is annotation: NLP tends to add, associate, or tag documents with extra information.

2.1.2. Model-driven methods#

The other key difference between text mining and NLP – which goes hand-in-hand with the idea of annotation – lies in the fact that the latter tends to be more model-driven. NLP methods often rely on statistical models to create the above information, and ultimately these models have a lot of assumptions baked into them. Such assumptions range from philosophy of language (how do we know we’re analyzing meaning?) to the kind of training data on which they’re trained (what does the model represent, and what biases might thereby be involved?). Of course, it’s possible to build your own models, and indeed a later chapter will show you how to do so, but you’ll often find yourself using other researchers’ models when doing NLP work. It’s thus very important to know how researchers have built their models so you can do your own work responsibly.

Keep in mind

Throughout this series, we will be using NLP methods in the context of text-based data, but NLP applies more widely to speech data as well.

2.2. spaCy Language Models#

Much of this workshop series will use language models from spaCy, one of the most popular NLP libraries in Python. spaCy is both a framework and a model resource. It offers access to models through a unique set of coding workflows, which we’ll discuss below (you can also train your own models with the library). Learning about these workflows will help us add annotate documents with extra information that will, in turn, enable us to perform a number of different NLP tasks.

2.2.1. spaCy pipelines#

In essence, a spaCy model is a collection of sub-models arranged into a pipeline. The idea here is that you send a document through this pipeline, and the model does the work of annotating your document. Once it has finished, you can access these annotations to perform whatever analysis you’d like to do.

The spaCy pipeline, which is broken up into separate components, or pipes

Every component, or pipe, in a spaCy pipeline performs a different task, from tokenization to part-of-speeching tagging and named-entity recognition. Each model comes with a specific ordering of these tasks, but you can mix and match them after the fact, adding or removing pipes as you see fit. The result is a wide set of options; the present workshop series only samples a few core aspects of the library’s overall capabilities.

2.2.2. Downloading a model#

The specific model we’ll be using is spaCy’s medium-sized English model: en_core_web_md. It’s been trained on the OntoNotes corpus and it features several useful pipes, which we’ll discuss below.

If you haven’t used spaCy before, you’ll need to download this model. You can do so by running the following in a command line interface:

python -m spacy download en_core_web_md

Just be sure you run this while working in the Python environment you’d like to use!

Once this downloads, you can load the model with the code below. Note that it’s conventional to assign the model to a variable called nlp.

import spacy

nlp = spacy.load('en_core_web_md')

2.3. Annotations#

With the model loaded, we can send a document through the pipeline, which will in turn produce our text annotations. To annotate a document with the spaCy model, simply run it through the core function, nlp(). We’ll do so with a short poem by Gertrude Stein.

with open('data/session_one/stein_carafe.txt', 'r') as f:
    stein_poem = f.read()
    
carafe = nlp(stein_poem)

With this done, we can inspect the result…

carafe
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing. All this and not ordinary, not unordered in not resembling. The difference is spreading.

…which seems to be no different from a string representation! This output is a bit misleading, however. Our carafe object actually has a ton of extra information associated with it, even though, on the surface, it appears to be a plain old string.

If you’d like, you can inspect all these attributes and methods with:

attributes = [i for i in dir(carafe) if i.startswith("_") is False]

We won’t show them all here, but suffice it to say, there are a lot!

print("Number of attributes in a SpaCy doc:", len(attributes))
Number of attributes in a SpaCy doc: 51

This high number of attributes indicates an important point to keep in mind when working with spaCy and NLP generally: as we mentioned before, the primary data model for NLP aims to maximally preserve information about your document. It keeps documents intact and in fact adds much more information about them than Python’s base string methods have. In this sense, we might say that spaCy is additive in nature, whereas text mining methods are subtractive, or reductive.

2.3.1. Document Annotations#

So, while the base representation of carafe looks like a string, under the surface there are all sorts of annotations about it. To access them, we use the attributes counted above. For example, spaCy adds extra segmentation information about a document, like which parts of it belong to different sentences. We can check to see whether this information has been attached to our text with the .has_annotation() method.

carafe.has_annotation('SENT_START')
True

We can use the same method to check for a few other annotations:

annotation_types = {'Dependencies': 'DEP', 'Entities': 'ENT_IOB', 'Tags': 'TAG'}
for a, t in annotation_types.items():
    print(
        f"{a:>12}: {carafe.has_annotation(t)}"
    )
Dependencies: True
    Entities: True
        Tags: True

Let’s look at sentences. We can access them with .sents.

carafe.sents
<generator at 0x11c014f40>

…but you can see that there’s a small complication here: .sents returns a generator, not a list. The reason has to do with memory efficiency. Because spaCy adds so much extra information about your document, this information could slow down your code or overwhelm your computer if the library didn’t store it in an efficient manner. Of course this isn’t a problem with our small poem, but you can imagine how it could become one with a big corpus.

To access the actual sentences in carafe, we’ll need to convert the generator to a list.

import textwrap

sentences = list(carafe.sents)
for s in sentences:
    s = textwrap.shorten(s.text, width=100)
    print(s)
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an [...]
All this and not ordinary, not unordered in not resembling.
The difference is spreading.

One very useful attribute is .noun_chunks. It returns nouns and compound nouns in a document.

noun_chunks = list(carafe.noun_chunks)

for noun in noun_chunks:
    print(noun)
A kind
glass
a cousin
a spectacle
nothing
a single hurt color
an arrangement
a system
All this
The difference

See how this picks up not only nouns, but articles and compound information? Articles could be helpful if you wanted to track singular/plural relationships, while compound nouns might tell you something about the way a document refers to the entities therein. The latter could have repeating patterns, and you might imagine how you could use noun chunks to create and count n-gram tokens and feed that into a classifier.

Consider this example from The Odyssey. Homer used many epithets and repeating phrases throughout his epic. According to some theories, these act as mnemonic devices, helping a performer keep everything in their head during an oral performance (the poem wasn’t written down in Homer’s day). Using .noun_chunks in conjunction with a Python Counter, we may be able to identify these in Homer’s text. Below, we’ll do so with The Odyssey Book XI.

First, let’s load and model the text.

with open('data/session_one/odyssey_book_11.txt', 'r') as f:
    book_eleven = f.read()
    
odyssey = nlp(book_eleven)

Now we’ll import a Counter and initialize it. Then we’ll get the noun chunks from the document and populate them in the count dictionary with a list comprehension line. Be sure to only grab the text from each token. We’ll explain why in a little while.

from collections import Counter

noun_counts = Counter([chunk.text for chunk in odyssey.noun_chunks])

With that done, let’s look for repeating noun chunks with three or more words.

import pandas as pd

chunks = []
for chunk, count in noun_counts.items():
    chunk = chunk.split()
    if (len(chunk) > 2) and (count > 1):
        joined = ' '.join(chunk)
        chunks.append({
            'PHRASE': joined,
            'COUNT': count
        })
        
chunks = pd.DataFrame(chunks).set_index('PHRASE')
chunks
COUNT
PHRASE
the sea shore 2
a fair wind 2
the poor feckless ghosts 2
the same time 2
the other side 2
his golden sceptre 2
your own house 2
her own son 2
the Achaean land 2
her own husband 2
my wicked wife 2
all the Danaans 2
the poor creature 2

Excellent! Looks like we turned up a few: “the poor feckless ghosts,” “my wicked wife,” and “all the Danaans” are likely the kind of repeating phrases scholars think of in Homer’s text.

Another way to look at entities in a text is with .ents. spaCy uses named-entity recognition to extract significant objects, or entities, in a document. In general, anything that has a proper name associated with it is considered an entity, but things like expressions of time and geographic location are also often tagged. Here are the first five from Book XI above.

entities = list(odyssey.ents)

count = 0
while count < 5:
    print(entities[count])
    count += 1
Circe
Circe
Here Perimedes and Eurylochus
first
thirdly

You can select particular entities using the .label_ attribute. Here are all the temporal entities in Book XI.

[e.text for e in odyssey.ents if e.label_ == 'TIME']
['morning']

And here is a unique listing of all the people.

set(e.text for e in odyssey.ents if e.label_ == 'PERSON')
{'Achilles',
 'Anticlea',
 'Ariadne',
 'Bacchus',
 'Chloris',
 'Circe',
 'Cretheus',
 'Dia',
 'Diana',
 'Echeneus',
 'Epeus',
 'Ephialtes',
 'Erebus',
 'Eriphyle',
 'Gorgon',
 'Hades',
 'Hebe',
 'Helen',
 'Hercules',
 'Ithaca',
 'Jove',
 'Leto',
 'Maera',
 'Megara',
 'Memnon',
 'Minerva',
 'Neleus',
 'Neptune',
 'Pelias',
 'Penelope',
 'Pero',
 'Pollux',
 'Priam',
 'Proserpine',
 'Pylos',
 'Queen',
 'Salmoneus',
 'Sisyphus',
 'Telemachus',
 'Theban',
 'Theban Teiresias',
 'Thebes',
 'Theseus',
 'Thetis',
 'Troy',
 'Tyro',
 'Ulysses',
 'jailor Hades'}

Don’t see an entity that you know to be in your document? You can add more to the spaCy model. Doing so is beyond the scope of our workshop session, but the library’s EntityRuler() documentation will show you how.

2.3.2. Token Annotations#

In addition to storing all of this information about texts, spaCy creates a substantial amount of annotations for each of the tokens in that document. The same logic as above applies to accessing this information.

Let’s return to the Stein poem. Indexing carafe will return individual tokens:

carafe[3]
glass

Like carafe, each one has several attributes:

token_attributes = [i for i in dir(carafe[3]) if i.startswith("_") is False]

print("Number of token attributes:", len(token_attributes))
Number of token attributes: 94

That’s a lot!

These attributes range from simple booleans, like whether a token is an alphabetic character:

carafe[3].is_alpha
True

…or whether it is a stop word:

carafe[3].is_stop
False

…to more complex pieces of information, like tracking back to the sentence this token is part of:

carafe[3].sent
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing.

…sentiment scores:

carafe[3].sentiment
0.0

…and even vector space representations (more about these on day three!):

carafe[3].vector
array([-0.60665  , -0.19688  , -0.1227   , -0.63598  ,  0.89807  ,
       -0.26537  , -0.015889 ,  0.28961  , -0.10494  , -0.46685  ,
        0.36891  ,  0.15006  , -0.22227  ,  0.38622  ,  0.29493  ,
       -0.9845   , -0.32567  ,  0.23355  ,  0.49492  , -0.57569  ,
       -0.10022  ,  0.2888   ,  0.786    , -0.054396 ,  0.14151  ,
       -0.40768  ,  0.061544 ,  0.39313  , -1.092    , -0.14511  ,
        0.42103  , -0.76163  ,  0.29339  ,  0.41215  , -0.41869  ,
        0.39918  ,  0.39688  , -0.46966  ,  0.74839  ,  0.50759  ,
        0.48458  ,  0.51865  ,  0.20757  , -0.32406  ,  0.72689  ,
       -0.31903  , -0.23761  , -0.6614   ,  0.28486  ,  0.44251  ,
        0.10102  , -0.26663  ,  0.55344  ,  0.26378  , -0.41728  ,
       -0.30876  ,  0.41034  , -0.54028  , -0.42292  ,  0.22636  ,
        0.31144  ,  0.27504  ,  0.18149  ,  0.21325  , -0.55227  ,
       -0.1011   ,  0.25542  ,  0.45925  ,  0.60463  ,  0.35682  ,
        0.56958  ,  0.036307 ,  0.83937  ,  0.48093  , -0.19525  ,
        0.46602  ,  0.022897 ,  0.11539  , -0.24179  ,  0.48036  ,
       -1.0714   , -0.36915  ,  0.16971  , -0.053726 , -0.97381  ,
        0.42214  ,  1.1142   ,  1.4597   , -0.50529  , -0.45446  ,
        0.41132  , -0.33779  ,  0.32959  ,  0.49328  ,  0.56425  ,
        0.012859 ,  0.25048  , -0.87307  , -0.21436  , -0.20508  ,
        0.42198  , -0.22507  , -0.53125  ,  0.44767  , -0.18567  ,
       -1.3365   ,  0.089993 , -0.64042  ,  1.0186   ,  0.21136  ,
        0.19435  ,  0.17786  , -0.13835  , -0.37329  ,  0.31726  ,
        0.16039  , -0.19666  ,  0.049621 ,  0.094966 , -0.037447 ,
       -0.29013  ,  0.48434  , -0.12724  ,  0.67386  ,  0.32174  ,
       -0.32181  , -0.16145  , -0.60097  ,  0.60743  ,  0.22665  ,
       -0.28438  , -0.29495  ,  0.04462  , -0.013322 ,  0.11204  ,
        0.63518  , -0.022868 , -0.034192 , -0.20114  ,  0.11167  ,
       -0.37905  ,  0.36819  , -0.018246 ,  0.25284  , -0.088927 ,
       -0.077314 , -0.35745  ,  0.14237  ,  0.14386  ,  0.24076  ,
       -0.29973  , -0.03639  ,  0.22725  ,  0.10916  , -0.94986  ,
        0.046044 ,  0.81856  , -0.33634  , -0.15261  ,  0.60717  ,
       -0.2718   ,  0.22678  , -0.18739  ,  0.623    ,  0.82995  ,
       -0.20024  ,  0.73056  , -0.0084901, -0.39045  ,  0.18742  ,
       -0.73338  , -0.077664 ,  0.23246  ,  0.87546  ,  0.21647  ,
       -0.072898 ,  0.36288  ,  0.2308   , -0.30565  ,  0.11694  ,
       -0.33983  , -0.15152  ,  0.92947  ,  0.58747  , -0.12973  ,
        0.005775 , -0.092589 ,  0.18885  , -0.40725  ,  0.18249  ,
       -0.90487  , -0.42429  ,  0.099451 , -0.61899  , -0.060563 ,
        0.16057  , -0.43026  ,  0.33288  ,  0.5276   ,  0.16662  ,
       -0.11755  ,  0.17048  ,  0.20785  , -0.18347  ,  0.35685  ,
       -0.07018  , -0.064295 ,  0.46135  , -0.0083938, -0.24064  ,
       -0.18537  ,  0.067484 , -0.072876 , -0.80117  ,  0.11181  ,
        0.43282  ,  0.58373  , -0.31169  ,  0.032395 ,  0.33231  ,
        0.086075 ,  0.14476  ,  0.081413 ,  0.31666  , -0.49782  ,
       -0.15804  ,  0.3573   , -0.56534  , -0.58402  ,  0.023733 ,
       -0.29734  ,  0.13347  , -0.12673  ,  0.44492  , -0.33062  ,
       -0.50343  ,  0.43491  ,  0.028932 ,  0.17392  , -0.26669  ,
        0.020031 ,  0.91117  , -0.069201 ,  0.49468  ,  0.043936 ,
        0.083269 ,  0.14409  ,  0.15961  , -0.54771  ,  0.38417  ,
        0.51976  ,  0.17693  ,  0.36414  , -0.11335  , -0.078741 ,
       -0.41143  , -0.15808  ,  0.086959 ,  0.20845  ,  0.27017  ,
        0.018805 ,  1.2245   ,  0.12708  ,  0.25326  ,  0.10445  ,
       -0.058079 , -0.030512 ,  0.63189  , -0.56298  , -0.25331  ,
       -0.64939  , -0.48705  ,  0.18973  , -0.39923  ,  0.78043  ,
       -0.13467  ,  0.14517  ,  0.48435  ,  0.51201  , -0.80074  ,
       -0.42834  ,  0.17614  ,  0.59832  , -0.37692  ,  0.029607 ,
        0.09632  ,  0.40852  ,  0.62755  , -0.038655 ,  0.17166  ,
        0.451    ,  0.20851  ,  0.267    ,  0.24261  , -0.23774  ,
       -0.48429  ,  0.24286  , -0.20696  , -0.25682  , -0.1432   ],
      dtype=float32)

Here’s a listing of some attributes you might want to know about when text mining.

sample_attributes = []
for token in carafe:
    sample_attributes.append({
        'INDEX': token.i,
        'TEXT': token.text,
        'LOWERCASE': token.lower_,
        'ALPHABETIC': token.is_alpha,
        'DIGIT': token.is_digit,
        'PUNCTUATION': token.is_punct,
        'STARTS SENTENCE': token.is_sent_start,
        'LIKE URL': token.like_url
    })

sample_attributes = pd.DataFrame(sample_attributes).set_index('INDEX')
sample_attributes.head(10)
TEXT LOWERCASE ALPHABETIC DIGIT PUNCTUATION STARTS SENTENCE LIKE URL
INDEX
0 A a True False False True False
1 kind kind True False False False False
2 in in True False False False False
3 glass glass True False False False False
4 and and True False False False False
5 a a True False False False False
6 cousin cousin True False False False False
7 , , False False True False False
8 a a True False False False False
9 spectacle spectacle True False False False False

We’ll discuss some of the more complex annotations later on, both in this session and others. For now, let’s collect some simple information about each of the tokens in our document. We’ll use list comprehension to do so. We’ll also use the .text attribute for each token, since we only want the text representation. Otherwise, we’d be creating a list of generators, where each generator has all those attribute for every token! (This is why we made sure to only use .text in our work with The Odyssey above.)

words = ' '.join([token.text for token in carafe if token.is_alpha])
punctuation = ' '.join([token.text for token in carafe if token.is_punct])

print(
    f"Words\n-----\n{textwrap.shorten(words, width=100)}",
    f"\n\nPunctuation\n-----------\n{punctuation}"
)
Words
-----
A kind in glass and a cousin a spectacle and nothing strange a single hurt color and an [...] 

Punctuation
-----------
, . , . .

Want some linguistic information? We can get that too. For example, here are prefixes and suffixes:

prefix_suffix = []
for token in carafe:
    if token.is_alpha:
        prefix_suffix.append({
            'TOKEN': token.text,
            'PREFIX': token.prefix_,
            'SUFFIX': token.suffix_
        })

prefix_suffix = pd.DataFrame(prefix_suffix).set_index('TOKEN')
prefix_suffix.head(10)
PREFIX SUFFIX
TOKEN
A A A
kind k ind
in i in
glass g ass
and a and
a a a
cousin c sin
a a a
spectacle s cle
and a and

And here are lemmas:

lemmas = []
for token in carafe:
    if token.is_alpha:
        lemmas.append({
            'TOKEN': token.text,
            'LEMMA': token.lemma_
        })

lemmas = pd.DataFrame(lemmas).set_index('TOKEN')
lemmas[24:]
LEMMA
TOKEN
All all
this this
and and
not not
ordinary ordinary
not not
unordered unordered
in in
not not
resembling resemble
The the
difference difference
is be
spreading spread

With such attributes at your disposal, you might imagine how you could work spaCy into a text mining pipeline. Instead of using separate functions to clean your corpus, those steps could all be accomplished by accessing attributes.

Before you do this, however, you should consider two things: 1) whether the increased computational/memory overhead is worthwhile for your project; and 2) whether spaCy’s base models will work for the kind of text you’re using. This second point is especially important. While spaCy’s base models are incredibly powerful, they are built for general purpose applications and may struggle with domain-specific language. Medical text and early modern print are two such examples of where the base models interpret your documents in unexpected ways, thereby complicating, maybe even ruining, parts of a text mining pipeline that relies on them. Sometimes, in other words, it’s just best to stick with a text mining pipeline that you know to be effective.

That all said, there are ways to train your own spaCy model on a specific domain. This can be an extensive process, one which exceeds the limits of our short workshop, but if you want to learn more about doing so, you can visit this page. There are also third party models available, which you might find useful, though your milage may vary.

2.4. Part-of-Speech Tagging#

One of the most common tasks in NLP involves assigning part-of-speech, or POS, tags to each token in a document. As we saw in the text mining series, these tags are a necessary step for certain text cleaning process, like lemmatization; you might also use them to identify subsets of your data, which you could separate out and model. Beyond text cleaning, POS tags can be useful for tasks like word sense disambiguation, where you try to determine which particular facet of meaning a given token represents.

Regardless of the task, the process of getting POS tags from spaCy will be the same. Each token in a document has an associated tag, which is accessible as an attribute.

pos = []
for token in carafe:
    pos.append({
        'TOKEN': token.text,
        'POS_TAG': token.pos_
    })

pos = pd.DataFrame(pos).set_index('TOKEN')
pos
POS_TAG
TOKEN
A DET
kind NOUN
in ADP
glass NOUN
and CCONJ
a DET
cousin NOUN
, PUNCT
a DET
spectacle NOUN
and CCONJ
nothing PRON
strange ADJ
a DET
single ADJ
hurt ADJ
color NOUN
and CCONJ
an DET
arrangement NOUN
in ADP
a DET
system NOUN
to ADP
pointing VERB
. PUNCT
All DET
this PRON
and CCONJ
not PART
ordinary ADJ
, PUNCT
not PART
unordered ADJ
in ADP
not PART
resembling VERB
. PUNCT
The DET
difference NOUN
is AUX
spreading VERB
. PUNCT

If you don’t know what a tag means, you can use spacy.explain().

spacy.explain('CCONJ')
'coordinating conjunction'

spaCy actually has two types of POS tags. The ones accessible with the .pos_ attribute are the basic tags, whereas those under .tag_ are more detailed (these come from the Penn Treebank project). We’ll print them out below, along with information about what they mean.

detailed_tags = []
for token in carafe:
    detailed_tags.append({
        'TOKEN': token.text,
        'POS_TAG': token.tag_,
        'EXPLANATION': spacy.explain(token.tag_)
    })

detailed_tags = pd.DataFrame(detailed_tags).set_index('TOKEN')
detailed_tags
POS_TAG EXPLANATION
TOKEN
A DT determiner
kind NN noun, singular or mass
in IN conjunction, subordinating or preposition
glass NN noun, singular or mass
and CC conjunction, coordinating
a DT determiner
cousin NN noun, singular or mass
, , punctuation mark, comma
a DT determiner
spectacle NN noun, singular or mass
and CC conjunction, coordinating
nothing NN noun, singular or mass
strange JJ adjective (English), other noun-modifier (Chin...
a DT determiner
single JJ adjective (English), other noun-modifier (Chin...
hurt JJ adjective (English), other noun-modifier (Chin...
color NN noun, singular or mass
and CC conjunction, coordinating
an DT determiner
arrangement NN noun, singular or mass
in IN conjunction, subordinating or preposition
a DT determiner
system NN noun, singular or mass
to IN conjunction, subordinating or preposition
pointing VBG verb, gerund or present participle
. . punctuation mark, sentence closer
All PDT predeterminer
this DT determiner
and CC conjunction, coordinating
not RB adverb
ordinary JJ adjective (English), other noun-modifier (Chin...
, , punctuation mark, comma
not RB adverb
unordered JJ adjective (English), other noun-modifier (Chin...
in IN conjunction, subordinating or preposition
not RB adverb
resembling VBG verb, gerund or present participle
. . punctuation mark, sentence closer
The DT determiner
difference NN noun, singular or mass
is VBZ verb, 3rd person singular present
spreading VBG verb, gerund or present participle
. . punctuation mark, sentence closer

2.4.1. Use case: word sense disambiguation#

This is all well and good in the abstract, but the power of POS tags lies in how they support other kinds of analysis. We’ll do a quick word sense disambiguation task here but will return to do something more complex in a little while.

Between the two strings:

  1. “I am not going to bank on that happening.”

  2. “I went down to the river bank.”

How can we tell which sense of the word “bank” is being used? Well, we can model each with spaCy and see whether the POS tags for these two tokens match. If they don’t match, this will indicate that the tokens represent two different senses of the word “bank.”

All this can be accomplished with a for loop and nlp.pipe(). The latter function enables you to process different documents with the spaCy model all at once. This can be great for working with a large corpus, though note that, because .pipe() is meant to work on text at scale, it will return a generator.

banks = ["I am not going to bank on that happening.", "I went down to the river bank."]

nlp.pipe(banks)
<generator object Language.pipe at 0x1253b09e0>
for doc in nlp.pipe(banks):
    for token in doc:
        if token.text == 'bank':
            print(
                f"{doc.text}\n+ {token.text}: "
                f"{token.tag_} ({spacy.explain(token.tag_)})\n"
            )
I am not going to bank on that happening.
+ bank: VB (verb, base form)

I went down to the river bank.
+ bank: NN (noun, singular or mass)

See how the tags differ between the two instances of “bank”? This indicates a difference in usage and, by proxy, a difference in meaning.

2.5. Dependency Parsing#

Another tool that can help with tasks like disambiguating word sense is dependency parsing. We’ve actually used it already: it allowed us to extract those noun chunks above. Dependency parsing involves analyzing the grammatical structure of text (usually sentences) to identify relationships between the words therein. The basic idea is that every word in a linguistic unit (eg. a sentence) is linked to at least one other word via a tree structure, and these linkages are hierarchical in nature, with various modifications occuring across the levels of sentences, clauses, phrases, and even compound nouns. Dependency parsing can tell you information about:

  1. The primary subject of a linguistic unit (and whether it is an active or passive subject)

  2. Various heads, which determine the syntatic categories of a phrase; these are often nouns and verbs, and you can think of them as the local subjects of subunits

  3. Various dependents, which modify, either directly or indirectly, their heads (think adjectives, adverbs, etc.)

  4. The root of the unit, which is often (but not always!) the primary verb

Linguists have developed a number of different methods to parse dependencies, which we won’t discuss here. Take note though that most popular one in NLP is the Universal Dependencies framework; spaCy, like most NLP models, uses this. The library also has some functionality for visualizing dependencies, which will help clarify what it is they are in the first place. Below, we visualize a sentence from the Stein poem.

from spacy import displacy

to_render = list(carafe.sents)[2]
displacy.render(to_render, style='dep')
The DET difference NOUN is AUX spreading. VERB det nsubj aux

See how the arcs have arrows? Arrows point to the dependents within a linguistic unit, that is, they point to modifying relationships between words. Arrows arc out from a segment’s head, and the relationships they indicate are all specified with labels. As with the POS tags, you can use spacy.explain() on the dependency labels, which we’ll do below. The whole list of them is also available in this table of typologies. Finally, somewhere in the tree you’ll find a word with no arrows pointing to it (here, “spreading”). This is the root. One of its dependents is the subject of the sentence (here, “difference”).

Seeing these relationships are quite useful in and of themselves, but the real power of dependency parsing comes in all the extra data it can provide about a token. Using this technique, you can link tokens back to their heads, or find local groupings of tokens that all refer to the same head.

Here’s how you could formalize that with a dataframe. Given this sentence:

sentence = odyssey[2246:2260]
sentence.text
"Then I tried to find some way of embracing my mother's ghost."

We can construct a for loop, which rolls through each token and retrieves its dependency info.

dependencies = []
for token in sentence:
    dependencies.append({
        'INDEX': token.i,
        'TOKEN': token.text,
        'DEPENDENCY_SHORTCODE': token.dep_,
        'DEPENDENCY': spacy.explain(token.dep_),
        'HEAD_INDEX': token.head.i,
        'HEAD': token.head
    })
    
dependencies = pd.DataFrame(dependencies).set_index('INDEX')
dependencies
/Users/tyler/Environments/nlp/lib/python3.9/site-packages/spacy/glossary.py:19: UserWarning: [W118] Term 'ROOT' not found in glossary. It may however be explained in documentation for the corpora used to train the language. Please check `nlp.meta["sources"]` for any relevant links.
  warnings.warn(Warnings.W118.format(term=term))
TOKEN DEPENDENCY_SHORTCODE DEPENDENCY HEAD_INDEX HEAD
INDEX
2246 Then advmod adverbial modifier 2248 tried
2247 I nsubj nominal subject 2248 tried
2248 tried ROOT None 2248 tried
2249 to aux auxiliary 2250 find
2250 find xcomp open clausal complement 2248 tried
2251 some det determiner 2252 way
2252 way dobj direct object 2250 find
2253 of prep prepositional modifier 2252 way
2254 embracing pcomp complement of preposition 2253 of
2255 my poss possession modifier 2256 mother
2256 mother poss possession modifier 2258 ghost
2257 's case case marking 2256 mother
2258 ghost dobj direct object 2254 embracing
2259 . punct punctuation 2248 tried

How many tokens are associated with each head?

dependencies.groupby('HEAD').size()
HEAD
tried        5
find         2
way          2
of           1
embracing    1
mother       2
ghost        1
dtype: int64

Which tokens are in each of these groups?

groups = []
for group in dependencies.groupby('HEAD'):
    head, tokens = group[0].text, group[1]['TOKEN'].tolist()
    groups.append({
        'HEAD': head,
        'GROUP': tokens
    })
    
groups = pd.DataFrame(groups).set_index('HEAD')
groups
GROUP
HEAD
tried [Then, I, tried, find, .]
find [to, way]
way [some, of]
of [embracing]
embracing [ghost]
mother [my, 's]
ghost [mother]

spaCy also has a special .subtree attribute for each token, which will also produce a similar set of local groupings. Note however that .subtree captures all tokens that hold a dependent relationship with the one in question, meaning that when you find the subtree of the root, you’re going to print out the entire sentence.

As you might expect by now, .subtree returns a generator, so convert it to a list or use list comprehension to extract the tokens. We’ll do this in a separate function. Within this function, we’re going to use the .text_with_ws attribute of each token in the subtree to return an exact, string-like representation of the tree (this will include any whitespace characters that are attached to a token).

def subtree_to_text(subtree):
    subtree = ''.join([token.text_with_ws for token in token.subtree])
    subtree = subtree.strip()
    return subtree

sentence_trees = []
for token in sentence:
    subtree = subtree_to_text(token.subtree)
    sentence_trees.append({
        'TOKEN': token.text,
        'DEPENDENCY': token.dep_,
        'SUBTREE': subtree
    })

sentence_trees = pd.DataFrame(sentence_trees).set_index('TOKEN')
sentence_trees
DEPENDENCY SUBTREE
TOKEN
Then advmod Then
I nsubj I
tried ROOT "Then I tried to find some way of embracing my...
to aux to
find xcomp to find some way of embracing my mother's ghost
some det some
way dobj some way of embracing my mother's ghost
of prep of embracing my mother's ghost
embracing pcomp embracing my mother's ghost
my poss my
mother poss my mother's
's case 's
ghost dobj my mother's ghost
. punct .

2.6. Putting Everything Together#

Now that we’ve walked through all these options (which are really only a small sliver of what you can do with spaCy!), let’s put them into action. Below, we’ll construct two short examples of how you might combine different aspects of token attributes to analyze a text. Both of them are essentially information retrieval tasks, and you might imagine doing something similar to extract and analyze particular words in your corpus, or to find different grammatical patterns that could be of significance (as we’ll discuss in the next session).

2.6.1. Finding lemmas#

In the first, we’ll use the .lemma_ attribute to search through Book XI of The Odyssey and match its tokens to a few key words. If you’ve read The Odyssey, you’ll know that Book XI is where Odysseus and his fellow sailors have to travel down to the underworld Hades, where they speak with the dead. We already saw one example of this: Odysseus attempts to embrace his dead mother after communing with her. The whole trip to Hades is an emotionally tumultuous experience for the travelers, and peppered throughout Book XI are expressions of grief.

With .lemma_, we can search for these expressions. We’ll roll through the text and determine whether a token lemma matches one of a selected set. When we find a match, we’ll get the subtree of this token’s head. That is, we’ll find the head upon which this token depends, and then we’ll use that to reconstruct the local context for the token.

sorrowful_lemmas = []
for token in odyssey:
    if token.lemma_ in ('cry', 'grief', 'grieve', 'sad', 'sorrow', 'tear', 'weep'):
        subtree = subtree_to_text(token.head.subtree)
        sorrowful_lemmas.append({
            'TOKEN': token.text,
            'SUBTREE': subtree
        })

sorrowful_lemmas = pd.DataFrame(sorrowful_lemmas).set_index('TOKEN')
sorrowful_lemmas
SUBTREE
TOKEN
cried cried when I saw him: 'Elpenor
sad sad
tears tears
sorrow all my sorrow
sad sad
tears tears both night and day
grieves He grieves continually about your never having...
sad sad
sorrows our sorrows
grief grief
grief great grief for the spite the gods had borne him
grief grief
sadder still sadder
weeping weeping
wept I too wept and pitied him as I beheld him
weeping weeping and talking thus sadly with one anothe...
tear a tear
cries such appalling cries

2.6.2. Verb-subject relations#

For this next example, we’ll use dependency tags to find the subject sentences in Book XI. As before, we’ll go through each token in the document, this time checking to see whether it has the nsubj or nsubjpass tag for its .dep_ attribute, which denote the subjects of the sentence’s root. We’ll also check to see whether a token is a noun (otherwise we’d get a lot of articles like “who,” “them,” etc.). If a token matches these two conditions, we’ll find its head verb as well as the token’s subtree. Note that this time, the subtree will refer directly to the token in question, not to the head. This will let us capture some descriptive information about each sentence subject.

nsubj = []
for token in odyssey:
    if token.dep_ in ('nsubj', 'nsubjpass') and token.pos_ in ('NOUN', 'PROPN'):
        nsubj.append({
            'SUBJECT': token.text,
            'HEAD': token.head.text,
            'HEAD_LEMMA': token.head.lemma_,
            'SUBTREE': subtree_to_text(token.subtree)
        })

nsubj_df = pd.DataFrame(nsubj).set_index('SUBJECT')
nsubj_df
HEAD HEAD_LEMMA SUBTREE
SUBJECT
Circe sent send Circe, that great and cunning goddess,
sails were be her sails
sun went go the sun
darkness was be darkness
rays pierce pierce the rays of the sun
... ... ... ...
Mercury helped help Mercury and Minerva
thousands came come so many thousands of ghosts
Proserpine send send Proserpine
ship went go the ship
wind sprang spring a fair wind

140 rows × 3 columns

Let’s look at a few subtrees. Note how sometimes they are simple noun chunks, while in other cases they expand to whole phrases.

for chunk in nsubj_df['SUBTREE'].sample(10):
    print(f"+ {chunk}")
+ Proserpine
+ Diana
+ many others also of the Ceteians
+ King Alcinous
+ the steam
+ the blame
+ the Trojan prisoners and Minerva
+ the gods
+ her abominable crime
+ scared birds

Time to zoom out. How many time do each of our selected subjects appear?

nsubj_df.groupby('SUBJECT').size().sort_values(ascending=False).head(25)
SUBJECT
ghost         6
heaven        4
Proserpine    4
ghosts        4
wife          4
man           4
one           4
Ulysses       4
people        3
judgement     2
gods          2
life          2
mother        2
creature      2
son           2
Teiresias     2
ship          2
Jove          2
wind          2
Neleus        2
will          2
Circe         2
sweat         1
others        1
wave          1
dtype: int64

What heads are associated with each subject? (Note that we’re using the lemmatized form of the verbs.)

nsubj_df.groupby(['SUBJECT', 'HEAD_LEMMA']).size().sort_values(ascending=False).head(25)
SUBJECT       HEAD_LEMMA
Ulysses       answer        2
ghost         come          2
son           be            2
Proserpine    send          2
people        hear          1
              bless         1
              be            1
others        fall          1
one           tell          1
man           kill          1
one           invite        1
prisoners     be            1
prophecyings  speak         1
one           get           1
              be            1
mother        come          1
              answer        1
match         marry         1
Aegisthus     be            1
queen         say           1
judgement     rankle        1
heaven        make          1
              take          1
              vouchsafe     1
heroes        lie           1
dtype: int64

Such information provides another way of looking at something like topicality. Rather than using, say, a bag of words approach to build a topic model, you could instead segment your text into chunks like the above and start tallying up token distributions. Such distributions might help you identify the primary subject in a passage of text, whether that be a character or something like a concept. Or, you could leverage them to investigate how different subjects are talked about, say by throwing POS tags into the mix to further nuance relationships across entities.

Our next session will demonstrate what such investigations look like in action. For now however, the main takeway is that the above annotation structures provide you with a host of different ways to segment and facet your text data. You are by no means limited to single token counts when working computationally analyzing text. Indeed, sometimes the most compelling ways to expore a corpus lie in the broader, fuzzier relationships that NLP annotations help us identify.