11. Text Annotation with spaCy#

This chapter introduces the general field of natural language processing, or NLP. While NLP is often used interchangeably with text mining/analytics in introductory settings, the former differs in important ways from many of the core methods in the latter. We will highlight a few such differences over the course of this session, and then more generally throughout the workshop series as a whole.

Learning objectives

By the end of this chapter, you will be able to:

  • Explain how document annotation differs from other representations of text data

  • Know how spaCy models and their respective pipelines work

  • Extract linguistic information about text using spaCy

  • Describe key terms in NLP, like part-of-speech tagging, dependency parsing, etc.

  • Know how/where to look for more information about the linguistic data spaCy makes available

11.1. NLP vs. Text Mining: In Brief#

11.1.1. Data structures#

One way to distinguish NLP from text mining has has to do with the various data structures we use in the former. Generally speaking, NLP methods are maximally preservative when it comes to representing textual information in a way computers can read. Unlike text mining’s atomizing focus on bags of words, in NLP we often use literal transcriptions of the input text and run our analyses directly on that. This is because much of the information NLP methods provide is context-sensitive: we need to know, for example, the subject of a sentence in order to do dependency parsing; part-of-speech taggers are most effective when they have surrounding tokens to consider. Accordingly, our workflow needs to retain as much information about our documents as possible, for as long as possible. In fact, many NLP methods build on each other, so data about our documents will grow over the course of processing them (rather than getting pared down, as with text mining). The dominant paradigm, then, for thinking about how text data is represented in NLP is annotation: NLP tends to add, associate, or tag documents with extra information.

11.1.2. Model-driven methods#

The other key difference between text mining and NLP lies in the way the latter tends to be more model-driven. NLP methods often rely on statistical models to create the above information, and ultimately these models have a lot of assumptions baked into them. Such assumptions range from philosophy of language (how do we know we’re analyzing meaning?) to the kind of training data on which they’re trained (what does the model represent, and what biases might thereby be involved?). Of course, it’s possible to build your own models, and indeed a later chapter will show you how to do so, but you’ll often find yourself using other researchers’ models when doing NLP work. It’s thus very important to know how researchers have built their models so you can do your own work responsibly.

11.2. spaCy Language Models#

11.2.1. spaCy pipelines#

Much of this workshop series will use language models from spaCy, a very popular NLP library for Python. In essence, a spaCy model is a collection of sub-models arranged into a pipeline. The idea here is that you send a document through this pipeline, and the model does the work of annotating your document. Once it has finished, you can access these annotations to perform whatever analysis you’d like to do.

The spaCy pipeline, which is broken up into separate components, orpipes

Every component, or pipe, in a spaCy pipeline performs a different task, from tokenization to part-of-speech tagging and named-entity recognition. Each model comes with a specific ordering of these tasks, but you can mix and match them after the fact.

11.2.2. Downloading a model#

The specific model we’ll be using is spaCy’s medium-sized English model: en_core_web_md. It’s been trained on the OntoNotes corpus and it features several useful pipes, which we’ll discuss below.

If you haven’t used spaCy before, you’ll need to download this model. You can do so by running the following in a command line interface:

python -m spacy download en_core_web_md

11.3. Preliminaries#

Once your model has downloaded, it’s time to set up an environment. Here are the libraries you’ll need for this chapter.

from pathlib import Path
from collections import Counter, defaultdict
from tabulate import tabulate
import spacy

And here’s the data directory we’ll be working from:

indir = Path("data/section_two/s1")

Finally, we initialize the model.

nlp = spacy.load('en_core_web_md')

11.4. Annotations#

To annotate a document with the model, simply pass it to the model. We’ll use a short poem by Gertrude Stein to show this.

with indir.joinpath("stein_carafe.txt").open('r') as fin:
    poem = fin.read()

carafe = nlp(poem)

With this done, we inspect the result…

carafe
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing. All this and not ordinary, not unordered in not resembling. The difference is spreading.

…which seems to be no different from a string representation! This output is misleading, however. Our annotated poem now has a tone of extra information associated it, which is accessible via attributes and methods.

attributes = [i for i in dir(carafe) if i.startswith("_") == False]
print("Number of attributes in a spaCy doc:", len(attributes))
Number of attributes in a spaCy doc: 52

This high number of attributes indicates an important point to keep in mind when working with spaCy and NLP generally: as we mentioned before, the primary data model for NLP aims to maximally preserve information about your document. It keeps documents intact and in fact adds much more information about them than Python’s base string methods have. In this sense, we might say that spaCy is additive in nature, whereas text mining methods are subtractive, or reductive.

11.4.1. Document annotations#

spaCy annotations apply to either documents or individual tokens. Here are some document-level annotations:

annotations = {'Sentences': 'SENT_START', 'Dependencies': 'DEP', 'TAGS': 'TAG'}
for annotation, tag in annotations.items():
    print(f"{annotation:<12} {carafe.has_annotation(tag)}")
Sentences    True
Dependencies True
TAGS         True

Let’s look at sentences, which we access with .sents.

carafe.sents
<generator at 0x10c715ea0>

…with a slight hitch: this returns a generator, not a list. spaCy aims to be memory efficient (especially important for big corpora), so many of its annotations are stored this way. We’ll need to iterate through this generator to see its contents.

for sent in carafe.sents:
    print(sent.text)
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing.
All this and not ordinary, not unordered in not resembling.
The difference is spreading.

One very useful attribute is .noun_chunks. It returns nouns and compound nouns in a document.

for chunk in carafe.noun_chunks:
    print(chunk)
A kind
glass
a cousin
a spectacle
nothing
a single hurt color
an arrangement
a system
All this
The difference

See how this picks up not only nouns, but articles and compound information? Articles could be helpful if you wanted to track singular/plural relationships, while compound nouns might tell you something about the way a document refers to the entities therein. The latter could have repeating patterns, and you might imagine how you could use noun chunks to create and count n-gram tokens and feed that into a classifier.

Consider this example from The Odyssey. Homer used many epithets and repeating phrases throughout his epic. According to some theories, these act as mnemonic devices, helping a performer keep everything in their head during an oral performance (the poem wasn’t written down in Homer’s day). Using .noun_chunks in conjunction with a Python Counter, we may be able to identify these in Homer’s text. Below, we’ll do so with The Odyssey Book XI.

First, let’s load and model the text.

with indir.joinpath("odyssey_book_11.txt").open('r') as fin:
    book11 = fin.read()

odyssey = nlp(book11)

Now we pass our noun chunks to the Counter. Be sure to grab only the .text attribute from each token; we don’t need the other attributes.

counts = Counter([chunk.text for chunk in odyssey.noun_chunks])

With that done, let’s look for repeating noun chunks with three or more words.

repeats = []
for chunk, count in counts.items():
    length = len(chunk.split())
    if length > 2 and count > 1:
        repeats.append([chunk, length])

print(tabulate(repeats, ['Chunk', 'Length']))
Chunk                       Length
------------------------  --------
the sea shore                    3
a fair wind                      3
the poor feckless ghosts         4
the same time                    3
the other side                   3
his golden sceptre               3
your own house                   3
her own son                      3
the Achaean land                 3
her own husband                  3
my wicked wife                   3
all the Danaans                  3
the poor creature                3

Another way to look at entities of this sort is with .ents. spaCy uses named-entity recognition (NER) to extract significant objects, or entities, in a document. In general, anything that has a proper name associated with it is likely to be an entity, but things like expressions of time and geographic location are also often tagged.

for i in range(5):
    print(odyssey.ents[i])
Circe
Oceanus
Cimmerians
one
Oceanus

Entities come with labels that differentiate what kind of entity they are. Using the .label_ attribute, we extract temporal entities in Book XI.

"; ".join(e.text for e in odyssey.ents if e.label_ == 'TIME')
'both night; all night; morning'

And here is a unique list of all the people

set(e.text for e in odyssey.ents if e.label_ == 'PERSON')
{'Achilles',
 'Ajax',
 'Antiope',
 'Ariadne',
 'Cassandra',
 'Creon',
 'Dia',
 'Diana',
 'Echeneus',
 'Epeus',
 'Epicaste',
 'Eriphyle',
 'Eurypylus',
 'Gorgon',
 'Hebe',
 'Helen',
 'Hercules',
 'Iasus',
 'Jove',
 'King Alcinous',
 'Leda',
 'Leto',
 'Maera',
 'Minos',
 'Neleus',
 'Neoptolemus',
 'O Phaecians',
 'Orestes',
 'Ossa',
 'Panopeus',
 'Peleus',
 'Pelias',
 'Periclymenus',
 'Pero',
 'Phaedra',
 'Phylace',
 'Priam',
 'Proserpine',
 'Pylos',
 'Pytho',
 'Queen',
 'Salmoneus',
 'Scyros',
 'Sisyphus',
 'Teiresias',
 'Telamon',
 'Telemachus',
 'Theban Teiresias',
 'Thetis',
 'Troy',
 'Tyro',
 'Ulysses'}

Don’t see an entity that you know to be in your document? You can add more to the spaCy model. Doing so is beyond the scope of our workshop session, but the library’s EntityRuler documentation will show you how.

11.4.2. Token annotations#

In addition to storing all of this information about documents, spaCy creates a substantial amount of annotations for every token in a document. The same logic as above applies to accessing this information.

Let’s return to the Stein poem. Indexing it will return individual tokens.

carafe[3]
glass

A token’s attributes and methods range from simple booleans, like whether a token is an alphabetic character:

carafe[3].is_alpha
True

…or whether it is a stop word:

carafe[3].is_stop
False

…to more complex pieces of information, like tracing back to the sentence in which this token appears:

carafe[3].sent
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing.

…or the token’s vector representation (more on this in the third session):

carafe[3].vector
array([-2.375   , -4.6198  , -3.2252  ,  3.8549  , -0.038654, -5.3678  ,
        3.1497  ,  5.764   ,  0.19113 ,  1.7685  ,  5.8596  , -1.6302  ,
       -2.8097  , -1.2745  ,  1.937   , -1.9711  ,  0.82933 ,  0.13711 ,
        5.9818  , -2.1468  ,  3.0062  ,  1.5146  ,  1.0455  , -1.6433  ,
       -5.714   , -3.2725  , -7.9325  , -1.5346  ,  1.6383  , -0.38003 ,
        0.3552  , -5.9469  ,  4.5539  , -1.2032  ,  0.66832 ,  0.74726 ,
       -0.17968 ,  2.0502  ,  3.0344  ,  0.55895 ,  1.0194  , -7.562   ,
        1.3125  ,  1.002   ,  1.7906  ,  1.8453  ,  2.9978  , -2.7806  ,
        3.1258  ,  2.3834  ,  3.1675  ,  3.2206  ,  2.3995  , -2.3102  ,
       -4.7973  ,  4.1421  ,  3.5867  , -2.0717  ,  1.9329  ,  4.7482  ,
        1.0141  , -3.5112  , -1.5582  , -2.9616  , -4.4105  , -0.79301 ,
       -3.3079  , -1.0187  ,  2.9166  , -0.03589 , -0.48378 , -0.10681 ,
       -2.1924  ,  4.543   , -0.89629 , -2.554   , -1.0931  , -3.2937  ,
       -1.5559  , -2.248   ,  1.5028  , -0.56177 ,  4.9519  , -5.1972  ,
        4.0128  , -0.5309  ,  2.8961  ,  0.4134  , -4.5235  , -1.6513  ,
       -4.3762  ,  3.5658  ,  1.4299  , -6.1665  ,  3.2851  , -4.8415  ,
        4.8492  , -2.0114  ,  0.48911 ,  1.2261  , -5.114   , -2.2333  ,
        1.8823  , -4.2833  ,  7.3651  , -0.79045 , -0.67217 , -4.5771  ,
        2.7402  ,  2.1193  , -3.2789  , -1.0388  ,  2.5737  , -2.5655  ,
        1.2905  ,  1.9414  ,  5.8181  , -2.0797  , -0.13823 , -0.63757 ,
       -4.3091  , -4.0641  , -1.7641  ,  6.1505  ,  0.479   , -3.3209  ,
       -0.92649 , -4.0634  ,  9.5734  , -2.6268  , -6.0084  ,  2.0715  ,
        7.7494  , -3.1998  , -5.1168  , -5.4073  ,  3.4898  , -3.068   ,
       -0.22946 ,  2.7993  ,  0.62587 ,  5.1366  , -4.2662  , -4.8456  ,
        3.6147  ,  2.196   , -3.3093  ,  2.9666  ,  3.9719  , -0.6797  ,
        0.68821 ,  0.71232 , -3.333   ,  0.29148 , -0.39584 ,  0.42873 ,
        6.4946  , -0.823   , -0.020651, -2.7477  , -2.6581  ,  3.1154  ,
        4.6083  , -0.25886 ,  4.4504  ,  1.9561  , -1.6606  , -1.2628  ,
        1.808   , -2.4556  , -2.2795  , -1.5849  ,  0.82376 ,  1.7998  ,
       -5.5956  ,  3.03    , -0.099104, -2.6838  , -1.5193  , -3.0473  ,
       -1.7901  , -3.6805  ,  0.46672 , -2.4157  , -3.5586  , -2.2314  ,
        3.0696  , -1.6802  , -3.1026  ,  1.6336  ,  0.11814 , -0.38136 ,
       -1.0735  , -0.41647 , -0.33304 ,  4.421   , -0.33598 , -0.11295 ,
        2.7921  , -1.2169  ,  6.9349  , -5.8916  ,  1.3116  ,  2.5849  ,
       -2.3761  , -1.9785  , -0.16168 ,  5.3954  ,  3.486   , -1.4455  ,
        2.4938  ,  3.6603  ,  2.8105  ,  3.5898  , -0.027119, -0.4058  ,
        1.0381  ,  2.0197  , -8.8603  ,  0.93775 ,  1.4575  ,  2.1531  ,
       -1.1737  , -2.0585  , -1.904   , -3.7927  ,  2.4029  ,  5.7362  ,
       -0.17973 , -1.7102  , -1.7022  ,  0.012114,  3.8014  , -4.9803  ,
       -1.8146  ,  0.27379 , -1.8618  ,  1.1915  ,  2.8944  ,  2.4326  ,
       -0.047042,  4.4882  ,  3.6984  ,  4.3737  ,  1.9543  , -4.9321  ,
       -2.9894  ,  5.1766  ,  1.9559  , -0.71326 , -2.1107  ,  4.4851  ,
       -0.40627 , -2.2319  ,  1.1195  , -3.6748  , -0.69771 ,  3.9417  ,
       -3.5574  , -3.2113  ,  0.16818 ,  4.9451  ,  0.21031 , -1.7497  ,
        4.2453  , -1.7539  , -0.25824 ,  4.7119  , -6.0458  , -1.823   ,
        4.9671  ,  0.56035 , -2.061   , -2.0699  , -5.2793  , -3.4587  ,
       -2.1138  ,  3.0765  , -0.2651  , -1.373   , -1.6493  ,  3.4943  ,
        4.5286  ,  1.7202  ,  3.2332  ,  1.4228  , -4.3931  , -0.198   ,
       -1.1396  ,  2.7613  ,  1.0529  ,  0.082944,  3.6753  ,  1.901   ,
       -3.4038  , -2.3816  ,  0.72765 ,  3.7799  ,  1.6928  , -3.4926  ],
      dtype=float32)

Here’s a listing of some attributes that are relevant for text mining:

attributes = [[tok.text, tok.is_punct, tok.like_url] for tok in carafe]
print(tabulate(attributes, ['Text', 'Is punctuation', 'Is a URL']))
Text         Is punctuation    Is a URL
-----------  ----------------  ----------
A            False             False
kind         False             False
in           False             False
glass        False             False
and          False             False
a            False             False
cousin       False             False
,            True              False
a            False             False
spectacle    False             False
and          False             False
nothing      False             False
strange      False             False
a            False             False
single       False             False
hurt         False             False
color        False             False
and          False             False
an           False             False
arrangement  False             False
in           False             False
a            False             False
system       False             False
to           False             False
pointing     False             False
.            True              False
All          False             False
this         False             False
and          False             False
not          False             False
ordinary     False             False
,            True              False
not          False             False
unordered    False             False
in           False             False
not          False             False
resembling   False             False
.            True              False
The          False             False
difference   False             False
is           False             False
spreading    False             False
.            True              False

We’ll discuss some of the more complex annotations later on, both in this session and others. For now, let’s collect some simple information about each of the tokens in our document. To do so, we use a list comprehension on the .text attribute of each token.

words = ' '.join(tok.text for tok in carafe if tok.is_alpha)
punct = ' '.join(tok.text for tok in carafe if tok.is_punct)

print("Words:", words)
print("Punctuation:", punct)
Words: A kind in glass and a cousin a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing All this and not ordinary not unordered in not resembling The difference is spreading
Punctuation: , . , . .

Want some linguistic information? We can get that too. For example, here are lemmas:

lemmas = [[tok.text, tok.lemma_] for tok in carafe]
print(tabulate(lemmas, ['Token', 'Lemma']))
Token        Lemma
-----------  -----------
A            a
kind         kind
in           in
glass        glass
and          and
a            a
cousin       cousin
,            ,
a            a
spectacle    spectacle
and          and
nothing      nothing
strange      strange
a            a
single       single
hurt         hurt
color        color
and          and
an           an
arrangement  arrangement
in           in
a            a
system       system
to           to
pointing     point
.            .
All          all
this         this
and          and
not          not
ordinary     ordinary
,            ,
not          not
unordered    unordered
in           in
not          not
resembling   resemble
.            .
The          the
difference   difference
is           be
spreading    spread
.            .

With such attributes at your disposal, you might imagine how you could work spaCy into a text mining pipeline. Instead of using separate functions to clean your corpus, those steps could all be accomplished by accessing attributes.

But before you do this, you should consider 1) whether the increased computational/memory overhead is worthwhile for your project; and 2) whether spaCy’s base models will work for the kind of text you’re using. This second point is especially important. While spaCy’s base models are incredibly powerful, they are built for general purpose applications and may struggle with domain-specific language. Medical text and early modern print are two such examples of where the base models interpret your documents in unexpected ways, thereby complicating, maybe even ruining, parts of a text mining pipeline that relies on them.

That all said, there are ways to train your own spaCy model on a specific domain. This can be an extensive process, one which exceeds the limits of our short workshop, but if you want to learn more about doing so, you can visit this page. There are also third party models available, which you might find useful, though your mileage may vary.

11.5. Part-of-Speech Tagging#

One of the most common tasks in NLP involves assigning part-of-speech, or POS, tags to each token in a document. As we saw in the text mining series, these tags are a necessary step for certain text cleaning process, like lemmatization; you might also use them to identify subsets of your data, which you could separate out and model. Beyond text cleaning, POS tags can be useful for tasks like word sense disambiguation, where you try to determine which particular facet of meaning a given token represents.

Regardless of the task, the process of getting POS tags from spaCy will be the same. Each token in a document has an associated tag, which is accessible as an attribute.

pos_tags = [[tok.text, tok.pos_] for tok in carafe]
print(tabulate(pos_tags, ['Token', 'Tag']))
Token        Tag
-----------  -----
A            DET
kind         NOUN
in           ADP
glass        NOUN
and          CCONJ
a            DET
cousin       NOUN
,            PUNCT
a            DET
spectacle    NOUN
and          CCONJ
nothing      PRON
strange      ADJ
a            DET
single       ADJ
hurt         NOUN
color        NOUN
and          CCONJ
an           DET
arrangement  NOUN
in           ADP
a            DET
system       NOUN
to           ADP
pointing     VERB
.            PUNCT
All          DET
this         PRON
and          CCONJ
not          PART
ordinary     ADJ
,            PUNCT
not          PART
unordered    ADJ
in           ADP
not          PART
resembling   VERB
.            PUNCT
The          DET
difference   NOUN
is           AUX
spreading    VERB
.            PUNCT

If you don’t know what a tag means, use spacy.explain().

spacy.explain('CCONJ')
'coordinating conjunction'

spaCy actually has two types of POS tags. The ones accessible with the .pos_ attribute are the basic tags, whereas those under .tag_ are more detailed (these come from the Penn Treebank project).

treebank = [[tok.text, tok.tag_, spacy.explain(tok.tag_)] for tok in carafe]
print(tabulate(treebank, ['Token', 'Tag', 'Explanation']))
Token        Tag    Explanation
-----------  -----  --------------------------------------------------
A            DT     determiner
kind         NN     noun, singular or mass
in           IN     conjunction, subordinating or preposition
glass        NN     noun, singular or mass
and          CC     conjunction, coordinating
a            DT     determiner
cousin       NN     noun, singular or mass
,            ,      punctuation mark, comma
a            DT     determiner
spectacle    NN     noun, singular or mass
and          CC     conjunction, coordinating
nothing      NN     noun, singular or mass
strange      JJ     adjective (English), other noun-modifier (Chinese)
a            DT     determiner
single       JJ     adjective (English), other noun-modifier (Chinese)
hurt         NN     noun, singular or mass
color        NN     noun, singular or mass
and          CC     conjunction, coordinating
an           DT     determiner
arrangement  NN     noun, singular or mass
in           IN     conjunction, subordinating or preposition
a            DT     determiner
system       NN     noun, singular or mass
to           IN     conjunction, subordinating or preposition
pointing     VBG    verb, gerund or present participle
.            .      punctuation mark, sentence closer
All          PDT    predeterminer
this         DT     determiner
and          CC     conjunction, coordinating
not          RB     adverb
ordinary     JJ     adjective (English), other noun-modifier (Chinese)
,            ,      punctuation mark, comma
not          RB     adverb
unordered    JJ     adjective (English), other noun-modifier (Chinese)
in           IN     conjunction, subordinating or preposition
not          RB     adverb
resembling   VBG    verb, gerund or present participle
.            .      punctuation mark, sentence closer
The          DT     determiner
difference   NN     noun, singular or mass
is           VBZ    verb, 3rd person singular present
spreading    VBG    verb, gerund or present participle
.            .      punctuation mark, sentence closer

11.6. Dependency Parsing#

Another tool that can help with tasks like disambiguating word sense is dependency parsing. Dependency parsing involves analyzing the grammatical structure of text (usually sentences) to identify relationships between the words therein. The basic idea is that every word in a sentence is linked to at least one other word via a tree structure, and these linkages are hierarchical. Dependency parsing can tell you information about:

  1. The primary subject of a sentence (and whether it is an active or passive subject)

  2. Various heads, which determine the syntactic categories of a phrase; these are often nouns and verbs

  3. Various dependents, which modify, either directly or indirectly, their heads (think adjectives, adverbs, etc.)

  4. The root of the sentence, which is often (but not always!) the primary verb

Linguists have developed a number of different methods to parse dependencies, which we won’t discuss here. Take note though that most popular one in NLP is the Universal Dependencies framework; spaCy, like most NLP models, uses this. The library also has some functionality for visualizing dependencies, which will help clarify what it is they are in the first place. Below, we visualize a sentence from the Stein poem.

to_render = list(carafe.sents)[2]
spacy.displacy.render(to_render, style = 'dep')
The DET difference NOUN is AUX spreading. VERB det nsubj aux

See how the arcs have arrows? Arrows point to the dependents within a phrase or sentence, that is, they point to modifying relationships between words. Arrows arc out from a head, and the relationships they indicate are all specified with labels. As with the POS tags, you can use spacy.explain() on the dependency labels, which we’ll do below. The whole list of them is also available in this table of typologies. Finally, somewhere in the tree you’ll find a word with no arrows pointing to it (here, “spreading”). This is the root. One of its dependents is the subject of the sentence (here, “difference”).

Seeing these relationships are quite useful in and of themselves, but the real power of dependency parsing comes in all the extra data it can provide about a token. Using this technique, you can link tokens back to their heads, or find local groupings of tokens that all refer to the same head.

With this sentence, for example:

sentence = odyssey[2246:2260]
sentence.text
"Then I tried to find some way of embracing my mother's ghost."

We can construct a for loop to roll through each token and retrieve dependency info.

dependencies = []
for tok in sentence:
    info = [tok.text, tok.dep_, spacy.explain(tok.dep_), tok.head.text]
    dependencies.append(info)

print(tabulate(dependencies, ['Text', 'Dependency', 'Explanation', 'Head']))
Text       Dependency    Explanation                Head
---------  ------------  -------------------------  ---------
Then       advmod        adverbial modifier         tried
I          nsubj         nominal subject            tried
tried      ROOT          root                       tried
to         aux           auxiliary                  find
find       xcomp         open clausal complement    tried
some       det           determiner                 way
way        dobj          direct object              find
of         prep          prepositional modifier     way
embracing  pcomp         complement of preposition  of
my         poss          possession modifier        mother
mother     poss          possession modifier        ghost
's         case          case marking               mother
ghost      dobj          direct object              embracing
.          punct         punctuation                tried

How many tokens are associated with each head?

heads = Counter(head for (tok, dep, exp, head) in dependencies)
print(tabulate(heads.items(), ['Head', 'Count']))
Head         Count
---------  -------
tried            5
find             2
way              2
of               1
mother           2
ghost            1
embracing        1

We can also find which tokens are associated with each head. spaCy has a special .subtree attribute for each token, which produces this grouping. As you might expect by now, .subtree returns a generator, so convert it to a list or use list comprehension to extract the tokens. We’ll do this in a separate function. Within this function, we use a token’s .text_with_ws attribute to return an exact, string-like representation of the string.

def subtree_to_text(subtree):
    """Convert a subtree to its text representation."""
    subtree = ''.join(tok.text_with_ws for tok in subtree)

    return subtree.strip()

subtrees = []
for tok in sentence:
    subtree = subtree_to_text(tok.subtree)
    subtrees.append([tok.text, tok.dep_, subtree])

print(tabulate(subtrees, ['Token', 'Dependency', 'Subtree']))
Token      Dependency    Subtree
---------  ------------  --------------------------------------------------------------
Then       advmod        Then
I          nsubj         I
tried      ROOT          "Then I tried to find some way of embracing my mother's ghost.
to         aux           to
find       xcomp         to find some way of embracing my mother's ghost
some       det           some
way        dobj          some way of embracing my mother's ghost
of         prep          of embracing my mother's ghost
embracing  pcomp         embracing my mother's ghost
my         poss          my
mother     poss          my mother's
's         case          's
ghost      dobj          my mother's ghost
.          punct         .

11.7. Putting Everything Together#

Now that we’ve walked through all these options, let’s put them into action. Below, we construct two short examples of how you might combine different aspects of token attributes to analyze a text. Both are essentially information retrieval tasks, and you might imagine doing something similar to extract and analyze particular words in your corpus, or to find different grammatical patterns that could be of significance.

11.7.1. Finding lemmas#

In the first, we use the .lemma_ attribute to search through Book XI and match its tokens to a few key words. If you’ve read The Odyssey, you’ll know that Book XI is where Odysseus and his fellow sailors have to travel down to the underworld Hades, where they speak with the dead. We already saw one example of this: Odysseus attempts to embrace his dead mother after communing with her. The whole trip to Hades is an emotionally tumultuous experience for the travelers, and peppered throughout Book XI are expressions of grief.

With .lemma_, we can search for these expressions. We’ll roll through the text and determine whether a token lemma matches one of a selected set. When we find a match, we get the subtree of this token’s head. That is, we find the head upon which this token depends, and then we use that to reconstruct the local context for the token.

target = ('cry', 'grief', 'grieve', 'sad', 'sorrow', 'tear', 'weep')
retrieved = []
for tok in odyssey:
    if tok.lemma_ in target:
        subtree = subtree_to_text(tok.head.subtree)
        retrieved.append([tok.text, subtree])

print(tabulate(retrieved, ['Token', 'Subtree']))
Token    Subtree
-------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cried    I was very sorry for him, and cried when I saw him: 'Elpenor,' said I, 'how did you come down here into this gloom and darkness?
sad      sad talk
tears    to tears
sorrow   for all my sorrow
sad      this sad place
tears    in tears
grieves  He grieves continually about your never having come home, and suffers more and more as he grows older.
sad      sad comfort
sorrows  of our sorrows
grief    of grief
grief    in great grief for the spite the gods had borne him
grief    for grief
sadder   the still sadder tale of those of my comrades who did not fall fighting with the Trojans
weeping  As soon as he had tasted the blood he knew me, and weeping bitterly stretched out his arms towards me to embrace me; but he had no strength nor substance any more, and I too wept and pitied him as I beheld him.
wept     he had no strength nor substance any more, and I too wept and pitied him as I beheld him.
weeping  As we two sat weeping and talking thus sadly with one another the ghost of Achilles came up to us with Patroclus, Antilochus, and Ajax who was the finest and goodliest man of all the Danaans after the son of Peleus
tear     wipe a tear from his cheek
cries    uttered such appalling cries

11.7.2. Verb-subject relations#

For our second example, we use dependency tags to find the subject of sentences in Book XI. As before, we iterate through each token in the document, this time checking to see whether it has the nsubj or nsubjpass tag for its .dep_ attribute. We also check whether a token is a noun (otherwise we’d get many articles like “who,” “them,” etc.). If a token matches these two conditions, we find its head verb as well as the token’s subtree. Note that this time, the subtree will refer directly to the token in question, not to the head. This will let us capture some descriptive information about each sentence subject.

subj = []
for tok in odyssey:
    if tok.dep_ in ('nsubj', 'nsubjpass') and tok.pos_ in ('NOUN', 'PROPN'):
        subtree = subtree_to_text(tok.subtree)
        subj.append([tok.text, tok.head.text, tok.head.lemma_, subtree])

print(tabulate(subj, ['Subject', 'Head', 'Head lemma', 'Subtree']))
Subject       Head          Head lemma    Subtree
------------  ------------  ------------  ----------------------------------------------------------------
Circe         sent          send          Circe, that great and cunning goddess,
wind          headed        head          the wind and helmsman
sails         were          be            her sails
sun           went          go            the sun
darkness      was           be            darkness
rays          pierce        pierce        the rays of the sun
wretches      live          live          the poor wretches
Circe         told          tell          Circe
Eurylochus    held          hold          Eurylochus
Teiresias     have          have          Teiresias
blood         run           run           the blood
ghosts        came          come          the ghosts
men           worn          wear          old men
armour        smirched      smirch        their armour
ghosts        come          come          the poor feckless ghosts
Teiresias     answered      answer        Teiresias
ghost         was           be            The first ghost 'that came
Elpenor       said          say           Elpenor
ghost         saying        say           the ghost of my comrade
ghost         came          come          the ghost of my dead mother Anticlea, daughter to Autolycus
ghost         came          come          the ghost of Theban Teiresias, with his golden sceptre
man           left          leave         poor man
heaven        make          make          heaven
ship          reaches       reach         your ship
hardship      reach         reach         much hardship
people        heard         hear          the people
wayfarer      meet          meet          A wayfarer
death         come          come          death
life          ebb           ebb           your life
people        bless         bless         your people
ghost         close         close         my poor mother's ghost
taste         talk          talk          taste of the blood
ghost         went          go            the ghost of Teiresias
prophecyings  spoken        speak         his prophecyings
mother        came          come          my mother
man           cross         cross         no man
heaven        vouchsafe     vouchsafe     heaven
wife          intends       intend        my wife
mother        answered      answer        My mother
wife          remains       remain        Your wife
one           got           get           No one
Telemachus    holds         hold          Telemachus
lands         undisturbed   undisturbed   your lands
one           invites       invite        every one
father        remains       remain        your father
weather       comes         come          the warm weather
heaven        take          take          heaven
Proserpine    want          want          Proserpine
people        are           be            all people
sinews        hold          hold          The sinews
life          left          leave         life
soul          flits         flit          the soul
Proserpine    sent          send          anon Proserpine
one           told          tell          each one as I questioned her
wave          arched        arch          a huge blue wave
god           accomplished  accomplish    the god
embraces      are           be            the embraces of the gods
Pelias        was           be            Pelias
rest          were          be            The rest of her children
gods          proclaimed    proclaim      the gods
gods          borne         bear          the gods
Epicaste      went          go            Epicaste
spirits       haunted       haunt         the avenging spirits
Chloris       given         give          Chloris, whom Neleus married for her beauty,
Neleus        married       marry         Neleus
woman         round         round         marvellously lovely woman Pero, who was wooed by all the country
Neleus        give          give          Neleus
man           was           be            The only man who would undertake to raid
will          was           be            the will of heaven
rangers       caught        catch         the rangers of the cattle
year          passed        pass          a full year
season        came          come          the same season
Iphicles      set           set           Iphicles
heroes        lying         lie           Both these heroes
Orion         excepted      except        Orion
Apollo        killed        kill          Apollo, son of Leto,
Theseus       carrying      carry         Theseus
Diana         killed        kill          Diana
Bacchus       said          say           Bacchus
guests        sat           sit           the guests
Arete         said          say           Arete
friends       spoke         speak         My friends
queen         said          say           our august queen
decision      rests         rest          the decision
thing         done          do            The thing
guest         is            be            Our guest
Ulysses       answered      answer        Ulysses
Ulysses       replied       reply         "Ulysses,"
evenings      are           be            The evenings
time-         go            go            bed time-
Ulysses       answered      answer        Ulysses
Proserpine    dismissed     dismiss       Proserpine
ghost         came          come          the ghost of Agamemnon son of Atreus
Neptune       raise         raise         Neptune
enemies       make          make          your enemies
foes          despatch      despatch      my foes
Aegisthus     were          be            Aegisthus and my wicked wife
comrades      slain         slay          my comrades
daughter      scream        scream        Priam's daughter Cassandra
Clytemnestra  killed        kill          Clytemnestra
crime         brought       bring         her abominable crime
Jove          hated         hate          Jove
Clytemnestra  hatched       hatch         Clytemnestra
wife          is            be            your wife, Ulysses,
Penelope      is            be            Penelope
child         grown         grow          This child
wife          allow         allow         my wicked wife
son           is            be            your son
ghost         came          come          the ghost of Achilles
descendant    knew          know          The fleet descendant of Aeacus
one           was           be            no one
limbs         fail          fail          his limbs
judgement     was           be            his judgement
Nestor        were          be            Nestor and I
man           kill          kill          Many a man
others        fell          fall          many others also of the Ceteians
Epeus         made          make          Epeus
leaders       drying        dry           all the other leaders and chief men among the Danaans
rage          is            be            the rage of Mars
ghost         strode        stride        the ghost of Achilles
ghosts        stood         stand         The ghosts of other dead men
Thetis        offered       offer         Thetis
prisoners     were          be            the Trojan prisoners and Minerva
judgement     rankle        rankle        the judgement about that hateful armour
blame         laid          lay           the blame
Jove          bore          bear          Jove
ghosts        gathered      gather        the ghosts
Tityus        stretched     stretch       Tityus son of Gaia
vultures      digging       dig           Two vultures on either side of him
creature      stooped       stoop         the poor creature
creature      stretched     stretch       the poor creature
wind          tossed        toss          the wind
weight        be            be            its weight
stone         come          come          the pitiless stone
sweat         ran           run           the sweat
steam         rose          rise          the steam
ghosts        screaming     scream        The ghosts
man           do            do            The man who made that belt
Hercules      knew          know          Hercules
Ulysses       are           be            my poor Ulysses, noble son of Laertes,
Mercury       helped        help          Mercury and Minerva
thousands     came          come          many thousands of ghosts
Proserpine    send          send          Proserpine
ship          went          go            the ship
wind          sprang        spring        a fair wind

How many times do each of our subjects appear?

subjects = Counter(subject for (subject, head, lemma, subtree) in subj)
print(tabulate(subjects.items(), ['Subject', 'Count']))
Subject         Count
------------  -------
Circe               2
wind                3
sails               1
sun                 1
darkness            1
rays                1
wretches            1
Eurylochus          1
Teiresias           2
blood               1
ghosts              5
men                 1
armour              1
ghost               9
Elpenor             1
man                 5
heaven              3
ship                2
hardship            1
people              3
wayfarer            1
death               1
life                2
taste               1
prophecyings        1
mother              2
wife                4
one                 4
Telemachus          1
lands               1
father              1
weather             1
Proserpine          4
sinews              1
soul                1
wave                1
god                 1
embraces            1
Pelias              1
rest                1
gods                2
Epicaste            1
spirits             1
Chloris             1
Neleus              2
woman               1
will                1
rangers             1
year                1
season              1
Iphicles            1
heroes              1
Orion               1
Apollo              1
Theseus             1
Diana               1
Bacchus             1
guests              1
Arete               1
friends             1
queen               1
decision            1
thing               1
guest               1
Ulysses             4
evenings            1
time-               1
Neptune             1
enemies             1
foes                1
Aegisthus           1
comrades            1
daughter            1
Clytemnestra        2
crime               1
Jove                2
Penelope            1
child               1
son                 1
descendant          1
limbs               1
judgement           2
Nestor              1
others              1
Epeus               1
leaders             1
rage                1
Thetis              1
prisoners           1
blame               1
Tityus              1
vultures            1
creature            2
weight              1
stone               1
sweat               1
steam               1
Hercules            1
Mercury             1
thousands           1

Which heads are associated with what subject?

subject_heads = defaultdict(list)
for item in subj:
    subject, head, *_ = item
    subject_heads[subject].append(head)

associations = [
    [subject, ", ".join(heads)] for subject, heads in subject_heads.items()
]
print(tabulate(associations, headers=["Subject", "Associated heads"]))
Subject       Associated heads
------------  --------------------------------------------------------
Circe         sent, told
wind          headed, tossed, sprang
sails         were
sun           went
darkness      was
rays          pierce
wretches      live
Eurylochus    held
Teiresias     have, answered
blood         run
ghosts        came, come, stood, gathered, screaming
men           worn
armour        smirched
ghost         was, saying, came, came, close, went, came, came, strode
Elpenor       said
man           left, cross, was, kill, do
heaven        make, vouchsafe, take
ship          reaches, went
hardship      reach
people        heard, bless, are
wayfarer      meet
death         come
life          ebb, left
taste         talk
prophecyings  spoken
mother        came, answered
wife          intends, remains, is, allow
one           got, invites, told, was
Telemachus    holds
lands         undisturbed
father        remains
weather       comes
Proserpine    want, sent, dismissed, send
sinews        hold
soul          flits
wave          arched
god           accomplished
embraces      are
Pelias        was
rest          were
gods          proclaimed, borne
Epicaste      went
spirits       haunted
Chloris       given
Neleus        married, give
woman         round
will          was
rangers       caught
year          passed
season        came
Iphicles      set
heroes        lying
Orion         excepted
Apollo        killed
Theseus       carrying
Diana         killed
Bacchus       said
guests        sat
Arete         said
friends       spoke
queen         said
decision      rests
thing         done
guest         is
Ulysses       answered, replied, answered, are
evenings      are
time-         go
Neptune       raise
enemies       make
foes          despatch
Aegisthus     were
comrades      slain
daughter      scream
Clytemnestra  killed, hatched
crime         brought
Jove          hated, bore
Penelope      is
child         grown
son           is
descendant    knew
limbs         fail
judgement     was, rankle
Nestor        were
others        fell
Epeus         made
leaders       drying
rage          is
Thetis        offered
prisoners     were
blame         laid
Tityus        stretched
vultures      digging
creature      stooped, stretched
weight        be
stone         come
sweat         ran
steam         rose
Hercules      knew
Mercury       helped
thousands     came

Such information provides another way of looking at something like topicality. Rather than using, say, a bag of words approach to build a topic model, you could instead segment your text into chunks like the above and start tallying up token distributions. Such distributions might help you identify the primary subject in a passage of text, whether that be a character or something like a concept. Or, you could leverage them to investigate how different subjects are talked about, say by throwing POS tags into the mix to further nuance relationships across entities.

Our next session will demonstrate what such investigations look like in action. For now however, the main takeaway is that the above annotation structures provide you with a host of different ways to segment and facet your text data. You are by no means limited to single token counts when working computationally analyzing text. Indeed, sometimes the most compelling ways to explore a corpus lie in the broader, fuzzier relationships that NLP annotations help us identify.