Text Annotation with spaCy
Contents
2. Text Annotation with spaCy#
This chapter introduces workshop participants to the general field of natural language processing, or NLP. While NLP is often used interchangeably with text mining/analytics in introductory settings, the former differs in important ways from many of the core methods in the latter. We will highlight a few such differences over the course of this session, and then more generally throughout the workshop series as a whole.
Learning objectives
By the end of this chapter, you will be able to:
Explain how document annotation differs from other representations of text data
Have a general sense of how
spaCy
models and their respective pipelines workExtract linguistic information about text using
spaCy
Describe key terms in NLP, like part-of-speech tagging, dependency parsing, etc.
Know how/where to look for more information about the linguistic data
spaCy
makes available
2.1. NLP vs. Text Mining: In Brief#
The short space of this reader, as well as that of our series, necessarily limits a conversation about everything that characterizes NLP from text mining/analytics. That there are differences at all is in itself worth noting and merits further exploration. But for the purposes of this series’s focus, there are two such instances of these differences that are especially worth calling out.
2.1.1. Data structures#
At the outset, one way of distinguishing NLP from text mining has to do with NLP’s underlying data structure. Generally speaking, NLP methods are maximally preservative when it comes to representing textual information in a way computers can read. Unlike text mining’s atomizing focus on bags of words, in NLP we often use literal transcriptions of the input text and run our analyses directly on that. This is because much of the information NLP methods provide is context-sensitive: we need to know, for example, the subject of a sentence in order to do dependency parsing; part-of-speech taggers are most effective when they have surrounding tokens to consider. Accordingly, our workflow needs to retain as much information about our documents as possible, for as long as possible. In fact, many NLP methods build on each other, so data about our documents will grow over the course of processing them (rather than getting pared down, as with text mining). The dominant paradigm, then, for thinking about how text data is represented in NLP is annotation: NLP tends to add, associate, or tag documents with extra information.
2.1.2. Model-driven methods#
The other key difference between text mining and NLP – which goes hand-in-hand with the idea of annotation – lies in the fact that the latter tends to be more model-driven. NLP methods often rely on statistical models to create the above information, and ultimately these models have a lot of assumptions baked into them. Such assumptions range from philosophy of language (how do we know we’re analyzing meaning?) to the kind of training data on which they’re trained (what does the model represent, and what biases might thereby be involved?). Of course, it’s possible to build your own models, and indeed a later chapter will show you how to do so, but you’ll often find yourself using other researchers’ models when doing NLP work. It’s thus very important to know how researchers have built their models so you can do your own work responsibly.
Keep in mind
Throughout this series, we will be using NLP methods in the context of text-based data, but NLP applies more widely to speech data as well.
2.2. spaCy Language Models#
Much of this workshop series will use language models from spaCy
, one of the most popular NLP libraries in Python. spaCy
is both a framework and a model resource. It offers access to models through a unique set of coding workflows, which we’ll discuss below (you can also train your own models with the library). Learning about these workflows will help us add annotate documents with extra information that will, in turn, enable us to perform a number of different NLP tasks.
2.2.1. spaCy
pipelines#
In essence, a spaCy
model is a collection of sub-models arranged into a pipeline. The idea here is that you send a document through this pipeline, and the model does the work of annotating your document. Once it has finished, you can access these annotations to perform whatever analysis you’d like to do.
Every component, or pipe, in a spaCy
pipeline performs a different task, from tokenization to part-of-speeching tagging and named-entity recognition. Each model comes with a specific ordering of these tasks, but you can mix and match them after the fact, adding or removing pipes as you see fit. The result is a wide set of options; the present workshop series only samples a few core aspects of the library’s overall capabilities.
2.2.2. Downloading a model#
The specific model we’ll be using is spaCy
’s medium-sized English model: en_core_web_md. It’s been trained on the OntoNotes corpus and it features several useful pipes, which we’ll discuss below.
If you haven’t used spaCy
before, you’ll need to download this model. You can do so by running the following in a command line interface:
python -m spacy download en_core_web_md
Just be sure you run this while working in the Python environment you’d like to use!
Once this downloads, you can load the model with the code below. Note that it’s conventional to assign the model to a variable called nlp
.
import spacy
nlp = spacy.load('en_core_web_md')
2.3. Annotations#
With the model loaded, we can send a document through the pipeline, which will in turn produce our text annotations. To annotate a document with the spaCy
model, simply run it through the core function, nlp()
. We’ll do so with a short poem by Gertrude Stein.
with open('data/session_one/stein_carafe.txt', 'r') as f:
stein_poem = f.read()
carafe = nlp(stein_poem)
With this done, we can inspect the result…
carafe
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing. All this and not ordinary, not unordered in not resembling. The difference is spreading.
…which seems to be no different from a string representation! This output is a bit misleading, however. Our carafe
object actually has a ton of extra information associated with it, even though, on the surface, it appears to be a plain old string.
If you’d like, you can inspect all these attributes and methods with:
attributes = [i for i in dir(carafe) if i.startswith("_") is False]
We won’t show them all here, but suffice it to say, there are a lot!
print("Number of attributes in a SpaCy doc:", len(attributes))
Number of attributes in a SpaCy doc: 51
This high number of attributes indicates an important point to keep in mind when working with spaCy
and NLP generally: as we mentioned before, the primary data model for NLP aims to maximally preserve information about your document. It keeps documents intact and in fact adds much more information about them than Python’s base string methods have. In this sense, we might say that spaCy
is additive in nature, whereas text mining methods are subtractive, or reductive.
2.3.1. Document Annotations#
So, while the base representation of carafe
looks like a string, under the surface there are all sorts of annotations about it. To access them, we use the attributes counted above. For example, spaCy
adds extra segmentation information about a document, like which parts of it belong to different sentences. We can check to see whether this information has been attached to our text with the .has_annotation()
method.
carafe.has_annotation('SENT_START')
True
We can use the same method to check for a few other annotations:
annotation_types = {'Dependencies': 'DEP', 'Entities': 'ENT_IOB', 'Tags': 'TAG'}
for a, t in annotation_types.items():
print(
f"{a:>12}: {carafe.has_annotation(t)}"
)
Dependencies: True
Entities: True
Tags: True
Let’s look at sentences. We can access them with .sents
.
carafe.sents
<generator at 0x11c014f40>
…but you can see that there’s a small complication here: .sents
returns a generator, not a list. The reason has to do with memory efficiency. Because spaCy
adds so much extra information about your document, this information could slow down your code or overwhelm your computer if the library didn’t store it in an efficient manner. Of course this isn’t a problem with our small poem, but you can imagine how it could become one with a big corpus.
To access the actual sentences in carafe
, we’ll need to convert the generator to a list.
import textwrap
sentences = list(carafe.sents)
for s in sentences:
s = textwrap.shorten(s.text, width=100)
print(s)
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an [...]
All this and not ordinary, not unordered in not resembling.
The difference is spreading.
One very useful attribute is .noun_chunks
. It returns nouns and compound nouns in a document.
noun_chunks = list(carafe.noun_chunks)
for noun in noun_chunks:
print(noun)
A kind
glass
a cousin
a spectacle
nothing
a single hurt color
an arrangement
a system
All this
The difference
See how this picks up not only nouns, but articles and compound information? Articles could be helpful if you wanted to track singular/plural relationships, while compound nouns might tell you something about the way a document refers to the entities therein. The latter could have repeating patterns, and you might imagine how you could use noun chunks to create and count n-gram tokens and feed that into a classifier.
Consider this example from The Odyssey. Homer used many epithets and repeating phrases throughout his epic. According to some theories, these act as mnemonic devices, helping a performer keep everything in their head during an oral performance (the poem wasn’t written down in Homer’s day). Using .noun_chunks
in conjunction with a Python Counter
, we may be able to identify these in Homer’s text. Below, we’ll do so with The Odyssey Book XI.
First, let’s load and model the text.
with open('data/session_one/odyssey_book_11.txt', 'r') as f:
book_eleven = f.read()
odyssey = nlp(book_eleven)
Now we’ll import a Counter
and initialize it. Then we’ll get the noun chunks from the document and populate them in the count dictionary with a list comprehension line. Be sure to only grab the text from each token. We’ll explain why in a little while.
from collections import Counter
noun_counts = Counter([chunk.text for chunk in odyssey.noun_chunks])
With that done, let’s look for repeating noun chunks with three or more words.
import pandas as pd
chunks = []
for chunk, count in noun_counts.items():
chunk = chunk.split()
if (len(chunk) > 2) and (count > 1):
joined = ' '.join(chunk)
chunks.append({
'PHRASE': joined,
'COUNT': count
})
chunks = pd.DataFrame(chunks).set_index('PHRASE')
chunks
COUNT | |
---|---|
PHRASE | |
the sea shore | 2 |
a fair wind | 2 |
the poor feckless ghosts | 2 |
the same time | 2 |
the other side | 2 |
his golden sceptre | 2 |
your own house | 2 |
her own son | 2 |
the Achaean land | 2 |
her own husband | 2 |
my wicked wife | 2 |
all the Danaans | 2 |
the poor creature | 2 |
Excellent! Looks like we turned up a few: “the poor feckless ghosts,” “my wicked wife,” and “all the Danaans” are likely the kind of repeating phrases scholars think of in Homer’s text.
Another way to look at entities in a text is with .ents
. spaCy
uses named-entity recognition to extract significant objects, or entities, in a document. In general, anything that has a proper name associated with it is considered an entity, but things like expressions of time and geographic location are also often tagged. Here are the first five from Book XI above.
entities = list(odyssey.ents)
count = 0
while count < 5:
print(entities[count])
count += 1
Circe
Circe
Here Perimedes and Eurylochus
first
thirdly
You can select particular entities using the .label_
attribute. Here are all the temporal entities in Book XI.
[e.text for e in odyssey.ents if e.label_ == 'TIME']
['morning']
And here is a unique listing of all the people.
set(e.text for e in odyssey.ents if e.label_ == 'PERSON')
{'Achilles',
'Anticlea',
'Ariadne',
'Bacchus',
'Chloris',
'Circe',
'Cretheus',
'Dia',
'Diana',
'Echeneus',
'Epeus',
'Ephialtes',
'Erebus',
'Eriphyle',
'Gorgon',
'Hades',
'Hebe',
'Helen',
'Hercules',
'Ithaca',
'Jove',
'Leto',
'Maera',
'Megara',
'Memnon',
'Minerva',
'Neleus',
'Neptune',
'Pelias',
'Penelope',
'Pero',
'Pollux',
'Priam',
'Proserpine',
'Pylos',
'Queen',
'Salmoneus',
'Sisyphus',
'Telemachus',
'Theban',
'Theban Teiresias',
'Thebes',
'Theseus',
'Thetis',
'Troy',
'Tyro',
'Ulysses',
'jailor Hades'}
Don’t see an entity that you know to be in your document? You can add more to the spaCy
model. Doing so is beyond the scope of our workshop session, but the library’s EntityRuler()
documentation will show you how.
2.3.2. Token Annotations#
In addition to storing all of this information about texts, spaCy
creates a substantial amount of annotations for each of the tokens in that document. The same logic as above applies to accessing this information.
Let’s return to the Stein poem. Indexing carafe
will return individual tokens:
carafe[3]
glass
Like carafe
, each one has several attributes:
token_attributes = [i for i in dir(carafe[3]) if i.startswith("_") is False]
print("Number of token attributes:", len(token_attributes))
Number of token attributes: 94
That’s a lot!
These attributes range from simple booleans, like whether a token is an alphabetic character:
carafe[3].is_alpha
True
…or whether it is a stop word:
carafe[3].is_stop
False
…to more complex pieces of information, like tracking back to the sentence this token is part of:
carafe[3].sent
A kind in glass and a cousin, a spectacle and nothing strange a single hurt color and an arrangement in a system to pointing.
…sentiment scores:
carafe[3].sentiment
0.0
…and even vector space representations (more about these on day three!):
carafe[3].vector
array([-0.60665 , -0.19688 , -0.1227 , -0.63598 , 0.89807 ,
-0.26537 , -0.015889 , 0.28961 , -0.10494 , -0.46685 ,
0.36891 , 0.15006 , -0.22227 , 0.38622 , 0.29493 ,
-0.9845 , -0.32567 , 0.23355 , 0.49492 , -0.57569 ,
-0.10022 , 0.2888 , 0.786 , -0.054396 , 0.14151 ,
-0.40768 , 0.061544 , 0.39313 , -1.092 , -0.14511 ,
0.42103 , -0.76163 , 0.29339 , 0.41215 , -0.41869 ,
0.39918 , 0.39688 , -0.46966 , 0.74839 , 0.50759 ,
0.48458 , 0.51865 , 0.20757 , -0.32406 , 0.72689 ,
-0.31903 , -0.23761 , -0.6614 , 0.28486 , 0.44251 ,
0.10102 , -0.26663 , 0.55344 , 0.26378 , -0.41728 ,
-0.30876 , 0.41034 , -0.54028 , -0.42292 , 0.22636 ,
0.31144 , 0.27504 , 0.18149 , 0.21325 , -0.55227 ,
-0.1011 , 0.25542 , 0.45925 , 0.60463 , 0.35682 ,
0.56958 , 0.036307 , 0.83937 , 0.48093 , -0.19525 ,
0.46602 , 0.022897 , 0.11539 , -0.24179 , 0.48036 ,
-1.0714 , -0.36915 , 0.16971 , -0.053726 , -0.97381 ,
0.42214 , 1.1142 , 1.4597 , -0.50529 , -0.45446 ,
0.41132 , -0.33779 , 0.32959 , 0.49328 , 0.56425 ,
0.012859 , 0.25048 , -0.87307 , -0.21436 , -0.20508 ,
0.42198 , -0.22507 , -0.53125 , 0.44767 , -0.18567 ,
-1.3365 , 0.089993 , -0.64042 , 1.0186 , 0.21136 ,
0.19435 , 0.17786 , -0.13835 , -0.37329 , 0.31726 ,
0.16039 , -0.19666 , 0.049621 , 0.094966 , -0.037447 ,
-0.29013 , 0.48434 , -0.12724 , 0.67386 , 0.32174 ,
-0.32181 , -0.16145 , -0.60097 , 0.60743 , 0.22665 ,
-0.28438 , -0.29495 , 0.04462 , -0.013322 , 0.11204 ,
0.63518 , -0.022868 , -0.034192 , -0.20114 , 0.11167 ,
-0.37905 , 0.36819 , -0.018246 , 0.25284 , -0.088927 ,
-0.077314 , -0.35745 , 0.14237 , 0.14386 , 0.24076 ,
-0.29973 , -0.03639 , 0.22725 , 0.10916 , -0.94986 ,
0.046044 , 0.81856 , -0.33634 , -0.15261 , 0.60717 ,
-0.2718 , 0.22678 , -0.18739 , 0.623 , 0.82995 ,
-0.20024 , 0.73056 , -0.0084901, -0.39045 , 0.18742 ,
-0.73338 , -0.077664 , 0.23246 , 0.87546 , 0.21647 ,
-0.072898 , 0.36288 , 0.2308 , -0.30565 , 0.11694 ,
-0.33983 , -0.15152 , 0.92947 , 0.58747 , -0.12973 ,
0.005775 , -0.092589 , 0.18885 , -0.40725 , 0.18249 ,
-0.90487 , -0.42429 , 0.099451 , -0.61899 , -0.060563 ,
0.16057 , -0.43026 , 0.33288 , 0.5276 , 0.16662 ,
-0.11755 , 0.17048 , 0.20785 , -0.18347 , 0.35685 ,
-0.07018 , -0.064295 , 0.46135 , -0.0083938, -0.24064 ,
-0.18537 , 0.067484 , -0.072876 , -0.80117 , 0.11181 ,
0.43282 , 0.58373 , -0.31169 , 0.032395 , 0.33231 ,
0.086075 , 0.14476 , 0.081413 , 0.31666 , -0.49782 ,
-0.15804 , 0.3573 , -0.56534 , -0.58402 , 0.023733 ,
-0.29734 , 0.13347 , -0.12673 , 0.44492 , -0.33062 ,
-0.50343 , 0.43491 , 0.028932 , 0.17392 , -0.26669 ,
0.020031 , 0.91117 , -0.069201 , 0.49468 , 0.043936 ,
0.083269 , 0.14409 , 0.15961 , -0.54771 , 0.38417 ,
0.51976 , 0.17693 , 0.36414 , -0.11335 , -0.078741 ,
-0.41143 , -0.15808 , 0.086959 , 0.20845 , 0.27017 ,
0.018805 , 1.2245 , 0.12708 , 0.25326 , 0.10445 ,
-0.058079 , -0.030512 , 0.63189 , -0.56298 , -0.25331 ,
-0.64939 , -0.48705 , 0.18973 , -0.39923 , 0.78043 ,
-0.13467 , 0.14517 , 0.48435 , 0.51201 , -0.80074 ,
-0.42834 , 0.17614 , 0.59832 , -0.37692 , 0.029607 ,
0.09632 , 0.40852 , 0.62755 , -0.038655 , 0.17166 ,
0.451 , 0.20851 , 0.267 , 0.24261 , -0.23774 ,
-0.48429 , 0.24286 , -0.20696 , -0.25682 , -0.1432 ],
dtype=float32)
Here’s a listing of some attributes you might want to know about when text mining.
sample_attributes = []
for token in carafe:
sample_attributes.append({
'INDEX': token.i,
'TEXT': token.text,
'LOWERCASE': token.lower_,
'ALPHABETIC': token.is_alpha,
'DIGIT': token.is_digit,
'PUNCTUATION': token.is_punct,
'STARTS SENTENCE': token.is_sent_start,
'LIKE URL': token.like_url
})
sample_attributes = pd.DataFrame(sample_attributes).set_index('INDEX')
sample_attributes.head(10)
TEXT | LOWERCASE | ALPHABETIC | DIGIT | PUNCTUATION | STARTS SENTENCE | LIKE URL | |
---|---|---|---|---|---|---|---|
INDEX | |||||||
0 | A | a | True | False | False | True | False |
1 | kind | kind | True | False | False | False | False |
2 | in | in | True | False | False | False | False |
3 | glass | glass | True | False | False | False | False |
4 | and | and | True | False | False | False | False |
5 | a | a | True | False | False | False | False |
6 | cousin | cousin | True | False | False | False | False |
7 | , | , | False | False | True | False | False |
8 | a | a | True | False | False | False | False |
9 | spectacle | spectacle | True | False | False | False | False |
We’ll discuss some of the more complex annotations later on, both in this session and others. For now, let’s collect some simple information about each of the tokens in our document. We’ll use list comprehension to do so. We’ll also use the .text
attribute for each token, since we only want the text representation. Otherwise, we’d be creating a list of generators, where each generator has all those attribute for every token! (This is why we made sure to only use .text
in our work with The Odyssey above.)
words = ' '.join([token.text for token in carafe if token.is_alpha])
punctuation = ' '.join([token.text for token in carafe if token.is_punct])
print(
f"Words\n-----\n{textwrap.shorten(words, width=100)}",
f"\n\nPunctuation\n-----------\n{punctuation}"
)
Words
-----
A kind in glass and a cousin a spectacle and nothing strange a single hurt color and an [...]
Punctuation
-----------
, . , . .
Want some linguistic information? We can get that too. For example, here are prefixes and suffixes:
prefix_suffix = []
for token in carafe:
if token.is_alpha:
prefix_suffix.append({
'TOKEN': token.text,
'PREFIX': token.prefix_,
'SUFFIX': token.suffix_
})
prefix_suffix = pd.DataFrame(prefix_suffix).set_index('TOKEN')
prefix_suffix.head(10)
PREFIX | SUFFIX | |
---|---|---|
TOKEN | ||
A | A | A |
kind | k | ind |
in | i | in |
glass | g | ass |
and | a | and |
a | a | a |
cousin | c | sin |
a | a | a |
spectacle | s | cle |
and | a | and |
And here are lemmas:
lemmas = []
for token in carafe:
if token.is_alpha:
lemmas.append({
'TOKEN': token.text,
'LEMMA': token.lemma_
})
lemmas = pd.DataFrame(lemmas).set_index('TOKEN')
lemmas[24:]
LEMMA | |
---|---|
TOKEN | |
All | all |
this | this |
and | and |
not | not |
ordinary | ordinary |
not | not |
unordered | unordered |
in | in |
not | not |
resembling | resemble |
The | the |
difference | difference |
is | be |
spreading | spread |
With such attributes at your disposal, you might imagine how you could work spaCy
into a text mining pipeline. Instead of using separate functions to clean your corpus, those steps could all be accomplished by accessing attributes.
Before you do this, however, you should consider two things: 1) whether the increased computational/memory overhead is worthwhile for your project; and 2) whether spaCy
’s base models will work for the kind of text you’re using. This second point is especially important. While spaCy
’s base models are incredibly powerful, they are built for general purpose applications and may struggle with domain-specific language. Medical text and early modern print are two such examples of where the base models interpret your documents in unexpected ways, thereby complicating, maybe even ruining, parts of a text mining pipeline that relies on them. Sometimes, in other words, it’s just best to stick with a text mining pipeline that you know to be effective.
That all said, there are ways to train your own spaCy
model on a specific domain. This can be an extensive process, one which exceeds the limits of our short workshop, but if you want to learn more about doing so, you can visit this page. There are also third party models available, which you might find useful, though your milage may vary.
2.4. Part-of-Speech Tagging#
One of the most common tasks in NLP involves assigning part-of-speech, or POS, tags to each token in a document. As we saw in the text mining series, these tags are a necessary step for certain text cleaning process, like lemmatization; you might also use them to identify subsets of your data, which you could separate out and model. Beyond text cleaning, POS tags can be useful for tasks like word sense disambiguation, where you try to determine which particular facet of meaning a given token represents.
Regardless of the task, the process of getting POS tags from spaCy
will be the same. Each token in a document has an associated tag, which is accessible as an attribute.
pos = []
for token in carafe:
pos.append({
'TOKEN': token.text,
'POS_TAG': token.pos_
})
pos = pd.DataFrame(pos).set_index('TOKEN')
pos
POS_TAG | |
---|---|
TOKEN | |
A | DET |
kind | NOUN |
in | ADP |
glass | NOUN |
and | CCONJ |
a | DET |
cousin | NOUN |
, | PUNCT |
a | DET |
spectacle | NOUN |
and | CCONJ |
nothing | PRON |
strange | ADJ |
a | DET |
single | ADJ |
hurt | ADJ |
color | NOUN |
and | CCONJ |
an | DET |
arrangement | NOUN |
in | ADP |
a | DET |
system | NOUN |
to | ADP |
pointing | VERB |
. | PUNCT |
All | DET |
this | PRON |
and | CCONJ |
not | PART |
ordinary | ADJ |
, | PUNCT |
not | PART |
unordered | ADJ |
in | ADP |
not | PART |
resembling | VERB |
. | PUNCT |
The | DET |
difference | NOUN |
is | AUX |
spreading | VERB |
. | PUNCT |
If you don’t know what a tag means, you can use spacy.explain()
.
spacy.explain('CCONJ')
'coordinating conjunction'
spaCy
actually has two types of POS tags. The ones accessible with the .pos_
attribute are the basic tags, whereas those under .tag_
are more detailed (these come from the Penn Treebank project). We’ll print them out below, along with information about what they mean.
detailed_tags = []
for token in carafe:
detailed_tags.append({
'TOKEN': token.text,
'POS_TAG': token.tag_,
'EXPLANATION': spacy.explain(token.tag_)
})
detailed_tags = pd.DataFrame(detailed_tags).set_index('TOKEN')
detailed_tags
POS_TAG | EXPLANATION | |
---|---|---|
TOKEN | ||
A | DT | determiner |
kind | NN | noun, singular or mass |
in | IN | conjunction, subordinating or preposition |
glass | NN | noun, singular or mass |
and | CC | conjunction, coordinating |
a | DT | determiner |
cousin | NN | noun, singular or mass |
, | , | punctuation mark, comma |
a | DT | determiner |
spectacle | NN | noun, singular or mass |
and | CC | conjunction, coordinating |
nothing | NN | noun, singular or mass |
strange | JJ | adjective (English), other noun-modifier (Chin... |
a | DT | determiner |
single | JJ | adjective (English), other noun-modifier (Chin... |
hurt | JJ | adjective (English), other noun-modifier (Chin... |
color | NN | noun, singular or mass |
and | CC | conjunction, coordinating |
an | DT | determiner |
arrangement | NN | noun, singular or mass |
in | IN | conjunction, subordinating or preposition |
a | DT | determiner |
system | NN | noun, singular or mass |
to | IN | conjunction, subordinating or preposition |
pointing | VBG | verb, gerund or present participle |
. | . | punctuation mark, sentence closer |
All | PDT | predeterminer |
this | DT | determiner |
and | CC | conjunction, coordinating |
not | RB | adverb |
ordinary | JJ | adjective (English), other noun-modifier (Chin... |
, | , | punctuation mark, comma |
not | RB | adverb |
unordered | JJ | adjective (English), other noun-modifier (Chin... |
in | IN | conjunction, subordinating or preposition |
not | RB | adverb |
resembling | VBG | verb, gerund or present participle |
. | . | punctuation mark, sentence closer |
The | DT | determiner |
difference | NN | noun, singular or mass |
is | VBZ | verb, 3rd person singular present |
spreading | VBG | verb, gerund or present participle |
. | . | punctuation mark, sentence closer |
2.4.1. Use case: word sense disambiguation#
This is all well and good in the abstract, but the power of POS tags lies in how they support other kinds of analysis. We’ll do a quick word sense disambiguation task here but will return to do something more complex in a little while.
Between the two strings:
“I am not going to bank on that happening.”
“I went down to the river bank.”
How can we tell which sense of the word “bank” is being used? Well, we can model each with spaCy
and see whether the POS tags for these two tokens match. If they don’t match, this will indicate that the tokens represent two different senses of the word “bank.”
All this can be accomplished with a for
loop and nlp.pipe()
. The latter function enables you to process different documents with the spaCy
model all at once. This can be great for working with a large corpus, though note that, because .pipe()
is meant to work on text at scale, it will return a generator.
banks = ["I am not going to bank on that happening.", "I went down to the river bank."]
nlp.pipe(banks)
<generator object Language.pipe at 0x1253b09e0>
for doc in nlp.pipe(banks):
for token in doc:
if token.text == 'bank':
print(
f"{doc.text}\n+ {token.text}: "
f"{token.tag_} ({spacy.explain(token.tag_)})\n"
)
I am not going to bank on that happening.
+ bank: VB (verb, base form)
I went down to the river bank.
+ bank: NN (noun, singular or mass)
See how the tags differ between the two instances of “bank”? This indicates a difference in usage and, by proxy, a difference in meaning.
2.5. Dependency Parsing#
Another tool that can help with tasks like disambiguating word sense is dependency parsing. We’ve actually used it already: it allowed us to extract those noun chunks above. Dependency parsing involves analyzing the grammatical structure of text (usually sentences) to identify relationships between the words therein. The basic idea is that every word in a linguistic unit (eg. a sentence) is linked to at least one other word via a tree structure, and these linkages are hierarchical in nature, with various modifications occuring across the levels of sentences, clauses, phrases, and even compound nouns. Dependency parsing can tell you information about:
The primary subject of a linguistic unit (and whether it is an active or passive subject)
Various heads, which determine the syntatic categories of a phrase; these are often nouns and verbs, and you can think of them as the local subjects of subunits
Various dependents, which modify, either directly or indirectly, their heads (think adjectives, adverbs, etc.)
The root of the unit, which is often (but not always!) the primary verb
Linguists have developed a number of different methods to parse dependencies, which we won’t discuss here. Take note though that most popular one in NLP is the Universal Dependencies framework; spaCy
, like most NLP models, uses this. The library also has some functionality for visualizing dependencies, which will help clarify what it is they are in the first place. Below, we visualize a sentence from the Stein poem.
from spacy import displacy
to_render = list(carafe.sents)[2]
displacy.render(to_render, style='dep')
See how the arcs have arrows? Arrows point to the dependents within a linguistic unit, that is, they point to modifying relationships between words. Arrows arc out from a segment’s head, and the relationships they indicate are all specified with labels. As with the POS tags, you can use spacy.explain()
on the dependency labels, which we’ll do below. The whole list of them is also available in this table of typologies. Finally, somewhere in the tree you’ll find a word with no arrows pointing to it (here, “spreading”). This is the root. One of its dependents is the subject of the sentence (here, “difference”).
Seeing these relationships are quite useful in and of themselves, but the real power of dependency parsing comes in all the extra data it can provide about a token. Using this technique, you can link tokens back to their heads, or find local groupings of tokens that all refer to the same head.
Here’s how you could formalize that with a dataframe. Given this sentence:
sentence = odyssey[2246:2260]
sentence.text
"Then I tried to find some way of embracing my mother's ghost."
We can construct a for
loop, which rolls through each token and retrieves its dependency info.
dependencies = []
for token in sentence:
dependencies.append({
'INDEX': token.i,
'TOKEN': token.text,
'DEPENDENCY_SHORTCODE': token.dep_,
'DEPENDENCY': spacy.explain(token.dep_),
'HEAD_INDEX': token.head.i,
'HEAD': token.head
})
dependencies = pd.DataFrame(dependencies).set_index('INDEX')
dependencies
/Users/tyler/Environments/nlp/lib/python3.9/site-packages/spacy/glossary.py:19: UserWarning: [W118] Term 'ROOT' not found in glossary. It may however be explained in documentation for the corpora used to train the language. Please check `nlp.meta["sources"]` for any relevant links.
warnings.warn(Warnings.W118.format(term=term))
TOKEN | DEPENDENCY_SHORTCODE | DEPENDENCY | HEAD_INDEX | HEAD | |
---|---|---|---|---|---|
INDEX | |||||
2246 | Then | advmod | adverbial modifier | 2248 | tried |
2247 | I | nsubj | nominal subject | 2248 | tried |
2248 | tried | ROOT | None | 2248 | tried |
2249 | to | aux | auxiliary | 2250 | find |
2250 | find | xcomp | open clausal complement | 2248 | tried |
2251 | some | det | determiner | 2252 | way |
2252 | way | dobj | direct object | 2250 | find |
2253 | of | prep | prepositional modifier | 2252 | way |
2254 | embracing | pcomp | complement of preposition | 2253 | of |
2255 | my | poss | possession modifier | 2256 | mother |
2256 | mother | poss | possession modifier | 2258 | ghost |
2257 | 's | case | case marking | 2256 | mother |
2258 | ghost | dobj | direct object | 2254 | embracing |
2259 | . | punct | punctuation | 2248 | tried |
How many tokens are associated with each head?
dependencies.groupby('HEAD').size()
HEAD
tried 5
find 2
way 2
of 1
embracing 1
mother 2
ghost 1
dtype: int64
Which tokens are in each of these groups?
groups = []
for group in dependencies.groupby('HEAD'):
head, tokens = group[0].text, group[1]['TOKEN'].tolist()
groups.append({
'HEAD': head,
'GROUP': tokens
})
groups = pd.DataFrame(groups).set_index('HEAD')
groups
GROUP | |
---|---|
HEAD | |
tried | [Then, I, tried, find, .] |
find | [to, way] |
way | [some, of] |
of | [embracing] |
embracing | [ghost] |
mother | [my, 's] |
ghost | [mother] |
spaCy
also has a special .subtree
attribute for each token, which will also produce a similar set of local groupings. Note however that .subtree
captures all tokens that hold a dependent relationship with the one in question, meaning that when you find the subtree of the root, you’re going to print out the entire sentence.
As you might expect by now, .subtree
returns a generator, so convert it to a list or use list comprehension to extract the tokens. We’ll do this in a separate function. Within this function, we’re going to use the .text_with_ws
attribute of each token in the subtree to return an exact, string-like representation of the tree (this will include any whitespace characters that are attached to a token).
def subtree_to_text(subtree):
subtree = ''.join([token.text_with_ws for token in token.subtree])
subtree = subtree.strip()
return subtree
sentence_trees = []
for token in sentence:
subtree = subtree_to_text(token.subtree)
sentence_trees.append({
'TOKEN': token.text,
'DEPENDENCY': token.dep_,
'SUBTREE': subtree
})
sentence_trees = pd.DataFrame(sentence_trees).set_index('TOKEN')
sentence_trees
DEPENDENCY | SUBTREE | |
---|---|---|
TOKEN | ||
Then | advmod | Then |
I | nsubj | I |
tried | ROOT | "Then I tried to find some way of embracing my... |
to | aux | to |
find | xcomp | to find some way of embracing my mother's ghost |
some | det | some |
way | dobj | some way of embracing my mother's ghost |
of | prep | of embracing my mother's ghost |
embracing | pcomp | embracing my mother's ghost |
my | poss | my |
mother | poss | my mother's |
's | case | 's |
ghost | dobj | my mother's ghost |
. | punct | . |
2.6. Putting Everything Together#
Now that we’ve walked through all these options (which are really only a small sliver of what you can do with spaCy
!), let’s put them into action. Below, we’ll construct two short examples of how you might combine different aspects of token attributes to analyze a text. Both of them are essentially information retrieval tasks, and you might imagine doing something similar to extract and analyze particular words in your corpus, or to find different grammatical patterns that could be of significance (as we’ll discuss in the next session).
2.6.1. Finding lemmas#
In the first, we’ll use the .lemma_
attribute to search through Book XI of The Odyssey and match its tokens to a few key words. If you’ve read The Odyssey, you’ll know that Book XI is where Odysseus and his fellow sailors have to travel down to the underworld Hades, where they speak with the dead. We already saw one example of this: Odysseus attempts to embrace his dead mother after communing with her. The whole trip to Hades is an emotionally tumultuous experience for the travelers, and peppered throughout Book XI are expressions of grief.
With .lemma_
, we can search for these expressions. We’ll roll through the text and determine whether a token lemma matches one of a selected set. When we find a match, we’ll get the subtree of this token’s head. That is, we’ll find the head upon which this token depends, and then we’ll use that to reconstruct the local context for the token.
sorrowful_lemmas = []
for token in odyssey:
if token.lemma_ in ('cry', 'grief', 'grieve', 'sad', 'sorrow', 'tear', 'weep'):
subtree = subtree_to_text(token.head.subtree)
sorrowful_lemmas.append({
'TOKEN': token.text,
'SUBTREE': subtree
})
sorrowful_lemmas = pd.DataFrame(sorrowful_lemmas).set_index('TOKEN')
sorrowful_lemmas
SUBTREE | |
---|---|
TOKEN | |
cried | cried when I saw him: 'Elpenor |
sad | sad |
tears | tears |
sorrow | all my sorrow |
sad | sad |
tears | tears both night and day |
grieves | He grieves continually about your never having... |
sad | sad |
sorrows | our sorrows |
grief | grief |
grief | great grief for the spite the gods had borne him |
grief | grief |
sadder | still sadder |
weeping | weeping |
wept | I too wept and pitied him as I beheld him |
weeping | weeping and talking thus sadly with one anothe... |
tear | a tear |
cries | such appalling cries |
2.6.2. Verb-subject relations#
For this next example, we’ll use dependency tags to find the subject sentences in Book XI. As before, we’ll go through each token in the document, this time checking to see whether it has the nsubj
or nsubjpass
tag for its .dep_
attribute, which denote the subjects of the sentence’s root. We’ll also check to see whether a token is a noun (otherwise we’d get a lot of articles like “who,” “them,” etc.). If a token matches these two conditions, we’ll find its head verb as well as the token’s subtree. Note that this time, the subtree will refer directly to the token in question, not to the head. This will let us capture some descriptive information about each sentence subject.
nsubj = []
for token in odyssey:
if token.dep_ in ('nsubj', 'nsubjpass') and token.pos_ in ('NOUN', 'PROPN'):
nsubj.append({
'SUBJECT': token.text,
'HEAD': token.head.text,
'HEAD_LEMMA': token.head.lemma_,
'SUBTREE': subtree_to_text(token.subtree)
})
nsubj_df = pd.DataFrame(nsubj).set_index('SUBJECT')
nsubj_df
HEAD | HEAD_LEMMA | SUBTREE | |
---|---|---|---|
SUBJECT | |||
Circe | sent | send | Circe, that great and cunning goddess, |
sails | were | be | her sails |
sun | went | go | the sun |
darkness | was | be | darkness |
rays | pierce | pierce | the rays of the sun |
... | ... | ... | ... |
Mercury | helped | help | Mercury and Minerva |
thousands | came | come | so many thousands of ghosts |
Proserpine | send | send | Proserpine |
ship | went | go | the ship |
wind | sprang | spring | a fair wind |
140 rows × 3 columns
Let’s look at a few subtrees. Note how sometimes they are simple noun chunks, while in other cases they expand to whole phrases.
for chunk in nsubj_df['SUBTREE'].sample(10):
print(f"+ {chunk}")
+ Proserpine
+ Diana
+ many others also of the Ceteians
+ King Alcinous
+ the steam
+ the blame
+ the Trojan prisoners and Minerva
+ the gods
+ her abominable crime
+ scared birds
Time to zoom out. How many time do each of our selected subjects appear?
nsubj_df.groupby('SUBJECT').size().sort_values(ascending=False).head(25)
SUBJECT
ghost 6
heaven 4
Proserpine 4
ghosts 4
wife 4
man 4
one 4
Ulysses 4
people 3
judgement 2
gods 2
life 2
mother 2
creature 2
son 2
Teiresias 2
ship 2
Jove 2
wind 2
Neleus 2
will 2
Circe 2
sweat 1
others 1
wave 1
dtype: int64
What heads are associated with each subject? (Note that we’re using the lemmatized form of the verbs.)
nsubj_df.groupby(['SUBJECT', 'HEAD_LEMMA']).size().sort_values(ascending=False).head(25)
SUBJECT HEAD_LEMMA
Ulysses answer 2
ghost come 2
son be 2
Proserpine send 2
people hear 1
bless 1
be 1
others fall 1
one tell 1
man kill 1
one invite 1
prisoners be 1
prophecyings speak 1
one get 1
be 1
mother come 1
answer 1
match marry 1
Aegisthus be 1
queen say 1
judgement rankle 1
heaven make 1
take 1
vouchsafe 1
heroes lie 1
dtype: int64
Such information provides another way of looking at something like topicality. Rather than using, say, a bag of words approach to build a topic model, you could instead segment your text into chunks like the above and start tallying up token distributions. Such distributions might help you identify the primary subject in a passage of text, whether that be a character or something like a concept. Or, you could leverage them to investigate how different subjects are talked about, say by throwing POS tags into the mix to further nuance relationships across entities.
Our next session will demonstrate what such investigations look like in action. For now however, the main takeway is that the above annotation structures provide you with a host of different ways to segment and facet your text data. You are by no means limited to single token counts when working computationally analyzing text. Indeed, sometimes the most compelling ways to expore a corpus lie in the broader, fuzzier relationships that NLP annotations help us identify.