4. Word Embeddings#

Our sessions so far have worked off the idea of document annotation to produce new, sometimes highly useful metdata about texts. We’ve used this information for everything from information retrieval tasks (Chapter 2) to predictive classification (Chapter 3). Along the way, we’ve also made some passing discussions about how such annotations work to quantify or identify the semantics of those tasks (our work with POS tags, for example). But what we haven’t yet done is produce a model of semantic meaning ourselves. This is another core task of NLP, and there are several different ways to approach building a statistical representation of tokens’ meanings. The present chapter discusses one of the most popular methods of doing so: word embeddings. Below, we’ll overview what word embeddings are, demonstrate how to build and use them, talk about important considerations regarding bias, and apply all this to a document clustering task.

The corpus we’ll use is Melanie Walsh’s collection of ~380 obituaries from the New York Times. If you participated in our Getting Started with Textual Data series, you’ll be familiar with this corpus: we used it in the context of tf-idf scores. Our return to it here is meant to chime with that discussion, for word embeddings enable us to perform a similar kind of text vectorization. Though, as we’ll discuss, the resultant vectors will be considerably more feature-rich than what we could achieve with tf-idf alone.

Learning objectives

By the end of this chapter, you will be able to:

  • Explain what word embeddings are

  • Use gensim to train and load word embeddings models

  • Identify and analyze word relationships in these models

  • Recognize how bias can inhere in embeddings

  • Encode documents with a word embeddings model

4.1. Word Embeddings: Introduction#

Prior to the advent of Transformer models, word embedding served as a state-of-the-art technique for representing semantic relationships between tokens. The technique was first introduced in 2013, and it spawned a host of different variants that completely flooded the field of NLP until about 2018. In part, word embedding’s popularity stems from the relatively simple intuition behind it, which is known as the distributional hypothesis: “a word is characterized by the company it keeps” (Firth). Words that appear in similar contexts, in other words, have similar meanings, and what word embeddings do is represent that context-specific information through a set of features. As a result, similar words share similar data representations, and we can leverage that similarity to explore the semantic space of a corpus, to encode documents with feature-rich data, and more.

If you’re familiar with tf-idf vectors, the underlying data structure of word embeddings is the same: every word is represented by a vector of n features. But a key difference lies in the sparsity of the vectors – or, in the case of word embeddings, the lack of sparsity. As we saw in the last chapter, tf-idf vectors can suffer from the curse of dimensionality, something that’s compounded by the fact that such vectors must contain features for every word in corpus, regardless of whether a document has that word. This means tf-idf vectors are highly sparse: they contain many 0s. Word embeddings, on the other hand, do not. They’re what we call dense representations. Each one is a fixed-length, non-sparse vector (of 50-300 dimensions, usually) that is much more information-rich than tf-idf. As a result, embeddings tend to be capable of representing more nuanced relationships between corpus words – a performance improvement that is further boosted by the fact that many of the most popular models had the advantage of being trained on billions and billions of tokens.

The other major difference between tehse vectors and tf-idf lies in how the former are created. While at root, word embeddings represent token co-occurence data (just like a document-term matrix), they are the product of millions of guesses made by a neural network. Training this network involves making predictions about a target word, based on that word’s context. We are not going to delve into the math behind these predictions (though this post does); however, it is worth noting that there are two different training set ups for a word embedding model:

  1. Common Bag of Words (CBOW): given a window of words on either side of a target, the network tries to predict what word the target should be

  2. Skip-grams: the network starts with the word in the middle of a window and picks random words within this window to use as its prediction targets

As you may have noticed, these are just mirrored versions of one another. CBOW starts from context, while skip-gram tries to rebuild context. Regardless, in both cases the network attempts to maximize the likelihood of its predictions, updating its weights accordingly over the course of training. Words that repeatedly appear in similar contexts will help shape thse weights, and in turn the model will associate such words with similar vector representations. If you’d like to see all this in action, Xin Rong has produced a fantastic, interactive visualization of how word embedding models learn.

Of course, the other way to understand how word embeddings work is to use them yourself. We’ll move on to doing so now.

4.2. Loading the Data#

Before we begin working with word embeddings in full, let’s load a corpus manifest file, which will help us keep track of all the obituaries.

import pandas as pd

manifest = pd.read_csv('data/session_three/manifest.csv', index_col=0)
manifest = manifest.assign(YEAR = pd.to_datetime(manifest['YEAR'], format='%Y').dt.year)

print(
    f"Number of obituaries: {len(manifest)}",
    f"\nDate range: {manifest['YEAR'].min()}--{manifest['YEAR'].max()}"
)
Number of obituaries: 379 
Date range: 1852--2007
manifest.groupby('YEAR').count().plot(
    figsize=(15, 5),
    y='NAME',
    title='Obituaries per Year',
    ylabel='Num. obituaries',
    xlabel='Year',
    legend=False
);
_images/04_word-embeddings_4_0.png

Here’s a sampling of the corpus:

for idx in manifest.sample(10).index:
    name, date = manifest.loc[idx, 'NAME'], manifest.loc[idx, 'YEAR']
    print(f"{name} ({date})")
Enrico Fermi (1954)
Melvil Dewey (1931)
Sarah Orne Jewett (1909)
Thomas Mann (1955)
Jeanette Rankin (1973)
Roger Maris (1985)
Max Ernst (1976)
Frank Sinatra (1998)
Getulio Vargas (1954)
I F Stone (1989)

Now we can load the obituaries themselves. While the past two sessions have required full-text representations of documents, word embeddings work best with bags of words, especially when it comes to doing analysis with them. Accordingly, each of the files in the corpus have already processed by a text cleaning pipeline: they represent the lowercase, stopped, and lemmatized versions of the originals.

No extra loading considerations are needed here either. We’ll just use glob to get our file paths and iterate through the list, loading each document into a corpus list. Note that we still must split the file contents.

import glob

paths = glob.glob('data/session_three/obits/*.txt')
paths.sort()

corpus = []
for path in paths:
    with open(path, 'r') as fin:
        doc = fin.read()
        doc = doc.split()
        corpus.append(doc)

With this done, we can move on to the model.

4.3. Using an Embeddings Model#

At this point, we are at a crossroads. On the one hand, we could train a word embeddings model using our corpus documents as is. The gensim library offers functionality for this, and it’s a relatively easy operation. On the other, we could use premade embeddings, which are usually trained on a more general – and much larger – set of documents. There is a tradeoff here:

  • Training a corpus-specific model will more faithfully represent the token behavior of the texts we’d like to analyze, but these representations could be too specific, especially if the model doesn’t have enough data to train on; the resultant embeddings may be closer to topic models than to word-level semantics

  • Using premade embeddings gives us the benefit of generalization: the vectors will cleave more closely to how we understand language; but such embeddings might a) miss out on certain nuances we’d like to capture, or b) introduce biases into our corpus (more on this below)

In our case, the decision is difficult. When preparing this reader, we (Tyler and Carl) found that a model trained on the obituaries alone did not produce vectors that could fully demonstrate the capabilities of the word embedding technique. The corpus is just a little too specific, and perhaps a little too small. We could’ve used a larger corpus, but doing so would introduce slow-downs in the workshop session. Because of this, we decided to use a premade model, in this case, the Stanford GloVe embeddings (the 200-dimension version). GloVe was trained on billions of tokens, spanning Wikipedia data, newswire articles, even Twitter. More, the model’s developers offer several different dimension sizes, which are helpful for selecting embeddings with the right amount of detail.

That said, going with GloVe introduces its own problems. For one thing, we can’t show you how to train a word embeddings model itself – at least not live. The code to do so, however, is reproduced below:

from gensim.models import Word2Vec

n_dimensions = 100
model = Word2Vec(n_dimensions)
model.build_vocab(corpus)
model.train(corpus, total_words=model.corpus_total_words, epochs=5)

Another problem has to do with the data GloVe was trained on. It’s so large that we can’t account for all the content, and this becomes particularly detrimental when it comes to bias. Researchers have found that general embeddings models reproduce gender-discriminatory language, even hate speech, by virtue of the fact that they are trained on huge amounts of text data, often without consideration of whether the content of such data is something one would endorse. GloVe is known to be biased in this way. We’ll show an example later on in this chapter and will discuss this in much more detail during our live session, but for now just note that the effects of bias do shape how we represent our corpus, and it’s important to keep an eye out for this when working with the data.

4.3.1. Loading a model#

With all that said, we can move on. Below, we load GloVe embeddings into our workspace using a gensim wrapper.

from gensim.models import KeyedVectors

model = KeyedVectors.load('data/session_three/glove/glove-wiki-gigaword_200d.bin')

The KeyedVectors object acts almost like a dictionary. You can do certain Python operations directly on it, like using len() to find the number of tokens in the model.

n_tokens = len(model)

print(f"Number of unique tokens in the model: {n_tokens:,}")
Number of unique tokens in the model: 400,000

4.3.2. Token mappings#

Each token in the model (what gensim calls a “key”) has an associated index. This mapping is accessible via the .key_to_index attribute:

model.key_to_index
{'the': 0,
 ',': 1,
 '.': 2,
 'of': 3,
 'to': 4,
 'and': 5,
 'in': 6,
 'a': 7,
 '"': 8,
 "'s": 9,
 'for': 10,
 '-': 11,
 'that': 12,
 'on': 13,
 'is': 14,
 'was': 15,
 'said': 16,
 'with': 17,
 'he': 18,
 'as': 19,
 'it': 20,
 'by': 21,
 'at': 22,
 '(': 23,
 ')': 24,
 'from': 25,
 'his': 26,
 "''": 27,
 '``': 28,
 'an': 29,
 'be': 30,
 'has': 31,
 'are': 32,
 'have': 33,
 'but': 34,
 'were': 35,
 'not': 36,
 'this': 37,
 'who': 38,
 'they': 39,
 'had': 40,
 'i': 41,
 'which': 42,
 'will': 43,
 'their': 44,
 ':': 45,
 'or': 46,
 'its': 47,
 'one': 48,
 'after': 49,
 'new': 50,
 'been': 51,
 'also': 52,
 'we': 53,
 'would': 54,
 'two': 55,
 'more': 56,
 "'": 57,
 'first': 58,
 'about': 59,
 'up': 60,
 'when': 61,
 'year': 62,
 'there': 63,
 'all': 64,
 '--': 65,
 'out': 66,
 'she': 67,
 'other': 68,
 'people': 69,
 "n't": 70,
 'her': 71,
 'percent': 72,
 'than': 73,
 'over': 74,
 'into': 75,
 'last': 76,
 'some': 77,
 'government': 78,
 'time': 79,
 '$': 80,
 'you': 81,
 'years': 82,
 'if': 83,
 'no': 84,
 'world': 85,
 'can': 86,
 'three': 87,
 'do': 88,
 ';': 89,
 'president': 90,
 'only': 91,
 'state': 92,
 'million': 93,
 'could': 94,
 'us': 95,
 'most': 96,
 '_': 97,
 'against': 98,
 'u.s.': 99,
 'so': 100,
 'them': 101,
 'what': 102,
 'him': 103,
 'united': 104,
 'during': 105,
 'before': 106,
 'may': 107,
 'since': 108,
 'many': 109,
 'while': 110,
 'where': 111,
 'states': 112,
 'because': 113,
 'now': 114,
 'city': 115,
 'made': 116,
 'like': 117,
 'between': 118,
 'did': 119,
 'just': 120,
 'national': 121,
 'day': 122,
 'country': 123,
 'under': 124,
 'such': 125,
 'second': 126,
 'then': 127,
 'company': 128,
 'group': 129,
 'any': 130,
 'through': 131,
 'china': 132,
 'four': 133,
 'being': 134,
 'down': 135,
 'war': 136,
 'back': 137,
 'off': 138,
 'south': 139,
 'american': 140,
 'minister': 141,
 'police': 142,
 'well': 143,
 'including': 144,
 'team': 145,
 'international': 146,
 'week': 147,
 'officials': 148,
 'still': 149,
 'both': 150,
 'even': 151,
 'high': 152,
 'part': 153,
 'told': 154,
 'those': 155,
 'end': 156,
 'former': 157,
 'these': 158,
 'make': 159,
 'billion': 160,
 'work': 161,
 'our': 162,
 'home': 163,
 'school': 164,
 'party': 165,
 'house': 166,
 'old': 167,
 'later': 168,
 'get': 169,
 'another': 170,
 'tuesday': 171,
 'news': 172,
 'long': 173,
 'five': 174,
 'called': 175,
 '1': 176,
 'wednesday': 177,
 'military': 178,
 'way': 179,
 'used': 180,
 'much': 181,
 'next': 182,
 'monday': 183,
 'thursday': 184,
 'friday': 185,
 'game': 186,
 'here': 187,
 '?': 188,
 'should': 189,
 'take': 190,
 'very': 191,
 'my': 192,
 'north': 193,
 'security': 194,
 'season': 195,
 'york': 196,
 'how': 197,
 'public': 198,
 'early': 199,
 'according': 200,
 'several': 201,
 'court': 202,
 'say': 203,
 'around': 204,
 'foreign': 205,
 '10': 206,
 'until': 207,
 'set': 208,
 'political': 209,
 'says': 210,
 'market': 211,
 'however': 212,
 'family': 213,
 'life': 214,
 'same': 215,
 'general': 216,
 '–': 217,
 'left': 218,
 'good': 219,
 'top': 220,
 'university': 221,
 'going': 222,
 'number': 223,
 'major': 224,
 'known': 225,
 'points': 226,
 'won': 227,
 'six': 228,
 'month': 229,
 'dollars': 230,
 'bank': 231,
 '2': 232,
 'iraq': 233,
 'use': 234,
 'members': 235,
 'each': 236,
 'area': 237,
 'found': 238,
 'official': 239,
 'sunday': 240,
 'place': 241,
 'go': 242,
 'based': 243,
 'among': 244,
 'third': 245,
 'times': 246,
 'took': 247,
 'right': 248,
 'days': 249,
 'local': 250,
 'economic': 251,
 'countries': 252,
 'see': 253,
 'best': 254,
 'report': 255,
 'killed': 256,
 'held': 257,
 'business': 258,
 'west': 259,
 'does': 260,
 'own': 261,
 '%': 262,
 'came': 263,
 'law': 264,
 'months': 265,
 'women': 266,
 "'re": 267,
 'power': 268,
 'think': 269,
 'service': 270,
 'children': 271,
 'bush': 272,
 'show': 273,
 '/': 274,
 'help': 275,
 'chief': 276,
 'saturday': 277,
 'system': 278,
 'john': 279,
 'support': 280,
 'series': 281,
 'play': 282,
 'office': 283,
 'following': 284,
 'me': 285,
 'meeting': 286,
 'expected': 287,
 'late': 288,
 'washington': 289,
 'games': 290,
 'european': 291,
 'league': 292,
 'reported': 293,
 'final': 294,
 'added': 295,
 'without': 296,
 'british': 297,
 'white': 298,
 'history': 299,
 'man': 300,
 'men': 301,
 'became': 302,
 'want': 303,
 'march': 304,
 'case': 305,
 'few': 306,
 'run': 307,
 'money': 308,
 'began': 309,
 'open': 310,
 'name': 311,
 'trade': 312,
 'center': 313,
 '3': 314,
 'israel': 315,
 'oil': 316,
 'too': 317,
 'al': 318,
 'film': 319,
 'win': 320,
 'led': 321,
 'east': 322,
 'central': 323,
 '20': 324,
 'air': 325,
 'come': 326,
 'chinese': 327,
 'town': 328,
 'leader': 329,
 'army': 330,
 'line': 331,
 'never': 332,
 'little': 333,
 'played': 334,
 'prime': 335,
 'death': 336,
 'companies': 337,
 'least': 338,
 'put': 339,
 'forces': 340,
 'past': 341,
 'de': 342,
 'half': 343,
 'june': 344,
 'saying': 345,
 'know': 346,
 'federal': 347,
 'french': 348,
 'peace': 349,
 'earlier': 350,
 'capital': 351,
 'force': 352,
 'great': 353,
 'union': 354,
 'near': 355,
 'released': 356,
 'small': 357,
 'department': 358,
 'every': 359,
 'health': 360,
 'japan': 361,
 'head': 362,
 'ago': 363,
 'night': 364,
 'big': 365,
 'cup': 366,
 'election': 367,
 'region': 368,
 'director': 369,
 'talks': 370,
 'program': 371,
 'far': 372,
 'today': 373,
 'statement': 374,
 'july': 375,
 'although': 376,
 'district': 377,
 'again': 378,
 'born': 379,
 'development': 380,
 'leaders': 381,
 'council': 382,
 'close': 383,
 'record': 384,
 'along': 385,
 'county': 386,
 'france': 387,
 'went': 388,
 'point': 389,
 'must': 390,
 'spokesman': 391,
 'your': 392,
 'member': 393,
 'plan': 394,
 'financial': 395,
 'april': 396,
 'recent': 397,
 'campaign': 398,
 'become': 399,
 'troops': 400,
 'whether': 401,
 'lost': 402,
 'music': 403,
 '15': 404,
 'got': 405,
 'israeli': 406,
 '30': 407,
 'need': 408,
 '4': 409,
 'lead': 410,
 'already': 411,
 'russia': 412,
 'though': 413,
 'might': 414,
 'free': 415,
 'hit': 416,
 'rights': 417,
 '11': 418,
 'information': 419,
 'away': 420,
 '12': 421,
 '5': 422,
 'others': 423,
 'control': 424,
 'within': 425,
 'large': 426,
 'economy': 427,
 'press': 428,
 'agency': 429,
 'water': 430,
 'died': 431,
 'career': 432,
 'making': 433,
 '...': 434,
 'deal': 435,
 'attack': 436,
 'side': 437,
 'seven': 438,
 'better': 439,
 'less': 440,
 'september': 441,
 'once': 442,
 'clinton': 443,
 'main': 444,
 'due': 445,
 'committee': 446,
 'building': 447,
 'conference': 448,
 'club': 449,
 'january': 450,
 'decision': 451,
 'stock': 452,
 'america': 453,
 'given': 454,
 'give': 455,
 'often': 456,
 'announced': 457,
 'television': 458,
 'industry': 459,
 'order': 460,
 'young': 461,
 "'ve": 462,
 'palestinian': 463,
 'age': 464,
 'start': 465,
 'administration': 466,
 'russian': 467,
 'prices': 468,
 'round': 469,
 'december': 470,
 'nations': 471,
 "'m": 472,
 'human': 473,
 'india': 474,
 'defense': 475,
 'asked': 476,
 'total': 477,
 'october': 478,
 'players': 479,
 'bill': 480,
 'important': 481,
 'southern': 482,
 'move': 483,
 'fire': 484,
 'population': 485,
 'rose': 486,
 'november': 487,
 'include': 488,
 'further': 489,
 'nuclear': 490,
 'street': 491,
 'taken': 492,
 'media': 493,
 'different': 494,
 'issue': 495,
 'received': 496,
 'secretary': 497,
 'return': 498,
 'college': 499,
 'working': 500,
 'community': 501,
 'eight': 502,
 'groups': 503,
 'despite': 504,
 'level': 505,
 'largest': 506,
 'whose': 507,
 'attacks': 508,
 'germany': 509,
 'august': 510,
 'change': 511,
 'church': 512,
 'nation': 513,
 'german': 514,
 'station': 515,
 'london': 516,
 'weeks': 517,
 'having': 518,
 '18': 519,
 'research': 520,
 'black': 521,
 'services': 522,
 'story': 523,
 '6': 524,
 'europe': 525,
 'sales': 526,
 'policy': 527,
 'visit': 528,
 'northern': 529,
 'lot': 530,
 'across': 531,
 'per': 532,
 'current': 533,
 'board': 534,
 'football': 535,
 'ministry': 536,
 'workers': 537,
 'vote': 538,
 'book': 539,
 'fell': 540,
 'seen': 541,
 'role': 542,
 'students': 543,
 'shares': 544,
 'iran': 545,
 'process': 546,
 'agreement': 547,
 'quarter': 548,
 'full': 549,
 'match': 550,
 'started': 551,
 'growth': 552,
 'yet': 553,
 'moved': 554,
 'possible': 555,
 'western': 556,
 'special': 557,
 '100': 558,
 'plans': 559,
 'interest': 560,
 'behind': 561,
 'strong': 562,
 'england': 563,
 'named': 564,
 'food': 565,
 'period': 566,
 'real': 567,
 'authorities': 568,
 'car': 569,
 'term': 570,
 'rate': 571,
 'race': 572,
 'nearly': 573,
 'korea': 574,
 'enough': 575,
 'site': 576,
 'opposition': 577,
 'keep': 578,
 '25': 579,
 'call': 580,
 'future': 581,
 'taking': 582,
 'island': 583,
 '2008': 584,
 '2006': 585,
 'road': 586,
 'outside': 587,
 'really': 588,
 'century': 589,
 'democratic': 590,
 'almost': 591,
 'single': 592,
 'share': 593,
 'leading': 594,
 'trying': 595,
 'find': 596,
 'album': 597,
 'senior': 598,
 'minutes': 599,
 'together': 600,
 'congress': 601,
 'index': 602,
 'australia': 603,
 'results': 604,
 'hard': 605,
 'hours': 606,
 'land': 607,
 'action': 608,
 'higher': 609,
 'field': 610,
 'cut': 611,
 'coach': 612,
 'elections': 613,
 'san': 614,
 'issues': 615,
 'executive': 616,
 'february': 617,
 'production': 618,
 'areas': 619,
 'river': 620,
 'face': 621,
 'using': 622,
 'japanese': 623,
 'province': 624,
 'park': 625,
 'price': 626,
 'commission': 627,
 'california': 628,
 'father': 629,
 'son': 630,
 'education': 631,
 '7': 632,
 'village': 633,
 'energy': 634,
 'shot': 635,
 'short': 636,
 'africa': 637,
 'key': 638,
 'red': 639,
 'association': 640,
 'average': 641,
 'pay': 642,
 'exchange': 643,
 'eu': 644,
 'something': 645,
 'gave': 646,
 'likely': 647,
 'player': 648,
 'george': 649,
 '2007': 650,
 'victory': 651,
 '8': 652,
 'low': 653,
 'things': 654,
 '2010': 655,
 'pakistan': 656,
 '14': 657,
 'post': 658,
 'social': 659,
 'continue': 660,
 'ever': 661,
 'look': 662,
 'chairman': 663,
 'job': 664,
 '2000': 665,
 'soldiers': 666,
 'able': 667,
 'parliament': 668,
 'front': 669,
 'himself': 670,
 'problems': 671,
 'private': 672,
 'lower': 673,
 'list': 674,
 'built': 675,
 '13': 676,
 'efforts': 677,
 'dollar': 678,
 'miles': 679,
 'included': 680,
 'radio': 681,
 'live': 682,
 'form': 683,
 'david': 684,
 'african': 685,
 'increase': 686,
 'reports': 687,
 'sent': 688,
 'fourth': 689,
 'always': 690,
 'king': 691,
 '50': 692,
 'tax': 693,
 'taiwan': 694,
 'britain': 695,
 '16': 696,
 'playing': 697,
 'title': 698,
 'middle': 699,
 'meet': 700,
 'global': 701,
 'wife': 702,
 '2009': 703,
 'position': 704,
 'located': 705,
 'clear': 706,
 'ahead': 707,
 '2004': 708,
 '2005': 709,
 'iraqi': 710,
 'english': 711,
 'result': 712,
 'release': 713,
 'violence': 714,
 'goal': 715,
 'project': 716,
 'closed': 717,
 'border': 718,
 'body': 719,
 'soon': 720,
 'crisis': 721,
 'division': 722,
 '&': 723,
 'served': 724,
 'tour': 725,
 'hospital': 726,
 'kong': 727,
 'test': 728,
 'hong': 729,
 'u.n.': 730,
 'inc.': 731,
 'technology': 732,
 'believe': 733,
 'organization': 734,
 'published': 735,
 'weapons': 736,
 'agreed': 737,
 'why': 738,
 'nine': 739,
 'summer': 740,
 'wanted': 741,
 'republican': 742,
 'act': 743,
 'recently': 744,
 'texas': 745,
 'course': 746,
 'problem': 747,
 'senate': 748,
 'medical': 749,
 'un': 750,
 'done': 751,
 'reached': 752,
 'star': 753,
 'continued': 754,
 'investors': 755,
 'living': 756,
 'care': 757,
 'signed': 758,
 '17': 759,
 'art': 760,
 'provide': 761,
 'worked': 762,
 'presidential': 763,
 'gold': 764,
 'obama': 765,
 'morning': 766,
 'dead': 767,
 'opened': 768,
 "'ll": 769,
 'event': 770,
 'previous': 771,
 'cost': 772,
 'instead': 773,
 'canada': 774,
 'band': 775,
 'teams': 776,
 'daily': 777,
 '2001': 778,
 'available': 779,
 'drug': 780,
 'coming': 781,
 '2003': 782,
 'investment': 783,
 '’s': 784,
 'michael': 785,
 'civil': 786,
 'woman': 787,
 'training': 788,
 'appeared': 789,
 '9': 790,
 'involved': 791,
 'indian': 792,
 'similar': 793,
 'situation': 794,
 '24': 795,
 'los': 796,
 'running': 797,
 'fighting': 798,
 'mark': 799,
 '40': 800,
 'trial': 801,
 'hold': 802,
 'australian': 803,
 'thought': 804,
 '!': 805,
 'study': 806,
 'fall': 807,
 'mother': 808,
 'met': 809,
 'relations': 810,
 'anti': 811,
 '2002': 812,
 'song': 813,
 'popular': 814,
 'base': 815,
 'tv': 816,
 'ground': 817,
 'markets': 818,
 'ii': 819,
 'newspaper': 820,
 'staff': 821,
 'saw': 822,
 'hand': 823,
 'hope': 824,
 'operations': 825,
 'pressure': 826,
 'americans': 827,
 'eastern': 828,
 'st.': 829,
 'legal': 830,
 'asia': 831,
 'budget': 832,
 'returned': 833,
 'considered': 834,
 'love': 835,
 'wrote': 836,
 'stop': 837,
 'fight': 838,
 'currently': 839,
 'charges': 840,
 'try': 841,
 'aid': 842,
 'ended': 843,
 'management': 844,
 'brought': 845,
 'cases': 846,
 'decided': 847,
 'failed': 848,
 'network': 849,
 'works': 850,
 'gas': 851,
 'turned': 852,
 'fact': 853,
 'vice': 854,
 'ca': 855,
 'mexico': 856,
 'trading': 857,
 'especially': 858,
 'reporters': 859,
 'afghanistan': 860,
 'common': 861,
 'looking': 862,
 'space': 863,
 'rates': 864,
 'manager': 865,
 'loss': 866,
 '2011': 867,
 'justice': 868,
 'thousands': 869,
 'james': 870,
 'rather': 871,
 'fund': 872,
 'thing': 873,
 'republic': 874,
 'opening': 875,
 'accused': 876,
 'winning': 877,
 'scored': 878,
 'championship': 879,
 'example': 880,
 'getting': 881,
 'biggest': 882,
 'performance': 883,
 'sports': 884,
 '1998': 885,
 'let': 886,
 'allowed': 887,
 'schools': 888,
 'means': 889,
 'turn': 890,
 'leave': 891,
 'no.': 892,
 'robert': 893,
 'personal': 894,
 'stocks': 895,
 'showed': 896,
 'light': 897,
 'arrested': 898,
 'person': 899,
 'either': 900,
 'offer': 901,
 'majority': 902,
 'battle': 903,
 '19': 904,
 'class': 905,
 'evidence': 906,
 'makes': 907,
 'society': 908,
 'products': 909,
 'regional': 910,
 'needed': 911,
 'stage': 912,
 'am': 913,
 'doing': 914,
 'families': 915,
 'construction': 916,
 'various': 917,
 '1996': 918,
 'sold': 919,
 'independent': 920,
 'kind': 921,
 'airport': 922,
 'paul': 923,
 'judge': 924,
 'internet': 925,
 'movement': 926,
 'room': 927,
 'followed': 928,
 'original': 929,
 'angeles': 930,
 'italy': 931,
 '`': 932,
 'data': 933,
 'comes': 934,
 'parties': 935,
 'nothing': 936,
 'sea': 937,
 'bring': 938,
 '2012': 939,
 'annual': 940,
 'officer': 941,
 'beijing': 942,
 'present': 943,
 'remain': 944,
 'nato': 945,
 '1999': 946,
 '22': 947,
 'remains': 948,
 'allow': 949,
 'florida': 950,
 'computer': 951,
 '21': 952,
 'contract': 953,
 'coast': 954,
 'created': 955,
 'demand': 956,
 'operation': 957,
 'events': 958,
 'islamic': 959,
 'beat': 960,
 'analysts': 961,
 'interview': 962,
 'helped': 963,
 'child': 964,
 'probably': 965,
 'spent': 966,
 'asian': 967,
 'effort': 968,
 'cooperation': 969,
 'shows': 970,
 'calls': 971,
 'investigation': 972,
 'lives': 973,
 'video': 974,
 'yen': 975,
 'runs': 976,
 'tried': 977,
 'bad': 978,
 'described': 979,
 '1994': 980,
 'toward': 981,
 'written': 982,
 'throughout': 983,
 'established': 984,
 'mission': 985,
 'associated': 986,
 'buy': 987,
 'growing': 988,
 'green': 989,
 'forward': 990,
 'competition': 991,
 'poor': 992,
 'latest': 993,
 'banks': 994,
 'question': 995,
 '1997': 996,
 'prison': 997,
 'feel': 998,
 'attention': 999,
 ...}

If you want to get the vector representation for a token, you can use either the key or the index. The syntax is just like a Python dictionary. Below, we randomly select a single token from the model vocabulary’s .index_to_key attribute and find the index associated with it.

import random

rand_token = random.choice(model.index_to_key)
rand_idx = model.key_to_index[rand_token]

print(f"The index position for '{rand_token}' is {rand_idx}")
The index position for 'darkchild' is 191458

Here’s its vector:

model[rand_idx]
array([-0.69181  , -0.38802  , -0.76594  ,  0.19935  , -0.36225  ,
        0.26337  , -0.12532  ,  0.18984  ,  0.67979  , -0.54584  ,
       -0.80352  , -0.37795  , -0.55269  ,  0.54476  , -0.029629 ,
       -0.21068  ,  0.49436  ,  0.3293   , -0.017231 ,  0.79946  ,
       -0.40382  , -0.87286  , -0.16939  , -0.12445  ,  0.031971 ,
       -0.23482  , -0.039091 ,  0.045027 ,  0.58463  ,  0.37604  ,
       -0.095765 ,  0.45375  ,  0.2442   ,  0.011691 , -0.39593  ,
       -0.5763   ,  0.57578  ,  0.16547  , -0.41061  , -0.9752   ,
        0.16886  ,  0.15708  , -0.5736   , -0.18827  ,  0.15447  ,
       -0.68973  , -0.44074  , -0.22236  , -0.017376 , -0.20254  ,
        0.26098  ,  0.21436  , -0.018241 ,  0.047523 , -0.7001   ,
       -0.32253  , -0.49242  ,  0.033699 ,  0.83126  ,  0.078904 ,
       -0.30966  , -0.15     ,  0.012531 ,  0.20949  , -0.44654  ,
        0.18191  , -0.57749  ,  0.15384  , -0.29896  , -0.78124  ,
        0.31608  , -0.12994  , -0.09851  , -0.39394  , -0.37538  ,
       -0.45069  ,  0.48411  , -0.039195 ,  0.40725  , -0.3364   ,
        0.39917  ,  0.40619  , -0.12627  ,  0.30577  , -0.20408  ,
       -0.077245 , -0.2501   , -0.37723  , -0.31394  ,  0.57691  ,
        0.42883  ,  0.030597 ,  0.099596 ,  0.28072  ,  0.20477  ,
        0.43538  , -0.091653 , -0.36189  ,  0.11607  , -0.070713 ,
       -0.71446  ,  0.47509  , -1.1435   ,  0.053046 ,  0.11112  ,
        0.67894  ,  0.29344  , -0.32388  , -0.23991  ,  0.13188  ,
        0.19395  , -0.056736 ,  0.14044  , -0.073921 ,  0.4227   ,
       -0.036191 , -0.42104  , -0.53608  , -0.33449  , -0.48462  ,
       -0.75614  , -0.28959  , -0.53963  , -0.52327  ,  0.0037486,
        0.74531  ,  0.50435  , -0.4324   ,  0.39642  ,  0.17006  ,
        0.055405 , -0.50794  , -0.55403  ,  0.005382 , -0.40653  ,
       -0.42513  , -0.21335  ,  0.53993  ,  0.22362  , -0.035143 ,
       -0.78953  ,  0.050332 , -0.18659  , -0.49351  , -0.54373  ,
       -0.83972  , -0.82726  ,  0.44814  , -0.23607  , -0.4201   ,
       -0.35265  , -0.14795  , -0.1344   ,  0.53614  ,  0.11558  ,
       -0.0094971, -0.79706  , -0.075228 , -0.33526  ,  0.29514  ,
       -0.010351 ,  0.2632   ,  0.46318  ,  0.26167  , -0.17626  ,
        0.11656  ,  0.37446  ,  0.1088   , -0.10536  , -0.47429  ,
       -0.14761  , -0.14779  ,  0.15915  ,  0.28293  , -0.73619  ,
        0.41312  ,  0.10996  ,  0.77046  ,  0.24727  , -0.058859 ,
       -0.64571  ,  0.033264 ,  0.17759  , -0.0071632, -0.42984  ,
       -0.024373 ,  0.11693  , -0.35598  , -0.28983  , -0.041846 ,
       -0.025351 , -0.43246  , -0.74205  , -0.11076  , -0.20876  ,
       -0.47641  ,  0.29342  , -0.53923  ,  0.032584 ,  0.14822  ],
      dtype=float32)

And here we show that accessing this vector with either the index or key produces the same thing:

import numpy as np

np.array_equal(model[rand_idx], model[rand_token])
True

Finally, we can store the entire model vocabulary in a set and show a few examples of the tokens therein.

model_vocab = set(model.index_to_key)

for token in random.sample(model_vocab, 10):
    print(token)
leagas
magnification
perjuangan
terroristas
multipurpose
polcino
olgun
feer
113-member
newgen
/var/folders/h7/tzxfms7d2z7gwlgtbvw15msc0000gn/T/ipykernel_34189/2334269136.py:3: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  for token in random.sample(model_vocab, 10):

You may find some unexpected tokens in this output. Though it has been ostensibly trained on an English corpus, GloVe contains multilingual text. It also contains lots of noisy tokens, which range from erroneous segmentations (“drummer/percussionist” is one token, for example) to password-like strings and even HTML markup. Depending on your task, you may not notice these tokens, but they do in fact influence the overall shape of the model, and sometimes you’ll find them cropping up when you’re hunting around for similar terms and the like (more on this soon).

4.3.3. Out-of-vocabulary tokens#

While GloVe’s vocabulary sometimes seems too expansive, there are other instances where it’s too restricted.

assert 'unshaped' in model, "Not in vocabulary!"
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 assert 'unshaped' in model, "Not in vocabulary!"

AssertionError: Not in vocabulary!

If the model wasn’t trained on a particular word, it won’t have a corresponding vector for that word either. This is crucial. Because models like GloVe only know what they’ve been trained on, you need to be aware of any potential discrepancies between their vocabularies and your corpus data. If you don’t keep this in mind, sending unseen, or out-of-vocabulary, tokens to GloVe will throw errors in your code:

model['unshaped']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 model['unshaped']

File ~/Environments/nlp/lib/python3.9/site-packages/gensim/models/keyedvectors.py:404, in KeyedVectors.__getitem__(self, key_or_keys)
    390 """Get vector representation of `key_or_keys`.
    391 
    392 Parameters
   (...)
    401 
    402 """
    403 if isinstance(key_or_keys, _KEY_TYPES):
--> 404     return self.get_vector(key_or_keys)
    406 return vstack([self.get_vector(key) for key in key_or_keys])

File ~/Environments/nlp/lib/python3.9/site-packages/gensim/models/keyedvectors.py:447, in KeyedVectors.get_vector(self, key, norm)
    423 def get_vector(self, key, norm=False):
    424     """Get the key's vector, as a 1D numpy array.
    425 
    426     Parameters
   (...)
    445 
    446     """
--> 447     index = self.get_index(key)
    448     if norm:
    449         self.fill_norms()

File ~/Environments/nlp/lib/python3.9/site-packages/gensim/models/keyedvectors.py:421, in KeyedVectors.get_index(self, key, default)
    419     return default
    420 else:
--> 421     raise KeyError(f"Key '{key}' not present")

KeyError: "Key 'unshaped' not present"

There are a few ways to handle this problem. The most common is to simply not encode tokens in your corpus that don’t have a corresponding vector in GloVe. Below, we construct three dictionaries for our corpus data. The first contains all tokens, while the second and third are comprised of tokens that are and are not in Glove, respectively. We identify whether the model has a token using its .has_index_for() method.

corpus_vocab = set(token for doc in corpus for token in doc)
in_glove = set(token for token in corpus_vocab if model.has_index_for(token))
no_glove = set(token for token in corpus_vocab if model.has_index_for(token) == False)

print(
    f"Total words in the corpus vocabulary: {len(corpus_vocab):,}",
    f"\nNumber of corpus words in GloVe: {len(in_glove):,}",
    f"\nNumber of corpus words not in GloVe: {len(no_glove):,}"
)
Total words in the corpus vocabulary: 29,330 
Number of corpus words in GloVe: 27,488 
Number of corpus words not in GloVe: 1,842

Any subsequent code we write will need to reference these dictionaries to determine whether it should encode a token.

While this is what we’ll indeed do below, obviously it isn’t an ideal situation. But it’s one of the consequences of using premade models. There are, however, a few other ways to handle out-of-vocabulary terms. Some models offer special “UNK” tokens, which you could associate with all of your problem tokens. This, at the very least, enables you to have some representation of your data. A more complex approach involves taking the mean embedding of the word vectors surrounding an unknown token; and depending on the model, you can also train it further, adding extra tokens from your domain-specific text. Instructions for this last option are available here in the gensim documentation.

4.4. Word relationships#

Later on we’ll use GloVe to encode our corpus texts. But before we do, it’s worth demonstrating more generally some of the properties of word vectors. Vector representations of text allow us to perform various mathematical operations on our corpus that approximate (though maybe only approximate) semantics. The most common among these operations is finding the cosine similarity between two vectors. Our Getting Started with Textual Data series has a whole chapter on this measure, so if you haven’t encountered it before, we recommend you read that. But in short: cosine similarity measures the difference between vectors’ orientation in a feature space (here, the feature space is comprised of each of the vectors’ 200 dimentions). The closer two vectors are, the more likely they are to share semantic similarities.

4.4.1. Cosine similarity#

gensim provides easy access to this measure and other such vector space operations, and we can use this functionality to explore relationships between words in a model. To find the cosine similarity between the vectors for two words in GloVe, simply use the model’s .similarity() method:

a, b = 'calculate', 'compute'
sim = model.similarity(a, b)

print(f"Consine similarity score for '{a}' and '{b}': {sim:0.4f}")
Consine similarity score for 'calculate' and 'compute': 0.6991

The only difference between the score above and the one that you might produce, say, with scikit-learn’s cosine similarity implementation is that gensim bounds its values from [-1,1], whereas the latter uses a [0,1] scale. While in gensim it’s still the case that similar words score closer to 1, highly dissimilar words will be closer to -1.

At any rate, we can get the top n most similar words for a word using .most_similar(). The function defaults to 10 entries, but you can change that with the topn paramter.

targets = random.sample(in_glove, 5)

for token in targets:
    similarities = model.most_similar(token)
    print(f"Tokens most similar to '{token}':")
    df = pd.DataFrame(similarities, columns=['WORD', 'SCORE'])
    display(df)
Tokens most similar to 'lelia':
/var/folders/h7/tzxfms7d2z7gwlgtbvw15msc0000gn/T/ipykernel_34189/243927065.py:1: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
  targets = random.sample(in_glove, 5)
WORD SCORE
0 masaga 0.764858
1 audette 0.467348
2 ferres 0.456096
3 sivivatu 0.442534
4 hunsberger 0.436561
5 vunibaka 0.432584
6 howlett 0.432057
7 hallas 0.429879
8 macnaghten 0.429140
9 sandercock 0.425308
Tokens most similar to 'disengage':
WORD SCORE
0 disengaging 0.639132
1 disengaged 0.565804
2 redeploy 0.524834
3 detach 0.511609
4 disentangle 0.506060
5 unilaterally 0.501928
6 reorient 0.491407
7 reposition 0.481564
8 reoccupy 0.478571
9 forthwith 0.476014
Tokens most similar to 'cold':
WORD SCORE
0 warm 0.661124
1 cool 0.633241
2 chill 0.629438
3 dry 0.608534
4 chilly 0.607997
5 temperatures 0.603120
6 hot 0.593781
7 freezing 0.592056
8 frigid 0.586982
9 weather 0.584291
Tokens most similar to 'sweater':
WORD SCORE
0 sweaters 0.843625
1 turtleneck 0.746769
2 pullover 0.715948
3 cashmere 0.699273
4 trousers 0.698051
5 slacks 0.697593
6 cardigan 0.690271
7 blouse 0.688263
8 pants 0.683246
9 jeans 0.675004
Tokens most similar to 'nationals':
WORD SCORE
0 foreigners 0.670123
1 expatriates 0.612945
2 citizens 0.608192
3 pakistanis 0.527161
4 canadians 0.514506
5 filipinos 0.512406
6 detained 0.512111
7 visas 0.504068
8 britons 0.501268
9 diplomats 0.500606

We can also find the least similar word. This is useful to show, because it pressures our idea of what counts as similarity. Mathematical similarity does not always align with concepts like synonyms and antonyms. For example, it’s probably safe to say that the semantic opposite of “good” – that is, its antonym – is “evil.” But in the world of vector spaces, the least similar word to “good” is:

model.most_similar('good', topn=len(model))[-1]
('cw96', -0.6553234457969666)

Just noise! Relatively speaking, the vectors for “good” and “evil” are actually quite similar.

a, b = 'good', 'evil'
sim = model.similarity(a, b)

print(f"Consine similarity score for {a} and {b}: {sim:0.4f}")
Consine similarity score for good and evil: 0.3378

How do we make sense of this? Well, it has to do with the way the word embeddings are created. Since embeddings models are ultimately trained on co-occurrence data, words that tend to appear in similar kinds of contexts will be more similar in a mathematical sense than those that don’t.

Keeping this in mind is also important for considerations of bias. Since, in one sense, embeddings reflect the interchangeability between tokens, they will reinforce negative, even harmful patterns in the data (which is to say in culture at large). For example, consider the most similar words for “doctor” and “nurse.” The latter is locked up within gendered language: according to GloVe, a nurse is like a midwife is like a mother.

for token in ['doctor', 'nurse']:
    similarities = model.most_similar(token)
    print(f"Tokens most similar to '{token}':")
    df = pd.DataFrame(similarities, columns=['WORD', 'SCORE'])
    display(df)
Tokens most similar to 'doctor':
WORD SCORE
0 physician 0.736021
1 doctors 0.672406
2 surgeon 0.655147
3 dr. 0.652498
4 nurse 0.651449
5 medical 0.648189
6 hospital 0.636380
7 patient 0.619159
8 dentist 0.584747
9 psychiatrist 0.568571
Tokens most similar to 'nurse':
WORD SCORE
0 nurses 0.714051
1 doctor 0.651449
2 nursing 0.626937
3 midwife 0.614592
4 anesthetist 0.610603
5 physician 0.610359
6 hospital 0.609222
7 mother 0.586503
8 therapist 0.580488
9 dentist 0.573556

4.4.2. Visualizing the vector space#

One way to start getting a feel for all this is to visualize the word vectors. We do so below by sampling a portion of the GloVe vectors and then reducing them into two-dimensional data, which we can plot. First, let’s build two functions.

from sklearn.manifold import TSNE

def sample_embeddings(vectors, samp=1000):
    n_vectors = vectors.shape[0]
    mask = random.sample(range(n_vectors), samp)
    vectors = vectors[mask]
    vocab = [model.index_to_key[idx] for idx in mask]
    
    return vectors, vocab

def prepare_vis_data(vectors, labels):
    reduced = TSNE(
        n_components=2,
        learning_rate='auto',
        init='random',
        random_state=357
    ).fit_transform(vectors)
    
    vis_data = pd.DataFrame(reduced, columns=['X', 'Y'])
    vis_data['LABEL'] = labels
    
    return vis_data

Now we can retrieve all the vectors from GloVe using the .key_to_index attribute. With those stored in a numpy array, it’s time to sample them and create the visualization data.

all_vectors = np.array([model[idx] for idx in model.key_to_index])

sampled, sampled_vocab = sample_embeddings(all_vectors)
vis_data = prepare_vis_data(sampled, sampled_vocab)

With the reduced embeddings made, it’s time to plot them. Have a look around at the results. What seems right to you? What surprises you?

import altair as alt

alt.Chart(vis_data).mark_circle(size=30).encode(
    x='X',
    y='Y',
    tooltip='LABEL'
).properties(
    height=650,
    width=650
).interactive()

4.4.3. Other relationships#

Beyond cosine similarity, there are other word relationships to explore via vector space math. For example, one way of modeling something like a concept is to think about what other concepts comprise it. In other words: what plus what creates a new concept? Could we identify concepts by adding together vectors to create a new vector? Which words would this new vector be closest to in the vector space? Using the .similar_by_vector() method, we can find out.

concepts = {'beach': ('sand', 'ocean'), 'hotel': ('vacation', 'room'), 'airplane': ('air', 'car')}

for concept in concepts:
    pair = concepts[concept]
    generated_concept = model[pair[0]] + model[pair[1]]
    similarities = model.similar_by_vector(generated_concept)
    print(f"Most similar tokens to '{pair[0]}' + '{pair[1]}' (for '{concept}')")
    df = pd.DataFrame(similarities, columns=['WORD', 'SCORE'])
    display(df)
Most similar tokens to 'sand' + 'ocean' (for 'beach')
WORD SCORE
0 sand 0.845458
1 ocean 0.845268
2 sea 0.687682
3 beaches 0.667521
4 waters 0.664894
5 coastal 0.632485
6 water 0.618701
7 coast 0.604373
8 dunes 0.599333
9 surface 0.597545
Most similar tokens to 'vacation' + 'room' (for 'hotel')
WORD SCORE
0 vacation 0.823460
1 room 0.810719
2 rooms 0.704233
3 bedroom 0.658199
4 hotel 0.647865
5 dining 0.634925
6 stay 0.617807
7 apartment 0.616495
8 staying 0.615182
9 home 0.606009
Most similar tokens to 'air' + 'car' (for 'airplane')
WORD SCORE
0 air 0.827957
1 car 0.810086
2 vehicle 0.719382
3 cars 0.671697
4 truck 0.645963
5 vehicles 0.637166
6 passenger 0.625993
7 aircraft 0.624820
8 jet 0.618584
9 airplane 0.610345

Not bad! Our target concept isn’t the most similar word for either of these examples, but it’s in the top 10.

Most famously, word embeddings enable quasi-logical reasoning. Though, as we mentioned earlier, relationships between antonyms and synoyms do not necessarily map to a vector space, certain analogies do – at least under the right circumstances, and with particular training data. The logic here is that we identify a relationship between two words and we subtract one of those words’ vectors from the other. To that new vector we add in a vector for a target word, which forms the analogy. Querying for the word closest to this modified vector should produce a similar relation between the result and the target word as that between the original pair.

Here, we ask: “strong is to stronger what clear is to X?”

analogies = model.most_similar(positive=['stronger', 'clear'], negative=['strong'])
display(pd.DataFrame(analogies, columns=['WORD', 'SCORE']))
print("Ideal target: 'clearer'")
WORD SCORE
0 easier 0.633451
1 should 0.630116
2 clearer 0.621850
3 better 0.602637
4 must 0.601793
5 need 0.595918
6 meant 0.594797
7 harder 0.591297
8 anything 0.589579
9 nothing 0.589187
Ideal target: 'clearer'

And here, we ask: “Paris is to France what Berlin is to X”?

analogies = model.most_similar(positive=['france', 'berlin'], negative=['paris'])
display(pd.DataFrame(analogies, columns=['WORD', 'SCORE']))
print("Ideal target: 'Germany'")
WORD SCORE
0 germany 0.835242
1 german 0.684480
2 austria 0.612803
3 poland 0.581331
4 germans 0.574868
5 munich 0.543591
6 belgium 0.532413
7 britain 0.529541
8 europe 0.524402
9 czech 0.515241
Ideal target: 'Germany'

Both of the above produce compelling results, though your mileage may vary. Consider the following: “arm is to hand what leg is to X?”

analogies = model.most_similar(positive=['hand', 'leg'], negative=['arm'])
display(pd.DataFrame(analogies, columns=['WORD', 'SCORE']))
print("Ideal target: 'foot'")
WORD SCORE
0 final 0.543408
1 table 0.540411
2 legs 0.527352
3 back 0.523477
4 saturday 0.522487
5 round 0.516250
6 draw 0.516066
7 second 0.510900
8 place 0.509784
9 side 0.508683
Ideal target: 'foot'

Importantly, these results are always going to be specific to the data on which a model was trained. Claims made on the basis of word embeddings that aspire to general linguistic truths would be treading on shaky ground here.

4.5. Document similarity#

While the above word relationships are relatively abstract (and any such findings therefrom should be couched accordingly), we can ground them with a concrete task. In this final section, we use GloVe embeddings to encode our corpus documents. This involves associating a word vector for each token in an obituary. Of course, GloVe has not been trained on the obituaries, so there may be important differences in token behavior between that model and the corpus; but we can assume that the general nature of GloVe will give us a decent sense of the overall feature space of the corpus. The result will be an enriched representation of each document, the nuances of which may better help us identify things like similarities between obituaries in our corpus.

The other consideration for using GloVe with our specific corpus concerns the out-of-vocabulary words we’ve already discussed. Before we can encode our documents, we need to filter out tokens for which GloVe has no representation. We can do so by referencing the in_glove set we produced above.

pruned = []
for doc in corpus:
    keep = []
    for token in doc:
        if token in in_glove:
            keep.append(token)
    pruned.append(keep)

4.5.1. Encoding#

Time to encode. This is an easy operation. All we need to do is run the list of document’s tokens directly into the model object and gensim will encode each accordingly. The result will be an (n, 200) array, where n is the number of tokens we passed to the model; each one will have 200 dimensions.

But if we kept this array as is, we’d run into trouble. Matrix operations often require identically shaped representations, so documents with different lengths would be incomparable. To get around this, we take the mean of all the vectors in a document. The result is a 200-dimension vector that stands as a general representation of a document.

doc_embeddings = [np.mean(model[doc], axis=0) for doc in pruned]
doc_embeddings = np.array(doc_embeddings)

Let’s quickly check our work.

print(
    f"Shape of an encoded document: {model[pruned[0]].shape}",
    f"\nShape of an encoded document after taking its mean embedding: {doc_embeddings[0].shape}"
)
Shape of an encoded document: (485, 200) 
Shape of an encoded document after taking its mean embedding: (200,)

From here, we can treat these embeddings almost as if they represented words. Let’s plot our obituaries accordingly. Take a look around at this and see what you can find. As a starting point, you might focus on that cluster of nodes right in the middle of the graph, toward the top. All the obituaries there are for sports players – they’re even broken out by sport (baseball players are on the right).

vis_data = prepare_vis_data(doc_embeddings, manifest['NAME'])

alt.Chart(vis_data).mark_circle(size=30).encode(
    x='X',
    y='Y',
    tooltip='LABEL'
).properties(
    height=650,
    width=650
).interactive()

4.5.2. Clustering#

The document embeddings seem to be partitioned into different clusters. We’ll end by using a hierarchical clusterer to see if we can further specifiy these clusters. This involves loading the AgglomerativeClustering object from scikit-learn and fitting it to our document embeddings. Hierarchical clustering requires us to predefine the number of clusters we’d like to generate. In this case, we’ll go with 18.

from sklearn.cluster import AgglomerativeClustering

n_clusters = 18
agg = AgglomerativeClustering(n_clusters=n_clusters).fit(doc_embeddings)

Now we can assign the clusterer’s predicted labels to the dataframe that contains our visualization data and re-plot the results.

vis_data['CLUSTER'] = agg.labels_ + 1

alt.Chart(vis_data).mark_circle(size=30).encode(
    x='X',
    y='Y',
    tooltip=['LABEL', 'CLUSTER'],
    color='CLUSTER:N'
).properties(
    height=650,
    width=650
).interactive()

Once again, take a look around and see what you can find. These clusters seem to be both detailed and nicely partitioned, bracketing off, for example, classical musicians and composers (cluster 7) from jazz and popular musicians (cluster 11).

import textwrap

for k in [7, 11]:
    people = vis_data[vis_data['CLUSTER']==k]['LABEL']
    people = ', '.join(person for person in people)
    people = textwrap.wrap(people, 80)
    print(f"Cluster: {k:>2}\n-----------")
    for entry in people:
        print(entry)
    print("\n")
Cluster:  7
-----------
Maurice Ravel, Constantin Stanislavsky, Bela Bartok, Sergei Eisenstein, Igor
Stravinsky, Otto Klemperer, Maria Callas, Arthur Fiedler, Arthur Rubinstein,
Andres Segovie, Vladimir Horowitz, Leonard Bernstein, Martha Graham, John Cage,
Carlos Montoya, Galina Ulanova


Cluster: 11
-----------
Jerome Kern, W C Handy, Billie Holiday, Cole Porter, Coleman Hawkins, Judy
Garland, Louis Armstrong, Mahalia Jackson, Stan Kenton, Richard Rodgers,
Thelonious Monk, Earl Hines, Muddy Waters, Ethel Merman, Count Basie, Benny
Goodman, Miles Davis, Dizzy Gillespie, Gene Kelly, Frank Sinatra

Consider further cluster 6, which seems to be about famous scientists.

for person in vis_data[vis_data['CLUSTER']==6]['LABEL']:
    print(person)
Martian Theory
Marie Curie
Elmer Sperry
George E Hale
C E M Clung
Max Planck
A J Dempster
Enrico Fermi
Ross G Harrison
Beno Gutenberg
J Robert Oppenheimer
Jacques Monod
William B Shockley
Linus C Pauling
Carl Sagan

There are, however, some interestingly noisy clusters, like cluster 13. With people like Queen Victoria and William McKinley in this cluster, it at first appears to be about national leaders of various sorts, but the inclusion of others like Al Capone (the ganster) and Ernie Pyle (a journalist) complicate this. If you take a closer look, what really seems to be tying these obituaries together is war. Nearly everyone here was involved in war in some fashion or another – save for Capone, whose inclusion makes for strange bedfellows.

for person in vis_data[vis_data['CLUSTER']==13]['LABEL']:
    print(person)
Robert E Lee
Bedford Forrest
Ulysses Grant
William McKinley
Queen Victoria
Geronimo
John P Holland
Alfred Thayer Mahan
Ernie Pyle
George Patton
Al Capone
John Pershing
Douglas MacArthur
Chester Nimitz
Florence Blanchfield
The Duke of Windsor

Depending on your task, these detailed distinctions may not be so desirable. But for us, the document embeddings provide a wonderfully nuanced view of the kinds of people in the obituaries. From here, further exploration might involve focusing on misfits and outliers. Why, for example, is Capone in cluster 13? Or why is Lou Gehrig all by himself in his own cluster? Of course, we could always recluster this data, which would redraw such groupings, but perhaps there is something indeed significant about the way things are divided up as they stand. Word embeddings help bring us to a point where we can begin to undertake such investigations – what comes next depends on which questions we want to ask.