7. Topic Modeling#

This final chapter follows from the previous chapter’s use of cosine similarity. The latter used this metric to cluster obituaries into broad categories based on what those obituaries were about. Similarly, in this chapter we’ll use topic modeling to identify the thematic content of a corpus and, on this basis, associate themes with individual documents.

As we’ll discuss below, human interpretation plays a key role in this process: topic models produce textual structures, but it’s on us to give those structures meaning. Doing so is an iterative process, in which we fine tune various aspects of a model to effectively represent our corpus. This chapter will show you how to build a model, how to appraise it, and how to start iterating through the process of fine tuning to produce a model that best serves your research questions.

To do so, we’ll use a corpus of book blurbs sampled from the U. Hamburg Language Technology Group’s Blurb Genre Collection. The collection contains ~92,000 blurbs from Penguin Random House, ranging from Colson Whitehead’s books to steamy supermarket romances and self-help manuals. We’ll use just 1,500 – not so much that we’d be stuck waiting for hours for models to train, but enough to get a broad sense of different topics among the blurbs.

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain what a topic model is, what it represents, and how to use one to explore a corpus

  • Build a topic model

  • Use two scoring metrics, perplexity and coherence, to appraise the quality of a model

  • Understand how to improve a model by fine tuning its number of topics and its hyperparameters

7.1. How It Works#

There are a few different flavors of topic models. We’ll be using the most popular one, a latent Dirichlet allocation, or LDA, model. It involves two assumptions: 1) documents are comprised of a mixture of topics; 2) topics are comprised of a mixture of words. An LDA model represents these mixtures in terms of probability distributions: a given passage, with a given set of words, is more or less likely to be about a particular topic, which is in turn more or less likely to be made up of a certain grouping of words.

We initialize a model by predefining how many topics we think it should find. When the model begins training, it randomly guesses which words are most associated with which topic. But over the course of its training, it will start to keep track of the probabilities of recurrent word collocations: “river”, “bank,” and “water,” for example, might keep showing up together. This suggests some coherence, a possible grouping of words. A second topic, on the other hand, might have words like “money,” “bank,” and “robber.” The challenge here is that words belong to multiple topics. In this instance, given a single instance of “bank,” it could be in either the first or second topic. Given this, how does the model tell which topic a document containing the word “bank” is more strongly associated with?

It does two things. First, the model tracks how often “bank” appears with its various collocates in the corpus. If “bank” is generally more likely to appear with “river” and “water” than “money” and “robber”, this weights the probability that this particular instance of “bank” belongs to the first topic. To put a check on this weighting, the model also tracks how often collocates of “bank” appear in the document in question. If, in this document, “river” and “water” appear more often than “robber” and “money,” then that will weight this instance of “bank” even further toward the first topic, not the second.

Using these weightings as a basis, the model assigns a probability score for a document’s association with these two topics. This assignment will also inform the overall probability distribution of topics to words, which will then inform further document-topic associations, and so on. Over the course of this process, topics become more consistent and focused and their associations with documents become stronger and weaker, as appropriate.

Here’s the formula that summarizes this process. Given topic \(T\), word \(W\), and document \(D\), we determine the probability of \(W\) belonging to \(T\) with:

\[ P(T|W,D) = \frac{\text{# of $W$ in } T + \eta_W}{\text{total tokens in } T + \eta} \cdot (\text{# words in $D$ that belong to } T + \alpha) \]

7.2. Preliminaries#

Before we begin building a model, we’ll load the libraries we need and our data. As we’ve done in past chapters, we use a file manifest to keep track of things.

from pathlib import Path
import numpy as np
import pandas as pd
import tomotopy as tp
from tomotopy.utils import Corpus
from tomotopy.coherence import Coherence
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis

Our input directory:

indir = Path("data/section_one/s3")

And our manifest:

manifest = pd.read_csv(indir.joinpath("manifest.csv"), index_col = 0)
manifest.loc[:, 'year'] = pd.to_datetime(manifest['pub_date']).dt.year
manifest.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     1500 non-null   object
 1   title      1500 non-null   object
 2   genre      1500 non-null   object
 3   pub_date   1500 non-null   object
 4   isbn       1500 non-null   int64 
 5   file_name  1500 non-null   object
 6   year       1500 non-null   int32 
dtypes: int32(1), int64(1), object(5)
memory usage: 87.9+ KB

A small snapshot of its contents:

print(f"Number of blurbs: {len(manifest)}")
print(f"Pub dates: {manifest['year'].min()} -- {manifest['year'].max()}")
print(f"Genres: {', '.join(manifest['genre'].unique())}")
Number of blurbs: 1500
Pub dates: 1958 -- 2018
Genres: Fiction, Classics, Nonfiction, Children’s Books, Teen & Young Adult, Poetry, Humor

7.3. Building a Topic Model#

With our preliminary work done, we’re ready to build a topic model. There are numerous implementations of LDA modeling available, ranging from the command line utility, MALLET, to built-in APIs offered by both gensim and scikit-learn. We will be using tomotopy, a Python wrapper built around the C++ topic modeling too, Tomato. Its API is fairly intuitive and comes with a lot of options, which we’ll leverage to build the best possible model for our corpus.

7.3.1. Initializing a corpus#

Before we build the model, we need to load the data on which it will be trained. We use Corpus to do so. Be sure to split each file into a list of tokens before adding it to this object.

corpus = Corpus()
for fname in manifest['file_name']:
    path = indir.joinpath(f"input/{fname}")
    with path.open('r') as fin:
        doc = fin.read()
        corpus.add_doc(doc.split())

7.3.2. Initializing a model#

To initialize a model with tomotopy, all we need is the corpus from above and the number of topics the model will generate. Determining how many topics to use is a matter of some debate and complexity, which you’ll learn more about below. For now, just pick a small number. We’ll also set a set for reproducibility.

seed = 357
model = tp.LDAModel(k = 5, corpus = corpus, seed = seed)

7.3.3. Training a model#

Our model is now ready to be trained. Under the hood, this happens in an iterative fashion, so we need to set the total number of iterations for the training. With that set, it’s simply a matter of calling .train().

iters = 1000
model.train(iter = iters)

7.3.4. Inspecting the results#

Let’s look at the trained model. For each topic, we can get the words that are most associated with it. The accompanying score is the probability of that word appearing in the topic.

def top_words(model, k):
    """Print the top words for topic k in a model."""
    top_words = model.get_topic_words(topic_id = k, top_n = 5)
    top_words = [f"{word} ({score:0.4f}%)" for (word, score) in top_words]
    print(f"Topic {k}: {', '.join(top_words)}")

for i in range(model.k):
    top_words(model, i)
Topic 0: history (0.0102%), world (0.0101%), story (0.0089%), american (0.0083%), new (0.0081%)
Topic 1: book (0.0130%), guide (0.0079%), use (0.0076%), life (0.0063%), include (0.0059%)
Topic 2: life (0.0194%), love (0.0107%), new (0.0090%), year (0.0090%), woman (0.0089%)
Topic 3: book (0.0154%), little (0.0081%), make (0.0078%), best (0.0068%), new (0.0068%)
Topic 4: new (0.0075%), world (0.0067%), city (0.0055%), mystery (0.0052%), murder (0.0048%)

These results make intuitive sense: we’re dealing with several hundred book blurbs, so we’d expect to see words like “book,” “reader,” and “new.”

The .get_topic_dist() method performs a similar function, but for a document.

def doc_topic_dist(model, idx):
    """Print the topic distribution for a document."""
    topics = model.docs[idx].get_topic_dist()
    for idx, prob in enumerate(topics):
        print(f"+ Topic #{idx}: {prob:0.2f}%")

random_title = manifest.sample().index.item()
doc_topic_dist(model, random_title)
+ Topic #0: 0.15%
+ Topic #1: 0.28%
+ Topic #2: 0.10%
+ Topic #3: 0.25%
+ Topic #4: 0.22%

tomotopy also offers some shorthand to produce the top topics for a document. Below, we sample from our manifest, send the indexes to our model, and retrieve top topics.

sampled_titles = manifest.sample(5).index
for idx in sampled_titles:
    top_topics = model.docs[idx].get_topics(top_n = 1)
    topic, score = top_topics[0]
    print(f"{manifest.loc[idx, 'title']}: #{topic} ({score:0.2f}%)")
Undone: #4 (0.45%)
The Golden Children's Bible: #1 (0.29%)
Milarepa: #1 (0.55%)
The Encyclopedia of Canadian Organized Crime: #0 (0.33%)
Cowboy to Command: #2 (0.45%)

It’s possible to get even more granular. Every word in a document in a document has its own associated topic, which will change depending on the document. This is about as close to context-sensitive semantics as we can get with this method.

doc = model.docs[random_title]
word_to_topic = list(zip(doc, doc.topics))
for word in range(10):
    word, topic = word_to_topic[word]
    print(f"+ {word} ({topic})")
+ develop (1)
+ new (0)
+ maintain (1)
+ town (3)
+ defense (4)
+ genius (3)
+ student (1)
+ claire (3)
+ danvers (2)
+ discover (3)

Let’s zoom out to the level of the corpus and retrieve the topic probability distribution for each document. In the literature, this is called the theta. More informally, we’ll refer to it as the document-topic matrix.

def get_theta(model, labels):
    """Get the theta matrix from a model."""
    theta = np.stack([doc.get_topic_dist() for doc in model.docs])
    theta = pd.DataFrame(theta, index = labels)

    return theta

theta = get_theta(model, manifest['title'])
theta
0 1 2 3 4
title
After Atlas 0.113308 0.081607 0.352574 0.029556 0.422954
Ragged Dick and Struggling Upward 0.688176 0.047536 0.134876 0.107657 0.021755
The Shape of Snakes 0.171699 0.009422 0.325449 0.114915 0.378514
The Setting Sun 0.284472 0.033805 0.492601 0.109370 0.079752
Stink and the Shark Sleepover 0.011927 0.052131 0.276666 0.469876 0.189400
... ... ... ... ... ...
Greetings From Angelus 0.848809 0.035545 0.088243 0.023748 0.003655
Peppa Pig's Pop-up Princess Castle 0.169732 0.011254 0.035234 0.755861 0.027919
What Should the Left Propose? 0.399707 0.420762 0.134537 0.038993 0.006002
Peter and the Wolf 0.298455 0.123670 0.109557 0.367842 0.100477
Macedonia 0.343510 0.112836 0.288597 0.113768 0.141289

1500 rows × 5 columns

It’s often helpful to know how large each topic is. There’s a caveat here, however, in that each word in the model technically belongs to each topic, so it’s somewhat of a heuristic to say that a topic’s size is \(n\) words. tomotopy derives the output below by multiplying each column of the theta matrix by the document lengths in the corpus. It then sums the results for each topic.

topic_sizes = model.get_count_by_topics()
print("Number of words per topic:")
for i in range(model.k):
    print(f"+ Topic #{i}: {topic_sizes[i]} words")
Number of words per topic:
+ Topic #0: 22269 words
+ Topic #1: 25184 words
+ Topic #2: 33400 words
+ Topic #3: 25518 words
+ Topic #4: 20294 words

Finally, using the num_words attribute we can express this as percentages:

print("Topic proportion across the corpus:")
for i in range(model.k):
    print(f"+ Topic #{i}: {topic_sizes[i] / model.num_words:0.2f}%")
Topic proportion across the corpus:
+ Topic #0: 0.18%
+ Topic #1: 0.20%
+ Topic #2: 0.26%
+ Topic #3: 0.20%
+ Topic #4: 0.16%

7.4. Fine Tuning: The Basics#

Everything is working so far. But our topics are extremely general. More, their total proportions across the corpus are relatively homogeneous. This may be an indicator that our model has not been fitted to our corpus particularly well.

Looking at the word-by-word topic distributions for a document from above shows this well. Below, we once again print those out, but we’ll add the top words for each topic in general.

for word in range(5):
    word, topic = word_to_topic[word]
    print(f"Word: {word}")
    top_words(model, topic)
Word: develop
Topic 1: book (0.0130%), guide (0.0079%), use (0.0076%), life (0.0063%), include (0.0059%)
Word: new
Topic 0: history (0.0102%), world (0.0101%), story (0.0089%), american (0.0083%), new (0.0081%)
Word: maintain
Topic 1: book (0.0130%), guide (0.0079%), use (0.0076%), life (0.0063%), include (0.0059%)
Word: town
Topic 3: book (0.0154%), little (0.0081%), make (0.0078%), best (0.0068%), new (0.0068%)
Word: defense
Topic 4: new (0.0075%), world (0.0067%), city (0.0055%), mystery (0.0052%), murder (0.0048%)

It would appear that the actual words in a document do not really match the top words for its associated topic. This suggests that we need to make adjustments to the way we initialize our model so that it better reflects the specifics of our corpus.

But there are several different parameters to adjust. So what should we change?

7.4.1. Number of topics#

An easy answer would be the number of topics. If, as above, your topics seem too general, it may be because you’ve set too small a number of topics for the model. Let’s set a higher number of topics for our model and see what changes.

model10 = tp.LDAModel(k = 10, corpus = corpus, seed = seed)
model10.train(iter = iters)

for k in range(model10.k):
    top_words(model10, k)
Topic 0: book (0.0278%), story (0.0143%), little (0.0117%), child (0.0112%), reader (0.0105%)
Topic 1: war (0.0182%), political (0.0127%), world (0.0108%), america (0.0103%), power (0.0099%)
Topic 2: new (0.0222%), life (0.0204%), story (0.0179%), world (0.0120%), great (0.0117%)
Topic 3: work (0.0147%), history (0.0107%), art (0.0106%), volume (0.0105%), classic (0.0099%)
Topic 4: secret (0.0119%), man (0.0087%), new (0.0086%), mystery (0.0083%), murder (0.0077%)
Topic 5: book (0.0150%), life (0.0100%), use (0.0084%), help (0.0066%), include (0.0062%)
Topic 6: world (0.0105%), battle (0.0076%), new (0.0067%), star (0.0059%), tale (0.0054%)
Topic 7: life (0.0248%), love (0.0226%), woman (0.0178%), family (0.0146%), year (0.0129%)
Topic 8: food (0.0157%), recipe (0.0137%), guide (0.0131%), new (0.0082%), top (0.0078%)
Topic 9: make (0.0140%), just (0.0126%), new (0.0110%), time (0.0110%), like (0.0101%)

That looks better! Adding more topics spreads out the word distributions. Given that, what if we increased the number of topics even more?

model30 = tp.LDAModel(k = 30, corpus = corpus, seed = seed)
model30.train(iter = iters)

for k in range(model30.k):
    top_words(model30, k)
Topic 0: recipe (0.0313%), food (0.0255%), family (0.0142%), italian (0.0128%), meal (0.0106%)
Topic 1: new (0.0276%), make (0.0178%), just (0.0175%), time (0.0141%), like (0.0140%)
Topic 2: life (0.0384%), love (0.0193%), year (0.0189%), family (0.0185%), old (0.0126%)
Topic 3: book (0.0363%), new (0.0302%), times (0.0277%), york (0.0251%), story (0.0214%)
Topic 4: history (0.0253%), american (0.0189%), year (0.0171%), story (0.0169%), great (0.0166%)
Topic 5: school (0.0259%), kid (0.0178%), little (0.0176%), animal (0.0170%), child (0.0167%)
Topic 6: music (0.0360%), japanese (0.0207%), tale (0.0153%), junie (0.0094%), rock (0.0094%)
Topic 7: art (0.0295%), artist (0.0181%), work (0.0172%), president (0.0123%), portrait (0.0115%)
Topic 8: book (0.0252%), guide (0.0192%), use (0.0177%), learn (0.0167%), step (0.0155%)
Topic 9: mystery (0.0246%), murder (0.0244%), killer (0.0129%), police (0.0117%), dead (0.0114%)
Topic 10: city (0.0265%), rule (0.0091%), old (0.0087%), light (0.0084%), paul (0.0081%)
Topic 11: book (0.0356%), story (0.0237%), adventure (0.0185%), reader (0.0177%), classic (0.0154%)
Topic 12: life (0.0258%), practice (0.0205%), spiritual (0.0165%), art (0.0126%), peace (0.0115%)
Topic 13: van (0.0134%), whale (0.0089%), bee (0.0082%), plant (0.0074%), riley (0.0074%)
Topic 14: god (0.0551%), jesus (0.0162%), religious (0.0162%), bible (0.0143%), spiritual (0.0124%)
Topic 15: world (0.0231%), poem (0.0209%), collection (0.0156%), tree (0.0151%), feel (0.0133%)
Topic 16: penguin (0.0234%), work (0.0213%), classic (0.0184%), literature (0.0180%), text (0.0176%)
Topic 17: black (0.0218%), jane (0.0176%), vampire (0.0144%), aunt (0.0097%), queen (0.0097%)
Topic 18: game (0.0237%), team (0.0180%), baseball (0.0165%), sport (0.0149%), hockey (0.0118%)
Topic 19: health (0.0184%), body (0.0154%), healthy (0.0143%), weight (0.0131%), food (0.0124%)
Topic 20: king (0.0268%), london (0.0214%), elizabeth (0.0111%), england (0.0107%), life (0.0103%)
Topic 21: sea (0.0195%), ship (0.0164%), planet (0.0133%), island (0.0130%), human (0.0120%)
Topic 22: life (0.0213%), people (0.0157%), book (0.0157%), change (0.0121%), way (0.0121%)
Topic 23: human (0.0122%), social (0.0116%), book (0.0106%), political (0.0101%), include (0.0095%)
Topic 24: dark (0.0143%), power (0.0137%), secret (0.0128%), face (0.0122%), world (0.0113%)
Topic 25: travel (0.0217%), new (0.0177%), guide (0.0177%), eyewitness (0.0168%), top (0.0163%)
Topic 26: christmas (0.0325%), cole (0.0157%), holiday (0.0119%), longarm (0.0119%), town (0.0103%)
Topic 27: star (0.0278%), lego (0.0182%), wars (0.0177%), action (0.0118%), group (0.0118%)
Topic 28: woman (0.0356%), love (0.0241%), man (0.0212%), heart (0.0149%), romance (0.0141%)
Topic 29: war (0.0377%), world (0.0187%), military (0.0118%), soldier (0.0107%), battle (0.0105%)

This also looks pretty solid. The two models appear to share topics, but the second model, which has a higher number of topics, includes a wider range of words in the top word distribution. While all that seems well and good, we don’t yet have a way to determine whether an increase in the number of topics will always produce more interpretable results. At some point, we might start splitting hairs. In fact, we can already see this beginning to happen in a few instances in the second model.

So the question is, what is an ideal number of topics?

One way to approach this question would be to run through a range of different topic sizes and inspect the results for each. In some cases, it can be perfectly valid to pick the number of topics that’s most interpretable for you and the questions you have about your corpus. But there are also a few metrics available that will assess the quality of a given model in terms of the underlying data it represents. Sometimes these metrics lead to models that aren’t quite as interpretable, but they also help us make a more empirically grounded assessment of the resultant topics.

7.4.2. Perplexity#

The first of these measures is perplexity. In text analysis, we use perplexity scoring to evaluate how well a model predicts an sample sequence of words. Essentially, it measures how “surprised” a model is by that sequence. The lower the perplexity, the more your model is capable of mapping predictions against the data it was trained on.

When you train a tomotopy model object, the model records a perplexity score for the training run.

for m in (model10, model30):
    print(f"Perplexity for the {m.k}-topic model: {m.perplexity:0.4f}")
Perplexity for the 10-topic model: 10777.7660
Perplexity for the 30-topic model: 10449.8925

In this instance, the model with more topics has a better perplexity score. This would suggest that the second model is better fitted to our data and is thus a “better” model.

But can we do better? What if there’s a model with a better score that sits somewhere between these two topic numbers (or beyond them)? We can test to see whether this is the case by constructing a for loop, in which we iterate through a range of different topic numbers, train a model on each, and record the resultant scores.

k_range = range(10, 31)
p_scores = []
for k in k_range:
    _model = tp.LDAModel(k = k, corpus = corpus, seed = seed)
    _model.train(iter = iters)
    p_scores.append({'n_topics': k, 'perplexity': _model.perplexity})

Convert the results to a DataFrame:

p_scores = pd.DataFrame(p_scores)
p_scores.sort_values('perplexity', inplace = True)
p_scores
n_topics perplexity
19 29 10259.426721
8 18 10356.584460
13 23 10443.413781
20 30 10449.892548
5 15 10458.080301
16 26 10469.298786
14 24 10476.101149
11 21 10557.088274
12 22 10599.010769
17 27 10614.836784
1 11 10616.506218
10 20 10632.128637
15 25 10660.327686
18 28 10728.454395
3 13 10730.003636
0 10 10777.766018
7 17 10785.896924
4 14 10861.638410
6 16 10912.364732
2 12 10944.483263
9 19 10997.969707

We’ll train a new model with the best score.

best_k = p_scores.nsmallest(1, 'perplexity')['n_topics'].item()
best_p = tp.LDAModel(k = best_k, corpus = corpus, seed = seed)
best_p.train(iter = iters)

Here are the top words:

for k in range(best_p.k):
    top_words(best_p, k)
Topic 0: emma (0.0151%), sin (0.0120%), roy (0.0113%), sheriff (0.0090%), alpine (0.0075%)
Topic 1: new (0.0199%), make (0.0176%), just (0.0152%), like (0.0142%), time (0.0131%)
Topic 2: horse (0.0234%), tale (0.0178%), wild (0.0132%), animal (0.0127%), weird (0.0122%)
Topic 3: story (0.0255%), volume (0.0169%), collection (0.0162%), include (0.0154%), book (0.0133%)
Topic 4: music (0.0262%), film (0.0157%), star (0.0138%), movie (0.0131%), thomas (0.0105%)
Topic 5: guide (0.0280%), travel (0.0156%), new (0.0124%), history (0.0124%), cover (0.0115%)
Topic 6: health (0.0189%), body (0.0160%), weight (0.0127%), healthy (0.0113%), program (0.0098%)
Topic 7: dead (0.0153%), vampire (0.0112%), john (0.0108%), death (0.0108%), cole (0.0099%)
Topic 8: recipe (0.0315%), food (0.0275%), italian (0.0147%), family (0.0117%), meal (0.0106%)
Topic 9: business (0.0201%), financial (0.0137%), money (0.0137%), new (0.0127%), success (0.0121%)
Topic 10: murder (0.0293%), mystery (0.0266%), police (0.0126%), solve (0.0117%), crime (0.0108%)
Topic 11: magic (0.0286%), adventure (0.0215%), city (0.0208%), london (0.0174%), princess (0.0134%)
Topic 12: history (0.0240%), american (0.0213%), america (0.0167%), great (0.0128%), century (0.0122%)
Topic 13: sea (0.0253%), ship (0.0157%), island (0.0126%), far (0.0107%), water (0.0100%)
Topic 14: god (0.0464%), spiritual (0.0260%), life (0.0152%), religious (0.0148%), jesus (0.0136%)
Topic 15: work (0.0308%), penguin (0.0236%), literature (0.0198%), year (0.0165%), text (0.0165%)
Topic 16: school (0.0422%), friend (0.0248%), girl (0.0248%), kid (0.0163%), day (0.0111%)
Topic 17: conan (0.0130%), ford (0.0093%), lacey (0.0087%), wit (0.0081%), sadie (0.0081%)
Topic 18: war (0.0315%), military (0.0143%), president (0.0129%), men (0.0119%), soldier (0.0109%)
Topic 19: book (0.0403%), child (0.0216%), little (0.0170%), story (0.0147%), perfect (0.0143%)
Topic 20: new (0.0491%), york (0.0397%), times (0.0376%), book (0.0329%), author (0.0313%)
Topic 21: king (0.0311%), empire (0.0143%), lady (0.0108%), sword (0.0094%), long (0.0084%)
Topic 22: human (0.0234%), science (0.0234%), planet (0.0187%), universe (0.0158%), scientist (0.0153%)
Topic 23: art (0.0320%), artist (0.0267%), step (0.0168%), create (0.0149%), paint (0.0118%)
Topic 24: life (0.0323%), world (0.0248%), story (0.0164%), year (0.0154%), new (0.0107%)
Topic 25: woman (0.0389%), love (0.0306%), life (0.0250%), father (0.0187%), family (0.0178%)
Topic 26: book (0.0276%), life (0.0139%), use (0.0115%), include (0.0109%), offer (0.0100%)
Topic 27: world (0.0179%), battle (0.0145%), star (0.0138%), power (0.0133%), know (0.0099%)
Topic 28: social (0.0212%), power (0.0115%), political (0.0110%), culture (0.0092%), theory (0.0092%)

7.4.3. Coherence#

If you find that your perplexity scores don’t translate to interpretable models, you might use a coherence score instead. Coherence scores measure the degree of semantic similarity among words in a topic. Some people prefer to use them in place of perplexity because these scores help distinguish between topics that fit snugly on consistent word co-occurrence and those that are artifacts of statistic inference.

There are a few ways to calculate coherence. We’ll use c_v coherence, which uses the two kinds of text similarity we’ve already seen: pointwise mutual information (PMI) and cosine similarity. This method takes the co-occurrence counts of top words in a given topic and calculates a PMI score for each word. Then, it looks to every other topic in the model and calculates a PMI score for the present topic’s words and those in the other topics. This results in a series of PMI vectors, which are then measured with cosine similarity.

Let’s look at the score for the best model above:

best_p_coherence = Coherence(best_p, coherence = 'c_v')
print(f"Coherence score: {best_p_coherence.get_score():0.4f}")
Coherence score: 0.6988

As with perplexity, we can look for the best score among a set of topic numbers. Here, we’re looking for the highest score, which will be a number between 0 and 1.

c_scores = []
for k in k_range:
    _model = tp.LDAModel(k = k, corpus = corpus, seed = seed)
    _model.train(iter = iters)
    coherence = Coherence(_model, coherence = 'c_v')
    c_scores.append({'n_topics': k, 'coherence': coherence.get_score()})

Let’s format the scores.

c_scores = pd.DataFrame(c_scores)
c_scores.sort_values('coherence', ascending = False, inplace = True)
c_scores.head(10)
n_topics coherence
20 30 0.700868
19 29 0.698793
14 24 0.690023
16 26 0.689783
17 27 0.682885
15 25 0.670136
11 21 0.648808
13 23 0.645939
18 28 0.641926
10 20 0.638035

Now we select the best one and train a model on that.

best_k = c_scores.nlargest(1, 'coherence')['n_topics'].item()
best_c = tp.LDAModel(k = best_k, corpus = corpus, seed = seed)
best_c.train(iter = iters)

Here are the top words for each topic:

for k in range(best_c.k):
    top_words(best_c, k)
Topic 0: recipe (0.0313%), food (0.0255%), family (0.0142%), italian (0.0128%), meal (0.0106%)
Topic 1: new (0.0276%), make (0.0178%), just (0.0175%), time (0.0141%), like (0.0140%)
Topic 2: life (0.0384%), love (0.0193%), year (0.0189%), family (0.0185%), old (0.0126%)
Topic 3: book (0.0363%), new (0.0302%), times (0.0277%), york (0.0251%), story (0.0214%)
Topic 4: history (0.0253%), american (0.0189%), year (0.0171%), story (0.0169%), great (0.0166%)
Topic 5: school (0.0259%), kid (0.0178%), little (0.0176%), animal (0.0170%), child (0.0167%)
Topic 6: music (0.0360%), japanese (0.0207%), tale (0.0153%), junie (0.0094%), rock (0.0094%)
Topic 7: art (0.0295%), artist (0.0181%), work (0.0172%), president (0.0123%), portrait (0.0115%)
Topic 8: book (0.0252%), guide (0.0192%), use (0.0177%), learn (0.0167%), step (0.0155%)
Topic 9: mystery (0.0246%), murder (0.0244%), killer (0.0129%), police (0.0117%), dead (0.0114%)
Topic 10: city (0.0265%), rule (0.0091%), old (0.0087%), light (0.0084%), paul (0.0081%)
Topic 11: book (0.0356%), story (0.0237%), adventure (0.0185%), reader (0.0177%), classic (0.0154%)
Topic 12: life (0.0258%), practice (0.0205%), spiritual (0.0165%), art (0.0126%), peace (0.0115%)
Topic 13: van (0.0134%), whale (0.0089%), bee (0.0082%), plant (0.0074%), riley (0.0074%)
Topic 14: god (0.0551%), jesus (0.0162%), religious (0.0162%), bible (0.0143%), spiritual (0.0124%)
Topic 15: world (0.0231%), poem (0.0209%), collection (0.0156%), tree (0.0151%), feel (0.0133%)
Topic 16: penguin (0.0234%), work (0.0213%), classic (0.0184%), literature (0.0180%), text (0.0176%)
Topic 17: black (0.0218%), jane (0.0176%), vampire (0.0144%), aunt (0.0097%), queen (0.0097%)
Topic 18: game (0.0237%), team (0.0180%), baseball (0.0165%), sport (0.0149%), hockey (0.0118%)
Topic 19: health (0.0184%), body (0.0154%), healthy (0.0143%), weight (0.0131%), food (0.0124%)
Topic 20: king (0.0268%), london (0.0214%), elizabeth (0.0111%), england (0.0107%), life (0.0103%)
Topic 21: sea (0.0195%), ship (0.0164%), planet (0.0133%), island (0.0130%), human (0.0120%)
Topic 22: life (0.0213%), people (0.0157%), book (0.0157%), change (0.0121%), way (0.0121%)
Topic 23: human (0.0122%), social (0.0116%), book (0.0106%), political (0.0101%), include (0.0095%)
Topic 24: dark (0.0143%), power (0.0137%), secret (0.0128%), face (0.0122%), world (0.0113%)
Topic 25: travel (0.0217%), new (0.0177%), guide (0.0177%), eyewitness (0.0168%), top (0.0163%)
Topic 26: christmas (0.0325%), cole (0.0157%), holiday (0.0119%), longarm (0.0119%), town (0.0103%)
Topic 27: star (0.0278%), lego (0.0182%), wars (0.0177%), action (0.0118%), group (0.0118%)
Topic 28: woman (0.0356%), love (0.0241%), man (0.0212%), heart (0.0149%), romance (0.0141%)
Topic 29: war (0.0377%), world (0.0187%), military (0.0118%), soldier (0.0107%), battle (0.0105%)

And here’s a distribution plot of topic proportions. We’ll define a function to create this as we’ll be making a few of these plots later on.

def format_top_words(tm, k, top_n = 5):
    """Get a formatted string of the top words for a topic."""
    words = tm.get_topic_words(k, top_n = top_n)
    
    return f"Topic #{k}: {', '.join(word for (word, _) in words)}"

def plot_topic_proportions(tm, name = '', top_n = 5):
    """Plot the topic proportions for a model."""
    dists = tm.get_count_by_topics() / tm.num_words
    words = [format_top_words(tm, k, top_n) for k in range(tm.k)]
    data = pd.DataFrame(zip(words, dists), columns = ('word', 'dist'))
    data.sort_values('dist', ascending = False, inplace = True)

    fig, ax = plt.subplots(figsize = (15, 15))
    g = sns.barplot(x = 'dist', y = 'word', color = 'blue', data = data)
    g.set(title = f"Topic proportions for {name}", xlabel = "Proportion");

plot_topic_proportions(best_c, name = 'Best coherence')
../_images/51bd4883f744f19d19584eec2cee645957a18c6c84d29de55e6508da1986225f.png

7.5. Fine Tuning: Advanced#

7.5.1. Hyperparameters: alpha and eta#

The number of topics is not the only value we can set when initializing a model. LDA modeling has two key hyperparameters, which we can configure to control the nature of the topics a training run produces:

  • Alpha: represents document-topic density. The higher the alpha, the more evenly distributed, or “symmetric,” topic proportions are in a particular document. A lower alpha means topic proportions are “asymmetric,” that is, a document will have fewer predominating topics, rather than several

  • Eta: represents word-topic density. The higher the eta, the more word probabilities will be distributed evenly across a topic (specifically, this boosts the presence of low-probability words). A lower eta means word distributions are more uneven, so each topic will have less dominant words

At core, these two hyperparameters variously control specificity in models: one for the way models handle document specificity, and one for the way they handle topic specificity.

On terminology

Different LDA implementations have different names for these hyperparameters. Eta, for example, is also referred to as beta. When reading the documentation for an implementation, look for whatever term stands for the “document prior” (alpha) and the “word prior” (eta).

tomotopy has actually been setting values for alpha and eta all along. We can declare them specifically with arguments when initializing a model. Below, we boost the alpha and lessen the eta. This configuration should give us a more even distribution in topics among the documents and higher probabilities for the words in each topic. We’ll use the topic number from our best coherence model above.

ae_adjusted = tp.LDAModel(
    k = best_c.k, alpha = 1, eta = 0.001, corpus = corpus, seed = seed
)
ae_adjusted.train(iter = iters)

Let’s compare with the best coherence model.

compare = {'best coherence': best_c, 'high alpha/low eta': ae_adjusted}
for name, tm in compare.items():
    probs = []
    for topic in range(tm.k):
        scores = [s for (w, s) in tm.get_topic_words(topic, top_n = 5)]
        probs.append(scores)

    probs = np.mean(probs)
    words_per_topic = np.median(tm.get_count_by_topics())

    print(f"For the {name} model:")
    print(f"+ Median words/topic: {words_per_topic:0.0f}")
    print(f"+ Mean probablity of a topic's top-five words: {probs:0.4f}%")
For the best coherence model:
+ Median words/topic: 2652
+ Mean probablity of a topic's top-five words: 0.0175%
For the high alpha/low eta model:
+ Median words/topic: 4278
+ Mean probablity of a topic's top-five words: 0.0260%

7.5.2. Choosing hyperparameter values#

In the literature about LDA modeling, researchers have suggested various ways of setting hyperparameters. For example, the authors of this paper suggest that the ideal alpha and eta values are \(\frac{50}{k}\) and 0.1, respectively (where \(k\) is the number of topics). Alternatively, you’ll often see people advocate for an approach called grid searching. This involves selecting a range of different values for the hyperparameters, permuting them, and building as many different models as it takes to go through all possible permutations.

Both approaches are valid but they don’t emphasize an important point about what our hyperparameters represent. Alpha and eta are priors, meaning they represent certain kinds of knowledge we have about our data before we even model it. In our case, we’re working with book blurbs. The generic conventions of these texts are fairly constrained, so it probably doesn’t make sense to raise our alpha values. The same might hold for a corpus of tweets collected around a small keyword set: the data collection is already a form of hyperparameter optimization. Put another way, setting hyperparameters depends on your data and your research question(s). It’s as valid to ask, “do these values give me an interpretable model?” as it is to look to perplexity and coherence scores as the sole arbiters of model quality.

Here’s an example of where the interpretability consideration matters. In the model below, we set hyperparameters to produce low perplexity and coherence scores.

optimized = tp.LDAModel(
    k = best_c.k, alpha = 5, eta = 2, corpus = corpus, seed = seed
)
optimized.train(iter = iters)
optimized_coherence = Coherence(optimized, coherence = 'c_v')

The scores look good.

print(f"Perplexity: {optimized.perplexity:0.4f}")
print(f"Coherence: {optimized_coherence.get_score():0.4f}")
Perplexity: 7354.1627
Coherence: 0.8756

But look at the topics:

for k in range(optimized.k):
    top_words(optimized, k)
Topic 0: recalcitrants (0.0001%), sculley (0.0001%), kropp (0.0001%), distort (0.0001%), sixth (0.0001%)
Topic 1: forces (0.0001%), rwanda (0.0001%), awakes (0.0001%), armada (0.0001%), inseparable (0.0001%)
Topic 2: haight (0.0001%), jung (0.0001%), youdon (0.0001%), reconstruction (0.0001%), gifts (0.0001%)
Topic 3: yugoslavia (0.0001%), mistery (0.0001%), brainiac (0.0001%), pilchuk (0.0001%), lola (0.0001%)
Topic 4: cuneiform (0.0001%), purport (0.0001%), quadratic (0.0001%), warts (0.0001%), alphabetic (0.0001%)
Topic 5: jinks (0.0001%), neuro (0.0001%), shivers (0.0001%), cull (0.0001%), beet (0.0001%)
Topic 6: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 7: subdue (0.0001%), cooperative (0.0001%), heated (0.0001%), anara (0.0001%), improved (0.0001%)
Topic 8: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 9: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 10: corporal (0.0001%), piston (0.0001%), disembarks (0.0001%), oriented (0.0001%), introductionfrom (0.0001%)
Topic 11: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 12: teroni (0.0001%), exhilarating (0.0001%), wifesanta (0.0001%), rosa (0.0001%), ragtag (0.0001%)
Topic 13: heartbreakingly (0.0001%), chances (0.0001%), vivacious (0.0001%), behindin (0.0001%), thorkellson (0.0001%)
Topic 14: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 15: unfulfilled (0.0001%), cameo (0.0001%), nationis (0.0001%), paean (0.0001%), medalwinning (0.0001%)
Topic 16: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 17: una (0.0001%), banda (0.0001%), coburn (0.0001%), eso (0.0001%), justamente (0.0001%)
Topic 18: possessiveness (0.0001%), dauntless (0.0001%), yearsfrom (0.0001%), goodfrom (0.0001%), meddle (0.0001%)
Topic 19: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 20: underprivileged (0.0001%), widen (0.0001%), enchanting (0.0001%), rabbi (0.0001%), prop (0.0001%)
Topic 21: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 22: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 23: del (0.0001%), que (0.0001%), elendel (0.0001%), más (0.0001%), mistborn (0.0001%)
Topic 24: sneezing (0.0001%), haynes (0.0001%), choucroute (0.0001%), tanksincludes (0.0001%), literaturelabeled (0.0001%)
Topic 25: boomer (0.0001%), transformationfor (0.0001%), cosimo (0.0001%), nergal (0.0001%), coerce (0.0001%)
Topic 26: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 27: new (0.0000%), life (0.0000%), book (0.0000%), world (0.0000%), story (0.0000%)
Topic 28: new (0.0056%), life (0.0052%), book (0.0050%), world (0.0037%), story (0.0035%)
Topic 29: josephus (0.0001%), employees (0.0001%), snowbound (0.0001%), chaffino (0.0001%), hatches (0.0001%)

And the proportions:

plot_topic_proportions(optimized, name = 'Hyperoptimized')
../_images/f2160c022a822f09b7c559c554d7789f387fc3a2f147403e4a08aa8d219e1101.png

The top words are incoherent and one topic all but completely dominates the topic distribution.

The challenge of setting hyperparameters, then, is that it’s a balancing act. In light of the above output, for example, you might decide to favor interpretability above everything else. But doing so can lead to overfitting. Hence the balancing act: the whole process of fine tuning involves incorporating a number of different considerations (and compromises!) that, at the end of the day, should work in the service of your research question.

7.5.3. Final configuration#

To return to the question of our own corpus, here are the best topic number/hyperparameter configuration from a grid search run:

grid_df = pd.read_csv(indir.joinpath("grid_search_results.csv"), index_col = 0)
grid_df.sort_values('coherence', ascending = False, inplace = True)
grid_df.head(10)
n_topics alpha eta perplexity coherence
315 29 0.100 0.0150 10190.744366 0.726677
335 30 0.125 0.0150 10101.535785 0.723886
323 30 0.050 0.0150 10248.757358 0.721783
318 29 0.125 0.0125 10449.013502 0.720463
310 29 0.075 0.0125 10630.975052 0.719463
326 30 0.075 0.0125 10241.915592 0.718477
330 30 0.100 0.0125 10304.032934 0.717663
275 27 0.050 0.0150 10222.881381 0.716541
319 29 0.125 0.0150 10266.766973 0.714951
311 29 0.075 0.0150 10229.393637 0.712904

If you look closely at the scores, you’ll see that they’re all very close to one another; any one of these options would make for a good model. The perplexity is a bit high, but the coherence scores look good and the topic numbers produce a nice spread of topics. Past testing has shown that the following configuration makes for a particularly good – which is to say, interpretable – one:

tuned = tp.LDAModel(
    k = 29, alpha = 0.1, eta = 0.015, corpus = corpus, seed = seed
)
tuned.train(iter = iters)
tuned_coherence = Coherence(tuned, coherence = 'c_v')

Our metrics:

print(f"Perplexity: {tuned.perplexity:0.4f}")
print(f"Coherence: {tuned_coherence.get_score():0.4f}")
Perplexity: 10170.1982
Coherence: 0.7075

Our topics:

for k in range(tuned.k):
    top_words(tuned, k)
Topic 0: war (0.0457%), history (0.0138%), soldier (0.0124%), military (0.0108%), line (0.0087%)
Topic 1: new (0.0184%), make (0.0177%), just (0.0156%), like (0.0138%), time (0.0129%)
Topic 2: child (0.0276%), mother (0.0263%), school (0.0253%), parent (0.0214%), family (0.0201%)
Topic 3: adventure (0.0180%), cat (0.0176%), love (0.0156%), tale (0.0145%), animal (0.0121%)
Topic 4: music (0.0202%), american (0.0143%), song (0.0123%), star (0.0119%), film (0.0115%)
Topic 5: team (0.0179%), money (0.0157%), game (0.0148%), sport (0.0139%), financial (0.0125%)
Topic 6: book (0.0456%), child (0.0188%), reader (0.0157%), little (0.0146%), young (0.0142%)
Topic 7: christmas (0.0259%), horse (0.0173%), town (0.0147%), cole (0.0125%), family (0.0117%)
Topic 8: murder (0.0272%), mystery (0.0225%), killer (0.0139%), crime (0.0136%), police (0.0130%)
Topic 9: band (0.0132%), aunt (0.0108%), amy (0.0102%), junie (0.0096%), jones (0.0096%)
Topic 10: political (0.0270%), america (0.0139%), states (0.0114%), united (0.0103%), politics (0.0095%)
Topic 11: secret (0.0143%), world (0.0122%), new (0.0103%), human (0.0088%), power (0.0085%)
Topic 12: life (0.0261%), story (0.0246%), year (0.0195%), world (0.0167%), family (0.0105%)
Topic 13: recipe (0.0277%), food (0.0258%), italian (0.0117%), family (0.0104%), meal (0.0095%)
Topic 14: star (0.0240%), universe (0.0146%), planet (0.0142%), lego (0.0138%), wars (0.0134%)
Topic 15: work (0.0174%), penguin (0.0171%), classic (0.0156%), introduction (0.0147%), english (0.0144%)
Topic 16: guide (0.0305%), travel (0.0138%), new (0.0134%), include (0.0128%), top (0.0121%)
Topic 17: comic (0.0179%), volume (0.0124%), tale (0.0115%), sword (0.0107%), adventure (0.0098%)
Topic 18: york (0.0554%), new (0.0526%), times (0.0514%), book (0.0283%), author (0.0269%)
Topic 19: guide (0.0175%), help (0.0136%), learn (0.0109%), health (0.0107%), program (0.0101%)
Topic 20: novel (0.0231%), reader (0.0165%), character (0.0142%), story (0.0128%), author (0.0123%)
Topic 21: van (0.0108%), ellie (0.0101%), landscape (0.0074%), riley (0.0068%), quinn (0.0068%)
Topic 22: book (0.0189%), work (0.0101%), life (0.0098%), world (0.0092%), people (0.0091%)
Topic 23: art (0.0411%), artist (0.0237%), step (0.0227%), use (0.0167%), create (0.0157%)
Topic 24: sea (0.0213%), ship (0.0159%), island (0.0101%), sam (0.0093%), crew (0.0093%)
Topic 25: woman (0.0437%), love (0.0358%), life (0.0225%), husband (0.0152%), passion (0.0130%)
Topic 26: century (0.0183%), king (0.0179%), history (0.0171%), london (0.0149%), great (0.0086%)
Topic 27: vampire (0.0156%), dead (0.0139%), hunter (0.0110%), black (0.0110%), queen (0.0101%)
Topic 28: god (0.0339%), life (0.0277%), spiritual (0.0215%), practice (0.0177%), book (0.0143%)

Our proportions:

plot_topic_proportions(tuned, name = 'Tuned')
../_images/10dc4d39d4d5ada53b664abcdca104bfbda7ea25ae52fe5d9754a0c39ebac3a5.png

7.6. Model Exploration#

Topic proportions and top words are all helpful, but there are other ways to dig more deeply into a model. This final section will show you a few examples of this.

First, let’s rebuild a theta matrix from the fine-tuned model. Remember that a theta is a document-topic matrix, where each cell is a probability score for a document’s association with a particular topic.

theta = get_theta(tuned, manifest['title'])

We’ll also make a quick set of labels for our topics, which list the top five words for each one.

labels = [format_top_words(tuned, k) for k in range(tuned.k)]

A heatmap is a natural fit for inspecting the probability distributions of theta.

n_samples = 20
blurb_set = manifest['title'].sample(n_samples)
sub_theta = theta[theta.index.isin(blurb_set)]

fig, ax = plt.subplots(figsize = (15, 8))
g = sns.heatmap(sub_theta, linewidths = 1, cmap = 'Blues', ax = ax)
g.set(xlabel = 'Topic', ylabel = 'Title', xticklabels = labels)
ax.xaxis.set_label_position('top')
ax.xaxis.tick_top()
plt.xticks(rotation = 30, ha = 'left')
plt.tight_layout();
../_images/a0df2001796259d70a4c47d42085e8d37167f21a92984091b141471585e0d807.png

Topic 15 appears to be about the classics. What kind of titles have a particularly high association with this topic?

def doc_topic_associations(theta, manifest, k):
    """Find highly associated documents from a manifest with a topic."""
    topk = theta.loc[theta.idxmax(axis = 1) == k, k]
    associated = manifest[manifest['title'].isin(topk.index)].copy()
    associated.loc[:, f'{k}_score'] = topk.values

    return associated[['author', 'title', 'genre', f'{k}_score']]

k15 = doc_topic_associations(theta, manifest, 15)
k15.head(15)
author title genre 15_score
1 Horatio Alger Ragged Dick and Struggling Upward Classics 0.510900
66 Chris Carter, Roy Thomas, Glenn Morgan, James ... X-Files Classics: Season 1 Volume 1 Fiction 0.344544
146 Marge Piercy The Art of Blessing the Day Poetry 0.174064
258 Chretien de Troyes Arthurian Romances Classics 0.731107
271 Various American Science Fiction: Nine Classic Novels ... Fiction 0.350466
275 Nicola Sacco, Bartolomeo Vanzetti The Letters of Sacco and Vanzetti Classics 0.588130
412 Michel de Montaigne The Essays Classics 0.291393
443 David Goodis The Burglar Fiction 0.221486
464 Frances Hodgson Burnett A Little Princess Classics 0.246585
480 Laura Ingalls Wilder Laura Ingalls Wilder: The Little House Books V... Fiction 0.358375
485 Brooks Haxton They Lift Their Wings to Cry Classics 0.191476
595 Mary Boykin Chesnut Mary Chesnut's Diary Classics 0.522240
601 Kenneth Grahame The Wind in the Willows Classics 0.633616
621 Harriet Jacobs Incidents in the Life of a Slave Girl Classics 0.573353
631 Richard E. Kim The Martyred Classics 0.425864

Certain topics seem to map very nicely onto particular book genres. Here are some titles associated with topic 13, which is about recipes.

k13 = doc_topic_associations(theta, manifest, 13)
k13.head(15)
author title genre 13_score
8 Adele Yellin, Kevin West The Grand Central Market Cookbook Nonfiction 0.432363
17 Gene Daoust, Joyce Daoust The Formula Nonfiction 0.352713
188 Lilach German Cookies, Cookies & More Cookies! Nonfiction 0.382768
233 Katharine Ibbs DK Children's Cookbook Nonfiction 0.455814
279 Christina Orchid Christina's Cookbook Nonfiction 0.421846
327 Nicola Graimes The Low-Sugar Cookbook Nonfiction 0.439704
368 Sara Deseran, Joe Hargrave, Antelmo Faria, Mik... Tacolicious Nonfiction 0.663644
395 Kendra Bailey Morris The Southern Slow Cooker Nonfiction 0.509477
452 Carmen Posadas Little Indiscretions Fiction 0.201231
459 Suvir Saran, Stephanie Lyness Indian Home Cooking Nonfiction 0.683412
462 Helene Siegel Totally Salmon Cookbook Nonfiction 0.653585
549 Carly de Castro, Hedi Gores, Hayden Slater Juice Nonfiction 0.406312
578 Hi Soo Shin Hepinstall Growing up in a Korean Kitchen Nonfiction 0.379062
612 Maria del Mar Sacasa Winter Cocktails Nonfiction 0.438557
620 Frank Pellegrino Rao's Cookbook Nonfiction 0.541190

Topic 2 appears to be about girlhood.

k2 = doc_topic_associations(theta, manifest, 2)
k2.head(15)
author title genre 2_score
9 Liz Curtis Higgs Thorn in My Heart Fiction 0.306729
90 Paul D. White, Ron Arias White's Rules Nonfiction 0.260193
153 Liz Curtis Higgs Grace in Thine Eyes Fiction 0.305969
278 Julianna Margulies, Paul Margulies Three Magic Balloons Children’s Books 0.289717
328 John Ramsey Miller The Last Family Fiction 0.392691
382 Val Brelinski The Girl Who Slept with God Fiction 0.274784
416 Sarah Henstra Mad Miss Mimic Teen & Young Adult 0.216246
678 Leigh Stein The Fallback Plan Fiction 0.363485
698 Suzanne Fisher Staples The House of Djinn Teen & Young Adult 0.305969
718 Goce Smilevski Freud's Sister Fiction 0.241625
763 Edd Doerr The Case Against School Vouchers Nonfiction 0.276409
897 Adolf Schroder The Game of Cards Fiction 0.294001
921 Bobby Henderson The Gospel of the Flying Spaghetti Monster Classics 0.227076
942 Manuela Monari Zero Kisses for Me Children’s Books 0.360254
1142 Emily Bazelon Sticks and Stones Nonfiction 0.257467

Though there are some perplexing titles listed here. What’s The Gospel of the Flying Spaghetti Monster doing here? The Case Against School Vouchers is similarly odd, though maybe it’s a sensible fit given what we might expect from shared vocabulary. Topic models, remember, are ultimately counting word co-occurrences, not the different semantic valences of a word, or tone, style, etc. It’s up to us to parse the latter kinds of things.

To do so, it’s helpful to examine the overall similarities and differences between topics, much in the way we projected our documents into a vector space in the previous chapter. We’ll prepare the code to do something similar here but will save the final result for a separate webpage. Below, we produce the following:

  • A topic-term distribution matrix (word probabilities for each topic)

  • The lengths of every blurb

  • A list of the corpus vocabulary

  • The corresponding frequency counts for the corpus vocabulary

Once we’ve made these, we’ll prepare our visualization data with a package called pyLDAvis and save it.

topic_terms = np.stack([tuned.get_topic_word_dist(k) for k in range(tuned.k)])
doc_lengths = np.array([len(doc.words) for doc in tuned.docs])
vocab = list(tuned.used_vocabs)
term_frequency = tuned.used_vocab_freq

vis = pyLDAvis.prepare(
    topic_terms, theta.values, doc_lengths, vocab, term_frequency,
    start_index = 0, sort_topics = False
)

outdir = indir.joinpath("output/topic_model_plot.html")
pyLDAvis.save_html(vis, outdir.as_posix())

With that done, we’ve finished our initial work with topic models. The resultant visualization of the above is available here. It’s a scatter plot that represents topic similarity; the size of each topic circle corresponds to that topic’s proportion in the model. Explore it some and see what you find!