14. Assessment#
Instructions
To earn a micro-badge for this workshop, write code for/answer the prompts on
the next page. You should do your work in a code environment (like Jupyter);
include all code, code outputs, and short answers directly in your notebook.
Submit this notebook to GradPathways by exporting it to an html
file (in
Jupyter Notebooks, you can do this by going to File > Save Page As
in your
browser).
Links
14.1. Environment and File Setup#
For this assessment, you will extract linguistic information about a corpus of texts and use this to model the texts. We intend this both to reinforce the materials covered in the workshop sessions and to serve as an occasion with which you can practice NLP methods for your own research. To wit: you are invited to use your own corpus of text files to complete the assessment; by the time you’ve finished, you’ll have built a foundation for a research project with NLP.
If you don’t want to use your own corpus, or if you don’t have one ready to
hand, we’ve also provided you one. Under data/sherlock
you’ll find 56
Sherlock Holmes short stories. You can find a corresponding file manifest at
data/manifest.csv
, which you should use as a reference when doing your work.
If you choose to use your own corpus, we suggest using somewhere between 50-100
documents. But note: 100 novels is far more material than 100 short stories.
Try to aim for a corpus that has at least 150,000 total words.
This assessment does not require you to generate and store new files, but you are welcome to do so as you see fit.
While we ask that you do this work in a code notebook, you may use a local
environment on your own computer or Google Colab. Regardless of which
environment you use, you will need to make sure that you have all required
packages installed. The requirements file for these packages is under
data/requirements.txt
.
The directory structure for this assessment is:
requirements.txt A list of required packages
text_mining_assessment.pdf These instructions
data/ The data directory
|-- manifest.csv A file manifest
`-- sherlock/ Plaintext files of 56 Sherlock Holmes stories
14.2. Rubric#
Readers at GradPathways will be looking for a few things in this assessment:
Working code: were you able to successfully implement code for each prompt?
Understanding the code: can you explain what your code does and why you implemented it?
Supported examples and materials: have you used graphs and other results to produce evidence for your findings?
Critical reflection: do your short answers provide context (conceptual, domain-specific, etc.) for your findings and observations? Can you use your results to reason about your corpus, or even provide preliminary hypotheses?
14.3. Prompts#
14.3.1. Processing#
Load the following data into your environment, assigning each to their respective variables:
nlp
:spaCy
’sen_core_web_md
modelmanifest
: file manifest for the corpus files
Load your files and process them with
spaCy
. Assign the output to a variable namedcorpus
By default,
spaCy
assigns word embedding vectors to both individual tokens and to the document as a whole. You can access these with the.vector
attribute. Use a list comprehension to get the document vectors from each file in the corpus and store the output in avecs
variable. Wrap this list in aNumPy
array
14.3.2. Clustering documents with word embeddings#
With the embedding vectors extracted, you can cluster and visualize the corpus. Use the
AgglomerativeClustering
object fromscikit-learn
to do the clustering on the vectors. Then, run a dimensionality reduction withTSNE
(also inscikit-learn
) to create two-dimensional (XY) representations of the vectors. Use the following values forTSNE
:init
: pcalearning_rate
: autoangle
: any float under 0.3Hint: if you’re unsure about the syntax here, refer to the final section in chapter three of the workshop reader
Convert your XY data into a
pandas
DataFrame calledvis_data
. Assign two new columns to this dataframe: 1)title
: title of each story (available inmanifest
); 2)label
: output clusters fromAgglomerativeClustering
(accessible via the.labels_
attribute). Usealtair
to make a scatter plot of this data. The plot’scolor
argument should take in thelabel
column (uselabel:N
to get the coloring right);tooltip
should take inname
. Don’t forget to set the plot to its interactive mode!AgglomerativeClustering
defaults to two clusters. Based on a visual inspection of the plot you’ve made, do you think this number of clusters is adequate for your corpus? Why not or why not?Re-cluster your data a few times with a different number of clusters. Find what you think is the optimal number of clusters for the corpus. Explain your reasoning for doing so, using supporting visualizations as you see fit
Hint: If you’re having trouble dividing up your documents into groups in the visualization, try adjusting the parameters of
TSNE
to better reflect what theAgglomerativeClustering
object analyzes. A good place to start isangle
Note: Clustering almost always involves some element of interpretation, but, in your own work, you can also use empirical measures to help you set an appropriate number of clusters. Silhouette scoring is a common strategy for doing so
Once you’ve picked your optimal number of clusters, assign the cluster labels to a
cluster
column inmanifest
. Use agroupby
to count the number of documents in each cluster
14.3.3. Unique words#
It’s now to time to explore the actual words in your corpus. First, you’ll look at intersecting words across the clusters you’ve created. To do so, create a list of sets, where each set corresponds to the unique tokens in a corpus text. Be sure to use the
.text
attribute when compiling these sets fromspaCy
documents. Assign this list to a new column inmanifest
calledtypes
In a for loop, step through each unique cluster in
manifest
. Use.groupby('cluster').get_group(<cluster_num>)['types']
to extract all the sets for a given cluster. Perform a set intersection on the list, add the result to a dictionary calleddoc_intersections
(use the cluster number as a key), and print the following to screen:Cluster number
Number of elements in a set
Hint: not sure how to do a set intersection? Take a look at this link
Get the intersection of
doc_intersections
and store it in a variable namedcorpus_intersections
(the result should be relatively short). This is the overlap between all unique words in all clustersReturn to your dictionary of sets and build one more for loop to step through each one. Within the for loop, use
<set>.difference(corpus_intersections)
to get the difference between the current set of words and the intersection across the corpus. Print the result of.difference()
to screen along with the cluster number and inspect the contents. What do you see? Can you discern any patterns among these words that might help you understand why they’ve been grouped together?
14.3.4. Linguistic features#
In the question above, you created sets from the corpus texts using
spaCy
’s.text
attribute. But as you know from our series,spaCy
provides several other document and token annotations, ranging from entities to part-of-speech tags and syntactic dependencies. Select three such attributes and for each one, write a function that will gather information about this attribute from the corpus textsHint: You’re probably going to be counting attributes. If so, it’s best to normalize those counts, usually by dividing them against the document as a whole (refer to the advanced feature engineering section of the second chapter in the corpus reader)
Before running these functions on the corpus text, explain your reasoning for selecting each attribute in a sentence or two. Make a prediction: do you think your attribute will help you discern differences between your clusters?
Run the functions and assign the result of each to columns in
manifest
. Then groupmanifest
bycluster
and make a histogram of each attribute you created. Based on what you see, do you think these attributes successfully partition the corpus along the lines your clustering has defined? Explain why or why not