1. Section Overview#
In this three-part workshop series you will learn the basics of text mining with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovery of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/natural language processing and will walk through different methods of ranking terms and documents. We will conclude by using these methods to classify texts and to build models of “topics.” Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities.
2. Before We Begin…#
This reader is meant to serve both as a roadmap for the overall trajectory of the series and as a reference for later work you may do in text mining. Our sessions will follow its overall logic, but the reader itself offers substantially more details than we may have time to discuss in the sessions. The instructors will call attention to this when appropriate; you are encouraged to consult the reader whenever you would like to learn more about a particular topic.
Each session of this workshop will cover material from one or more chapters. We also ask that you read Chapter 2 in advance of our first session. It’s a review of sorts and will set up a frame for the series.
Session |
Chapters Covered |
Topic |
---|---|---|
0* |
Chapter 2 |
Review: textual data in Python |
1 |
Chapter 3 |
Text cleaning |
2 |
Chapters 4 & 5 |
Corpus analytics and document clustering |
3 |
Chapter 6 |
Topic modeling |
* Please read in advance
Learning Objectives
By the end of this series, you will be able to:
Prepare textual data for analysis using a variety of cleaning processes
Recognize and explain how these cleaning processes impact research findings
Explain key terminology in text mining, including “tokenization,” “n-grams,” “lexical diversity,” and more
Use special data structures (such as document-term matrices) to efficiently analyze multiple texts
Use statistical measures (pointwise mutual information, tf-idf) to identify significant patterns in text
Cluster and classify texts on the basis of such measures
Produce statistical models of “toics” from/about a collection of texts
2.1. File and Data Setup#
2.1.1. Google Colab#
We will be using Google Colab’s platform and Google Drive during the series and working with a set of pre-configured notebooks and data. You must have a Google account to work in the Colab environment. Perform the following steps to setup your environment for the course:
Download the data
Un-compress the downloaded .zip file by double clicking on it
Visit the Google Colab website at and sign-in using your Google account
In a separate browser tab (from the one where you are logged-in to Colab) sign-in to your Google Drive
Upload the
tm_workshop_data
directory into the root of your Google Drive
Once you have completed the above steps, you will have your basic environment setup. Next, you’ll need to create a blank notebook in Google Colab. To do this, go to Google Colab and choose “File->New Notebook” from the File Menu. Alternatively, select “New Notebook” in the bottom right corner of the notebooks pop-up if it appears in your window.
Now, you need to connect your Google Drive to your Colab environment. To do this, run the following code in the code cell at appears at the top of your blank notebook:
from google.colab import drive
drive.mount('/content/drive')
Your environment should be ready to go!
2.1.2. Template code#
This workshop is hands-on, and you’re encouraged to code alongside the
instructors. That said, we’ll also start each session with some template code
from the session before. You can find these templates in this start
script directory. Simply copy/paste the code from the .txt
files into
your Jupyter environment.
2.2. Assessment#
If you are taking this workshop to complete a GradPathways micro-credential track, you can find instructions for the assessment here.