1. Section Overview#

In this three-part workshop series you will learn the basics of text mining with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovery of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/natural language processing and will walk through different methods of ranking terms and documents. We will conclude by using these methods to classify texts and to build models of “topics.” Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities.

2. Before We Begin…#

This reader is meant to serve both as a roadmap for the overall trajectory of the series and as a reference for later work you may do in text mining. Our sessions will follow its overall logic, but the reader itself offers substantially more details than we may have time to discuss in the sessions. The instructors will call attention to this when appropriate; you are encouraged to consult the reader whenever you would like to learn more about a particular topic.

Each session of this workshop will cover material from one or more chapters. We also ask that you read Chapter 2 in advance of our first session. It’s a review of sorts and will set up a frame for the series.

Session

Chapters Covered

Topic

0*

Chapter 2

Review: textual data in Python

1

Chapter 3

Text cleaning

2

Chapters 4 & 5

Corpus analytics and document clustering

3

Chapter 6

Topic modeling

* Please read in advance

Learning Objectives

By the end of this series, you will be able to:

  • Prepare textual data for analysis using a variety of cleaning processes

  • Recognize and explain how these cleaning processes impact research findings

  • Explain key terminology in text mining, including “tokenization,” “n-grams,” “lexical diversity,” and more

  • Use special data structures (such as document-term matrices) to efficiently analyze multiple texts

  • Use statistical measures (pointwise mutual information, tf-idf) to identify significant patterns in text

  • Cluster and classify texts on the basis of such measures

  • Produce statistical models of “toics” from/about a collection of texts

2.1. File and Data Setup#

2.1.1. Google Colab#

We will be using Google Colab’s platform and Google Drive during the series and working with a set of pre-configured notebooks and data. You must have a Google account to work in the Colab environment. Perform the following steps to setup your environment for the course:

  1. Download the data

  2. Un-compress the downloaded .zip file by double clicking on it

  3. Visit the Google Colab website at and sign-in using your Google account

  4. In a separate browser tab (from the one where you are logged-in to Colab) sign-in to your Google Drive

  5. Upload the tm_workshop_data directory into the root of your Google Drive

Once you have completed the above steps, you will have your basic environment setup. Next, you’ll need to create a blank notebook in Google Colab. To do this, go to Google Colab and choose “File->New Notebook” from the File Menu. Alternatively, select “New Notebook” in the bottom right corner of the notebooks pop-up if it appears in your window.

Now, you need to connect your Google Drive to your Colab environment. To do this, run the following code in the code cell at appears at the top of your blank notebook:

from google.colab import drive
drive.mount('/content/drive')

Your environment should be ready to go!

2.1.2. Template code#

This workshop is hands-on, and you’re encouraged to code alongside the instructors. That said, we’ll also start each session with some template code from the session before. You can find these templates in this start script directory. Simply copy/paste the code from the .txt files into your Jupyter environment.

2.2. Assessment#

If you are taking this workshop to complete a GradPathways micro-credential track, you can find instructions for the assessment here.