Overview#

This week-long workshop series covers the basics of text mining and natural language processing (NLP) with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovering of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/NLP and will walk through methods that range from tokenization and dependency parsing to text classification, topic modeling, and word embeddings. Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities.

Note: this series concludes with a special session on large language models, “The Basics of Large Language Models.”

Learning Goals#

By the end of this series, you will be able to:

  • Clean and structure textual data for analysis

  • Recognize and explain how these cleaning processes impact research findings

  • Explain key concepts and terminology in text mining/NLP, including tokenization, dependency parsing, word embedding

  • Use special data structures such as document-term matrices to efficiently analyze multiple texts

  • Use statistical measures (pointwise mutual information, tf-idf) to identify significant patterns in text

  • Classify texts on the basis of their features

  • Produce statistical models of topics from/about a collection of texts

  • Produce models of word meanings from a corpus

Prerequisites#

These workshops are not an introduction to Python. Learners must have taken DataLab’s Python Basics workshop series or have equivalent prior experience using Python.

Computing Requirements#

Before the workshop, please make sure your computer has a working internet connection and the most recent versions of the following software:

You can find step-by-step installation instructions in DataLab’s Python Install Guide. If you need additional help, come chat with us in DataLab’s Office Hours.