UC Davis DataLab Text and NLP Toolkit

Overview

This page will provide you with a few basic concepts and tools for working with text data. Natural Language Processing (NLP) is a discipline of its own, so if you plan to use these methods of your own research, please consider this a starting point only. That being said, this page should give you the vocabulary required to start researching the concepts that are of interest to you and your project.

Natural Language Processing (NLP) is a combination of linguistics, computer science, data science, and statistics, which uses models of language to understand large collections of text, called corpora. Here, “natural language” refers to language people use (in contrast with something like a “coding language,” for example). In the digital humanities, NLP is sometimes referred to as “distant reading,” to distinguish it from the kind of “close reading” that typically marks literary interpretation. Good NLP projects, however, still involve plenty of manual reading, at least in part. Ultimately, the adage “garbage in, garbage out” still applies to NLP, and much of this work involves iteratively comparing data-driven analysis of a corpus with human-led reading.

Another common term used in this area is text mining. There is significant overlap between text mining and NLP, but the key distinction depends on the final use of the text data. Text mining is usually understood as the process of collecting, cleaning, and doing structural analyses of text such as word counts. While this is always part of the process of NLP, NLP itself is usually concerned with the semantic meaning of the text, and the concepts contained within.

Natural Language Processing (NLP) Concepts

NLP involves many methods. These can range from simple counts of words across a corpus, to processing the corpus to find salient themes and the documents that discuss them. Before you can pick a method though, you need to make some decisions about how you will represent your text as data. This includes how you will select and clean your text, as well as picking the unit of analysis. We will give a quick overview of these steps here, but please check out our workshop archive for NLP workshops and other resources that can better prepare you to take advantage of NLP methods in your work.

As with all data sets, you must keep in mind what you are actually sampling when you collect text data. For example, while Twitter may provide ready access to social media data through their API, it is still only a snapshot of Twitter users, not a generalizable corpus of all humanity. The same is true when you get newspapers from an archive or transcripts from speeches: it will always be just one slice of all that has ever been said or written. Readers interested in how they might frame their research with these caveats in mind are encouraged to consult research that employs data-driven approaches to book history and bibliography.

After you have selected a corpus, you’ll likely need to clean it. Even the most common NLP analyses involve substantial data cleaning and pre-processing. Consider, for example, a scenario in which you are interested in simply counting the number of times “Follow” appears in your corpus. You would need to decide things like:

  1. Does capitalization matter? Is Follow a proper term?
  2. Does the tense of the word matter? Does Follow == Following == Followed?
  3. Does the word or a phrase with the word matter? Do you care about “Follow” or “Current Follow Count”?

You must also decide what the unit of analysis for your project will be. Are you looking at books in a subject area, or chapters in those books, or paragraphs in those chapters? Depending on the analysis, the unit of comparison can have a large effect on outcomes. For example the co-occurrence of the terms “University” and “California” may be small if you are looking within sentences, but much higher if you expand the scope to paragraphs.

Natural Language Processing (NLP) Pre-Processing

Here we will cover some of the most common cleaning steps in NLP. Note however that you may not need to perform all of them; they are project-dependent, and it is possible to ruin your analyses with over-cleaning. Think carefully about how each process will affect your research question and remember, too, that, as with all data cleaning, textual preprocessing is an iterative process.

Lowercase Everything

One of the most common cleaning steps involves setting all your text to lowercase. This is typically done because a computer will consider capitalized letters to be different from their lowercase forms. For example, “Hello” and “hello” are, in a computational sense, different words, and you would need to choose one or the other to get a computer to count them as the same. While this is a fairly straightforward example, there are plenty of edge cases, which can throw off the meaning of the text you want to analyze. Consider the difference between “dune,” the terrain feature, and Dune, the book: converting both words to lowercase would make the proper noun disappear from your data!

Stopword Removal

Stopwords are the connective tissue of language; the term covers words like “the”, “a”, “to”, etc. But while such words are integral to the way we understand the meaning of a sentence, they are so context-specific that they often contain very little meaning in and of themselves. Because of this, stopwords are generally not very useful for basic text analysis. In fact, keeping them in your corpus can actually be detrimental: they over-inflate the number of words in your corpus, throwing off the denominator of many calculations. More advanced NLP methods do make use of these words for charting sentence structure, but it is generally recommended that you remove them otherwise. Often, pre-made stopword lists come included with text analysis software; one of the most popular lists is the Buckley-Salton list. But sometimes it may also be prudent to alter a list. If a word is particularly common in your text, you may want to remove it. On the other hand, if a common word has a particular meaning in your context, you may want to make sure it is preserved.

Tokenization

In NLP parlance, a “token” is the unit by which you are analyzing your data. The most intuitive way to tokenize is by word, where each word in your text becomes a distinct item. However, there are cases where, instead of words, you may want to break your text into n-grams, or rolling chunks of n length. For example, if you have the sentence, “The NLP guide was helpful” you could tokenize it in the following ways:

  • word: “The”, “NLP”, “guide”, “was”, “helpful”
  • bi-gram: “The NLP”, “NLP guide”, “guide was”, “was helpful”
  • tri-gram: “The NLP guide”, “NLP guide was”, “guide was helpful”
  • letter bi-gram: “Th”, “he”, “eN”, “NL”, “LP”, etc.
  • etc.

Stemming

Stemming is one way to standardize words such that different forms of a word are turned into the same token. For example, if you have “run”, “running”, and “ran” in your text, you may want to covert them into one token because they all convey the same meaning. To try and accomplish that standardization, stemming cuts off prefixes and suffices of words, typically doing so on the basis of a syllable-driven splitting algorithm. So:

  • run | -> run
  • run | ning -> run
  • ran | -> ran

In this example, “ran” would not be appropriately converted. While in actuality most stemming algorithms with most likely catch such a common word, stemming trades precision for speed in most cases. If your corpus contains novel language that is not likely to be present in commonly used model, it may be best to look at lemmatization instead.

Lemmatization

Lemmatization wants to accomplish the same ends as stemming–standardizing tokens–but it is typically more comprehensive (at the cost of speed/computational usage). While some lemmatization algorithms start with a dictionary lookup for common words, many then take another pass at the input using part-of-speech tagging (see below) and language rules to try and catch more words that may be missed using the dictionary strategy. It will then try to reduce the word to its “lemma” or base form, sometimes changing words entirely to achieve this. For this reason, lemmatization will catch more words like “ran” where the required change is more than the removal of a prefix or suffix.

  • run -> run
  • running -> run
  • ran -> run

Part-of-Speech Tagging

Part-of-speech (POS) tagging is just what it sounds like: the process goes through the words in your corpus and tags them with metadata, indicating whether those words are nouns, verbs, adjectives, etc. Being a NLP tool, POS tagging goes into much finer detail, with one of the more common lists of tags being the Penn Treebank. Different NLP tools will use different tag lists, so make sure to research which one will be applied to your corpus.

Natural Language Processing (NLP) Measures

While the ways to measure text are as numerous as texts themselves, there are a few common measures that are used often. We will list them here and provide a brief explanation fo their utility.

Word Frequency

Word frequency is the simplest measure of text. It combines all of your textual data into a single, un-ordered “bag of words” and counts how often each word appears. The basic idea is that if a word occurs more frequently, it will be more important.

Co-occurrence

Co-occurrence of specific terms may be of interest for your question. For example, if you are interested in how a policy is portrayed in newspapers, you may look for the co-occurrence of that policy name and reaction words such as “support” or “oppose.”

Named Entity Recognition (NER)

Named Entity Recognition (NER) searches through a text to find proper nouns, concepts, and other such words that can help us understand what a text is about. A common use is to search though a piece of text for the proper nouns, time words (time and date), and locations to try and understand the context of the text. This can be helpful for categorizing texts based on the subjects included in them. These tags can be used to build further models, for example counting the number of business entities in a piece of text.

Sentiment Analysis

Sentiment analysis looks though a text and tried to apply numerical scores to how positive or negative it things the text is. Typically higher scores means a happier, or more positive, bit of text, while negative scores can mean anger or rage. More nuanced scoring exists, where it instead counts the number of words with specific emotional meaning, such as joy, fear, hate, etc. Most sentiment analyses are rather naive, just counting the number of words with emotional weight. Some more advanced methods are becoming more popular, the most crucial of which allows for negation words to affect the score (note that many negation words are also stopwords, which adds a layer of complexity). For example, a simple sentiment analysis might consider “I am not happy” to be positive, rather than negative, if it could not understand the negation of “not” in that sentence.

Term Frequency Inverse Document Frequency (TF-IDF)

A term frequency inverse document frequency (TF-IDF) matrix creates a weighted score for how important a term is per document in a corpus. The first is the ‘term frequency (TF),’ or how often the word appeared within that document. It makes intuitive sense that if a word appears many times in a document, that the document is about something related to that term. The second attribute is the ‘inverse document frequency (IDF),’ a measure of what proportion of the documents the word appeared in. If a word appears in all documents, its weight should be reduced. Conversely, if a word appears only in few documents, it should be highly weighted for those documents.

The formula for TF-IDF is: tfidf(t,d,D) = tf(t,d) * idf(t,D)

Topic Models

Topic models use the TF-IDF matrix discussed above to cluster a corpus by perceived topics, which closely correspond to what human readers would call themes. The generated topics are clusters of tokens that frequently co-occur, with each topic providing a score for how prevalent that topic is for each document in the corpus. This allows you to see what topics are most common and what documents are exemplars of a particular topic. Prevalent topics can then be used as another data point for further analyses, such as showing how topic importance changes over time, between authors, or after significant events.

Word Embeddings

Word embeddings extend the logic of topic modeling with contextual information about words in a corpus. Embeddings are numerical representations of a word, generated by counting each word in the context of all other words in a corpus. This comparison allows words to be “mapped” as vectors in a high-dimensional space, wherein words clustered together have similar meaning. While very abstracted from the text, this method has achieved some very interesting results. It has also shown how bias is baked in to language, revealing how some words are implicitly gendered.

Tools for NLP

The following are some tools you can use for NLP. Many of them have been implemented as packages in larger data analysis suites like Python and R, though there are also standalone tools. If the latter is the case, the tool is marked with an asterisks (*) next to its name. We will first go over tools within R and Python, and then highlight some stand alone applications that might be helpful if you are not familiar with either of those languages.

Programatic Tools

These tools are packages within established programming languages, namely R or Python. Most of the workshops at the DataLab uses one of these languages, and they are very common among academics. Learning these tools will first require knowledge of the language they are implemented in however.

Python NLP Tools

Natural Language Toolkit (NLTK)

Natural Language Toolkit (NLTK) has been a staple for NLP for years. It is a fully featured environment for text collection, cleaning, and some modeling. Because of its prevalence, there are numerous guides online on how to use the tools NLTK provides.

SpaCy

SpaCy is relatively new but has been rapidly gaining popularity. It aims to be a fully featured suite of tools like NLTK, but builds on an updated (and much faster) architecture.

Gensim

The Gensim library offers numerous methods for producing advanced linguistic models, including topic models and word embeddings. While it offers minimal support for features like POS tagging, the package’s expansive offering of models has made it a common tool for higher-level text analysis.

scikit-learn

Among scikit-learn’s many features, it offers several, easy-to-use text analysis methods, ranging from quickly producing TF-IDF scores to topic modeling. An additional advantage to using such methods is that they play nicely with the larger scikit-learn ecosystem.

R NLP Tools

quanteda

quanteda is a newer text analysis suite for R, similar to NLTK or SpaCy for Python. You can perform data management, cleaning, and analysis through the main tool and its companion packages.

Spacyr

Spacyr is a R wrapper around the python SpaCy suite. It ports many of the SpaCy features to R but ultimately relies on Python running in the background.

text2vec

text2vec is primary used for text modeling. It supports topic modeling, word embeddings, and text similarity testing. It is built on C++ and is thus fast and memory efficient.

tidytext

tidytext is the tidyverse entry into R NLP. It works well with other tidyverse packages, and is notable for its built in topic model capabilities.

tm

tm is a classic NLP package for R, and is often used as a dependency for other packages. It can be used to ingest text data into R from a variety of sources, create a unified corpus object, and perform large scale transformations. You can mass clean white space or lowercase, or create your own cleaning functions. You can also perform transformations such as TFIDF.

Text Data Gathering Tools

Digital Methods Initiative

Digital Methods Initiative offers a suite of tools for collecting and analyzing internet data, all created at the Digital Methods Initiative at the University of Amsterdam. Tools are generally web-based and user-friendly, where researchers can input URLs they are interested in studying. DMI also provides various tools to analyze, organize, and visualize the collected internet data. All tools include instructions and examples scenarios of use. They have tools in a variety of realms including media analysis (media monitoring, mapping, clouding, comparative analysis), data treatment and visualization, natively digital (The Link, URL, Tag, Domain, PageRank, etc.), specific devices/websites (Google, Yahoo, Wikipedia, Alexa, IssueCrawler, Twitter, Facebook, Amazon, iTunes, Wayback, YouTube, Instagram, Github, etc.), and spherical studies (including the web, news, tag, video, image, code and blogo-spheres)

Text Data Cleaning/Prep Tools

The Sourcecaster

The Sourcecaster collects instructions and tutorials for using the command line for common text cleaning tasks. Useful and time-saving with lots of dependencies (brew, port, wget, tesseract, etc.) for people already familiar with command line. It might have a steeper learning curve for people who are not tech savvy.

AntSoftware

AntSoftware has several tools for cleaning and prepping text documents, such as AntCorGen (corpus creation tool), AntFileConverter (converts PDF and Word files to plain text), AntFileSplitter (file splitting tool), AntGram (n-gram and p-frame generation tool), EncodeAnt (detects and converts character encodings), SarAnt (batch search and replace tool), and VariAnt (a spelling analysis program). All programs are free and easily downloaded, and the site includes step by step screenshot help pages.

Text Analysis

BookNLP

BookNLP A natural language processing pipeline developed to run on books and other long (English only) documents. Uses the Stanford Core NLP software, so it requires Java and use of the command line or more advanced programming skills. Includes the ability to do part-of-speech tagging, dependency parsing, named entity recognition, character name clustering, quotation speaker identification, pronominal coreference resolution, and suspersense tagging.

FACTORIE

From the command-line, FACTORIE provides a framework for probabilistic graphical models, and can do a wide variety of text analysis tasks such as topic modeling, document classification, and part-of-speech tagging. The tutorials and how-to guides are mostly code, and so is recommended for users already familiar with at least one computer programming language. The last update to the software was 2014.

LDAvis *

LDAvis is a commonly used tool for exploring topic models and understanding the relationship between topics. It has implementations in many coding languages, notably R and Python. s

Macro-Etyological Analyzer *

Macro-Etyological Analyzer is a Python tool for text analysis that tracks the etymological origins of the words in a text based on language family, this tool was recently updated to analyze any number of texts in 250 languages. This tool outputs many useful statistical descriptions of the results and can be useful with other NLP methods such as topic modeling. It is a very important tool for people interested in historical word usage and linguistic change.

Signature Stylometric System

Signature Stylometric System is a tool for stylometric analysis and comparison of texts with emphasis on author identification. The tool is well documented and seemingly well designed and implemented. Several corpuses are available with examples of usage available on the website. It seems like a good starting point for any kind of efforts towards author attribution, and the program is downloadable in a self-extracting ZIP file.

Sentiment 140

Sentiment 140 is an API for sentiment analysis on tweets, this tool can do bulk processes of around 5000 per minute and was built from machine learning algorithms. The tool is well documented and seems easy to interface with and parse results. Users can upload JSON and CSV formatted data and use of the tool requires registration.

Stanford NER *

Part of Stanford Core NLP, Stanford NER is a Java implementation with web interface (for demonstrations) of Stanford’s Name Entity Recognition and extraction. Users can get the names, dates, places, organizations and other entities from texts in many languages. Users can try out the software on a web interface to determine if the tool is useful, and then download the source code to their own machine. The web page provides step-by-step instructions to download and begin using the software via command-line. It has recently been extended to have bindings in many languages other than Java, such as Python, PHP, Perl, JavaScript, etc.

Stanford Sentiment Analysis *

Part of Stanford Core NLP, Stanford Sentiment Analysis is a Java implementation with web demo of Stanford’s model for sentiment analysis. The model uses sentence structure to attempt to quantify the general sentiment of a text based on a type of recursive neural network which analyzed Stanford’s Sentiment Treebank dataset. This is a very useful tool and is a much more up to date and robust model for sentiment analysis then previous naive model’s that simply sum the scores of the words in the text without caring for order or grammatical nuance. Users can use a web based live demo to see if the tool is useful for them, and then download the source code via command line.

TextPlot *

TextPlot is a program that converts a document into a network of terms based on co-occurrences with the goal of teasing out information about high-level topic structure of the text. For each term, users can get the set of offsets in the document where the term appears and use the kernel density estimation to compute a probability density function that represents the word’s distribution across the document. Users can also compute the bray-Curtis dissimilarity between the term’s PDF and the PDFs of all other terms in the document to measure the extent to which two words appear in the same locations. Output includes various ways of graphing these measures, including an overall word network. Users can install the source code from Github and can use the program via command line or Python.

Text Corpus Exploration Software

Overview

Overview is a document mining application to read and analyze sets of documents, including full text search, visualizations, entity detection, topic clustering, and more. The tool includes built-in OCR, document annotation, and word clouds. It also includes tagging and metadata support, and users can write their own plugins using the API. Users can run the program on the public Overview server or download source code from Github to run it on their own machines via command line.

AntConc

AntConc is a useful tool for exploratory natural language processing tasks, including clusters/N-grams, word frequency lists, and keywords. The software is well documented with several tutorials and easy to set up and use. AntConc is free and requires no downloading, it will launch from the web. It has seven tools users can access: Concordance tool (viewing the results in a KWIC format to see the commonly used words and phrases in a corpus of texts); Concordance Plot tool (plots the search results in a barcode format); File View tool (shows text of individual files); Clusters/N-grams tool (shows clusters based on search condition, or scans corpus for ‘N’ length clusters); Collocates tool (shows the collocates of a search term to investigate non-sequential patterns in language); Word List tool (frequency distribution of all words in the corpus); Keywords List tool (shows which words are unusually frequent in corpus in comparison to a reference corpus).

WordHoard

WordHoard is a powerful tool for anyone interested in understanding classic literary texts and the word usage they contain. It implements many core natural language processing methods, designed with close reading in mind and working towards meticulous tagging and metadata retrieval in texts. A very well documented with many examples, this tool is easy to customize with a well-designed and explained scripting system. The current version has the entire cannon of Early Greek epic in the original and in translation, as well as ass of Chaucer, Shakespeare, and Spenser.

Sketch Engine

Sketch Engine is a subscription-based tool aimed at non-programmers that analyzes large corpora to look for common words, phrases, or phenomena, or can perform various types of text analysis including co-occurrence analysis, term extraction, frequency lists, and part-of-speech tagging. The tool includes a 30-day free trial, and then has various per-month prices for different types of researchers (academic and non-academic). Users can upload text files (including metadata) or use the automated built-in tool for downloading relevant texts from the web. The tool is built (and personalized) for translators, terminologists, lexicographers, teachers, students and historians.

DfR-Browser

DfR-Browser is a tool to visualize and explore output from LDA topic models. View options of topics include a grid, scaled, list, and stacked view. Within topics, users can view frequency distributions, topic usage through time, and doc-topic matrices. Users can also access a word index to explore single words across the corpus, as well as topic distributions within documents. The site clearly explains what the tool does, and users can easily download the software via a .zip file.

Text Corpus Exploration Web Apps

WordandPhrase

WordandPhrase is a web interface tool for finding key words and phrases in a text based on their frequency. Users paste in text into interface and get a visualization of collocations and co-occurrence matrices, as well as the ability to import and export the data. The site also provides key word in context and other statistics about the word usage in the source text. However, in its current state, there is no easy way to use the tool on a large corpus and is therefore limited.

Lexos

Lexos is an online tool for cleaning and processing text, and users can upload data in various formats, including .txt, .html, .xml, .sgml, and .lexos. There is also a beta version available where users can use a web scraper to upload data. The tool can clean and separate the uploaded text into chunks and then provide several visualization tools to run on the processed text including word clouds, networks, and word collocation. Once uploaded, users can manage document lists, sorting them by labeled classes, sources, and excerpts. Text cleaning tools include removing punctuation, digits, whitespace, stop words, lemmatization, tokenization, creating a document-term matrix, etc. After visualization, the tool also offers several options for text analysis, including hierarchical clustering and k-means clustering, document similarity comparisons, and topwords. Users can export the data as CSV or tsv.

WordTree

WordTree is an online tool for visualizing the contexts in which certain phrases appear to get a sense of the ultimate structure of the text. This tool provides a novel way for searching a text based on certain phrases or words that appear within. Seems most useful for short texts as opposed to a full corpus and does not provide any analysis output beyond exploring your text visually.

WordSeer *

WordSeer is a Python webapp for textual analysis combines visualization, information retrieval, sensemaking and natural language processing. After installing the source code via command line, use of WordSeer requires registration, and tools for collaboration are provided. Many standard nlp tools and visualizations are available through the web interface. Users can try out the tool on a test collection before uploading their own documents to see if the tool is right for them.

Voyant-Tools

Voyant-Tools is a web environment that implements many core nlp methods. Users paste in URLs or text and the interface provides various text exploration outputs, including a word cloud, relative frequency graph, word correlations, vocabulary density, and more. Data can be exported, however there is a limitation on the size of the corpus you can process.

In-Browser Topic Modeling

The goal of In-Browser Topic Modeling is to make topic modeling accessible to users not familiar with R or Mallet. Users can run the program from a web browser, or download from Github, and can explore pre-loaded data or upload their own dataset (in CSV format). The UI allows users to explore topics by documents, correlations, and time series. Users can also explore the corpus vocabulary by frequency and topic specificity and can easily create custom stopword lists. The best feature of this program is that users can also download CSV formatted files of their topic model results, including document topics, topic words, topic summaries, topic-topic connections, doc-topic graphs (for Gephi) and a complete sampling state.

Qualitative Data Analysis (QDA) Suites

Qualitative Data Analysis (QDA) tools are often used by researchers to code and manage interviews or other textual documents. The advantage of these software suites is that they typically do not require coding knowledge. However, this often means they are commercial products with subscription fees or licensing agreements. The University of Surrey keeps an up to date list of many QDA platforms, as well a their relative advantages and disadvantages. You can see their list of QDA software options here.

Dataset Sources

Chronicling America

Chronicling America provides an API to download relevant digitized historical newspapers from the US. This makes it easy to access large collections of digitized text, for free, compared to many other methods.

Digital Humanities Toychest

The Digital Humanities Toychest maintains a list of several corpora ready for analysis, and are good places to start your first NLP project. The website as a whole also presents several resources like those here readers may find helpful.

HathiTrust

HathiTrust is a web portal to many databases of texts, books, and images. Users can download text versions of many of the works, and the site includes works in many languages. HathiTrust interfaces with WorldCat and other tools for finding alternate versions of the text.

Novelties

Poem Viewer

Poem Viewer is a web tool for visualizing the features, form, and style of poems. This is one of the most comprehensive and well-built tools for visualizing the many attributes of poems, where users can visualize phonetics units, attributes, features, and relations, as well as word units and attributes, and semantics relations.

Nonsense Laboratory

Nonsense Laboratory has a number of machine learning power language novelties, like mixing together words to create new ones, or finding new ways to spell words that elicit different accents when spoken.

WordWanderer

WordWanderer is an experimental web tool for visualization of an imported text. The web interface loads all words in the corpus and provides clickable and draggable options for users to explore the context of words and the relationships between words. No output or analysis function beyond exploring your text visually.

NLP in Action

What Every1 Says

What Every1 Says is a meta-study using humanities and text methods to study the humanities. The link here shows off their gallery on one page reports, many of which rely on NLP for their findings.

Quintessence

Quintessence seeks to add state-of-the-art data analysis and dynamic corpus exploration to the study of Early Modern period English texts. This project currently uses a corpus of approximately sixty thousand texts from the Early English Books Online (EEBO) Text Creation Partnership. Each text is standardized using Northwestern University’s MorphAdorner, which accounts for spelling changes over time. Any scholar interested in the archive can use Quintessence to run analyses ranging from individual word meanings to broad textual themes. The ability to add more collections of texts is under active development.

Contributions

This research toolkit is maintained by the UC Davis DataLab, and is open for contribution. See how you can contribute on the Github repo.

This toolkit has been made possible thanks to contributions by: