2 Intro to Text Processing and NLP for Health Data
There is a lot of health data that is unstructured text.
- EMR Clinical notes
- Medical journal literature
- Surveys and questionnaires
- Interviews
- Community forums
- Social media
The goal is to take text data -> convert it into data that can be analyzed (numbers) -> run some analysis -> interpret the analysis.
2.1 General Methods for Analyzing Text
- word frequency
- TF-IDF
- PCA
- tokenization
- KWIC: Key Word in Context
- co-occurence
- stemming
- lemmatization - removing inflectional endings to return the base dictionary form of a word (lemma).
- bigrams: conditional probability of a token given the preceding token.
- named entity recognition (NER): identifying which items in the text map to proper names (people, places, location).
- regular expressions (RegEx): searching for exact words, parts of words or phrases
- feature based linear classifiers
- topic modeling
- word embedding: mapping words or phrases to vectors of real numbers allowing words with similar meaning to have a similar representation.
- sentiment analysis: subjective information to determine “polarity” of a document. (e.g., positive or negative reviews)
- part of speech tagging: determining the parts of speech for each word in a sentence.
2.2 NLP
NLP allows us to use “distance reading” to unlock information from narrative text for extraction and classification:
- keyword detection
- topic detection
- document summarizing
- document classification
- document clustering
- document similarity
- speech recognition
- text translation
Natural Language Processing (NLP) = linguistics + computer science + information engineering + data science. We use NLP to computationally parse and analyze large amounts of natural language data.
2.3 What’s a Natural Language?
“…any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic.” Thanks, Wikipedia.
Thought questions: Are clinical notes a natural language? What about tweets?
2.4 NLP Trends
- 1950s - 1970s Using our understanding of language to develop rules that we feed into computer programs.
- 1980s - 2000s Classical NLP - Using statistical methods to analyze text corpora.
- 2010s - Further extension of statistical analysis of corpora and developments in deep learning. More and more data available, untagged. Need for unsupervised methods.
In this workshop we’re focusing on classical NLP (no deep learning).
2.5 Caveats before we get started
Garbage in, garbage out. Know your corpus. All models are wrong, but some are useful. And others are dangerous.
Natural langauge data can not be deidentified.
NLP models are powerful, but can fail when applied to jargon-heavy, niched domains. Augmenting with developed medical dictionaries helps, but parts of speech taggers, for example, are not designed for the health space. This is especially true for classical NLP techniques, although deep learning approaches hold strong promise for working with unstructured health data.
2.6 Further resources
- DataLab’s NLP Researcher Toolkit
- Curated list of ML and NLP resources for healthcare
- Post on how NLP can help clinicians
Useful packages:
- ntlk (python)
- spaCy (python)
- quanteda (R)
- tidytext (R)
Classes:
Datasets: