2 Intro to Text Processing and NLP for Health Data

There is a lot of health data that is unstructured text.

  • EMR Clinical notes
  • Medical journal literature
  • Surveys and questionnaires
  • Interviews
  • Community forums
  • Social media

The goal is to take text data -> convert it into data that can be analyzed (numbers) -> run some analysis -> interpret the analysis.

2.1 General Methods for Analyzing Text

  • word frequency
  • TF-IDF
  • PCA
  • tokenization
  • KWIC: Key Word in Context
  • co-occurence
  • stemming
  • lemmatization - removing inflectional endings to return the base dictionary form of a word (lemma).
  • bigrams: conditional probability of a token given the preceding token.
  • named entity recognition (NER): identifying which items in the text map to proper names (people, places, location).
  • regular expressions (RegEx): searching for exact words, parts of words or phrases
  • feature based linear classifiers
  • topic modeling
  • word embedding: mapping words or phrases to vectors of real numbers allowing words with similar meaning to have a similar representation.
  • sentiment analysis: subjective information to determine “polarity” of a document. (e.g., positive or negative reviews)
  • part of speech tagging: determining the parts of speech for each word in a sentence.

2.2 NLP

NLP allows us to use “distance reading” to unlock information from narrative text for extraction and classification:

  • keyword detection
  • topic detection
  • document summarizing
  • document classification
  • document clustering
  • document similarity
  • speech recognition
  • text translation

Natural Language Processing (NLP) = linguistics + computer science + information engineering + data science. We use NLP to computationally parse and analyze large amounts of natural language data.

2.3 What’s a Natural Language?

“…any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic.” Thanks, Wikipedia.

Thought questions: Are clinical notes a natural language? What about tweets?

2.5 Caveats before we get started

Garbage in, garbage out. Know your corpus. All models are wrong, but some are useful. And others are dangerous.

Natural langauge data can not be deidentified.

NLP models are powerful, but can fail when applied to jargon-heavy, niched domains. Augmenting with developed medical dictionaries helps, but parts of speech taggers, for example, are not designed for the health space. This is especially true for classical NLP techniques, although deep learning approaches hold strong promise for working with unstructured health data.