Introduction
Contents
Introduction¶
Learning Objectives
By the end of this session you will be able to:
explain what Optical Character Recognition (OCR) is and when to use it
use Tesseract OCR software through a python wrapper called pytesseract
understand the main ways of configuring tesseract including - engine selection, language, layouts
extract and parse layout information including bounding boxes from images
identify common problems in images that may lead to poor quality OCR
implement some common text processing techniques in python
This workshop is designed to introduce the steps involved in performing Optical Character Recognition with Python. Included in this workshop is a description of common image preprocessing strategies for improving the quality of the OCR output, a walk-through of tesseract usage through pytesseract, strategies for quantifying OCR quality, ways to clean the text outputs, and a discussion on ways of developing an effective and reproducible workflow for performing OCR on a text corpus. The workshop will focus on images that contain English language text, that has been printed, that is to say, not handwritten. Additionally, it is focused on the tesseract open source library. This is an introductory workshop which means its focus is on common ocr problems, and not on developing strategies for training ocr, complex heuristics for document segmenting, nor is it a deep dive on computer vision. This workshop also assumes that the data is in a digitized form, albeit as images and not machine encoded text, and therefore does not cover strategies for scanning/imaging physical documents. Overall, learners should walk away with a sense of how to use tesseract ocr, and a comprehension of some of the steps involved in a potential research workflow that involves tesseract which includes image pre-processing and text cleaning.
OCR allows us to convert images of text documents into machine encoded text. There are many reasons to use OCR for research. Often times the most relevant documents to your research query are not available in a digitized form. Or, there does exist a scanned version of the document online, but that digital document doesn’t contain a machine-encoded text layer. There are even cases where the document has a text layer, but OCR is still useful. This can occur if the quality of the text layer is itself poor - perhaps it was generated with an older version of the ocr software, perhaps it was manually catalogued but there are many errors. Another use for OCR is for analysis that relies on spatial data of the text on the page, that is lost in purely machine-encoded text layer. For example, if you want to extract all the footnotes from an academic paper, you may use OCR and leverage the positional data OCR returns.
For this workshop we will be using Tesseract. Tesseract is free and open source. It is developed by Google and has many independent contributors. The most recent release, version 5.1.0 was released on March 1, 2022. Tesseract is a command line tool as well as an OCR library. Tesseract can run on Windows, MacOS, and Linux. In addition to be free, there are many important advantages to using Tesseract over other proprietary OCR tools. Tesseract is extremely popular which makes it relatively easy to find resources online, such as guides and stack overflow posts. In addition, there are many third party tools that interface with Tesseract including GUIs.
While Tesseract is a command line tool, we will be running it through a Python wrapper called pytesseract for this workshop. This means that we will be writing Python code to run Tesseract behind the scenes. A reason for running Tesseract through python, instead of the command line, is it can make for more reproducible workflows. This way all of the steps will be self documented within your source code as well as its easier to connect the OCR step with the other steps in the process - preprocessing, cleaning, which will also be done in python.
Prerequisites¶
For this course you will need Python, Tesseract, and several python packages. While it is possible to install of these software individually, we recommend using Miniconda.
Miniconda is a minimal version of Anaconda, so, if you already have Anaconda installed on your system, you can go ahead and continue using that. Anaconda is a collection of free and open-source data science software. Part of Anaconda is conda which is a tool for installing and managing software on your system. We will be using conda to manage the installing of the software for this workshop. Both Miniconda and Anaconda provide conda. Setup Miniconda by downloading and running the approrpriate installer for your operating system. Navigate to the download page and select the most recent version for your system.
In addition, we will be using JupyterLab. JupyterLab is an IDE (integraded development environment) that runs in your browser. In this environment we can install packages, run python code interactively, interface with the command line tools we will be using, and manage our filesystems.
Setup using Conda¶
After installing, verify that conda is set up by opening a terminal and running
conda. On windows this can be done by searching for ‘Anaconda Prompt’ in your
start menu. for details. On MacOS, open the ‘Terminal’ application, which can
be found in Applications -> Utilities -> Terminal.app
. On Linux, open your
favorite terminal emulator. With the terminal open, type the following command
and press enter:
conda --version
If this displays a line of information about the version, you are good to go.
For example on my machine I see conda 4.12.0
.
With conda installed, we can install the other software for this workshop with the following command. Copy and paste this command into your terminal and press Enter:
conda install -c conda-forge tesseract poppler pytesseract PyPDF2 stop-words pdf2image pandas jupyter jupyterlab
Note
This command will install the list of named software on your system. The
section -c conda-forge
specifies to conda
to use the
conda-forge channel. A
channel
is a repository containing packages, with conda you can interface with many
different channels. The ‘conda-forge’ channel is a community contributed
channel with many additional packages that aren’t provided by Anaconda’s
default channel.
Tip
Conda can also be used to manage multiple environments. See this section of DataLab’s intermediate python reader for more information.