OCR With Pytesseract¶

Setup¶

For this workshop, we will be using a sample set of images prepared to demonstrate some key ocr concepts. Download this zipped folder of images and extract it to a directory where you are keeping your notes.

Start by importing pandas and the pytesseract package into your python session with:

import pandas as pd
import pytesseract

We can verify that tesseract is also installed, as well as which version is being used with the get_tesseract_version function:

pytesseract.get_tesseract_version()

Next, lets create names for some of the prepared images that we will refer to throughout this workshop.

simple_img = './data/alice_start-gutenberg.jpg'
fr_img = './data/fr_ocr-wikipedia.png'
kor_img = './data/kor_ocr-wikipedia.png'
toc_img = './data/alice_toc-gutenberg.jpg'
two_column_img = './data/two_column-google.png'

Note

Supported Image Formats:

pytesseract can operate on any PIL Image, NumPy array or file path of an image than can be processed by Tessseract. Tesseract supports most image formats: png, jpeg, tiff, bmp, gif.

Notably, pytesseract, and tesseract, don’t work on Pdf files. In order to perform OCR on a pdf file, you must first convert it to a supported image format.

Pytesseract Usage¶

In order to maximize the quality of results from OCR with tesseract, its often necessary to customize the behavior of the OCR through parameters. With tesseract, you can specify one or multiple languages you expect in the document, which OCR engine to use, and information about the layout of the text within the document.

Tesseract by default uses its english training data. Tesseract detects characters and then tries to map the detected characters to its closest neighbor. Both of these processes are greatly effected by the assumed language of the text. With tesseract you can specify the language or languages for the OCR engine to use. Tesseract can be configured to use different OCR ‘engine modes’. This can be very useful when working with software or on systems that don’t support the newest engines or for which computational performance is a limiting factor. In addition, not all languages have training data for each engine mode. Tesseract also supports different behaviors for how it expects the text to be laid out on the page. For example it supports options for if the image is expected to contain just a single character, a single line, multiple columns, and several others.

In addition to modifying the behavior of the OCR engine, we can configure the format of the output. Including how much information we want about the extracted text - such as the location on the page and confidence values for the extracted text.

Simplest Usage¶

The simplest way to get the text from an image with pytesseract is with pytesseract.image_to_string:

pytesseract.image_to_string(simple_img)

'Chapter 1\n\nDown the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the bank,\nand of having nothing to do: once or twice she had peeped into the book her\nsister was reading, but it had no pictures or conversations in it, ‘and what is\nthe use of a book,’ thought Alice ‘without pictures or conversation?’\n\nSo she was considering in her own mind (as well as she could, for the hot\nday made her feel very sleepy and stupid), whether the pleasure of making a\ndaisy-chain would be worth the trouble of getting up and picking the daisies,\nwhen suddenly a White Rabbit with pink eyes ran close by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh\ndear! I shall be late!’ (when she thought it over afterwards, it occurred to\nher that she ought to have wondered at this, but at the time it all seemed\nquite natural); but when the Rabbit actually TOOK A WATCH OUT OF\nITS WAISTCOAT- POCKET, and looked at it, and then hurried on, Alice\nstarted to her feet, for it flashed across her mind that she had never before\nseen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and\nburning with curiosity, she ran across the field after it, and fortunately was\njust in time to see it pop down a large rabbit-hole under the hedge.\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again.\n\nThe rabbit-hole went straight on like a tunnel for some way, and then\ndipped suddenly down, so suddenly that Alice had not a moment to think\nabout stopping herself before she found herself falling down a very deep well.\n\nEither the well was very deep, or she fell very slowly, for she had plenty\nof time as she went down to look about her and to wonder what was going\n\n13\n'

This returns just a string of all the text detected in the image. Notice that the returned text contains alphabetical characters, digits, and escape characters such as \n which represents a newline. The entire text has been concatenated into a single python string, aggregating all the lines, and words detected on the page by tesseract.

This is the simplest way to extract the text from an image, when invoked without additional parameters, the image_to_string function uses the default usage options of tesseract.

Language Specification¶

By default, tesseract uses its english training data. This can lead to very poor results if there are non english characters in the image. This is especially true if the image contains text that doesn’t use a latin alphabet. Lets look at an example using an image containing French text, and another image containing Korean text. As before, we will be invoking the function without any additional parameters:

pytesseract.image_to_string(fr_img)

"Reconnaissance optique de caractéres\n\nLa reconnaissance optique de caractéres (ROC, ou OCR pour l'anglais optical character recognition), ou\nocérisation, désigne les procédés informatiques pour la traduction d'images de textes imprimés ou\ndactylographiés en fichiers de texte.\n\nUn ordinateur réclame pour 'exécution de cette tache un logiciel d'OCR. Celui-ci permet de récupérer le texte\ndans l'image d'un texte imprimé et de le sauvegarder dans un fichier pouvant étre exploité dans un traitement\nde texte pour enrichissement, et stocké dans une base de données ou sur un autre support exploitable par un\nsystéme informatique.\n"

pytesseract.image_to_string(kor_img)

'Bet xt ol}\n\n \n\n‘lvls, $2] 20) sahara\n\n‘BB BAt 24\\(Optical character recognition; OCR) Ateto| M74L} 7| A= last SLO] Bars o|o|x| A.\nMAS 85810} 7/717} AS + Qk= VAS Wetst= AOIch.\n\n0[0|4| Atos YS + We SM] Mt BSS ARE BS 7st BABS S92] BACs Helse 2zE\n\nAOSM Ubos OCRO|A}T SOY, OCRE 2ISAlSOILt 7124] Al2H(machine vision) 2] SP HOFS AlAHE| A\nch\n'

Notice that these results are not ideal. While the French string is quite close, there are a couple of errors with the accents on the characters, and the Korean string is pretty much useless.

In order to use ocr on languages other than english we need to download the language’s associated training data for tesseract. The tesseract package we installed with conda-forge comes with most of the language training data. Data for tesseract can be found at the tessdata github repository.

With pytesseract we can see all the available languages with:

pytesseract.get_languages()

['afr',
 'amh',
 'ara',
 'asm',
 'aze',
 'aze_cyrl',
 'bel',
 'ben',
 'bod',
 'bos',
 'bre',
 'bul',
 'cat',
 'ceb',
 'ces',
 'chi_sim',
 'chi_sim_vert',
 'chi_tra',
 'chi_tra_vert',
 'chr',
 'cos',
 'cym',
 'dan',
 'deu',
 'div',
 'dzo',
 'ell',
 'eng',
 'enm',
 'epo',
 'equ',
 'est',
 'eus',
 'fao',
 'fas',
 'fil',
 'fin',
 'fra',
 'frk',
 'frm',
 'fry',
 'gla',
 'gle',
 'glg',
 'grc',
 'guj',
 'hat',
 'heb',
 'hin',
 'hrv',
 'hun',
 'hye',
 'iku',
 'ind',
 'isl',
 'ita',
 'ita_old',
 'jav',
 'jpn',
 'jpn_vert',
 'kan',
 'kat',
 'kat_old',
 'kaz',
 'khm',
 'kir',
 'kmr',
 'kor',
 'kor_vert',
 'lao',
 'lat',
 'lav',
 'lit',
 'ltz',
 'mal',
 'mar',
 'mkd',
 'mlt',
 'mon',
 'mri',
 'msa',
 'mya',
 'nep',
 'nld',
 'nor',
 'oci',
 'ori',
 'osd',
 'pan',
 'pol',
 'por',
 'pus',
 'que',
 'ron',
 'rus',
 'san',
 'sin',
 'slk',
 'slv',
 'snd',
 'spa',
 'spa_old',
 'sqi',
 'srp',
 'srp_latn',
 'sun',
 'swa',
 'swe',
 'syr',
 'tam',
 'tat',
 'tel',
 'tgk',
 'tha',
 'tir',
 'ton',
 'tur',
 'uig',
 'ukr',
 'urd',
 'uzb',
 'uzb_cyrl',
 'vie',
 'yid',
 'yor']

To specify the language to use, pass the name of the language as a parameter to pytesseract.image_to_string. Lets rerun the ocr on the korean image, this time specifying the appropriate language.

pytesseract.image_to_string(kor_img, lang='kor')

'광학 문자 인식\n\n \n\n위키백과, 우리 모두의 백과사전.\n\n광학 문자 인식(20068! 08180 『600901007; 0ㄷㅠ83:은 사람이 쓰거나 기계로 인쇄한 문자의 영상을 이미지 스\n캐너로 획득하여 기계가 읽을 수 있는 문자로 변환하는 것이다.\n\n이미지 스캔으로 얻을 수 있는 문서의 활자 영상을 컴퓨터가 편집 가능한 문자코드 등의 형식으로 변환하는 소프트\n\n웨어로써 일반적으로 0ㄷ이라고 하며, 0은 인공지능이나 기계 시각(07106 1510/의 연구분야로 시작되었\n다\n'

Tesseract supports images that contain multiple languages, we can specify which languages to use by separating them with the + character in the configuration string:

pytesseract.image_to_string(kor_img, lang='kor+eng')

'광학 문자 인식\n\n \n\n위키백과, 우리 모두의 백과사전.\n\n광학 문자 인식(20068! character recognition; OCR) 사람이 쓰거나 기계로 인쇄한 문자의 영상을 이미지 스\n캐너로 획득하여 기계가 읽을 수 있는 문자로 변환하는 것이다.\n\n이미지 스캔으로 얻을 수 있는 문서의 활자 영상을 컴퓨터가 편집 가능한 문자코드 등의 형식으로 변환하는 소프트\n\n웨어로써 일반적으로 0ㄷ이라고 하며, OCRE 인공지능이나 기계 Al2H(machine 1510/의 연구분야로 시작되었\n다\n'

Engine Selection¶

Tesseract has several engine modes that can be used. There are two main implementations - the original tesseract engine, and, since Tesseract version 4, an LSTM based OCR engine. In addition, Tesseract supports using a combination of the two. The list of Tesseract’s engine modes:

= Original Tesseract only.
= Neural nets LSTM only.
= Tesseract + LSTM.
= Default, based on what is available.

By default Tesseract uses mode 3, which is generally equivalent to option 2.

To set the ‘oem’ (OCR engine mode) with pytesseract we pass it as the ‘config’ parameter:

custom_oem_psm_config = r'--oem 1'
pytesseract.image_to_string(simple_img, config=custom_oem_psm_config)

'Chapter 1\n\nDown the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the bank,\nand of having nothing to do: once or twice she had peeped into the book her\nsister was reading, but it had no pictures or conversations in it, ‘and what is\nthe use of a book,’ thought Alice ‘without pictures or conversation?’\n\nSo she was considering in her own mind (as well as she could, for the hot\nday made her feel very sleepy and stupid), whether the pleasure of making a\ndaisy-chain would be worth the trouble of getting up and picking the daisies,\nwhen suddenly a White Rabbit with pink eyes ran close by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh\ndear! I shall be late!’ (when she thought it over afterwards, it occurred to\nher that she ought to have wondered at this, but at the time it all seemed\nquite natural); but when the Rabbit actually TOOK A WATCH OUT OF\nITS WAISTCOAT- POCKET, and looked at it, and then hurried on, Alice\nstarted to her feet, for it flashed across her mind that she had never before\nseen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and\nburning with curiosity, she ran across the field after it, and fortunately was\njust in time to see it pop down a large rabbit-hole under the hedge.\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again.\n\nThe rabbit-hole went straight on like a tunnel for some way, and then\ndipped suddenly down, so suddenly that Alice had not a moment to think\nabout stopping herself before she found herself falling down a very deep well.\n\nEither the well was very deep, or she fell very slowly, for she had plenty\nof time as she went down to look about her and to wonder what was going\n\n13\n'

Note

The r before the string in the above code section tells python to treat the string as a sequence of literal characters. This is different behavior from just using a regular python string. In python, strings prefixed with the r are called Raw strings.

I recommend using option 1 for the best accuracy, unless you are running into specific constraints. For example, are using an older version of Tesseract (less than version 4, that doesn’t have the LSTM option), are running on a system that doesn’t support the LSTM (apparently some android builds), or have performance issues.

Page Layouts¶

Tesseract supports a variety of common Page Segmentation Modes.

= Orientation and script detection (OSD) only.
= Automatic page segmentation with OSD.
= Automatic page segmentation, but no OSD, or OCR. (not implemented)
= Fully automatic page segmentation, but no OSD. (Default)
= Assume a single column of text of variable sizes.
= Assume a single uniform block of vertically aligned text.
= Assume a single uniform block of text.
= Treat the image as a single text line.
= Treat the image as a single word.
= Treat the image as a single word in a circle.
= Treat the image as a single character.
= Sparse text. Find as much text as possible in no particular order.
= Sparse text with OSD.
= Raw line. Treat the image as a single text line,
     bypassing hacks that are Tesseract-specific.

Just like with the OCR engine mode, we set the Page Segmentation Mode as part of the config string.

Automatic page segmentation¶

By default, tesseract will attempt to automatically detect the text layout. If we have prior knowledge its best to specify the layout that is most appropriate.

custom_oem_psm_config = r'--oem 1 --psm 3'
pytesseract.image_to_string(two_column_img, config=custom_oem_psm_config)

'An Overview of the Tesseract OCR Engine\n\nRay Smith\nGoogle Inc.\ntheraysmith@gmail.com\n\nAbstract\n\nThe Tesseract OCR engine, as was the HP Research\nPrototype in the UNLV Fourth Annual Test of OCR\nAccuracy[1], is described in a _ comprehensive\noverview. Emphasis is placed on aspects that are novel\nor at least unusual in an OCR engine, including in\nparticular the line finding, features/classification\nmethods, and the adaptive classifier.\n\n1. Introduction — Motivation and History\n\nTesseract is an open-source OCR engine that was\ndeveloped at HP between 1984 and 1994. Like a super-\nnova, it appeared from nowhere for the 1995 UNLV\nAnnual Test of OCR Accuracy [1], shone brightly with\nits results, and then vanished back under the same\ncloak of secrecy under which it had been developed.\nNow for the first time, details of the architecture and\nalgorithms can be revealed.\n\nTesseract began as a PhD research project [2] in HP\nLabs, Bristol, and gained momentum as a possible\nsoftware and/or hardware add-on for HP’s line of\nflatbed scanners. Motivation was provided by the fact\nthat the commercial OCR engines of the day were in\ntheir infancy, and failed miserably on anything but the\nbest quality print.\n\nAfter a joint project between HP Labs Bristol, and\nHP’s scanner division in Colorado, Tesseract had a\nsignificant lead in accuracy over the commercial\nengines, but did not become a product. The next stage\nof its development was back in HP Labs Bristol as an\ninvestigation of OCR for compression. Work\nconcentrated more on improving rejection efficiency\nthan on base-level accuracy. At the end of this project,\nat the end of 1994, development ceased entirely. The\nengine was sent to UNLV for the 1995 Annual Test of\nOCR Accuracy[1], where it proved its worth against\nthe commercial engines of the time. In late 2005, HP\nreleased Tesseract for open source. It is now available\nat http://code.google.com/p/tesseract-ocr.\n\n2. Architecture\n\nSince HP had independently-developed page layout\nanalysis technology that was used in products, (and\ntherefore not released for open-source) Tesseract never\nneeded its own page layout analysis. Tesseract\ntherefore assumes that its input is a binary image with\noptional polygonal text regions defined.\n\nProcessing follows a traditional step-by-step\npipeline, but some of the stages were unusual in their\nday, and possibly remain so even now. The first step is\na connected component analysis in which outlines of\nthe components are stored. This was a computationally\nexpensive design decision at the time, but had a\nsignificant advantage: by inspection of the nesting of\noutlines, and the number of child and grandchild\noutlines, it is simple to detect inverse text and\nrecognize it as easily as black-on-white text. Tesseract\nwas probably the first OCR engine able to handle\nwhite-on-black text so trivially. At this stage, outlines\nare gathered together, purely by nesting, into Blobs.\n\nBlobs are organized into text lines, and the lines and\nregions are analyzed for fixed pitch or proportional\ntext. Text lines are broken into words differently\naccording to the kind of character spacing. Fixed pitch\ntext is chopped immediately by character cells.\nProportional text is broken into words using definite\nspaces and fuzzy spaces.\n\nRecognition then proceeds as a two-pass process. In\nthe first pass, an attempt is made to recognize each\nword in turn. Each word that is satisfactory is passed to\nan adaptive classifier as training data. The adaptive\nclassifier then gets a chance to more accurately\nrecognize text lower down the page.\n\nSince the adaptive classifier may have learned\nsomething useful too late to make a contribution near\nthe top of the page, a second pass is run over the page,\nin which words that were not recognized well enough\nare recognized again.\n\nA final phase resolves fuzzy spaces, and checks\nalternative hypotheses for the x-height to locate small-\ncap text.\n'

Notice that with the default page segmentation mode (fully automatic) it correctly identifies that the lines of text are split between the two columns on the page.

Other psm options¶

Automatic page segmentation might not always be the best option. There are cases when we want to use a different page segmentation mode. One consideration is performance, tesseract will be able to run significantly faster on each image with modes that don’t require it to estimate a layout. While this is probably not a big consideration when working with a single image, it adds up when working over hundreds or thousands! Additionally, the automatic page layout detection may give results that don’t match your expectations. A common case where its best to explicitly provide a layout option is when working with tabular data.

Tables¶

Lets look at a common failure of automatic page segmentation. This image contains the table of contents page, where the chapters are aligned to the left and the page numbers to the right. Automatic page segmentation will separate this into two distinct regions, grouping all the left aligned text together and then all of the right aligned text.

custom_oem_psm_config = r'--oem 1 --psm 3' 
pytesseract.image_to_string(toc_img, config=custom_oem_psm_config) 

'Contents\n\n8\n\n9\n\nDown the Rabbit-Hole\n\nThe Pool of Tears\n\nA Caucus-Race and a Long Tale\nThe Rabbit Sends in a Little Bill\nAdvice from a Caterpillar\n\nPig and Pepper\n\nA Mad Tea-Party\n\nThe Queen’s Croquet-Ground\n\nThe Mock Turtle’s Story\n\n10 The Lobster Quadrille\n\n11 Who Stole the Tarts?\n\n12 Alice’s Evidence\n\n11\n\n13\n\n19\n\n25\n\n31\n\n37\n\n43\n\n51\n\n59\n\n67\n\n73\n\n81\n\n87\n'

If no segmentation mode exactly matches what you are looking for in terms of grouping together text on the page, you can manually reconstruct the content by using the positional data output by tesseract.

Output Formats¶

So far, we have just been extracting the text as a python string. However, tesseract, and pytesseract, support a variety of output options. Some of these options contain more information than can be stored in just a string. This means we can use that information to get more use out of our ocr results in some cases. Additionally, these output formats can often be interpreted by other software, or may be useful in some bigger pipeline such as a web app.

Pytesseract supports the following output formats which can be specified by the function.

Output	Function	Return Type	Description
string	`image_to_string`	str	the extracted text
osd	`image_to_osd`	str	orientation and script as detected by tesseract
boxes	`image_to_boxes`	str	bounding boxes for each character
data	`image_to_data`	str/tsv or df	tab separated table with boxes, confidences, line numbers
alto xml	`image_to_alto_xml`	str/xml	ALTO XML - standard for representing OCR and layout data in xml
pdf	`image_to_pdf_or_hocr`	binary/pdf	Searchable PDF
hocr	`image_to_pdf_or_hocr`	str/hocr	hOCR - Another standard for representing OCR data as valid html

Note

Each of these functions accept the lang and config parameters we have already seen. Some have additional parameters, such as image_to_data which accepts an output_type parameter which we will use later.

In an interactive python session, we can read documentation for functions with the help function. See DataLab’s introductory python reader section for more information. In this case, we can see the parameters for the different pytesseract functions with:

help(pytesseract.image_to_alto_xml)

Tip

To save these outputs to disk we can use python file objects.

For example when working with pdfs:

pdfdata = pytesseract.image_to_pdf_or_hocr('img.png', extension='pdf')
with open('output.pdf', 'w+b') as f:
    f.write(pdfdata)

Another example saving the output in ALTO xml format:

xml = pytesseract.image_to_alto_xml('img.png')
with open('output.xml', 'w') as f:
    f.write(xml)

Verbose OCR Data¶

Pytesseract’s image_to_data function provides word level data of the ocr output. Parsing this information can be useful for many types of analyses. By default, the return value is a string containing a table of tab seperated values. However, when pandas is loaded, we can have image_to_data return a pandas DataFrame object by setting the output_type parameter:

data = pytesseract.image_to_data(toc_img, config=custom_oem_psm_config, output_type='data.frame')
data.head()

	level	page_num	block_num	par_num	line_num	word_num	left	top	width	height	conf	text
0	1	1	0	0	0	0	0	0	2480	3507	-1.000000	NaN
1	2	1	1	0	0	0	376	824	444	74	-1.000000	NaN
2	3	1	1	1	0	0	376	824	444	74	-1.000000	NaN
3	4	1	1	1	1	0	376	824	444	74	-1.000000	NaN
4	5	1	1	1	1	1	376	824	444	74	96.769348	Contents

Note

Pandas Dataframes are incredibly powerful for data analysis. For an introduction to Pandas and dataframes see DataLab’s Python Basics reader.

Here is a summary of each column in this table. Adapted from this blog post

Column	Description
level	1: page, 2: block, 3: paragraph, 4: line, 5: word
page_num	starts at 1, indicates page, only useful for multi-page documents
block_num	starts at 0, pages > blocks > paragraphs > lines > words
par_num	starts at 0
line_num	starts at 0
word_num	starts at 0
left	x coordinate in pixels, top left corner of the bounding box, from top left corner of image
top	y coordinate in pixels, top left corner of the bounding box, from top left corner of image
width	width in pixels of bounding box
height	height in pixels of bounding box
conf	confidence value for the word, 0-100, -1 for any row that isn’t a word
text	detected word, NaN or empty for any row that isn’t a word

A bounding box refers to a rectangular region within the image. Bounding boxes can be used to represent a page, a block, a paragraph, a line, a word or even a character.

Analysis of this data can be very useful for projects that rely on the layout of the documents. One use of this data is to quickly classify types of pages within your document set, for example, you could develop heuristics for detecting if a page contains a table of contents and filter those out. You could use this data for extracting Titles or Headers other sequences of text that have differing text heights. Additionally, if you only care about the text within a certain region of the page, for example the main article body of a journal article, you could filter out the rows that aren’t within that region.

In addition to information about the layout, this table contains the confidence values associated with each word of detected text. These scores range from 0-100 and reflect the engine’s confidence in the detected word.

Assessing Accuracy¶

OCR is a very challenging problem, and while current tools are very advanced and built using the latest technologies, they are imperfect.

Confidence scores¶

One quantitative way of evaluating the OCR’s performance is by analyzing the confidence values returned from pytesseract.image_to_data. With Pandas we can compute some summary statistics on the values.

data["conf"].loc[data["text"].notna()].describe()

count    65.000000
mean     95.051439
std       2.124442
min      88.821213
25%      93.293381
50%      96.125114
75%      96.529099
max      96.911034
Name: conf, dtype: float64

We can also sort the words by their confidence scores

data.sort_values('conf', ascending=False)[["text", "conf"]].loc[data["text"].notna()]

	text	conf
8	8	96.911034
12	9	96.907730
128	67	96.851425
23	of	96.804993
67	Story	96.780762
...	...	...
66	Turtle’s	91.321640
55	Tea-Party	90.653679
74	Quadrille	90.321297
87	Alice’s	89.817886
132	73	88.821213

65 rows × 2 columns

Vocabulary¶

data["text"].value_counts()

The               5
a                 3
11                2
A                 2
the               2
and               2
12                1
Tarts?            1
Stole             1
Contents          1
Who               1
Evidence          1
Quadrille         1
Lobster           1
10                1
Story             1
Alice’s           1
25                1
13                1
19                1
Mock              1
31                1
37                1
43                1
51                1
59                1
67                1
73                1
81                1
Turtle’s          1
Tea-Party         1
Croquet-Ground    1
Queen’s           1
9                 1
Down              1
Rabbit-Hole       1
Pool              1
of                1
Tears             1
Caucus-Race       1
Long              1
Tale              1
Rabbit            1
Sends             1
in                1
Little            1
Bill              1
Advice            1
from              1
Caterpillar       1
Pig               1
Pepper            1
Mad               1
8                 1
87                1
Name: text, dtype: int64

Image Considerations¶

The quality of OCR outputs is heavily dependent on the quality of the input image. There are many potentially problematic features that can lead to very poor OCR results. Many of these problems can be programmatically resolved before passing the image to the OCR engine through image preprocessing. There are some instances where no amount of preprocessing will be sufficient to get high quality OCR results. In order to maximize the quality of OCR results from tesseract it is important to consider a few things. Many of these things are quite intuitive if we consider how ‘tesseract’ is preforming the task of OCR. So far we have worked with images that are very well suited to OCR. They are well suited for several reasons: its easy to distinguish text from the background, they are properly aligned, they contain nothing but text, they are high quality (measured in dots per inch), and they contain standard fonts.

So what can we do with images that don’t look like the images we have seen so far? It is a good idea to start by randomly selecting some of your images and getting the OCR data. It is also good to consider what you are seeking to get out of OCRing your images. Do you really need perfect quality transcription? How many images are you hoping to process? How much manual intervention can you apply?

Tesseract provides detailed documentation on ways to improve accuracy. Many of these ways involve preprocessing the images. You can do all of this with whatever graphical user interface image editor you prefer. This may be your best option if you are working with relatively few documents. If you are working with lots of documents, then you can develop a preprocessing scheme with the graphical interface and work your way to replicating the workflow programmatically using imagemagick or opencv.

Working with Pdfs¶

Tesseract does not operate on Pdfs. To run OCR with tesseract on a Pdf, you must first convert the pages of the pdf to an image file format.

extracting existing text layers¶

Often times Pdfs will have a text layer already embedded. In which case it may not be required to run OCR. We can extract text layers, if they exist, with PyPDF2 library.

from PyPDF2 import PdfFileReader

reader = PdfFileReader('./data/two_column-google.pdf')
page = reader.pages[0]
page.extractText()

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [_reader.py:1065]

'An Overview of the Tesseract OCR Engi ne \n \n \nRay Smith  \nGoogle Inc. \ntheraysmith@gmail.com \n \nAbst ra ct  \n \nThe Tesseract OCR engine, as was the HP Research \nPrototype in the UNLV Fourth Annual Test of OCR \n\nA ccuracy[1], is described in a comprehensive \n\noverview. Emphasis is placed on aspects that are novel \n\nor at least unusual in an OCR engine, including in \n\nparticular the line finding, features/classification \n\nmethods, and the adaptive classifier. \n  \n \n1. Introduction ŒMotivation and History \n \nTesseract is an open-source OCR engine that was \ndeveloped at HP between 1984 and 1994. Like a super-\n\nnova, it appear ed fr om nowher e for  the 1995 UNLV \n\nAnnual Test of OCR Accuracy [1], shone brightly with \n\nits results, and then vanished back under the same \n\ncloak of secrecy under which it had been developed. \n\nNow for the first time, deta ils of the architecture and \nalg orith ms can  be revealed . \nTesser act began as a PhD r esear ch pr oject [2] in HP \nLabs, Bristol, and gain ed momen tum as a possible \n\nsoftware and/or hardware add-on for HP™s line of \n\nflatbed scanners. Motivation was provided by the fact \n\nthat the commercial OCR engines of the day were in \n\ntheir infancy, and failed miserably on anything but the \n\nbest quality print. \nAfter a joint project between HP Labs Bristol, and \nHP™s scann er division  in  Colorado, Tesseract had a \n\nsignificant lea d in a ccura cy over the commercia l \n\nengines, but did not become a product. The next stage \nof its development was back in HP Labs Bristol as an \ninvestigation of OCR for compression. Work \n\nconcentrated more on improving rejection efficiency \n\nthan on base-level accuracy. At the end of this project, \n\nat the end of 1994, development ceased entirely. The \n\nengine was sent to UNLV for  the 1995 Annual Test of \n\nOCR Ac cura cy[1], where it proved its worth a gainst \n\nthe commercial engines of the time. In late 2005, HP \n\nreleased Tesseract for open source. It is now available \n\nat http://code.google.com/p/tesseract-ocr. \n\n \n2. Archi tecture \n \nSince HP had independently-developed page layout \nanalysis technology that was used in products, (and \n\ntherefore not released for open-source) Tesseract never \n\nneeded its own page layout analysis. Tesseract \n\ntherefore assumes that its input is a binary image with \n\noptional polygonal text regions defined. \nProcessing follows a traditional step-by-step \npipeline, but some of the stages wer e unusual in their \n\nda y, a nd possibly remain so even now. The first step is \n\na connected component analysis in which outlines of \n\nthe components are stored. This was a computationally \nexpensive design decision at the time, but had a \nsignificant advantage: by inspection of the nesting of \n\noutlines, and the number of child and grandchild \n\noutlines,  it is simple to detect inverse text and \n\nrecognize it as easily as black-on-white text. Tesseract \n\nwas probably the first OCR engine able to handle \n\nwhite-on-black text so trivially. At this stage, outlines \n\nare gathered together, purely by nesting,  into \nBlobs\n. \nBlobs are organized into text lines, and the lines and \nr egions ar e analyzed for  fixed pitch or  pr oportional \n\ntext. Text lines are broken into words differently \n\naccor ding to the kind of character spacing. Fixed pitch \ntext is chopped immediately by character cells. \nProportional text is broken into words using definite \n\nspaces and fuzzy spaces. \nRe c ognition then proceeds a s a two-pa ss process. In \nthe first pass, an attempt is made to recognize each \n\nword in turn. Each word that is satisfactory is passed to \n\nan ada ptive cla ssifier a s training data. The a da ptive  \n\nclassifier then gets a chance to more accurately \n\nrecognize text lower down the page. \nS ince the a da ptive cla ssifier ma y ha ve lea rned \nsomething useful too late to make a contribution near \nthe top of the page, a second pass is run over the page, \nin which words that were not recognized well enough \n\nare recognized again. \nA final phase resolves fuzzy spaces, and checks \nalternative hypotheses for the x-height to locate small-\n\ncap text. \n'

converting pdfs to an image format¶

Using the pdf2image package we can convert pdfs to image formats:

import pdf2image
images = pdf2image.convert_from_path('./data/two_column-google.pdf', output_folder='.', dpi=300, fmt='jpeg')

Text Cleaning¶

Depending on what analysis you want to do with your text, its helpful to know some common text cleaning methods and ways of implementing them in python. Preprocessing text for computational analysis is a huge topic that won’t be covered here. You can read more about text processing in python in DataLab’s getting started with textual data series. As well as the upcoming Natural Language Processing Series

Here is some sample code demonstrating common preprocessing techniques applied to the output of the OCR data:

from stop_words import get_stop_words
words = data["text"].loc[data["text"].notna()]
words = words.str.lower()
words = words.str.replace('[^\w\s]', '')
words = words.str.replace('\d+', '')
stopwords = get_stop_words('en')
words = [w for w in words if w not in stopwords]
words = [w for w in words if w]

FutureWarning: The default value of regex will change from True to False in a future version. [1984863378.py:4]
FutureWarning: The default value of regex will change from True to False in a future version. [1984863378.py:5]

Optical Character Recognition (OCR) and Working with Messy Text Data

OCR With Pytesseract

Contents

OCR With Pytesseract¶

Setup¶

Pytesseract Usage¶

Simplest Usage¶

Language Specification¶

Engine Selection¶

Page Layouts¶

Automatic page segmentation¶

Other psm options¶

Tables¶

Output Formats¶

Verbose OCR Data¶

Assessing Accuracy¶

Confidence scores¶

Vocabulary¶

Image Considerations¶

Working with Pdfs¶

extracting existing text layers¶

converting pdfs to an image format¶

Text Cleaning¶