11 Optical Character Recognition
This lesson focuses on extracting data from non-digital sources, such as printed documents, using several packages for Optical Character Recognition.
11.1 Learning Objectives
After this lesson, you should be able to:
- Understand the usefulness of OCR
- Describe how OCR ‘reads’
- Differentiate between how
print()
andcat()
represent formatting - Describe different strategies for improving OCR accuracy
11.2 What is Optical Character Recognition?
Much of the data we’ve used in the course thus far has been born-digital. That is, we’ve used data that originates from a digital source and does not exist elsewhere in some other form. Think back, for example, to the lecture on strings in R: your homework required you to type text directly into RStudio, manipulate it, and print it to screen. But millions, even billions, of data-rich documents do not originate from digital sources. The United States Census, for example, dates back to 1790; we still have these records and could go study them to get a sense of what the population was like hundreds of years ago. Likewise, printing and publishing far precedes the advent of computers; much of the literary record is still bound up between the covers books or stowed away in archives. Computers, however, can’t read the way we read, so if we wanted to use digital methods to analyze such materials, we’d need to convert them into a computationally tractable form. How do we do so?
One way would be to transcribe documents by hand, either by typing out plaintext versions with word processing software or by using other data entry methods like keypunching to record the information those documents contain. Amazon’s Mechanical Turk service is an example of this kind of data entry. It’s also worth noting that, for much of the history of computing, data entry was highly gendered and considered to be “dumb”, secretarial work that young women would perform. Much of the divisions between “cool” coding and computational grunt work that, in a broad, cultural sense, continue to inform how we think about programming, and indeed who gets to program, stem from such perceptions. In spite of (or perhaps because of) such perceptions, huge amounts of data owe their existence to manual data entry. That said, the process itself is expensive, time consuming, error-prone, and, well, dull.
Optical character recognition, or OCR, is an attempt to offload the work of digitization onto computers. Speaking in a general sense, this process ingests images of print pages (such as those available on Google Books or HathiTrust), applies various preprocessing procedures to those images to make them a bit easier to read, and then scans through them, trying to match the features it finds with a “vocabulary” of text elements it keeps as a point of reference. When it makes a match, OCR records a character and enters it into a text buffer (a temporary data store). Oftentimes this buffer also includes formatting data for spaces, new lines, paragraphs, and so on. When OCR is finished, it outputs its matches as a data object, which you can then further manipulate or analyze using other code.
11.3 Loading Page Images
OCR “reads” by tracking pixel variations across page images. This means every
page you want to digitize must be converted into an image format. For the
purposes of introducing you to OCR, we won’t go through the process of creating
these images from scratch; instead, we’ll be using ready-made examples. The most
common page image formats you’ll encounter are pdf
and png
. They’re
lightweight, portable, and usually retain the image quality OCR software needs
to find text.
The pdftools
package is good for working with these files:
Once you’ve downloaded/installed it, you can load a pdf
into RStudio from your
computer by entering its filesystem location as a string and assigning that
string to a variable, like so:
pdf <- "./pdf_sample.pdf"
Note that we haven’t used a special load function, like read.csv()
or
readRDS()
. pdftools
will grab this file from its location and load it
properly when you run a process on it. (You can also just write the string out in
whatever function you want to call, but we’ll keep our pdf
location in a variable
for the sake of clarity.)
The same method works with web addresses. We’ll be using web material. First, write out an address and assign it to a variable.
Some pdf
files will have text data already encoded into them. This is
especially the case if someone made a file with word processing software (like
when you write a paper in Word and email a pdf
to your TA or professor). You
can check whether a pdf
has text data with pdf_text()
. Assign this function’s
output to a variable and print it to screen with cat()
, like so:
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
##
##
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
##
##
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
##
##
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
##
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
##
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
##
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
##
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
## The quick brown fox jumps over the lazy dog.
See how the RStudio terminal recreates the original formatting from the pdf
.
If you were to use print()
on the text output, you’d see all the line breaks
and spaces pdf_text()
created to match its output with the file. This
re-creation would be even more apparent if you were to save the output to a new
file with write()
. Doing so would produce a close, plaintext approximation of
the original pdf
.
You can also process multi-page pdf
files with pdf_text()
, with more or less
success. It can transcribe whole books and will keep them in a single text
buffer, which you can then assign to a variable or save to a file. Keep in mind,
however, that if your pdf
files have headers, footers, page numbers, chapter
breaks, or other such paratextual information, pdf_text()
will pick this up
and include it in its output. A later lecture in the course will discuss how to
go about dealing with this extra data.
If, when you run pdf_text()
, you find that your file already contains text
data, you’re set! There’s no need to perform OCR and you can immediately start
working with your data. However, if you run the function and find that it outputs
a blank character string, you’ll need to OCR it. The next section shows you how.
11.4 Running OCR
First, you’ll need to download/install another package, tesseract
, which
complements pdftools
. The latter only loads/reads pdfs
, whereas tesseract
actually performs OCR. Download/install tesseract
:
And assign a new pdf
to a new variable:
To run OCR on this pdf
, use the following:
Print the output to screen with cat()
and see if the process worked:
## THE SLEREXE COMPANY LIMITED
## SAPORS LANE . BOOLE - DORSET . BH25 8ER
## ‘aurrmowe noun (945 18) SU6I7 - run 123486
## Our Ref. 350/PIC/EAC gen January, 1972.
## De. P.M. Cundall,
## Mining Surveys Ltd.,
## Holroyd Road,
## Reading,
## Berks.
## Dear Pete,
##
## Permit me to introduce you to the facility of facsinile
## transmission.
##
## In facsimile @ photocell is caused to perform a raster scan over
## the subject copy, The variations of print density on the docunent
## cause the photocell to generate an analogous electrical video signal.
## ‘This signal is used to modulate a carrier, which is transmitted to a
## remote destination over a radio or cable Communications Link.
##
## At che remote tersinal, demodulation reconstructs the video
## signal, which is used to modulate the density of print produced by
## printing device. This device is scanning in raster scan synchronised
## ith chac at the transmitting terminal. As a result, a facsimile
## copy of the subject docusent is produced.
##
## Probably you have uses for this facility in your organisation.
##
## Yours sincerely,
## PJ. CROSS
## Group Leader ~ Facsimile Research
Voila! You’ve just digitized text. The formatting is a little off, but things look good overall. And most importantly, it looks like everything has been transcribed correctly.
As you ran this process, you might’ve noticed that a new png
file briefly
appeared on your computer. This is because tesseract
converts pdfs
to png
under the hood as part of its pre-processing work and then silently deletes that
png
when it outputs its matches. If you have a collection of pdf
files that
you’d like to OCR, it can sometimes be faster and less memory intensive to
convert them all to png
first. You can perform this conversion like so:
## Warning in sprintf(filenames, pages, format): 2 arguments not used by format
## 'img/png_example.png'
## Converting page 1 to img/png_example.png... done!
In addition to making a png
object in your RStudio environment, pdf_convert()
will also save that file directly to your working directory. Imagine, for
example, how you might put a vector of pdf
files through a for
loop and save
them to a directory, where they can be stored until you’re ready to OCR them.
pdfs <- c("list.pdf", "of.pdf", "files.pdf", "to.pdf", "convert.pdf")
outfiles <- c("list.png", "of.png", "files.png", "to.png", "convert.png")
for (i in 1:length(pdfs)) {
pdf_convert(pdfs[i], format="png", filenames=outfiles[i])
}
ocr()
can work with a number of different image types. It takes pngs
in the
same way as it takes pdfs
:
11.5 Accuracy
If you cat()
the output from the above png
file, you might notice that the
text is messier than it was when we used pdf_text_ocr()
.
## THE SLEREXE COMPANY LIMITED
## SAPORS LANE . BOOLE - DORSET . BH25 8ER
## ‘aurrmowe noun (945 18) SU6I7 - run 123486
## Our Ref. 350/PIC/EAC gen January, 1972.
## De. P.M. Cundall,
## Mining Surveys Ltd.,
## Holroyd Road,
## Reading,
## Berks.
## Dear Pete,
##
## Permit me to introduce you to the facility of facsinile
## transmission.
##
## In facsimile @ photocell is caused to perform a raster scan over
## the subject copy, The variations of print density on the docunent
## cause the photocell to generate an analogous electrical video signal.
## ‘This signal is used to modulate a carrier, which is transmitted to a
## remote destination over a radio or cable Communications Link.
##
## At che remote tersinal, demodulation reconstructs the video
## signal, which is used to modulate the density of print produced by
## printing device. This device is scanning in raster scan synchronised
## ith chac at the transmitting terminal. As a result, a facsimile
## copy of the subject docusent is produced.
##
## Probably you have uses for this facility in your organisation.
##
## Yours sincerely,
## PJ. CROSS
## Group Leader ~ Facsimile Research
This doesn’t have to do with the png
file format per se but rather with the
way we created our file. If you open it, you’ll see that it’s quite blurry,
which has made it harder for ocr()
to match the text it represents:
This blurriness is because pdf_convert()
defaults its conversions to 72 dpi,
or dots per inch. Dpi is a measure of image resolution (originally from
inkjet printing), which describes the amount of pixels your computer uses to
create images. More pixels means higher image resolution, though this comes with
a trade off: images with a high dpi are also bigger and take up more space on your
computer. Usually, a dpi of 150 is sufficient for most OCR jobs, especially if
your documents were printed with technologies like typewriters, dot matrix
printers, and so on and if they feature fairly legible typefaces (Times New Roman,
for example). A dpi of 300, however, is ideal. You can set the dpi in
pdf_convert()
by adding a dpi argument in the function:
## Warning in sprintf(filenames, pages, format): 2 arguments not used by format
## './img/hi_res_png_example.png'
## Converting page 1 to ./img/hi_res_png_example.png... done!
Another function, ocr_data()
outputs a data frame that contains all of the words
tesseract
found when it scanned through your image, along with a column of
confidence scores. These scores, which range from 0-100, provide valuable
information about how well the OCR process has performed, which in turn may
tell you whether you need to modify your pdf
or png
files further before
OCRing them (more on this below). Generally, you can trust scores of 93 and
above.
To get confidence scores for an OCR job, call ocr_data()
and subset the
confidence
column, like so:
## [1] 92.85288 93.29081 91.28961 89.58909 90.84585 93.28822 89.73586 42.98426
## [9] 42.98426 84.53579 93.22416 92.29937 78.53416 92.69026 92.69026 90.78268
## [17] 96.54684 96.20255 92.06576 90.66714 95.93098 95.93098 94.80000 91.29604
## [25] 86.38274 89.78857 96.21648 93.23919 91.63251 95.72250 96.07732 96.30678
## [33] 95.42773 96.16511 96.46468 96.18211 96.56435 96.26118 96.44928 96.84418
## [41] 95.99975 96.43151 96.05421 96.64143 96.58707 93.48241 95.80854 96.30167
## [49] 95.68130 94.79024 96.29650 95.70312 95.70312 96.16078 95.46851 96.13533
## [57] 96.07069 95.75951 28.89212 96.41727 95.87022 95.88496 95.70441 96.74840
## [65] 96.97053 96.28452 96.16819 96.95196 96.38844 96.38844 96.35432 96.92095
## [73] 95.38261 96.23360 96.32274 96.62334 95.21851 96.04497 96.48308 95.83801
## [81] 96.36586 96.01486 96.15874 96.15874 96.38681 96.57597 95.66638 96.50586
## [89] 96.53349 96.74413 95.86657 95.37359 95.37359 96.42490 96.42490 95.37225
## [97] 95.37225 95.67689 96.46522 96.46522 96.16774 95.67158 94.74640 94.74640
## [105] 96.43452 96.19952 96.89110 96.50343 96.91463 96.44903 95.93738 96.54442
## [113] 96.52963 96.17621 96.17621 96.27686 96.17960 96.20045 96.02205 96.35922
## [121] 96.08511 96.84263 94.95956 96.05283 96.05283 96.13543 96.48267 96.53674
## [129] 96.53783 96.63319 96.92226 96.41508 96.30360 96.21509 96.03458 96.52768
## [137] 95.88564 96.76052 96.69682 94.36144 95.98904 96.43297 96.19821 95.59159
## [145] 95.59159 96.73177 96.55448 95.66377 96.74652 95.87988 95.87988 96.37433
## [153] 96.56387 96.92600 96.76073 96.82498 96.14220 96.14220 96.24170 96.72656
## [161] 96.72889 96.51418 96.85883 96.14581 54.90906 34.23709 55.07772 96.81318
## [169] 96.47118 91.64275 89.22007 96.26302 96.66982 96.11583 94.74653 89.28475
## [177] 96.12286 91.70045 94.22867 81.54822 50.47863 96.34515 33.15864 39.17315
## [185] 80.93199 82.44621 40.23938
The mean is a good indicator of the overall OCR quality:
## [1] 92.27528
Looks pretty good, though there were a few low scores that dragged the score down a bit. Let’s look at the median:
## [1] 96.13543
We can work with that!
If we want to check our output a bit more closely, we can do two things. First,
we can look directly at ocr_data
and compare, row by row, a given word and its
confidence score.
## word confidence bbox
## 1 SAPORS 92.85288 422,194,497,208
## 2 LANE 93.29081 508,194,560,208
## 3 - 91.28961 570,203,575,205
## 4 BOOLE 89.58909 585,193,651,208
## 5 - 90.84585 661,202,666,205
## 6 DORSET 93.28822 676,193,755,208
## 7 - 89.73586 764,203,769,205
## 8 BH 42.98426 784,193,808,208
## 9 25 42.98426 813,189,834,217
## 10 8ER 84.53579 842,193,883,208
## 11 TELEPHONE 93.22416 449,232,534,243
## 12 BOOLE 92.29937 544,232,589,243
## 13 (94513) 78.53416 600,229,664,246
## 14 51617 92.69026 675,229,719,244
## 15 - 92.69026 730,237,735,240
## 16 TELEX 90.78268 746,232,792,243
## 17 123456 96.54684 804,228,857,244
## 18 Our 96.20255 211,392,246,408
## 19 Ref. 92.06576 261,391,306,407
## 20 350/PJC/EAC 90.66714 325,389,459,409
## 21 18th 95.93098 863,389,910,405
## 22 January, 95.93098 924,389,1020,408
## 23 1972. 94.80000 1038,388,1095,405
## 24 Dr. 91.29604 212,492,244,508
## 25 P.N. 86.38274 262,492,307,508
That’s a lot of information though. Something a little more sparse might be
better. We can use base R’s table()
function to count the number of times
unique words appear in the OCR data. We do this with the word
column in our
ocr_data
variable from above:
Let’s look at the first 30 words:
## Var1 Freq
## 1 - 5
## 2 : 1
## 3 (94513) 1
## 4 @O 1
## 5 / 1
## 6 1 1
## 7 123456 1
## 8 18th 1
## 9 1972. 1
## 10 2038 1
## 11 25 1
## 12 350/PJC/EAC 1
## 13 51617 1
## 14 8ER 1
## 15 a 9
## 16 an 1
## 17 analogous 1
## 18 As 1
## 19 at 1
## 20 At 1
## 21 Berks. 1
## 22 BH 1
## 23 BOOLE 2
## 24 by 1
## 25 cable 1
## 26 carrier, 1
## 27 cause 1
## 28 caused 1
## 29 communications 1
## 30 copy 1
This representation makes it easy to spot errors like discrepancies in spelling. We could correct those either manually or with string matching. One way to further examine this table is to look for words that only appear once or twice in the output; among such entries you’ll often find misspellings. The table does, however, have its limitations. Looking at this data can quickly become overwhelming if you send in too much text. Additionally, notice that punctuation “sticks” to words and that uppercase and lowercase variants of words are counted separately, rather than together. These quirks are fine, useful even, if we’re just spot-checking for errors, but we’d need to further clean this data if we wanted to use it in computational text analysis. A later lecture will discuss other methods that we can use to clean text.
When working in a data-forensic mode with page images, it’s a good idea to pull
a few files at random and run them through ocr_data()
to see what you’re
working with. OCR accuracy is often wholly reliant on the quality of the page
images, and most of the work that goes into digitizing text involves properly
preparing those images for OCR. Adjustments include making sure images are
converted to black and white, increasing image contrast and brightness, increasing
dpi, and rotating images so that their text is more or less horizontal.
tesseract
performs some of these tasks itself, but you can also do them ahead
of time and often you’ll have more control over quality this way. The tesseract
documentation goes into detail about what you can do to improve accuracy before
even opening RStudio; we can’t cover this in depth, but keep the resource in mind
as you work with this type of material. And remember: the only way to completely
trust your accuracy is to go through the OCR output yourself. It’s a very common
thing to have to make small tweaks to output. In this sense, we haven’t quite left
the era of hand transcription.
11.6 Unreadable Text
All that said, these various strategies for improving accuracy will only get you so far if your page images are composed in a way OCR just can’t read. OCR systems contain a lot of in-built assumptions about what “normal” text is, and they are incredibly brittle when they encounter text that diverges from that norm. Early systems, for example, required documents to be printed with special, machine-readable typefaces; texts that contained anything other than this design couldn’t be read. Now, OCR is much better at handling a variety of text styling, but systems still struggle with old print materials like blackletter.
Running:
ballad <- "https://ebba.english.ucsb.edu/images/cache/hunt_1_18305_2448x2448.jpg"
ballad_out <- ocr(ballad)
Produces:
## CAditeription of optons falcehod |
## nf Pogke thyze, and of bis fatall farewel,
## ache fatal fine of 2rattours lees :
## \ 4Bp guflice due, deferuyng foe. . :
## Flate (alas) the great bntruth = Whe Crane woldetipedpto the funne, sotto, bis futtrpug long (be fare) 7
## Mf Wrattours, hotw ttfpep % beavdit once of olde: _ Cpl pay bis foes atlak: : :
## MU do life toknolv, Hal herex-.aae And with the kyng ofbpresdinitrine his metepe moued once alway, :
## oly late allegeance fed. wp Fame, J beardittoloe ~~ He Hall them quight out café j
## @ Ff Riuersrage againt the Sea. Qnodotwae ihe wolve not fal fre ne, WH ith fentence int fo2 their bntruth, :
## And {well with foddeine rayne s Wut higher pil nimovry ? And boeakpng of his tupll: :
## Wolw glad ave they to fall agapne, Gil pat her veach (faith woe repo2te) Che fruits of their fedictous feds, :
## Andtrace their twonted traine? Shame madea backe recour Che barnes of earth thail fpll.
## M) Af fire by force wolde forge the fall 3 touch no Armes beret at all, Whete foules God wot foze clogd w crime
## Df any fumptuoufe place, sButthelvafable wpe: _ and their potteritie
## A) Af water flods byd him leaueof, We hote moall fence doth reper MWelpotted fore with theirabule,
## dis flames be wyll difgrace. DF clpmers bye the guyle. And ttand by thete folie.
## 3f Goo command the wyrides to ceale, Aho buyloes a boule of many > heir linpngs left thetr namea thame,
## ‘ Ibis blattes are lapo full low: andlatth not ground forks heir dedes with popfon (ped: :
## HB 3f God command the {eas to calmes 4But doth ertortethe ground ig, SHhbeirdeathesatvagefor wantofgrace
## } Whey tyll notrageo2 flow. _ Wis builopng can not dure, SUbeir honours quite ts dead, q
## H All thinges at Gods commandemet be, € Who fekes (urmifing to dip heir eth tofedethe kytesandcrowes
## f Af bhetheir flateregarde: : a Kuler fentby GOD? heir armes a maze fo2 men; u
## H Audnoman lines whole oeftinie 4s fubiect {ure, Deuoide of grace. Sheir guerdon as eramples are ks
## wy hint ts bupeeparde. be caufe of bis olune rod. Co Daly dolte Dunces den, ke
## f) iBut when a mait forlakes thefhip, - A byave that Wyll bernett defple holy vp pour fnouts pou fluggih forte Q
## f° Androlwles in wallowing wanes; ‘By right ould lofe a wyng: Poumumming malkpng route : :
## And of bis valuntarte wyil, Gua thents thee no flping fon! €rtoll pour erclantations vp,
## ‘ Wis one gwdbap depraues + {But How as other thyng. iBaals chapletnes,champions ffoute. &
## HM Uolw hal bebopetolcape the gulfe 2 anodbe that lofeth all at games, Make fate fo2 pardons, papitts boaue, ii
## : Wow Mhal be thinke todeale 2 2 (pendes infotule ercefle: Fo2traitours indulgence ; Q
## A olw thal his fanfic baing him found Andhopesbyhapsto healebisharme, . fend out fome purgatorte (craps, ‘
## i Zo Safties toe with faple 2 uit deivke of Beare diftrefte. Home wWulls with Peter pence. E
## : Jotw Chall bis fraight in fine fuccede 2 Go fpeake of b2pdles to reftrapne ® fwarine of D20nes, how dare pe pl a
## 4 Alas what thall begapne 2 SChis wylfull waplward cretpe ¢ With labourpng Wes contend 2
## fH Udbatfeare by oes do makebimquake bey care not foz the bake of God, ou fought forhonte fromthe hiues, ;
## j Wot ofte fubiece to payne 2 Oo princes, men bntrue. Wut gall pou foundinend. F
## | Yoin fundzie tintes in Dangers den Co cuntrye, caufers of much woe, hele walpes do walk, their lings beout
## : 4s th2owne theman onivpfe 2 Spotl freeones, afall: Whee {pight wyll not auaple : i
## f) Who climes withouten holde on bye, ano Gxtpeir one eftates, altpng, BWhele Peacocks proudearenakedlette fF
## x 4Beware, 3 hint adutse. a others, Harpeas gall. Of their oifplayed tayle. #
## f Alifuchas trutt to falle contracds, D Lorne, how long thele Liserds lurkt, hele Lurkype cocks tu cullour red, é
## | D2 fecret harmes confpire? GurvG O D> bow great a whple _ Solong bauelurkt alofe : é
## E iBefure, with Portonstbey altafte Were they in Handwith fetgnedbarts Whe Beare (although but foty of fote) 4
## 5 A right deferucd hire. @Cbheir cuntrye to defple 2 Hath pluc his wynages by profe. i
## & hep can not lie fo2 better (pede, Holv did they frame theirfurniture? . Whe Done her bozotyenlighthath lol,
## if So death for fuch to fells Wolv fit thep made their toles ¢ She wapnedas wwe lee : ;
## © Godgrant the tuftice of the wozlde Wow Symon fought our engipt) Drote Who hoped by bap of othersharmes, EB
## ie qput by the papnes of hell. To bapng to Romaine (coles. © full Bone once to be. 2
## B F02(uchapentiuercale it is, Holy Simon Magus playd his parte, SCbe Lyon ({uffrea long the Wuil, 3
## : Dhat Cnglith barts oid dare 3aiv Wabilon balwde pidrage: Wis noble mynd to trpe: :
## H 30 pale the boundes of duties latuc, Aolv walan bulles begon to bell, Wntpll the wWull was rageyng tod, H
## ‘ @2 of theircuntric care. How Judas fought bis wage. And from bis fake dia hye. #
## And mercie hath fo long releat Wotw Jannes and Jamb2es od abpde aUben time ft was to bid him fap :
## | Dffendours (God doth knoiw) aie bount of baaineficke ads, yerforce, bis hoanes to cut + :
## Al Gnd bountic of our curteous Nuene oww Dathan, Choe, Abiram femd Andmake him leaue bis rageing tanes 4
## 4 Go long hath fpared ber foe. Ho path our Moyles facs. Au fcilence to be put. &
## fy SBut God, whole grace infpiresherbarte, Wow womaine marchant feta fret Andall the calues of Wafan kynd i
## 4 WA pil not abyoe the fpight ifs pardons beaue a fale, Are Wweanedfrom their with ; h
## | Of Rebels rage, who rampeto reach Jor alwapes fomeagaint the ruth Whe Wiccan Wigers tamed now, a
## i From ber, her title quight. Molde dzeame afencelestale. __ lemathon eates no fith. bs
## A Although the flowe in pitifull seale, Gods bicar from bis god receaued 4Bebolode before pour balefullepes i
## t nd loueth to fucke no biwd Gbekeyestolofcandbynds Lhe purchace of pour parte, aH
## eet Goda caueat wyll her lend sBaals chapleinthogbt bo fire wole” te. SHuruep pour foseinefozrowful fight 4
## a © appeale thole Wipers mabe. Such was bis pagan mynd. With fighes of pubbie barte, 4
## B) aman that fees bts bouteon fire, Godiorre howhits thetertthete sts Lament thelackeof pour alies :
## 3 WH pli (eke to quench the fame: That faith fuch men tall he _ Religious rebells ail: 2
## i) Gls from the fpople fomeparteconucy, Au their religion hot nozcolve ABewepe that pli fuccelle of yours, :
## 4 Cis feke the heate to tame, Of much baviette. Come curfe pour fodcine fall. “i
## Ht Who leea penthoule wether beate, And Cundey foats of fects furts And when pe haue pour guilesoutfought #
## ‘ And beares a botttrouleivpndes - Wiuifion thall appeare ¢ , And all pour craft app2oued, 3
## % sButhedefull faletie of himfelfe, Againk thefather, tonnes —7UL> gpeccauimus thall be pour fong :
## i WA pil force him fuccour fynde 2 Gaink mother, saughter +k our ground tpozke is remoued, :
## ache pitifull pactent Pelican, 4s {t not come to palfe troty pra? Andlokebow Poetons(pedtheirvwills &
## i er blwd although He tea: aged, battards (ure they bes _ Euen fo thetr fect hall baue, ie
## i ject inyll he femeherdateto ends Cal ho our god mother Nueneo - 4 po better let themhope to gayne “i
## ‘ : Datare her poung be fpen. 6 he ied buen “a $ut gallowes without graue.
## (1) she Cagle fipngesbher pong ones done Can@od ig bengeance long retath See 4
## 3} What fight a at cate? eatbere bis ‘aes feruants fele cFINIS. pee eo. fe
## i winpertec folwles the aeadly bates, Aniurioute fights of godlete men, Op :
## “ And rightly Cuch mile. ‘ Moho turne as doth a wwhele 2 cA Q so CK OS &
## # ‘ ie
## a Ampzinted at London by Alerander Lacte, foz Henrie Iyskehans, divellpng at the fiqne a
## __ of the blacke Move, at the nitdole srozth doze of Paules church. i i
## i i i
(Note by the way that this example used a jpg
file. ocr()
can handle those too.)
Even though every page is an “image” to OCR, OCR struggles with imagistic or unconventional page layouts, as well as inset graphics. Add to that sub-par scans of archival documents, as in the newspaper page below, and the output will contain way more errors than correct matches.
newspaper <- "https://chroniclingamerica.loc.gov/data/batches/mimtptc_inkster_ver01/data/sn88063294/00340589130/1945120201/0599.pdf"
newspaper_ocr <- ocr(newspaper)
Beautifully messy output results:
## y ~ <a
##
## =
##
## 7 i ‘= : “|
##
## Relorm-elypewritin =F
## ——
One strategy you might use to work with sources like this is to crop out
everything you don’t want to OCR. This would be especially effective if, for
example, you had a newspaper column that always appeared in the top left-hand
corner of the page. You could preprocess your page images so that they only
showed that part of the newspaper and left out any ads, images, or extra text.
Doing so would likely increase the quality of your OCR output. Such a strategy
can be achieved outside of R with software ranging from Adobe Photoshop or the
open-source GIMP to Apple’s Automator workflows. Within R, packages like
tabulizer
and magick
enable this. You won’t, however, be required to use
these tools in the course, though we may have a chance to demonstrate some of them
during lecture.
There are several other scenarios where OCR might not be able to read text. Two final (and major) ones are worth highlighting. First, for a long time OCR support for non-alphabetic writing systems was all but nonexistent. New datasets have been released in recent years that mostly rectify these absences, but sometimes support remains spotty and your mileage may vary. Second, OCR continues to struggle with handwriting. While it is possible to train unsupervised learning processes on datasets of handwriting and get good results, as of yet there is no general purpose method for OCRing handwritten texts. The various ways people write just don’t conform to the standardized methods of printing that enable computers to recognize text in images. If, someday, you figure out a solution for this, you’ll have solved one of the most challenging problems in computer vision and pattern recognition to date!