Professional Documents
Culture Documents
OCR
Character
Find Text Outlines Connected
Lines and Component
Words Analysis
Character
Outlines
Organized
Into Words
Recognize Recognize
Word Word
Pass 1 Pass 2
Adaptive Thresholding is Essential
Some examples of how difficult it can be to make a binary image
Taken from the UNLV Magazine set.
(http://www.isri.unlv.edu/ISRI/OCRtk )
Baselines are rarely perfectly straight
Character Character No
Done?
Chopper Associator
Yes
Adapt to
Word
Static Adaptive
Number
Character Dictionary Character
Parser
Classifier Classifier
Outline Approximation
User-words
Wordlist2dawg
Word List
Word-dawg,
Freq-dawg
Training
page images mfTraining inttemp,
Character pffmtable
Tesseract
Tesseract
Features
+manual (*.tr files) cnTraining
correction normproto
Unicharset_extractor Addition of
Box files unicharset unicharset
character
properties
Manual
Data Entry DangAmbigs
Tesseract Dictionaries
Tesseract Data Files
Usually Empty
User-words
Infrequent
Word List
Word List Word-dawg,
Wordlist2dawg
Freq-dawg
Frequent
Word List
Tesseract Shape Data
Tesseract Data Files
Training
page images mfTraining inttemp,
Character pffmtable
Tesseract
Tesseract
Features
+manual (*.tr files) cnTraining
correction normproto
Training
page images
Tesseract
List of Characters + ctype information
+manual
correction
Unicharset_extractor Addition of
Box files unicharset unicharset
character
properties
Manual
Data Entry DangAmbigs
Typical OCR errors eg e<->c, rn<->m etc
Accuracy Results
Comparison of current results against 1995 UNLV results