Professional Documents
Culture Documents
LinuxCommando
ThisblogisabouttheLinuxCommandLineInterface(CLI),withanoccasionalforayintoGUIterritory.Insteadofjust
givingyouinformationlikesomemanpage,Ihopetoillustrateeachcommandinreallifescenarios.
GoogleAdWordsIndonesia
OfficialFreeSupportfromGoogle.StartNowandSaveRp450000!
Home
Resources
ContactMe/Crowdfunding/Advertising
Search
Thursday,January9,2014
OCRScanning
Thispostdescribeshowtoscanpagesfromaprintedbookandconverttheimagetotextusing
OpticalCharacterRecognition(OCR)technology.
ThetoolsthatIuseare:
1. SimpleScan
2. tesseract
Preparation
http://linuxcommando.blogspot.com/2014/01/ocr-scanning.html
1/4
4/24/2015
SimpleScanisaGUIscanapplicationthatcomespreinstalledinmanyLinuxdistributions
(includingDebianWheezy).
TomanuallyinstallitonDebian:
$ sudo apt-get install simple-scan
tesseractisacommandlineOCRprogram.
Toinstall:
$ sudo apt-get install tesseract-ocr
IfEnglishisthelanguageused,thatisallyouneedtoinstall.Ifyourequireanotherlanguage,you
mustinstalladditionaltesseractlanguagepacks.ExamplesaretesseractocrrusforRussian,
tesseractocrdeuforGerman,andtesseractocrfraforFrench.
OCRProcedure
Follow ers
1. ScanthepagesusingSimpleScan.
Jointhissite
w ithGoogleFriendConnect
Members(246) More
Andrei
Pak
2. Savetheimage.
Alreadyamember?Signin
3. Runthetesseractcommand:
$ tesseract OnWritingWell.jpg out
Tesseract Open Source OCR Engine v3.02 with Leptonica
Thefirstparameteristheinputimagefilename.Thesecondparameteristhedesired
basenameoftheoutputtextfile.Thedefaulttxtextensionisaddedtothebasename,e.g.,
out.txt.
http://linuxcommando.blogspot.com/2014/01/ocr-scanning.html
Subscribeinareader
Enteryouremailaddress:
2/4
4/24/2015
IfthelanguageisnotEnglish,youneedtospecifythelanguageonthecommandlineusing
a3characterlanguagecode(refertothetesseractmanpage).Thefollowingcommand
specifiestheuseof3languages:Russian,GermanandFrench.
$ tesseract OnWritingWell.jpg myout -l rus+deu+fra
Subscribe
DeliveredbyFeedBurner
Accuracy
Intheaboveexample,therewereatotalof734words.Withintheoutputtextfile,119words(16%
oftotal)requiresomeformofmanualcorrection.Thisroughlytranslatesto84%OCRaccuracy.
Thesamplesizeistoosmalltobescientific,orstatisticallyvalid.Whatistheperformancethat
youaregettingfromOCR?
PostedbyPeterLeungat5:07PM
35
Follow
submittoreddit
3comments:
JesusEmilioVillaGiraldosaid...
StumbleUpon
PopularPosts
Thanksalot.veryeasy.
February4,2014at10:35AM
Howtocountnumberoffilesinadirectory
HowtodisableSSHhostkeychecking
professordesociologiasaid...
Showprogressduringddcopy
Thanks,man!Itreallyhelpedme!
September15,2014at6:02PM
CompareDirectoriesusingDiffinLinux
HowtoDisplayRoutingTable
Anonymoussaid...
Manythanksforclearcommandlineexample
sriharikonakanchi
November23,2014at8:01AM
BlogArchive
2015(10)
2014(50)
PostaComment
Linkstothispost
CreateaLink
http://linuxcommando.blogspot.com/2014/01/ocr-scanning.html
December(2)
November(2)
October(2)
September(5)
3/4
4/24/2015
August(4)
July(5)
June(4)
May(7)
April(6)
March(6)
February(3)
January(4)
HowtosplitupPDFfilespart2
Printtextfileswithmultiplespagesper
sheet
pinta:alightweightpaintappthathas
(requires)...
OCRScanning
NewerPost
Subscribeto:PostComments(Atom)
Home
OlderPost
2013(22)
2012(1)
2010(1)
2009(9)
2008(51)
2007(21)
http://linuxcommando.blogspot.com/2014/01/ocr-scanning.html
4/4