Professional Documents
Culture Documents
Abstract—For shadowed text images, the character recognition rection, or often simply gamma, is a nonlinear operation
performance of Tesseract drops significantly. In this paper, we usually used to encode and decode luminance or tristimulus
propose a new method to process the shadowed text images for values in video or still image systems. The gamma correction
the Tesseract’s optical character recognition engine. First, a local
adaptive threshold algorithm is used to transform the grayscale method can improve a little light distribution for text images,
image into a binary image to capture the contours of texts. but its performance is not good [4]. Besides, the adaptive
Next, to delete the salt-and-pepper noise in the shadow areas histogram equalization (AHE) and the improved algorithms
we propose a double-filtering algorithm, in which a projection based on AHE are presented to improve uneven illumination
method is used to remove the noise between texts and the median [5-7]. These approaches proved to be very effective for gray
filter is used to remove the noise within characters. Finally,
the processed binary image is fed into the Tesseract’s optical level images, but they are not preferred for text images. In
character recognition engine. Experimental results show that most cases, the feature of text image is significantly different
the proposed method can achieve a better character recognition from the graphic image in frequency distribution. In order to
performance. obtain a good recognition result, the contrast of the text and
Keywords- Tesseract; shadow; local adaptive threshold; background in a text image are supposed to be large enough,
projection denoising; median filter while the AHE scheme will increase the pixel value in the
non-character information range, thus reducing the contrast of
I. I NTRODUCTION the image. What’s more, the adaptive histogram equalization
The Optical Character Recognition (OCR) technology has method produces a blocky effect, which is detrimental to
a wide range of applications in document processing. The text subsequent processing.
image information could be input to the computer quickly and
accurately with an excellent OCR system [1]. In most cases, To solve the problem of Tesseract’s weak performance for
text images are printed as a black-on-white style. However, the shadowed text images, we propose to generate binary images
recognition of degraded text images with uneven illuminations with less amount of noise before input into Tesseract. First
is still a great challenge for researchers. of all, a local adaptive threshold processing is presented by
The Tesseract OCR engine is an open-source system that replacing the traditional Otsu method, which can generate a
was developed originally in HP Labs between 1985 and 1995, binary image from a grayscale image and the binary image will
shelved for 10 years, open-sourced in 2006 and now developed contain a certain amount of salt-and-pepper noise. The noise
mostly at Google. It has gained in popularity due to its accu- is mainly distributed in the shadow areas. To remove the noise
racy and support for a great many languages [2,3]. The project from the binary image, a double filter is further proposed in
can be available now at http://code.google.com/p/tesseract-ocr. consideration of both denoising effect and computational cost.
If a printed text image is not severely corrupted by noise, the We first project the binary image in the horizontal direction
recognition rate of Tesseract is fairly high. On the contrary, and the vertical direction to remove the noise between the
the performance of Tesseract will drop significantly if the text lines. This step will remove the noise in the non-text
input image contains much noise, e.g., the shadows. Fig. 1(a) area effectively. After removing the majority of the noise in
shows a typical text image captured by a mobile phone, where the shadow areas, we employ the median filter to eliminate
shadows can be found on the right side of the picture. It is the noise within the regions of characters. Since the input of
known that Tesseract employs the Otsu segmentation method Tesseract can be color images, gray level images or binary
to process the input image, so we transform Fig. 1(a) into a images, we finally feed the binary image generated by our
binary format using the Otsu method. The results are shown in scheme into the Tesseract engine and evaluate its classification
Fig. 1(b). It is seen that the shadow area of Fig. 1(a) becomes performance.
an area of black in Fig. 1(b) after the Otsu segmentation.
Consequently the OCR performance of Tesseract becomes The rest of this paper is organized as follows. In Section II,
very poor because Fig. 1(a) is not correctly pre-processed for we provide a detailed scheme of our shadow removal method
Tesseract. for Tesseract text recognition. Experimental results and dis-
To improve the recognition accuracy for shadowed text cussions are presented in Section III. We draw conclusions in
images, some techniques have been proposed. Gamma cor- Section IV.
(b)
B. Projection Denoising
The image BI usually contains a lot of noise, which
will decrease the OCR performance significantly. The median
filter is usually utilized to reduce the salt-and-pepper noise. Fig. 3. Flowchart of vertical projection denoising
However, if the window size of the median filter is small, the
filtering performance is not very good because some outlier
noises still remain after filtering; if the window size is large, Fig. 3 demonstrates the vertical projection denoising pro-
the text contents will be damaged and computational cost will cess. Detailedly, we illustrate this process step by step as
increase. Therefore, we first use a projection method to remove following:
the majority noise in BI , then the median filter with a window (1) Set j1 = 1 and j2 = J initially.
size of 3 × 3 is employed to reduce the rest isolated noise (2) If C(j1 ) ≤ Tc , set Cs = j1 and go to (3); else, set
points. j1 = j1 + 1 and repeat (2) until j1 = J.
For most of languages, all text lines are appear either hor- (3) If C(j2 ) ≤ Tc , set Ce = j2 and go to (4); else, set
izontally or vertically [12]. Therefore, we can take horizontal j2 = j2 − 1 and repeat (3) until j2 = 1.
text layout into consideration in this case. (4) For j from 1 to Cs and from Ce to J, set BI (i, j) = 1.
computational cost. We finally obtain the image BMI which
contains only few noise points as shown in Fig. 1(e). BMI is
then sent to Tesseract engine in the experiments.
Fig. 4(a) and (b) show the horizontal and vertical projection TABLE I
OCR ACCURACY C OMPARISON
result of Fig. 1(c) respectively, and the threshold Tr and Tc
with red dashed line. After the process depicted by Step 1 Test No AHE Gamma Our
and Step 2, the image BPI is generated from BI and BPI
Images Preprocessing Method Correction Scheme
contains only a small amount of noise as shown in Fig. 1(d).
T1 78.2% 80.3% 81.6% 88.8%
C. Median Filter Denoising T2 86.5% 86.5% 86.5% 94.3%
The last stage of our scheme is reducing the noise in the T3 86.9% 88.4% 89.6% 94.6%
regions of characters with median filter. The median filter T4 75.5% 73.2% 77.1% 84.2%
is working on binary images and is effective in eliminating T5 92.0% 92.0% 92.7% 97.8%
isolated noise points [14,15]. The main idea of the median
filter is to run through the image pixel by pixel, replacing
each pixel with the median of neighboring pixels sorted In TABLE I, the OCR accuracy is given by
numerically. The pattern of the neighbors is called the window,
(u − π)/u (9)
which slides, pixel by pixel, over the entire image. The output
of 2D(two-dimensional) median filter can be obtained by: where u is the number of characters in a text image and π is
g(i, j) = med {f (i − k, j − l), (k, l ∈ W )} (8) the number of errors. From Fig. 5, we can draw conclusion
that the AHE method does not take good effect on balancing
where f (i, j) and g(i, j) are the input image and output illumination for text images, and the shadows are even more
image respectively. W denotes the 2D template, which can prominent. In addition, the gamma correction seems to work
be different shapes, and we use a square as a template in our better in some degree, but it is far from an ideal solution
scheme. because most shadows still exist. The experimental results in
As discussed earlier, we set the median filter window TABLE I show that our scheme can help improve Tesseract’s
size as 3 × 3 considering the recognition performance and OCR performance, with an average improvement of 8.1%.
ACKNOWLEDGMENT
This research is funded by the National Natural Sci-
ence Foundation of China (NO.61375011) and the National
Natural Science Foundation of Zhejiang province of China
(LY13F030015).
R EFERENCES
[1] M. W. C. Reynaert, “Character confusion versus focus word-based
correction of spelling and OCR variants in corpora.” International
Journal on Document Analysis & Recognition, vol.14, no.2, pp.173-187,
(a) 2011.
[2] R. Unnikrishna, and R. Smith. Combined script and page orientation
estimation using the Tesseract OCR engine. ACM, 2009.
[3] R. Smith, “An Overview of the Tesseract OCR Engine.” International
Conference on Document Analysis and Recognition IEEE Computer
Society, pp.629-633, 2007.
[4] J. Scott, and M. Pusateri, “Towards real-time hardware gamma cor-
rection for dynamic contrast enhancement.” Applied Imagery Pattern
Recognition Workshop IEEE, pp.1-5, 2009.
[5] J. Y. Kim, L. S. Kim, and S. H. Hwang, “An advanced contrast enhance-
ment using partially overlapped sub-block histogram equalization.” IEEE
Transactions on Circuits & Systems for Video Technology, vol.11, no.4,
pp.475-484, 2011.
[6] K. N. Chen, C. H. Chen, and C. C. Chang, “Efficient illumination
(b) compensation techniques for text images.” Digital Signal Processing
vol.22, no.5, pp.726-733, 2012.
[7] M. Sahani, S. K. Rout, L. M. Satpathy, and A. Patra, “Design of an
embedded system with modified contrast limited adaptive histogram
equalization technique for real-time image enhancement.” International
Conference on Communications and Signal Processing IEEE, pp.0332-
0335, 2015.
[8] T. R. Singh, S. Roy, O. I. Singn, et al. “A New Local Adaptive Thresh-
olding Technique in Binarization.” International Journal of Computer
Science Issues, vol.8, no.6, 2011.
[9] M. K. Kim, “Adaptive Thresholding Technique for Binarization of
License Plate Images.” Journal of the Optical Society of Korea, vol.14,
no.4, pp.368-375, 2010.
[10] P. N. Druzhkov, V. L. Erukhimov, N. Y. Zolotykh, et al. “New object
detection features in the OpenCV library.” Pattern Recognition & Image
(c) Analysis, vol.21, no.3, pp.384-386, 2011.
[11] I. Culjak, D. Abram, T. Pribanic, H. Dzapo and M. Cifrek, “A brief
introduction to OpenCV,” 2012 Proceedings of the 35th International
Convention MIPRO, Opatija, pp. 1725-1730, 2012.
[12] R. Smith, D. Antonova, and D. S. Lee. “Adapting the Tesseract open
source OCR engine for multilingual OCR.” Proceedings of the Interna-
tional Workshop on Multilingual OCR. ACM, 2009.
[13] F. F. Zeng, Y. Y. Gao, and X. L. Fu, “Application of denoising algorithm
based on document image,” Computer Engineering and Design, vol.33,
pp.2701-2705, 2012.
[14] S. Esakkirajan, T. Veerakumar, A. N. Subramanyam, and C. H. Prem-
chand, “Removal of High Density Salt and Pepper Noise Through
Modified Decision Based Unsymmetric Trimmed Median Filter.” IEEE
Signal Processing Letters, vol.18, no.5, pp.287-290, 2011.
(d) [15] M. Monajati, S. M. Fakhraie, and E. Kabir. “Approximate Arithmetic
for Low-Power Image Median Filtering.” Circuits Systems & Signal
Fig. 5. Test image T1 and results of various schemes: (a) The original image; Processing, vol.34, no.10, pp.3191-3219, 2015.
(b) Result with AHE; (c)Result with gamma correction (γ = 0.40); (d) Result
with our scheme
IV. C ONCLUSION
In view of Tesseract’s weak OCR performance for shadowed
text images, we have employed a novel technique to preprocess
these images in order to obtain better performance. These
images can be transformed into binary images with few noise
points after using our scheme. According to the experimental
results, the OCR accuracy can be increased. In our program,
the values of and are important factors for preprocessing. In the
future, we aim to research how to better set these coefficients.