You are on page 1of 4

Unsupervised Evaluation Methods Based on Local

Gray-Intensity Variances for Binarization of


Historical Documents
Marte A. Ramrez-Orteg on* and Ra ul Rojas
Institut f ur Informatik,Freie Universit at Berlin,
Takustr. 9, 14195 Berlin, Germany
* mars.sasha@gmail.com, rojas@inf.fu-berlin.de
AbstractWe attempt to evaluate the efcacy of six un-
supervised evaluation method to tune Sauvolas threshold in
optical character recognition (OCR) applications. We propose
local implementations of well-known measures based on gray-
intensity variances. Additionally, we derive four new measures
from them using the unbiased variance estimator and gray-
intensity logarithms. In our experiment, we selected the well
binarized images, according each measure, and computed the
accuracy of the recognized text of each. The results show that
the weighted and uniform variance (using logarithms) are suitable
measures for OCR applications.
Index Termsbinarization; unsupervised; evaluation;
I. INTRODUCTION
Libraries, such as the National Archives of Egypt, and the
Library of Congress (United States of America), have been
digitalizing historical printed documents like ancient codices,
maps, and books to preserve and spread the cultural heritage
through digital libraries.
The main problem in the construction of digital libraries lies
in the extraction of information from hundreds of thousands
ancient documents. The digitization of bibliographic records
is the only feasible solution to that problem.
This problem can be roughly divided in three parts: detec-
tion of object of interest (binarization), text extraction, and text
recognition. Here, we ignore the text extraction problem and
assume that the text recognition is performed by an optical
character recognition (OCR) application, which works as a
black box algorithm. This is, the OCR performance mainly
depend on the input image while the OCR parameters has
a low inuence in the output. Therefore, the evaluation of
the binarization algorithm and its parameters play the most
important roll in the system. Then, the natural question is:
Which parameters may be set in the binarization algorithm to
maximize the OCR performance?
Manual tuning of the binarization parameters by human
experts is inadequate because it implies time-consuming oper-
ations and high expenses; then, the binarization performance
may be assess with unsupervised evaluation methods which
analyze the segmentation quality by properties and principles
of the segmentation. These methods do not need neither human
intervention, nor groundtruth. Consequently, they can be used
on a large scale. Furthermore, they enable the objective com-
parison of both different segmentation methods and different
parameters of a single method. They also enable the self-tuning
of algorithms based on evaluation results.
Measures based on gray-intensity variance are popular for
evaluating binarized images [1], [2], [3] because, intuitively,
both foreground and background should be uniform and homo-
geneous regions. Unfortunately, few authors have analyzed the
mathematical and experimental behavior of these measures [4],
[5]. This is why, we study the efcacy of them for tuning bina-
rization methods in order to maximize the accuracy of OCR
applications. In our test, we analyzed Sauvolas method [6]
(binarization method) and TopOCR [7] (OCR software) but
the same methodology can be applied to more binarization
methods and OCR software.
We propose local implementations of classic and recent
measures to overcome images with composite background
(two or more sub-regions). Afterward, we propose modeling
the distribution of gray intensities of both foreground and
background as lognormally distributed.
The rest of this paper is organized as follows. Section II
introduces the examined unsupervised evaluation methods.
The comparison study is described in Section III. Results of
the experiment and conclusions are presented in Section IV.
II. EVALUATION METHODS FOR BINARIZATION
Binarization is the process of dividing the set of pixels P
into

F and

B with the aim of estimating the foreground F
and background B, respectively. In binarization context, F
represents the set of pixels containing the objects of interest
and B is the complement of F in P.
All binarization algorithms reported on [3], [8], [9] assume
that foreground pixels can be distinguished by extracting
diverse features based on their gray intensities. Under this
assumption, authors like [10], [2], [3] conjecture that the
variance of gray intensities of both foreground and background
in well binarized images are smaller than the corresponding
variances in wrong binarized images. However, this conjec-
ture is false for images with composite foreground and/or
background like Fig. 1. As a result, evaluation measures,
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 2010 IEEE
DOI 10.1109/ICPR.2010.500
2029
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 2010 IEEE
DOI 10.1109/ICPR.2010.500
2033
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 2010 IEEE
DOI 10.1109/ICPR.2010.500
2029
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 2010 IEEE
DOI 10.1109/ICPR.2010.500
2029
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 2010 IEEE
DOI 10.1109/ICPR.2010.500
2029
0.0
1.8
3.6
5.4
7.2
9.0
150 160 170 180 190 200
D
e
n
s
i
t
y
Gray Intensity
10
2
Fig. 1. Two different regions form the background. Although the gray
intensities of each background region are approximately normally distributed,
the gray intensities of the entire background are not.
based on gray-intensity variances, could be misleading. To
overcome this difculty we analyze local implementation of
these measures.
A. Notation
We dene the neighborhood P
r
(p) as the set of pixels
within a square centered at the pixel p of sides with length
2r + 1. We abbreviate the intersection between F and P
r
(p)
as F
r
(p) = F P
r
(p). Similarly we dene

F
r
(p), B
r
(p),
and

B
r
(p). The cardinality of a set A is denoted as |A|.
Given a set A, we denote the following statistics and
estimators of gray intensities: The expected value with
A
; the
variance with
A
; the mean with
A
(an estimator of
A
); the
unbiased sample variance with
2
A
(an unbiased estimator of

A
),
2
A
= 0 if |A| < 2; the biased sample variance of gray
intensities with S
2
A
(an unbiased estimator of
A
), S
2
A
= 0
if |A| < 1; the unbiased sample variance of gray-intensity
logarithms

2
A
= ln
_
1 +

2
A
[
A
]
2
_
. (1)
B. Unsupervised evaluation methods
To evaluate binarized images, Levine and Nazif [11] pro-
posed the gray-intensity uniformity (GU) measure. With the
same aim, Sezgin and Sankur [3] derived the region non-
uniformity (NU) measure from GU. These measures are de-
ned as
GU = S
2

B
+S
2

F
and NU =
|

F| S
2

F
|P| S
2
P
. (2)
Otsu [12] proposed the weighted variance (WV) dened as
WV =
|

B| S
2

B
+ |

F| S
2

F
|P|
(3)
Ramrez-Orteg on et al. [1] proposed the uniform variance
measure (UV) that is dened with the local gray-intensity
standard deviations as
UV
r
(p) =
|

B
r
(p)|

Br(p)
+ |

F
r
(p)|

Fr(p)
|P
r
(p)|
, (4)
All four measures expect that the better the binarization, the
lower the evaluation measurement.
GU, NU and WV can be transformed easily in the local
measures GU
r
, NU
r
and WV
r
by replacing P,

F, and

B,
with P
r
(p),

F
r
(p), and

B
r
(p), respectively. However their
local implementations lack desirable properties: NU
r
measure
is zero if

F
r
(p) = , and the expected values of both WV
r
and GU
r
are not minimum if

B
r
(p) = B
r
(p). For example,
assume that all pixels are background and

B
r
(p) = B
r
(p)
then
E(GU
r
) = E(S
2
Br(p)
) =
|B
r
(p)| 1
B
r
(p)

Br(p)
(5)
where E() denotes the expected value. However, if

B
r
(p) =
B
r
(p)\{p} and

F
r
(p) = {p} then
E(GU
r
) = E(S
2
Br(p)\{p}
) =
|B
r
(p)| 2
|B
r
(p)| 1

Br(p)
(6)
which is smaller than (5).
We propose r-local weighted variance measure whose ex-
pected value is minimum if

F
r
(p) = F
r
(p).
1

WV
r
(p) =
_
|

Br(p)|
2

Br(p)
+|

Fr(p)|
2

Fr(p)
|Pr(p)|
()

2
Pr(p)
otherwise.
() if |

B
r
(p)| 2 and |

F
r
(p)| 2.
(7)
Similarly, we dene

UV
r
(p).
Experiments in [1] suggested that both foreground and back-
ground gray intensities locally behave as lognormally rather
than normally distributed. Hence, we derived

WV
r
(p) and

UV
r
(p) from

WV
r
(p) and

UV
r
(p). These measures replace

Fr(p)
and

Br(p)
with

Fr(p)
and

Br(p)
, respectively, see
(1).
The binarization performance, in term of r-local weighted
variance measure, is evaluated as

WV
r
(B) =
1
|P|

pP

WV
r
(p), (8)
where B represents the binarized image. Likewise, we dene
the rest of the measures.
III. COMPARATIVE STUDY
We evaluated the efcacy of v = 6 measures in re-
lation to OCR performance. Figure 2 shows the eval-
uation processing ow. By simplicity, we denote M
(k)
,
for k = 1, . . . , v, the k-measure of the list M =
{GU
r
, NU
r
,

WV
r
,

WV
r
,

UV
r
,

UV
r
}.
1
We have constructed a formal treatment of this argument, using some
probability assumptions of gray-intensity differences. This work has been
submitted for publication.
2030 2034 2030 2030 2030
Gray-intensity image
for
Input
Binary image by
Sauvolas method using
for
Binarization
Measurement of image
using measure for
Evaluation of Binarization OCR and AC evaluation
I
(i)
i = 1, . . . , n.
B
(i,j)

j
j = 1, . . . , m

B
(i,k)
= arg min
B
(i,j)
, j=1,...m
M
k
(B
(i,j)
)
M
k
(B
(i,j)
) B
(i,j)
M
k
k = 1, . . . , v
x

i
= max
j=1,...,m
AC(B
(i,j)
)
x
i,k
= AC(

B
(i,k)
)
y
i,k
= x
i,k
/x

i
Fig. 2. Overview of the evaluation process.
Our test database is composed by n = 86 gray-intensity
images I
(i)
for i = 1, . . . , n. that contain degraded text (ink
stains and weak strokes for mention some kind of degradation).
These images were extracted from 61 maps of the historical
atlas Theatrum orbis terrarum, sive, Atlas novus (Blaeu
Atlas) [13] with 150 dpi resolution.
We chosen Sauvolas method [6] to perform the binarization
because it was top-ranked by [3], [8]. Sauvolas threshold is
dened as
T(p) =
P
r
(p)

_
1 +
_

P
r
(p)

1
__
, (9)
where r

, and are parameters. The pixel p is classied


as foreground if its gray intensity is lower than T(p). Table I
presents the range of each Sauvolas parameter that we used in
our experiment. Varying the parameters of Sauvolas method,
we computed m = 5, 454 binary images B
(i,j)
for each image
I
(i)
. Later on, we computed M
k
(B
(i,j)
) which represents the
measurement of B
(i,j)
with M
(k)
. Then,

B
(i,k)
= arg min
B
(i,j)
,j=1,...,m
M
(k)
(B
(i,j)
). (10)
denotes the best-binarized image among B
(i,j)
in terms of
measure M
(k)
.
We used TopOCR [7] to recognize the text from the bi-
narized images using four parameter sets. We measure the
accuracy of the recognized text as
AC(B
(i,j)
) =
#(characters of T
(i,j)
match
)
#(characters of T
(i)
in
)
, (11)
where T
(i)
in
is the original text in I
(i)
and T
(i,j)
match
denotes the
maximum matching text between T
(i)
in
and the OCR output.
T
(i,j)
match
is computed using Needleman-Wuntsh algorithm [14].
The AC measure is an important measure for OCR applica-
tions, because the high AC measurement, the greater the pos-
sibility to extract, by further algorithms, relevant information
from the recognized text.
In our evaluation, x

i
represents the maximum AC among
all the binarized images of I
(i)
, and x
i,k
represents the
OCR accuracy of the best-binarized image of I
(i)
in terms
of measure M
(k)
. Hence, our statistics and observations are
mainly based on
y
i,k
=
x
i,k
x

i
(AC efcacy) (12)
which represents the efcacy of M
(k)
for tuning Sauvolas
method in order to maximize the accuracy. Observe that
x
i,k
highly depends on x

i
and, consequently, we cannot
infer from it how efcient is M
(k)
to assess the binarization
method. For example, suppose that however the parameters of
Sauvolas method are, the OCR accuracy is lower or equal
to 0.5 (x

i
= 0.5); If x
i,k
= 0.45, for instance, this could
be interpreted either as low OCR performance, or as low
binarization method performance, but the ratio of x

i
to x
i,k
is y
i,k
= .90, which means that M
(k)
is highly efcient to
maximize the OCR accuracy despite of the intrinsic low OCR
(binarization method) performance in I
(i)
.
TABLE I
RANGE OF SAUVOLAS PARAMETERS. SWEEPING THE
PARAMETERS r

, AND , WE GENERATED m = 5, 454


DIFFERENT PARAMETER COMBINATIONS j = {r

j
, j, j}.
Parameter From/To Increment
r

10/50 5
0.0/1.0 0.01
32/196 32
IV. RESULTS AND CONCLUSION
In our experiment, we set r = 50 for all measures. Table II
and Fig. 3 present statistics of points (i, y
z
i,k
,k
) where z
i,k
are indexes such that y
z
1,k
,k
. . . y
z
n,k
,k
for k = 1, . . . , v
Table II also present the pairwise comparison between values
y
i,k
.
The measure

WV
r
is the best in overall performance (mean
and variance). However,

UV
r
performed better in the rst
quartile of measurements y
z
i,k
,k
.

UV
r
and

WV
r
have an
acceptable performance in a lesser degree.
Results in Table II indicate that GU
r
and NU
r
are in-
effective to tune Sauvolas parameters, see Fig. 3. Notice
that Sauvolas threshold can be interpreted as the acceptable
deviation from the expected gray intensity. While incrementing
the parameter , this tolerance increases and, consequently,
more and more pixels are classied as background up to all
pixels are in the estimated background. Therefore, high s are
chosen for those evaluation measures which do not or lightly
penalize the estimated background. Particularly, NU
r
yields
white images while GU
r
renders degraded characters.
After inspecting the binarized images visually, we con-
cluded that

UV
r
outperforms

UV
r
(Table II) because

UV
r
generates more false positive spots (connected components
with four or more pixels) which are scattered all around the
2031 2035 2031 2031 2031
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0.8
0.84
0.88
0.92
0.96
1
0 0.1 0.2 0.3 0.4 0.5
y
z
i
,
k
,
k
i/n
GU
r
NU
r

WV
r

WV
r

UV
r

UV
r
y
z
i
,
k
,
k
i/n
GU
r

WV
r

WV
r

UV
r

UV
r
Fig. 3. Each of the AC efcacy graph are in decreasing order to make the visual inspection easier.
TABLE II
AC EVALUATION OVERVIEW.
Method x

i
GUr NUr

WV r

WV r

UV r

UV r
Mean 0.907 0.536 0.059 0.805 0.719 0.731 0.769
Std. Dev. 0.094 0.229 0.118 0.209 0.207 0.244 0.247
y
z
i,k
,k
Quantiles i/n
1.00 0.25 0.01 0.00 0.07 0.05 0.03 0.09
0.95 0.39 0.02 0.00 0.13 0.05 0.12 0.19
0.90 0.61 0.06 0.00 0.38 0.14 0.28 0.34
0.85 0.77 0.10 0.00 0.59 0.29 0.41 0.49
0.80 0.90 0.15 0.00 0.70 0.48 0.51 0.60
0.75 0.90 0.24 0.00 0.81 0.53 0.60 0.71
0.60 1.00 0.40 0.00 0.90 0.78 0.83 0.86
0.50 1.00 0.53 0.02 0.93 0.91 0.91 0.90
Pairwise comparison P(y
i,a
> y
i,b
) (a row, b column)
GU
r
NU
r

WV
r

WV
r

UV
r

UV
r
GU
r
0.00 0.97 0.07 0.06 0.20 0.10
NU
r
0.00 0.00 0.00 0.00 0.00 0.00

WV
r
0.85 0.98 0.00 0.66 0.55 0.47

WV
r
0.79 0.98 0.22 0.00 0.33 0.28

UV
r
0.76 0.95 0.21 0.52 0.00 0.23

UV
r
0.77 0.93 0.30 0.62 0.51 0.00
background. In addition to this noise, binarized images which
are evaluated with

UV
r
overestimate the foreground contours
occasionally,
We also concluded that measures based on the lognormal
distribution yield sharper foreground boundaries than those
based on the normal distribution. However, we suppose that

WV
r
surpasses both

UV
r
and

WV
r
because

WV
r
conserves
the foreground contours fairly well and, at the same time,
generates few noise in comparison with

UV
r
and

WV
r
.
ACKNOWLEDGMENT
This research was partially supported by The National
Council on Science and Technology (CONACYT)of Mexico
(Grant number:218253).
REFERENCES
[1] M. A. Ramrez-Orteg on, E. Tapia, L. L. Ramrez-Ramrez, R. Rojas, and
E. Cuevas, Transition pixel: A concept for binarization based on edge
detection and gray-intensity histograms, Pattern Recognition, vol. 43,
pp. 1233 1243, 2010.
[2] P. K. Sahoo, S. Soltani, A. K. Wong, and Y. C. Chen, A survey
of thresholding techniques, Computer Vision, Graphics. and Image
Processing, vol. 41, no. 2, pp. 233260, 1988.
[3] M. Sezgin and B. Sankur, Survey over image thresholding techniques
and quantitative performance evaluation, Journal of Electronic Imaging,
vol. 13, no. 1, pp. 146168, January 2004.
[4] S. Chabrier, B. Emile, C. Rosenberger, and H. Laurent, Unsupervised
performance evaluation of image segmentation, Journal on Applied
Signal Processing, vol. 2006, pp. 112, 2006.
[5] H. Zhang, J. E. Fritts, and S. A. Goldman, Image segmentation
evaluation: A survey of unsupervised methods, Computer Vision and
Image Understanding, vol. 110, pp. 260 280, September 2008.
[6] J. Sauvola and M. Pietik ainen, Adaptive document image binarization,
Pattern Recognition, vol. 33, no. 2, pp. 225236, 2000.
[7] T. Soft, Top OCR. Top Soft, 2008. [Online]. Available:
http://www.topocr.com/
[8] P. Stathis, E. Kavallieratou, and N. Papamarkos, An evaluation tech-
nique for binarization algorithms, Journal of Universal Computer
Science, vol. 14, no. 18, pp. 30113030, Octuber 2008.
[9] . D. Trier and A. K. Jain, Goal-directed evaluation of binarization
methods, Transactions on Pattern Analysis and Machine Intelligence,
vol. 17, no. 12, pp. 11911201, 1995.
[10] M. D. Levine and A. M. Nazif, Dynamic measurement of computer
generated image segmentations, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 7, no. 2, pp. 155164, 1985.
[11] Y. J. Zhang, A survey on evaluation methods for image segmentation,
Pattern Recognition, vol. 29, no. 8, pp. 13351346, 1996.
[12] N. Otsu, A threshold selection method from grey-level histograms,
IEEE Transaction on Systems, Man, and Cybernetics, vol. 9, no. 1, pp.
6266, January 1979.
[13] W. Janszoon and J. Blaeu, Theatrum Orbis Terrarum,
Sive, Atlas Novus. Blaeu Atlas, 1645. [Online]. Available:
http://www.library.ucla.edu/yrl/reference/maps/blaeu
[14] S. B. Needleman and C. D. Wunsch, A general method applicable to
the search for similarities in the amino acid sequence of two proteins,
Journal of Molecular Biology, vol. 48, no. 3, pp. 443453, March 1970.
2032 2036 2032 2032 2032

You might also like