You are on page 1of 14

0361-7688/02/2803- $27.

00 2002

I Nauka

/Interperiodica 0148

Programming and Computer Software, Vol. 28, No. 3, 2002, pp. 148161. Translated from Programmirovanie, Vol. 28, No. 3, 2002.
Original Russian Text Copyright 2002 by Arlazarov, Loginov, Slavin.

1. INTRODUCTION
Optical character recognition (OCR) programs are
currently widely used for the transformation of a
scanned document image into a text representation.
Stages of optical input of documents (such as binariza-
tion, page segmentation, and line recognition) and
properties of the recognition results (accuracy, ability
to reproduce the document structure) are described in
[1, 2]. Modern recognition algorithms have rather com-
plex multistage structure. However, many publications
devoted to the text recognition discuss mainly problems
of recognition of separate characters [3] and focus on
the accuracy and rate of character recognition. Other
properties of character recognition algorithms are set
aside. However, in the development of the entire recog-
nition complex, additional properties of recognition
algorithms play often the key role. This happens, for
example, when combining recognition algorithms,
bringing them into accord, and attaching the spell
check. In this paper, we describe in detail sets of quan-
titative and qualitative characteristics of functioning of
various mechanisms used in OCR programs.
It should be emphasized that it is the characteristics
of the programs themselves (program modules and
groups of recognition tools, further referred to as librar-
ies) that are discussed in the paper. The case in point is
program implementations of the algorithms, whereas
properties of the algorithms themselves from the theo-
retical standpoint are not considered, in spite of the
importance of such studies. The emphasis on the pro-
gram characteristics can be explained by the following
reasons. The developer using a program implementa-
tion of a recognition algorithm has often no chance to
go into detail of the program. However, if characteris-
tics of the algorithm are known, the programmer has an
opportunity to carry the desired properties of the recog-
nition algorithm over to higher-level modules and, in a
number of cases, to improve them. With the help of the
set of characteristics, the optimization of the entire set
of the OCR program modules is possible. It does not
seem advisable (most likely, even impossible) to repre-
sent such a set as an abstract algorithm for the sake of
the theoretical study in view of its complexity and the
great number of intermodule interfaces. When describ-
ing characteristics of recognition algorithms, we will
gradually introduce the required terminology, creating,
thus, a specic vocabulary, which is aimed at reducing
linguistic problems arising in the communication of the
OCR developers.
The necessity in the description of recognition algo-
rithm characteristics is explained by specic features of
the development of a complex of OCR modules. There
is a feedback as well: namely, knowledge of these char-
acteristics makes it possible to design recognition algo-
rithms with desired properties. Note that characteristics
of program implementations of recognition algorithms
have certain advantages compared to those of abstract
prototypes. For example, if there are several recogni-
tion algorithms of the same type implemented in accor-
dance with a certain interface unication, they can be
compared in terms of their characteristics calculated on
representative test samples. The set of such characteris-
tics is helpful in revealing advantages of various algo-
rithms. Finally, if a recognition algorithm is used as a
component of a macroalgorithm for solving a certain
problem, the characteristics make it possible to formu-
late and set problems of improving both the general
algorithm and its components. In this paper, we
describe a scheme of the OCR functioning. The charac-
teristics presented below are grouped according to the
stages of the described recognition model. The discus-
sions of characteristics of character recognition algo-
rithms (CRA)the central part of the document recog-
nition processare most detailed; for other stages,
only the most important characteristics are described.

Characteristics of Optical Text Recognition Programs

V. L. Arlazarov*, A. S. Loginov**, and O. A. Slavin*

* Institute for Systems Analysis, Russian Academy of Sciences, pr. 60-letiya Oktyabrya 9, Moscow, 117312 Russia
** Moscow State Institute of Engineering Physics, Kashirskoe sh. 31, Moscow, 115409 Russia
e-mail: oslavin@cs.isa.ac.ru

Received November 28, 2001

Abstract

Characteristics of optical recognition programs are described from the standpoint of typical recog-
nition program modules. Not only quality criteria for the separate character recognition but also parameters of
other important stages of document input, such as character boundary segmentation, binarization, page segmen-
tation, and storing results, are discussed in detail. The set of characteristics presented can be used for the opti-
mization of both separate recognition stages and the whole process of document input.

PROGRAMMING AND COMPUTER SOFTWARE



Vol. 28



No. 3



2002

CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 149

2. OCR MODEL
The scheme of the multipass adaptive process of
document recognition depicted in Fig. 1 consists of
several stages, with each stage being associated with
one or several libraries designed for solving problems
of a certain kind.

Scanning

and

binarization

create a digital image of
a page (or its fragment) in a form that guarantees an
acceptable recognition of graphical images contained
in it. As a rule, the required image is binary (black-and-
white), and the same geometrical region can be
extracted many times with different parameters or bina-
rization thresholds. The binarization itself transforms a
grayscale or color image into a binary one. Grayscale or
color images can be required, e.g., for lling illustra-
tion zones in nal documents.
In the process of

segmentation

, the document is
divided into zones such that each zone contains infor-
mation that is homogeneous from the standpoint of the
classication chosen: text segments, illustrations,
tables, formulas, separating lines, etc. In turn, in text
zones and tables, lines of text characters or text cells,
respectively, are formed.
The

recognition

of lines is implemented in several
passes. On each pass, characters with a priori unknown
boundaries are recognized. The characters are classi-
ed with regard to the results of the recognition on pre-
vious passes. After the completion of the current pass,
the results of the recognition are used for learning, for
example, for the adaptation to specic features of the
fonts used in the document. The

postprocessing

includes learning specic features of the document and
modication of the description of its structure, i.e., a set
of the segments, their types, and characteristics. On the
next pass, the characters are recognized with regard to
the learning; i.e., by taking into account the information
about the fonts used.
After the completion of all recognition passes, the
results are

formatted

, i.e., reduced to one of commonly
accepted text formats, and subjected to the nal post-
processing, which is aimed at improving the document
appearance while keeping its structure unchanged.
In addition to the libraries that solve meaningful
problems, interface program modules ensuring com-
munication between the libraries and storing intermedi-
ate results are required.
The scheme described above displays not only
diversity of problems being solved but also the neces-
sity in the denition and formalization of criteria and
characteristics of these computational processes, which
is discussed in the paper.
3. SOME GENERAL CHARACTERISTICS
OF RECOGNITION METHODS
Let a base

B

of samples be given, which is meant in
a wide sense as a set of objects (graphical images of
separate lines or characters, whole pages, sets of lines,
and the like) that are to be recognized. Let a set

S

of
generalized codes be given. The elements of this set can
be viewed as certain structures describing properties or
encoding of elements from

B

; for example, for charac-
ter recognition algorithms, this is a set of all possible
codes; for page segmentation algorithms, this is a set of
all possible structures describing block boundaries and
types of blocks belonging to them; and the like. We will
use the notation

M

for the product of the sets,

M

=

S



W

, where

W

= {1, 0, 1, 2, ,

W

max

} is a nite set of
estimates and

W

max

is the maximal possible estimate.
An algorithm recognizing the text set

B

is a function

A

(

b

) dened on the base

B

with the range belonging to
the

m

-dimensional space

M

m

(

m

is the maximum possi-
ble number of alternatives used by a given method) and
satisfying the condition
(1)
Thus,

A

(

b

) = ((

S

1

(

b

),

W

1

), (

S

2

(

b

),

W

2

), , (

S

m

(

b

),

W

m

)), where

S

i

(

b

) is a generalized code of the

i

th alter-
native and

W

i

is an estimate of this alternative. This set
is referred to as a

collection of alternatives.

Alternatives
with nonnegative estimates and those with the estimate
1 are said to be

nonempty

and

empty

, respectively.
Note that, for the latter alternatives (

S

i

, 1), the value

S

i

does not matter, and we use the symbol of the empty set


for it. If the number of the alternatives returned by a
method is less than

m

, the last places in the collection
of alternatives are completed by empty alternatives.
Accordingly, if all alternatives of a collection are
empty, it is referred to as an

empty collection.

Let, on a
given base, a decoding function

K

(

b

) be dened that
maps the base into the space

M

m

such that only the rst
alternative of the vector

K

(

b

) is nonempty,

K

(

b

) =
((

S

1

(

b

),

W

1

), (


, 1), , (


, 1)). Suppose that, in a
W
1
W
2
W
m
.

Scanning and
Page segmentation
Line recognition
binarization
and separation of
text lines
Postprocessing
Additional learning (adaptation)
Formatting

Fig. 1.

Stages of optical recognition of a document.

150

PROGRAMMING AND COMPUTER SOFTWARE



Vol. 28



No. 3



2002

ARLAZAROV

et al

.

space of generalized codes

S

, a function

r

is dened that
identies elements of this set, such that

r

(

s

1

,

s

2

) = 0 if
the structures

s

1

and

s

2

from the set

S

are considered to
be identical by the expert, and

r

(

s

1

,

s

2

) = 1 otherwise.
For example, for algorithms with xed coding, such as
character recognition algorithms, this function is
dened as the distance
A similar function

r

can be dened in the space

S

m

.
The recognition accuracy of an algorithm

A

on a base

B

for a given identifying function

r

is the function dened
as
(2)
where


denotes the number of elements in the base

B

.
Note that, in this denition (as well as in several subse-
quent denitions (7)), the estimates themselves are not
used.
Consider two important cases of denition of the
identifying function in

M

m

(or, more precisely, in

S
m
)
by a given function r on S. Let U, V S
m
, U = (S
1
, S
2
,
, S
m
), and V = (T
1
, T
2
, , T
m
). The identifying func-
tion in S
m
can be dened in the two following ways:
A collection A(b) is said to be correct if r
1
(K(b),
A(b)) = 0 and complete if r
2
(K(b), A(b)) = 0. By means
of the functions r
1
and r
2
and denition (2), we intro-
duce now the notions of the recognition accuracy in a
strong and weak sense.
The recognition accuracy in a strong sense of a
method A with respect to a base B is dened as
(3)
The recognition accuracy in a weak sense of a
method A with respect to a base B is dened as
(4)
We emphasize again that only generalized codes are
used in the above denitions, and estimates of the alter-
natives are not used.
Another characteristic of an algorithm is associated
with its stability to noise. The robustness of an algo-
r s
1
s
2
, ( )
1, if s
1
s
2

0, if s
1
s
2
. =

'

=
v v A B r , , ( )
1 r K b ( ) A b ( ) , ( ) ( )
b B

B
----------------------------------------------------------, = =
r
1
U V , ( ) r S
1
T
1
, ( ) and r
2
U V , ( ) r S
1
T
i
, ( ).
i 1 =
m

= =
v A B r
1
, , ( )
1 r
1
K b ( ) A b ( ) , ( ) ( )
b B

B
------------------------------------------------------------. =
v A B r
2
, , ( )
1 r
2
K b ( ) A b ( ) , ( ) ( )
b B

B
------------------------------------------------------------. =
rithm A to a perturbation b* = F(b) with respect to a
base B in a strong sense is dened as
(5)
In the case of the character recognition, this quantity
is equal to the ratio of the number of coincidences of
the correct codes of the rst alternatives S
1
(b) and
S
1
(b*) when recognizing the image b and the image b*
obtained from b by means of a geometrical transforma-
tion F (rotation, or scaling, or distortion) to the total
number of elements of the base. In practice, distortions
of this kind occur when scanning a text page.
The robustness of an algorithm A to a perturbation
b* = F(b) with respect to a base B in a weak sense is
dened as
(6)
which is the ratio of the number of coincidences of the
codes of the rst alternatives S
1
(b) and S
1
(b*) when rec-
ognizing the images b and b* to the total number of ele-
ments of the base. Denitions similar to denitions (5)
and (6) can be introduced for the function r
2
as well.
The estimate distribution for an algorithm A is a set
of frequencies {v(1), v(0), v(1), , v(W
max
)} corre-
sponding to all possible alternatives, where
(7)
Here, N(W) = (W
1
(b) W)(1 r
1
(K(b),
A(b)))] is the number of images recognized by the rst
alternative with the estimate W, and H(x) =
is the Heaviside unit-step function. Some of the above
characteristics can be interpreted as probabilities of
certain events. For example, the estimate v(W) is the
probability of the correct recognition of an element b
with the estimate W. An elementary event in this case is
the result of the recognition of the element b by the
algorithm A. The set of all such events on the base B is
a eld of elementary events, with each of them having
the probability 1/B. The recognition accuracy in the
strong sense, dened by formula (3), is the probability
1 r
1
K b ( ) A b ( ) , ( ) ( )
b B

_
1 r
1
K b* ( ) A b* ( ) , ( ) ( )
,

_
/ B .
1 r
1
A b ( ) A b* ( ) , ( ) ( )
b B

B
---------------------------------------------------------------,
v W ( ) H W
1
b ( ) W ( ) [
b B

=
,

_
1 r
1
K b ( ) A b ( ) , ( ) ( ) ]
,

_
/ B N W ( )/ B . =
[H
b B
1, x 0 >
0, x 0

'

PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002


CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 151
of the correct recognition of the element b by the
method A. In this case, the space of elementary events
is the same as in the previous case. The event with the
probability equal to the recognition accuracy in the
weak sense (dened by formula (4)) is the complete
recognition of the element b on the same eld of ele-
mentary events. The robustness characteristics also
have probabilistic meanings. For example, the strong
robustness of an algorithm A to a perturbation F is
equal to the probability of the correct recognition of
both the element b and its perturbation b* = F(b) by the
algorithm. An elementary event is the result of the algo-
rithm A operation on the element b and its perturbation
b*. The eld of elementary events consists of B
equiprobable events.
The character recognition algorithms that have esti-
mate distributions with low frequencies of the leading
estimates require statistical recalculation of the esti-
mates. A generalization of the frequency v(W) of tak-
ing a given value W is the frequency v(c, d) of taking a
value belonging to a given half-interval (c, d]:
(8)
where N(c, d) = (W
1
(b) c)(1 H(W
1
(b)
d))(1 r
1
)(K(b), A(b)))] is the number of images recog-
nized with estimates belonging to the half-interval (c, d].
The monotonicity of estimates is the property of
estimates of the alternatives (in the rst turn, leading
ones) to characterize reliability of the character recog-
nition. The idea behind the introduction of such a char-
acteristic is that, for a number of reasons, we want to
have algorithms that possess the following property.
Let two recognition estimates be given, W
1
< W
2
. The
probability of the correct recognition of the object
under the condition that the recognition estimate is W
1
is to be less than that under the condition that the esti-
mate returned by the algorithm is equal to W
2
. Real
algorithms may not possess this property; therefore,
there is a need in a similar, but less restrictive, charac-
teristic. The latter can be obtained by replacing the con-
dition of the exact equality of the estimate to a given
value by the condition of the belonging of this estimate
to a given interval. Note that it is possible to use algo-
rithms that do not yield estimates at all or generate
identical estimates for the alternatives of the collection,
W
i
= W
j
(where i is not equal to j), if this drawback is
compensated by other characteristics, e.g., by a good
performance of the algorithm used. However, if an
algorithm A does generate estimates, we arrive at the
question of their reliability in the sense described
above. Let = {0 = x
0
< x
1
< < x
n
= W
max
} be a par-
titioning of the interval [0, W
max
]. If there exists a parti-
v c d , ( ) H W
1
b ( ) c ( ) 1 H W
1
b ( ) d ( ) ( ) [
b B

_
1 r
1
K b ( ) A b ( ) , ( ) ( ) ]
,

_
/ B N c d , ( )/ B , =
[H
b B
tioning with a monotonically increasing sequence of
frequencies, v(x
i
, x
i + 1
) v(x
i + 1
, x
i + 2
), i = 0, 1, , n
2, then the algorithm is considered to be monotonous.
Let [0, 1] and W = W
max
. We introduce one
more characteristic related to the monotonicity. The
threshold monotonicity is dened as
(9)
where N
err
(W) is the number of incorrectly recognized
images with the estimate W
1
W and N(W) is the num-
ber of correctly recognized images with the estimate
W
1
W.
The value of the threshold monotonicity for = 1 is
of interest.
4. RECOGNITION OF SEPARATE CHARACTERS
Consider the recognition of separate characters as a
program module function. An intuitively clear charac-
teristic, which is most frequently used in the literature,
is the recognition error, meant as a fraction of incor-
rectly identied images in the set of all objects to be
classied [4]. In the case of the recognition of decimal
digits, this denition needs no comments; however, in
the general case, the notion of accuracy needs to be
rened, which is done below. Another intuitive charac-
teristicperformance, meant as the number of charac-
ter images recognized per unit timeneeds to be rede-
ned immediately, since it depends on the computa-
tional platform and on a particular implementation of
the recognition algorithm. Since, in this work, we con-
sider only program modules rather than theoretical
algorithms, the way of the algorithm implementation is
not important by virtue of the fact that different imple-
mentations of recognition algorithms as program mod-
ules are considered to be different character recognition
algorithms. It is worth noting that we used mainly Win-
tel (IBM PC, CPU Intel, OS Microsoft Windows) and
Apple Mac platforms, i.e., personal computer plat-
forms.
In what follows, we assume that character recogni-
tion algorithms (CRA) are implemented as program
modules with the following standard interface. The
input parameter of a CRA is a graphical image of a
character (further referred to as simply image or char-
acter). Sources of images may be different; for exam-
ple, graphical images and the corresponding codes can
be extracted from a database of graphical images [5].
Thus, the test set B for such an algorithm is a set of all
possible graphical images. The set of generalized codes
S in the simplest case is a set of all possible codes of
characters of a certain language; e.g., in the case of
English, S = 0, 1, , a, b, , z, A, B, , Z. The result
1 H W W
1
b ( ) ( ) ( )r
1
K b ( ) A b ( ) , ( ) [ ]
b B

1 H W W
1
b ( ) ( ) ( )
b B

--------------------------------------------------------------------------------------------- =
N
err
W ( )
N W ( )
-------------------,
152
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
ARLAZAROV et al.
of operation of a character recognition algorithm A on
an image b is the set of alternatives
where S
i
(I) is a character code, also referred to as alter-
native code, and W
i
(I) is a character estimate, 1 i n.
For the set of codes, the abridged notation
is also used.
The alternative (S
1
, W
1
) is referred to as the leading
alternative. If the images are extracted from a test
sequence B of the type described in [5], it is possible to
determine whether an image b is recognized correctly,
as it is described in Section 3. The strong and weak rec-
ognition accuracy for a character recognition algorithm
can be dened by formulas (3) and (4), respectively. In
this case, it is assumed that the process of the algorithm
learning on a different sequence of images has been
completed and that the sequence of images {b} is rep-
resentative from the standpoint of its acceptability for
studying CRAs (these assumptions are implied in the
subsequent denitions of other characteristics).
Let us list several characteristics of the character
recognition algorithms, which are closely related to the
accuracy characteristic.
We start from the type of images being recognized,
which may be printed, handprinted, or handwritten, as
shown in Fig. 2. In the majority of cases, the input data
for a character recognition algorithm is not an original
image I but rather a certain mapping (representation)
(I) of the image into the same, or different, space. An
example of such a mapping into the same space is the
normalization of a bit image by size. Another example
is a mapping of a graphical object into a certain nite-
dimensional space of attributes. In other words, the
accuracy of a character recognition algorithm on an
array of images {I
i
} may depend on (or limited by) the
losses associated with the transformation {(I
i
)}.
A b ( ) S
1
W
1
, ( ) S
m
W
m
, ( ) , , { } =
= S
1
b ( ) W
1
b ( ) , ( ) S
m
b ( ) W
m
b ( ) , ( ) , , { },
S I ( ) S
1
S
m
, , { } S
1
I ( ) S
m
I ( ) , , { } = =
High accuracy of the recognition based on the use of a
mapping on a certain type of images does not guarantee
that the accuracy is high on another type of images. For
example, a representation (mapping) sufcient for a
nice identication of handprinted ZIP codes (Fig. 2c)
may be insufcient for more diverse handprinted
images.
A character recognition algorithm is said to be uni-
versal if it is (i) font-independent and (ii) size-indepen-
dent. The former property means that the algorithm can
work with different fonts. The latter implies the algo-
rithm can recognize images without restrictions on
their size.
One and the same character recognition algorithm
trained at different degrees of universality (e.g., for one
font of one size, or for one font of an arbitrary size, or
for any font of an arbitrary size) generates different
accuracy characteristics.
When considering the recognition accuracy, we can-
not do without the notion of an alphabet, i.e., a list of
classes of recognizable images. There are several char-
acteristics associated with the alphabet. We begin with
the learning alphabet, which is a list of classes of a
learning sequence. For the OCR programs, these
classes are unions of character subsets of various lan-
guages (see, e.g., Fig 3), digits, and special characters
(e.g., @#$%&
*
, etc.). For simple types, this list is
unambiguous; for example, for ZIP codes or hand-
printed digits, the learning alphabet is the set of ten dig-
its. Characters that are more diverse from the stand-
point of their outline, e.g., images of printed fonts, can
have several graphemes, i.e., design types correspond-
ing to one character. For example, a letter of the
Russian alphabet has the following graphemes:
These graphemes can be grouped in one or several
classes corresponding to one character . If a charac-
ter recognition algorithm was trained on learning
sequences containing the same images but different
names of graphemes, we get different accuracy. The
notion of the factor alphabet of a CRA, dened as a
subset of the learning alphabet in which the CRA can
distinguish all classes, is also very important. At rst
glance, the notion of the factor alphabet may seem arti-
cial; however, the following example demonstrates its
usefulness. A CRA that uses scaleable representations
is not able to distinguish between capital and lower-
.
(c) (b) (a)
Fig. 2. Example of characters of different types: (a) printed,
(b) handwritten, and (c) handprinted ZIP code.
Fig. 3. Characters from different languages.
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 153
case letters if they have the same outline. However,
many letters in various alphabets have this feature, as,
e.g., the following letters in Russian alphabet:
and an OCR program must distinguish between them.
Therefore, we need accuracy characteristics of charac-
ter recognition algorithms that are obtained just on the
factor alphabet. If the method fails to distinguish some
classes of the learning alphabet, additional CRAs or
other tools to resolve this problem should be used in the
OCR program.
Thus, the recognition accuracy of a character recog-
nition algorithm is connected with the type, representa-
tion, and alphabet that are characteristics of the learn-
ing and test sequences. The CRAs based on one and the
same theoretical algorithm but trained differently can
give considerably different values of the recognition
accuracy. A high accuracy of recognition, which is cal-
culated on the factor alphabet, by itself does not guar-
antee high accuracy of the OCR program, which was
demonstrated by the above example.
For a CRA, the recognition accuracy in the weak
sense v(A, B, r
2
), dened by formula (4), is also
referred to as recognition completeness of the CRA
(CRA completeness).
Generally, an image that is recognized completely
can be recognized correctly. Of interest is the case of
completely recognized characters that are recognized
incorrectly, i.e., the case where the original code of a
character coincides with the code of an alternative that
is not leading one, K(I) = S
i
(I), i > 1. Incorrectly recog-
nized, but complete, collections have a chance to be
corrected or can generate new correct collections
(K(I) = S
1
(I)) in combination with collections of other
methods. Sometimes, it is possible to nd a mechanism
that can transform incorrect complete collections into
correct ones. Clearly, it is always possible to complete
a collection up to a complete one by alternatives with
low estimates. However, an increase in the number of
the considered alternatives in the case when the proba-
bility of the occurrence of a correct character in the col-
lection does not grow can result in the reduction of
accuracy of the nal recognition, to say nothing of the
fact that the time of the algorithm operation increases in
this case. An example of the correction mechanism is
context-linguistic methods that use character alterna-
tives formed in advance. Thus, the improvement of the
completeness characteristic of the nal algorithm
greatly depends on the algorithm of the subsequent pro-
cessing of the set of alternatives. A formal improvement
of the recognition completeness through addition of
new alternatives can result in ambiguous subsequent
processing of the results M(I) and, in particular, worsen
the results. In other words, the determination and opti-
mization of the recognition completeness characteristic
should be agreed with other OCR algorithms that fol-
low the character recognition.
We discuss now other characteristics of the charac-
ter recognition algorithms.
The recognition rate (performance) is the number
of images recognized per unit time in processing a test
sequence. The recognition rate allows for time expen-
ditures of the base method but does not take into
account overheads on reading images from the deposi-
tory. This characteristic depends on the algorithm
implementation on a particular computer with the use
of some compiler, assuming that the compilation
parameters are set to provide the maximum perfor-
mance. To compare the rates of different methods, they
are to be implemented under the same conditions. The
meaningful optimization of the performance of a
method suggests both the reduction of the algorithm
complexity and the use of specic features of the pro-
cessors on which the method library is implemented.
Refusability is the property of a character recogni-
tion algorithm to generate collections of zero size
(empty collections). The CRA refusal implies that the
algorithm encountered an unfamiliar image or an image
that considerably differs from those used in the learning
process. The image that resulted in a refusal can be
interpreted in two ways: it is either a noncharacter or a
character considerably differing from those in the
learning sample.
In Section 2, we dened the notions of the weak and
strong robustness of a character recognition algorithm to
perturbations and the notion of the estimate distribution
for a given estimate W or in a given interval (see (7) and
(8)). The important properties of these algorithms are
those related to the estimate monotonicity and robust-
ness to perturbations (see (5) and (6)). The character
recognition algorithms that have estimate distributions
with low (or zero) frequencies of the leading estimates
require statistical recalculation of the estimates.
The monotonicity of estimates is the property of the
estimates of the alternatives (in the rst turn, leading
ones) to characterize reliability of character recognition
(see formula (9)). Note that it is possible to use algo-
rithms that do not yield estimates at all or generate
identical estimates W
i
= W
j
(where i is not equal to j) of
the alternatives of the collection if this drawback is
compensated by other characteristics, e.g., by a good
performance of the algorithm used.
The easiness of learning depends on the ratio of the
size of the learning image base and the time required
for the learning. The primary learning on large graphic
image bases may be as long as desired; however, the
learning in the process of recognition is possible only
in the methods that admit correction without learning
on a primary character sample. This characteristic is
closely connected with capabilities of the computers
used and OCR requirements. In the simplest case, the
easiness of learning can be dened as
N/T,
v 0 ( ) v 1 ( ) v W
max
( ) , , , { }
154
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
ARLAZAROV et al.
where N is the number of images in the learning
sequence and T is the learning time. This formula can
be redened to take into account the frequencies of
character occurrences.
When using on platforms with limited memory, the
program compactness, dened as the amount of mem-
ory occupied by the CRA together with the loaded data
(e.g., tables of templates), is an important characteristic
of the program. For the OCR programs operating on
personal computers, this characteristic is not important.
Note that some of the above characteristics depend
on the contents of the image bases. The stability of
these characteristics on large bases allows these CRA
properties to be considered as invariant with respect to
particular samples of character images.
Some of the above characteristics (accuracy, perfor-
mance, estimate monotonicity) characterize the CRA
performance. The others (alphabet, representation,
type) are important from the standpoint of the algo-
rithm applicability for solving particular OCR prob-
lems or in terms of combining them with other charac-
ter recognition algorithms.
5. RECOGNITION OF TEXT LINES
In the OCR methods, the recognition cannot gener-
ally rely on the knowledge of approximate character
boundaries. Only boundaries of the text line are a priori
reliable, whereas elements in the line may be images of
characters, or parts of characters, or combinations of
characters (or their parts), or images that are not text
characters.
In view of the possibility of combining several con-
nectivity components under the condition of possible
distortion of the page image, the segmentation (search
for boundaries and localization) of characters of a
printed text suggests several interrelated procedures,
which are described, e.g., in [7]. These are (i) nding of
regions where the segmentation of glued characters is
required, (ii) construction of a set of separative curves
that are candidates for the segmentation of connectivity
components, and (iii) search for possible variants for
nding an optimal path in the cutting curves graph.
Thus, the search for character boundaries under the
condition of real scanning defects suggests combining
various images extracted from the region within the
boundaries of the text line. The character segmentation
algorithm can use estimates of images generated by the
character recognition algorithm and penalizing mecha-
nisms [8] for comparing variants and distinguishing
between characters and noncharacters. In the general
case, the number of possible variants in such searches
is great, and, in practice, various heuristic criteria are
used for decision-making (e.g., in the segmentation of
character boundaries) [9].
For the estimation of the quality of segmentation
algorithms, the same characteristics that are used for
the estimation of properties of character recognition
algorithmsalphabet, easiness of learning, recognition
accuracy, recognition completeness, performance, and
estimate monotonicityare used. Many CRA charac-
teristics are inherited by the segmentation algorithms
(SA). For example, performances of an SA and the
CRA used by the former are related directly, since the
segmentation algorithm spends all time remaining after
the recognition on manipulations with images. The rec-
ognition completeness is determined only by that of the
method used, and the monotonicity can be improved
through the use of penalizing mechanisms of the SA
itself. When examining these characteristics, one faces
the problem of choosing an appropriate metrics in the
space of generalized codes. Elements of this space are
certain structures, for example, circumscribed rectan-
gles that enclose separate characters, and possible can-
didates for the cutting curves. The determination of
functions to evaluate closeness of such objects is a sep-
arate problem.
The notion of the SA alphabet is introduced in view
of the fact that the outlines of characters in different
alphabets require different models of generation of seg-
mentation variants, different penalizing functions, and
so on.
As applied to the segmentation algorithms, the eas-
iness of learning is meant as time expenditures on
learning new geometric congurations that cannot be
correctly analyzed by the existing segmentation means.
An SA module can be trained both manually (by incor-
porating new algorithms) and automatically [7].
Characteristics of the segmentation include also
special criteria related to the segmentation quality. We
discuss in more detail three of them: segmentation
accuracy, segmentation completeness, and segmenta-
tion performance. We will use the term zone to denote
a region containing a group of images the boundaries of
which are to be determined (character localization).
The generalized code for a segmentation algorithm
is a sequence of cutting curves that separate character
images to be recognized by a character recognition
algorithm. The result of the segmentation of a given
zone by an algorithm is a sequence of rectangles (or, in
the context of this paper, a sequence of graphical
images), which will be further subjected to the recogni-
tion. As discussed in Section 3, alternative results of the
zone segmentation can be represented as the sequence
where A
i
= A
i
(b) is a variant of partitioning the zone into
separate characters upon the action of the algorithm A
on an element of the test set b and W
i
is an estimate of
the segmentation variant. In fact, the structure A
i
should
contain also a sequence of pairs of bit images I
ij
with
the collections M
ij
of the recognition alternatives for
these images and references to the images I
ij
:
A
1
W
1
, ( ) A
m
W
m
, ( ) , , { },
M
i1
I
i1
, ( ) M
ik
I
ik
, ( ) , , { }.
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 155
Ideally, the number k of the images found should
coincide with the zone content, i.e., with the number of
actual characters located in the zone.
Suppose that, in the space of generalized codes S, a
base (or threshold) distance r is dened, and let r
1
and
r
2
be strong and weak metrics, respectively, generated
by the base distance (see Section 3). Then, formulas (3)
and (4), introduced in Section 3, dene the segmenta-
tion accuracy in the strong and weak sense, respec-
tively. Generally speaking, the segmentation accuracy
has nothing to do with the recognition accuracy. For
example, if all characters from the test sequence are
recognized incorrectly, the segmentation of their
boundaries can still be correct; i.e., the recognition
accuracy may be zero when the segmentation accuracy
equals 100%. The segmentation accuracy depends both
on adequacy of the segmentation variants generated by
the SA and on the tools for the verication of hypothe-
ses about belonging of generated images to the sets of
characters or noncharacters (i.e., on the denition of the
base distance).
The notion of the zone segmentation completeness,
dened in Section 3 for the general case, can be
extended to the case of segmentation algorithms as
well. An example of sequences of almost indistinguish-
able characters is shown in Fig. 4. The choice between
them can be made by means of context-linguistic meth-
ods, or immediately in the process of segmentation, or
on subsequent stages. If the stages of the additional rec-
ognition of character lines are separated in time, the
storage of the alternatives is necessary. The segmenta-
tion completeness is the fraction of the zones that are
subjected to the segmentation and the alternatives of
which contain correct segmentation variants.
The segmentation performance is dened as the
number of zones segmented per unit time. This charac-
teristic directly depends on the mathematical complex-
ity of the algorithm for searching for segmentation vari-
ants and on the performance of the method used for the
recognition of one character. It is advisable to evaluate
the segmentation performance for zones consisting of a
xed number of characters and, then, apply averaging
by the formula
where T
i
is the segmentation performance for zones of
length i, v
i
is the occurrence frequency for zones of
length i, and N is the maximum number of characters in
a segmentation zone.
The characteristics of a segmentation algorithm are
evaluated on a test sample of zones containing hetero-
geneous text images with several characters rather than
on sequences of separate characters, which do not con-
tain structural context of recognition of words belong-
ing to the document.
T
i
2
N

v
i

,

_
/ v
i
,
2
N

As can be seen, segmentation algorithms are charac-


terized by both their own characteristics and those of
the character recognition algorithms used. For exam-
ple, when dening the base distance in the space of
generalized codes of line segmentation, it is admissible,
and even desirable, to use character recognition algo-
rithms in order that the basic metrics would take into
account the possibility of subsequent correct recogni-
tion of separate characters occurring in the rectangles
obtained after the segmentation. Thus, valid debugging
and optimization of a segmentation algorithm are pos-
sible only if the properties of the segmentation and
character recognition algorithms match.
The segmentation algorithm is an important, but not
the only, algorithmic component of the text line recog-
nition module. This module must include the recogni-
tion of the punctuation marks; placement of spaces; and
determination of base lines, character case [8], and
attributes of text characters. Naturally, these processes
are controlled by means of quality characteristics simi-
lar to those of the character recognition (recognition
accuracy, recognition completeness, performance,
learning ability, robustness to image distortions in the
classication of spaces, punctuation marks, special
characters, character case and attributes).
The stage of text line recognition can be repeated
after an additional training, which is aimed at learning
specic features of printed fonts used and providing
opportunities for improving quality of the character
recognition. The quality of the multipass scheme,
which is represented as the combination of the addi-
tional learning and adaptive recognition, is estimated in
terms of the following characteristics: learning perfor-
mance (time expenditures on learning specic page
features and construction of fonts used in the docu-
ment), characteristics of adaptive character recogni-
tion (recognition quality, completeness, performance,
and other characteristics of the CRA that uses the fonts
constructed as a result of the learning), and adaptive
recognition efciency (the fraction of the characters the
quality of recognition of which was improved on a cur-
rent repeated recognition stage).
Algorithms for the recognition of text lines are more
complicated than character recognition algorithms
(although the latter are the most important components
of the former). This observation is substantiated by the
availability of additional characteristics of the text line
Fig. 4. Example of almost indistinguishable character
sequences.
156
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
ARLAZAROV et al.
recognition algorithms, as well as by the fact that the
problem of choosing base distances for general charac-
teristics of algorithms is a difcult one.
6. PAGE SEGMENTATION
Page segmentation is aimed at nding text frag-
ments, structural blocks (tables and forms), and graph-
ical blocks. Structural blocks, in turn, can contain other
structures [11], in particular, text lines. In what follows,
the search for blocks and their classication is referred
to as typication. The page segmentation can be
repeated many times with the use of the recognition
results obtained on the previous page processing
passes.
In view of the above-mentioned relationship
between the page segmentation and recognition stages,
we assume that the input data for the page segmentation
are primary segments, for example, connectivity com-
ponents found in the binary image of the page or char-
acter images obtained after the current stage of line rec-
ognition. The set of primary segments is analyzed by
page segmentation algorithms, and the segments are
grouped into typical blocks, some of which possess
their own structure. In accordance with this, we can
dene accuracy and completeness characteristics
related to two levels of the page segmentation, namely,
to the block determination and block structure.
We discuss rst characteristics of the block determi-
nation. Elements of the space of generalized codes for
the page segmentation are structures describing sets of
blocks, where each block is a polygon (in the simplest
case, rectangle) of a certain type and primary segments
assigned to it. For the test set or the sample base, we
take a set of document samples containing blocks of
various types. Such sample bases can contain typical
classes, such as book pages, magazine advertisement
pages, and the like.
By choosing the base distance in the space of gener-
alized codes, certain specic segmentation characteris-
tics can be introduced, such as accuracy of the typica-
tion and accuracy of the block boundaries segmenta-
tion. In the latter case, the denition of the base
distance should include conditions related to penalties
for missing blocks, for excessive primary segments
within the boundaries of the separated regions, and for
the intersection of the separated blocks.
Such a description of the accuracy of the typica-
tion and boundary segmentation is formulated after the
elaboration of criteria of the typication correctness
and criteria of matching boundaries of primary seg-
ments and blocks. First, we discuss the typication. In
view of a small number of block types, the problem is
similar to that of the classication of characters of a
given alphabet. This analogy relates also to difculties
of unambiguous typication of some simple cases. Fig-
ure 5 shows an example of the image that can be classi-
ed both as an illustration (cul-de-lampe) and as an
image of the letter H. The choice of the image type is
not objective and depends on the subsequent process-
ing; for example, when exporting the results to a poly-
graphic system, the illustration is more preferable; if
the results are saved in a text format, the letter should
be chosen.
The estimation of the correctness of a selection of
block boundaries and their mutual location is also an
ambiguous process, which depends on the model of the
subsequent processing and page image recognition.
Figure 6 shows different variants of segmentation of
two text columns owing around an illustration block.
The choice of the best segmentation variant depends on
the means (and desire) available for storing the illustra-
tion in the recognition results, as well as on the text
block recognition algorithms. In particular, for the
algorithms that rely on text homogeneity in a block,
e.g., for the learning within the block, variants (b) and
(d) are more preferable since they result in larger
blocks.
The above examples show the necessity in an alter-
native classication and selection of block boundaries,
which could assign each primary segment to both one
and several blocks (of one or several types).
Let us rene the notion of the estimate of segmenta-
tion results. We assume that each document image from
the test set is made to correspond to a description of
alternatives of boundaries and types of blocks con-
tained in it. We assume also that the most preferable
segmentation variants for a certain model of storing
results are known. Then, the estimate of segmentation
variants suggested by the page segmentation module
consists in the comparison of the suggested variants of
segments and those described in the test. Let us con-
sider a technique for the distance determination in the
page segmentation. We assume that the page segmenta-
tion results in a block structure containing blocks of the
three types: text blocks, rectangles of tables, and illus-
tration blocks. The text and illustration blocks are poly-
gons with the edges parallel to the coordinate axes. The
distance between two instantiations of the page seg-
mentation is dened as the Hausdorff distance

(X, Y)
Fig. 5. Image with dual typication.
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 157
between the sets X and Y,
First, we introduce metrics for the simplest model of
the page structuring, which consists of only rectangular
blocks of one type, for example, text blocks. We begin
with the denition of the distance between two rectan-
gles. Let two rectangles B = (B
1
, B
2
, B
3
, B
4
) and C = (C
1
,
C
2
, C
3
, C
4
), with vertices B
i
and C
i
, be given. Then, the
distance between them is the Hausdorff distance
between the sets of their vertices,
If at least one of the blocks, B or C, is not rectangu-
lar, then the distance between them is dened as the
Hausdorff distance between the boundaries of the
blocks. Let, for example, B and C be blocks of one type,
and either one of them or both are not rectangular.
Denote the boundaries of these blocks by B and C.
Then, the distance between the blocks is dened as
The distance between table blocks should take into
account the inner structure of the table. The denition
of the distance that takes into account the inner table
structure is also based on the Hausdorff distance. This
issue needs special consideration. First, we consider the
case where the table has the simplest structure of a rect-
angular matrix; i.e., the cells of the table are obtained
by dividing the table by vertical and horizontal separat-
ing lines that go from one edge of the table to the other.
For two tables T
1
= (R
1
, h
1
, v
1
) and T
2
= (R
2
, h
2
, v
2
),
where R
1
and R
2
are the table rectangles, h
1
= { , ,
, } and h
2
= { , , , } are sets of relative
coordinates of the horizontal separating lines, and v
1
=
{ , , , } and v
2
= { , , , } are sets
of relative coordinates of the vertical separating lines,
the distance is dened as
In the last formula, weight coefcients can be used
to take into account different effects of errors in the
determination of the outer block boundaries and those
in the recognition of the inner block structure. It should
be noted that the problem of nding an appropriate bal-
ance between different distances used in this formula is
not an easy one. Therefore, it may occur advisable to
introduce metrics separately for the table structures.
For two segmentations S = {B
1
, B
2
, , B
n
} and T = {C
1
,

X Y , ( ) max x y , ( ),
y Y
min
x X
max { =
x y , ( ) }.
x X
min
y Y
max
B C , ( )

B C , ( ) =
= max B
i
C
j
, ( )
j
min
i
max B
i
C
j
, ( )
i
min
j
max , { }.
B C , ( )

B C , ( ). =
h
1
1
h
2
1
h
n
1
1
h
1
2
h
2
2
h
n
2
2
v
1
1
v
2
1
v
m
1
1
v
1
2
v
2
2
v
m
2
2
T
1
T
2
, ( )
= max

R
1
R
2
, ( )

h
1
h
2
, ( )

v
1
v
2
, ( ) , , { }.
C
2
, , C
m
} consisting of blocks of one type, B
p
= { ,
, , } and C
q
= { , , , }, the distance
is dened as
Before to dene the distance between two block
instantiations, we dene the distance between two sets
of one-type blocks for the case where one of the sets is
empty, and the other is not empty. In this case, the dis-
tance should be set equal to a sufciently large number,
e.g., a multiple of the diagonal of the page where the
rectangles are located. For the multiplier, the number of
the rectangles in the nonempty set of blocks should be
taken. In addition, the distance between two empty seg-
mantations is equal to zero by denition. Suppose that
both segementations, S and T, are decomposed into
three sets of blocks, S
A
, S
T
, S
P
and T
A
, T
T
, T
P
, respec-
tively, where S
A
and T
A
are text blocks, S
T
and T
T
are
tables, and S
P
and T
P
are pictures. In this case, the dis-
tance is dened as max{(S
A
, T
A
), (S
T
, T
T
), (S
P
, T
P
)},
or by means of some other metrics, depending on
whether we distinguish between the errors in the recog-
nition of blocks of different types. The calculation of
the characteristics described in Section 3segmenta-
tion accuracy and completeness and typication com-
pletenessis based on the distance dened above.
Note that metrics based on the Hausdorff distance can
be used for solving problems of the segmentation of
character boundaries, including those discussed in Sec-
tion 5.
Let us show that the same ideas can be used for
dening the distance between two tables of more gen-
eral structure. Naturally, the denition of this distance
should rely on the description of the inner structure of
the table. A segment of a separating line enclosed
between two adjacent separating lines from the orthog-
onal family is referred to as a bridge. The boundaries of
the table are considered separating lines as well. To
describe the structure of a table, we assign a visibility
attribute to each bridge. Only visible bridges are shown
in the table image. The table structure is described by
twovisible and invisiblefamilies of bridges. For
both these families, we can dene the Hausdorff dis-
tances and take the maximum of the two for the result-
ing distance (as it was already done for sets of objects
of different type).
Criteria and characteristics of quality of structure
determination for complex table blocks and blocks con-
taining forms reduce to a set of simpler properties of
graphical objects. Below is the list of algorithms
required for the determination of the structure of table
objects:
decomposition of a structural block into atomic
table blocks;
B
1
p
B
2
p
B
3
p
B
4
p
C
1
q
C
2
q
C
3
q
C
4
q
S T , ( )

S T , ( ) =
= max B
p
C
q
, ( )
q
min
p
max B
p
C
q
, ( )
p
min
q
max , { }.
158
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
ARLAZAROV et al.
nding of elements of a table block (heading,
base, side columns, kernel, types of boundaries);
segmentation of cells in elements of a table block
and cell directions (vertical or horizontal);
determination of aggregates of one-kind cells;
determination of types and attributes of table cells
(direction, negativity, etc.).
For each of these algorithms, criteria are formulated
that make it possible to estimate quality of segmenta-
tion of the inner structure of tables compared to their
ideal descriptions. Clearly, the complexity of the inner
structure of table blocks suggests a multivalent seg-
mentation and alternative properties of table elements.
An additional condition included in the denition of
the base distance for the page segmentation is the min-
imality of the number of blocks, which is used in a
combination with basic quality criteria and is aimed at
avoiding situations similar to that shown in Fig. 6c.
Consider now characteristics of the block structure
determination. Text blocks have the simplest structure,
which is a set of text lines (horizontal or vertical). An
image of each text line can be uniquely localized on the
page image, in the sense that each primary segment
containing one character or a part of a character corre-
sponds to one line from the test sample. Primary seg-
ments containing several characters from different
lines, or from a union of a character with parts of illus-
trations (Fig. 7), can be divided into separate primary
segments or duplicated in several lines. A particular
way of manipulation of multiblock primary segments
depends on the possibilities of the subsequent process-
ing; in particular, an alternative line formation is possi-
ble.
The quality of segmentation of one text line is esti-
mated by the line formation accuracy dened as
where W
max
is the maximum possible line estimate;
N
doubles
is the number of primary line segments dupli-
cated in other lines; N
lost
is the number of lost primary
line segments; N
nontext
is the number of nontext primary
segments assigned to the given line; and P
1
, P
2
, and P
3
are penalties for the corresponding imperfections of the
line formation.
W
max
P
1
N
doubles
P
2
N
lost
P
3
N
nontext
,
(a) (b)
(c) (d)
Fig. 6. Different segmentations of a two-column text owing around an illustration (a): two polygonal blocks not intersecting the
illustration block (b), six rectangular blocks not intersecting the illustration block (c), two rectangular blocks intersecting the illus-
tration block (d).
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 159
The selection of the penalties P
1
, P
2
, and P
3
is based
on the capabilities of the line recognition module;
namely, lines should be formed in such a way that the
relationships between the duplicated, missing, and
alien primary segments would be optimal from the
standpoint of the nal recognition of the set of test
lines.
7. BINARIZATION
Binarization algorithms transform a color or gray-
scale original image to a binary image to be recognized.
The binarization algorithms either use the colors of a
digital image or reduce the color image to a grayscale
one, which is subjected to the binarization [10]. The
binarization of a grayscale image can be either thresh-
old or dynamical. The threshold binarization trans-
forms grayscale point values to black-and-white ones
by the rule
where G(P) is the grayscale intensity of the point P
belonging to a certain range; G
l
is a threshold grayscale
value, which is the same for the whole document
image; B(P) is a binary value of the point P, and H(x) is
the Heaviside unit-step function.
In the process of the dynamical binarization, bina-
rization thresholds for separate points are taken differ-
ent in different regions (corresponding, e.g., to one
word or one character) of the document image. Inde-
pendent of whether a binarization algorithm is thresh-
old or dynamical, its goal is to create optimal variants
of graphical images for the operation of subsequent
recognition algorithms, starting from the segmentation
and ending with formatting.
The binarization accuracy is dened as
(10)
where W
max
is the maximum possible line estimate; N
cut
is the number of glued images (not necessarily text
ones) that are to be separated; N
glue
is the number of dis-
integrated text images that are to be assembled; N
cont
is
the number of images the outline of which has been
changed; and P
1
, P
2
, and P
3
are penalties for the corre-
sponding imperfections of the line formation.
When evaluating the binarization accuracy, actual
image defects existing in the hard copy of the original
B P ( ) H G P ( ) G
l
( ), =
W
max
P
1
N
cut
P
2
N
glue
P
3
N
cont
,
document are not taken into account. The selection of
the penalties P
1
, P
2
, and P
3
is based on the capabilities
of the line and character recognition algorithms, which,
as was discussed above, are aimed at the determination
of character boundaries and recognition of distorted
character images.
The quality of a binarization algorithm can be esti-
mated by formula (10) only in sufciently simple cases,
since the latter involves nonautomated process of com-
parison of the binarization results with the source
images. Figure 8 illustrates possibilities of the binariza-
tion and the difculties that can arise in the subsequent
recognition. The criterion of quality optimization of a
binarization algorithm is estimated by the characteris-
tics of page segmentation and recognition in the binary
images produced.
The performance of a binarization algorithm is
dened as time expenditures on the binarization. These
expenditures have to be agreed with the accuracy and
performance of the subsequent page processing.
Clearly, a dynamical binarization with low perfor-
mance resulting in a page image that will perfectly
(and, perhaps, rapidly) be recognized on the subsequent
stage is more preferable than a fast binarization that
results in difculties associated with the recognition of
a noise-contaminated document image.
Binarization modules can be equipped with the
capability of learning. In such programs, an optimal
tuning occurs on learning sequences containing gray-
scale images of documents or their parts. An optimality
criteria is the results of the subsequent recognition.
8. FORMATTING
Formatting algorithms are nal in the OCR opera-
tion. Even minor errors (from the standpoint of the
algorithms discussed above) here can result in com-
pletely unsatisfactory results of the system operation on
the whole.
Characteristics of the formatting algorithms can be
divided into two groups. Most often discussed in the lit-
erature is the capability of an OCR program to save a
text in formats with given properties (multicolumn text,
illustrations saved, separating lines, attributes or styles
Fig. 7. Example of appearance of primary segments (con-
nectivity components) belonging to several text lines.
Fig. 8. Binarization of a word with different thresholds.
160
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
ARLAZAROV et al.
of text characters) [2]. These properties are greatly
affected by the page segmentation and text line recog-
nition algorithms, and the formatting algorithm repro-
duces the accumulated information in text formats, the
list of which is determined by the conversion character-
istics. Another group of characteristics estimates prop-
erties of just formatting algorithms and the correspon-
dence of the results of their operation to the original
document image.
Similar to the case of previously considered mod-
ules, the formatting accuracy is the most important
characteristic of the formatting algorithms; it estimates
the correspondence of the results obtained to the origi-
nal document structure. The formatting accuracy char-
acteristic is based on a set of elementary characteristics
aimed at the estimation of closeness of the resulting and
source documents. Clearly, the formatting into different
formats (text format, HTML, RTF) has certain restric-
tions imposed by the formats themselves. For example,
illustrations cannot be reproduced in the text format.
Note that, for some documents (for example, for com-
plicated polygraphic pages), the complete correspon-
dence of the results of formatting to the originals is
impossible even for the RTF format. For general-pur-
pose formats (RTF), we use the following elementary
formatting characteristics:
(i) accuracy in the reproduction of the numbers of
text columns and headings, illustrations, and separating
lines, which implies the correspondence of the numbers
of the resulting segments to those of the original text
segments in the scanned document;
(ii) accuracy in the reproduction of sizes of the doc-
ument segments;
(iii) accuracy in the alignment of lines;
(iv) accuracy of the reproduction of mutual loca-
tions of the document segments.
Each of the elementary characteristics (as applied to
a particular recognized document) is calculated as a
sum of penalties for the distortion of metric relations
for the corresponding dimensions. An approximate
preservation of metric relations means that a dimension
X (width, height, distance from another object, etc.) of
the original fragment is sufciently close to the corre-
sponding dimension X' of the resulting document,
where is a prescribed formatting error. The character-
istic X' is measured either in a text processor that maps
coordinates of elements comprising the document or in
a printed hard copy. The preservation of the document
sizes is estimated in a similar way. The preservation of
fonts is estimated by their types, characteristics, and
absolute sizes. The total formatting accuracy Q(I) of a
document image I is given by the summary penalty,
Q(I) = (I), where the summation is taken over
all characteristics,
i
(I) is the penalty of the ith charac-
teristic estimating the recognition of the image I, and W
i
X X' ,
W
i

is the weight of this characteristic. To determine the for-


matting accuracy, we use sets of images consisting of
documents of different classes (magazine, newspaper,
and advertisement pages). The results are averaged giv-
ing the formatting accuracy A. Therefore, a nonzero
formatting accuracy that is less than a certain threshold
value is considered acceptable (the base distance is
assumed zero), and the threshold of an acceptable pen-
alty is set experimentally.
Another important characteristic is robustness to
variations of images, which is calculated by the general
formulas (5) and (6) for the base distance introduced
above. For a perturbed image I* obtained from an orig-
inal image I, we consider that obtained by rotation or
variation of scanning brightness.
Other formatting characteristics, such as perfor-
mance and ability to learn, are also worth mentioning.
The latter characteristic implies that the formatting
algorithm can keep some information about typical
congurations obtained in the process of learning for
further use.
9. DISCUSSION
In the paper, some quality characteristics of the
algorithms required for the OCR functioning have been
discussed.
The characteristics discussed can be used for the
OCR programs designed for the input of hard copies of
documents into a computer. Characteristics of program
modules implementing the recognition algorithms have
been examined.
Some characteristics (accuracy, completeness, per-
formance) are considered as a set of properties of inter-
module interfaces of an OCR program divided into
modules in accordance with the logic of its operation
described in Section 2. Such an approach suggests that
the OCR is considered as a hierarchical system the
design of which requires solving two different prob-
lems: optimization of its separate parts and optimiza-
tion of the entire system on the whole. A number of the
characteristics described serve as quality criteria. The
others (properties of estimates, alphabet, image types,
learning ability) make it possible to compare program
implementations of similar algorithms designed for
solving the same problems. Clearly, the great number
of characteristics may require redenitions or criteria
ordering from the standpoint of a more general prob-
lem.
For the OCR programs functioning on modern per-
sonal computers, the most important characteristics are
recognition (segmentation) accuracy, recognition (seg-
mentation) completeness, and estimate monotonicity.
For systems designed for the recognition of other
objects or working on other platforms, the priority of
characteristics may be different.
PROGRAMMING AND COMPUTER SOFTWARE Vol. 28 No. 3 2002
CHARACTERISTICS OF OPTICAL TEXT RECOGNITION PROGRAMS 161
REFERENCES
1. Kazem, T., Borsack, J., and Condit, A., Evaluation of
Model-Based Retrieval Effectiveness with OCR Text,
ACM Trans. Information System, 1996, vol. 14, no. 1,
pp. 6493.
2. Haigh, S., Optical Character Recognition (OCR) as a
Digitization Technology, Network Notes, no. 37, Infor-
mation Technology Services National Library of Can-
ada, 1996. http://www.nlc-bnc.ca/publications/net-
notes/notes37.htm.
3. Portegys, T.E., A Search Technique for Pattern Recogni-
tion Using Relative Distances, IEEE Trans. Pattern
Analysis Machine Intelligence, 1995, vol. 17, no. 9,
pp. 910912.
4. Cun, Y.Le., et al., Back-Propagation Applied to Hand-
written Zip Code Recognition, Neural Computation,
1989, vol. 1, pp. 541551.
5. Slavin, O.A., Tools for Management of Graphical Char-
acter Images Bases and Their Place in Recognition Sys-
tems, in Razvitie bezbumazhnykh tekhnologii v organi-
zatsiyakh (Development of Paper-free Technologies for
Ofces), Moscow: URSS, 1999, pp. 277289.
6. Wang, J. and Jean, J., Segmentation of Merged Charac-
ters by Neural Networks Shortest Path, Pattern Recogni-
tion, 1994, vol. 27, no. 5, pp. 649658.
7. Arlazarov, V.L., Korolkov, G.V., and Slavin, O.A., A
Linear Criterion in OCR Problems, in Razvitie bezbu-
mazhnykh tekhnologii v organizatsiyakh (Development
of Paper-free Technologies for Ofces), Moscow:
URSS, 1999, pp. 1723.
8. Arlazarov, V.L., Kuratov, P.A., and Slavin, O.A., Line
Recognition for Printed Texts, in Metody i sredstva
raboty s dokumentami (Methods and Tools for Paper
Work), Moscow: URSS, 2000, pp. 3151.
9. Wu, V., Manmatha, R., and Riseman, E.M., TextFinder:
An Automatic System to Detect and Recognize Text in
Images, IEEE Trans. Pattern Analysis Machine Intelli-
gence, 1999, vol. 21, no. 11, pp. 12241229.
10. Klyahzkin, V., Shchepin, E., and Zingerman, K., Hierar-
chical Analysis of Multi-column Texts, Pattern Recogni-
tion Image Analysis, 1995, vol. 5, no. 1, pp. 112.

You might also like