Professional Documents
Culture Documents
1. Introduction
A character recognition system is designed for the machine
replication of human reading. It involves the techniques of
recognition on both printed and handwritten characters
from a text image. The first concept of idea of OCR
system was developed in the 1929 [l]. In the middle of the
1940's, the first OCR system appeared with the invention
of the digital computers [2]. A commercial OCR machine
was available in the 1950's. Since then, a large amount of
research papers and reports have appeared, and many new
recognition techniques have been developed. OCR systems
improve human-machine interaction in many applications,
including office automation, cheque verification, and a
large variety of banking, business and data entry
applications [l]. [2].
1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications 53 1
increases the classes to be recognized from 28 to 2.1 Image Acquisition and Preprocessing
... 100. A document is quantized by a scanner in space and
111. Some characters can only appear at the beginning or
amplitude (i.e. image sampling and gray-level
at the end of a word or sub-word. An Arabic word
quantization) to acquire a digitized representation of it [5].
could have one or more sub-words. This is due to
This is controlled by the user interface of the system.
the fact that some characters are not connectable
from the left side with the succeeding character.
The preprocessing is the process of enhancing the acquired
iv. Most characters (17 out of 28) have a dot, two dots,
image to increase the ease of feature extraction and to
or zigzags associated with the character and these
compensate for the eventual poor quality of the scanned
can be above, below, or inside the character.
documents Binarization, noise elimination and thinning
V. There are only three zigzags that represent vowels.
[6]-[lo] have also been implemented.
Other vowels are represented by diacritics in the
form of over-scores or under-scores. The use of
2.2 Segmentation
diacritics is limited to the cases where the word is
foreign or where the pronunciation requires a stress. Segmentation is a crucial step of OCR systems as it
vi. Some characters may overlap vertically (without extracts meaningful regions for analysis. A poor
touching each other) within a word. segmentation process produces mis-recognition or
vii. There are no upper or lower cases in Arabic. rejection. It is especially important for Arabic OCR
systems due to the cursive nature of Arabic script and the
Because of the above characteristics, the Arabic character fact that some Arabic words overlap vertically. Page layout
segmentation and recognition are far more difficult than analysis and character separation are used to segment
the recognition of the Latin or Chinese characters. characters from the preprocessed image.
-
where
\ -
N, is the number of columns of the word image and Ci, is
the number of black pixel of the ith column. Hence, each
part showing a value less than M, is segmented into a
t
Recognized
different Character. However, if the histogram produced
from the vertical projection does not agree the following
Text rules, the character remains un-segmented.
1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications
532
where di is the distance between ith peak and (i+l)th peak, mpq is the @+q)thorder geometric moment of a digital
and dLis the total width of the character. By examining the
Arabic characters, the distance between peaks does not density distribution function p(x,y) which can be expressed
excess 1/3 of the width of Arabic Character. Moreover, at as follow:
"
the end of a word or sub-word, the following rule is
applied:
e-0
Li+l > 1.5 x Li (3) where p, q = 0, 1,2, ...; p(x, .y) is the gray-level value of a
where L, is the ith peak in the histogram. pixel at (x, .y).
23 Feature Extraction
The end result of the image acquisition, preprocessing, and 2.4 Statistical Classification
segmentation is an array of numbers that represents the The classification process is carried out at the final stage to
character in some way. In the general case,however, the recognize the characters. The classification process assigns
matching of these numbers to a template may be too time an input character into one or more pre-specified classes
consuming and not flexible enough. Therefore, feature which are based on the extracted features and their
extraction is essential in character recognition systems. analysis. In our system, the minimum distance is obtained
The feature extraction process uses a set of measurements by calculating the sum of square of error between the input
that represent unique features to describe the character. seven moment descriptors with the moments of the 100
These measurements may then be represented in the characters stored in a database.
feature space for classification. Seven second-order
moments, qpq,described in [14], are used to calculate the 3. Experimental Results and Discussion
features of each segmented Arabic character. They are:
We have fully implemented the Arabic OCR system
$1 =77m +7702 (4) described above and provided a friendly graphical
window-based user-interface. In that system, document
@2 = (7720 - 7702 )" + 477fI (5) images are acquired through a scanner by calling a
command fiom the system's menu. After the recognition,
the users are able to edit their documents through the
system as well. Many images have been tested. It has been
shown that the system has an accuracy of 85% and is able
to run in real time at around 16 chartsec. Some example
results are shown in Figure 4 and 5.
(b)
kqis the central moment and can be expressed as follow: Figure 4 (a) the original document (b) the
m m
recognized result with the system menu.
where F = tn,o t m,,and 7 =moll m , The major problem that we have encountered is the
character separation through the use of the vertical
1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications 533
projection technique described in Section 2.2. Some of the References
characters could not be segmented as those characters are
S . Mori, C. Y. Suen and K Yamamoto, “Historical
horizontally overlapped with each other. Moreover, under
Review of OCR Research and Development,” Roc.
our experimental investigation, Amin’s algorithm was not
always satisfied during the segmentation of characters. It
IEEE,Vol. 80, NO. 7, pp. 1029-1058, 1992.
V. K Govindan and A. P. Shivaprasad, “Character
might be because this algorithm is only suitable for some
Recognition - A Review,” Pattern Recognition, Vol.
font types.
23, NO. 7, pp. 671-683, 1990.
1. S. Abuhaiba, S . A. Mahmoud and R. J. Green,
4. Conclusion and Future Work
“Cluster Number Estimation and Skeleton Refining
A user-friendly statistical based Arabic optical character Algorithms for Arabic Characters,” The Arabian
recognition system was implemented. Many tests were Journal for Science and Engineering, Vol. 16, No. 4B,
carried out and showed that the system is able to recognize pp. 519-530, Oct 1991.
Arabic texts with an accuracy of 85%. It is also a real time K. M. Jambi, “Arabic Character Recognition: Many
system which recognizes documents at around 16 char/sec. Approaches and One Decade,” The Arabian Journal
In our Arabic OCR system, the text image is first for Science and Engineering, Vol. 16, No. 4B, pp.501-
preprocessed to enhance the quality of the image. Then it 509, Oct 1991.
is divided into paragraphs, lines are then extracted, and W. Niblack, “An Introduction to Digital Image
each word is decomposed into its constituent characters Processing,” Rentice Hall International.
using the algorithm described in [ll]. Feature extraction A. Cheung, M. Bennamoun and N. W. Bergmann, “A
and statistical classification are then carried out to identify New Thinning Algorithm for Arabic Characters,”
the Arabic characters. The accuracy of the system is ISAS, Caracas, Venezuela, Oct 1997.
enormously affected by the segmentation process. C. J. Hilditch, “Comparison of Thinning Algorithm on
A Parallel Processor,” Image Vision Computer Vol. 1,
We have developed an Arabic word segmentation method NO. 3 pp.115-132, 1983.
which can successfully segment vertically overlapped L. Lam, and C. Y. Sum, “An Evaluation of Parallel
Arabic words [15]. This process should increase the Thinning Algorithms for Character Recognition,”
segmentation accuracy of Arabic characters. Currently, we BEE Tran. on PAMI, Vol. 17, No. 1, Jan 1995.
are investigating an improved character separation P. S . P. Wang and Y. Y Zhang, “A Fast and Flexible
technique, and believe that it will increase the recognition Thinning Algorithms,” IEEE Trans. on Computers,
accuracy of our system. Vol. 38, NO. 5, pp. 741-745, 1989.
[lo] S . Suzuki and K-Abe, “Binary Picture Thinning by an
Iterative Parallel Two Sub-cycle Operations,” Pattern
Recognition, Vol. 10, No. 3, pp. 297-307, 1987.
[11]A. Amin, ‘Xecognition of Arabic Handprinted
Mathematical Formulae,” The Arabian Joumal for
Science and Engineering, Vol. 16, No. 4B, pp. 532-
542, Oct 1991.
[12]A. Amin and G. Masini, “Machine Recognition of
Multifonts Printed Arabic Texts,” Proc. Sth Inter.
Conf. On Pattem Recognition (Paris, France), pp. 392-
395,1986.
[13]A. Amin and J. F. Mari, ‘Machine Recognition and
Correction of Printed Arabic Text,” IEEE Trans. on
System, Man, and Cybernetics, Vol. 19, No. 5,
pp. 1300-1306,1989.
[14]F. El-Khaly and M. A. Sid-Ahmed, ‘Tvlachine
Recognition of Optically Captured Machine Printed
Arabic Text,” Pattern Recognition, Vol. 23, No. 11,
pp. 1207-1214, 1990.
[15]A. Cheung, M. Bennamoun, and N. W. Bergmann, “A
New Word Segmentation Algorithm for Arabic
0) Script,” DICTA’97: The 4th Conference on Digital
Figure 5 (a) the original document (b) the Imaging Computing Techniques and Applications,
recognized result with the system menu. AucMand, New Zealand, 1997 (to appear).
1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommunications
534