You are on page 1of 5

Thinning Arabic Characters for Feature Extraction

Dr John Cowell Dr Fiaz Hussain


Dept. of Computer Science, School of Information Technology,
De Montfort University, Dubai Polytechnic,
The Gateway, Leicester, LE1 9BH PO Box 1457, Dubai,
England. United Arab Emirates (UAE).
jcowell@dmu.ac.uk fhussain@dubaipolytechnic.com

reading postal or zip codes on letters or reading


Abstract vehicle licence plates as part of traffic control systems
is still a subject for research. Details of the
A successful approach to the recognition of Latin
techniques used in commercial recognition systems
characters is to extract features from that character
are usually not available in the public domain as they
such as the number of strokes, stroke intersections
form part of the products of companies, however the
and holes, and to use ad-hoc tests to differentiate
techniques used are widely known. Such systems are
between characters which have similar features. The
usually based on feature extraction techniques,
first stage in this process is to produce thinned 1 pixel
counting features such as the number of strokes which
thick representations of the characters to simplify
make up the character, the curvature of the strokes
feature extraction. This approach works well with
and the number of holes. This approach readily
printed Latin characters which are of high quality.
organises the Latin characters into groups with similar
With poor quality characters, however, the thinning
characteristics. Ad-hoc tests can then be used to
process itself is not straightforward and can
differentiate between the characters in each group.
introduce errors which are manifested in the later
The tests tend to be very specific for the characters
stages of the recognition process. The recognition of
concerned. Recognition systems for the Latin
poor quality Arabic characters is a particular
characters, for example, do not consider the number
problem since the characters are calligraphic with
or position of dots, which is crucial in the Arabic
printed characters having widely varying stroke
case. The techniques used are often sensitive to
thicknesses to simulate the drawing of the character
variations in the orientation of the characters, do not
with a calligraphy pen or brush. This paper
deal with poor quality characters with ragged edges
describes the problems encountered when thinning
and are not readily transferable to other character sets
large poor quality Arabic characters prior to the
such as Arabic and Urdu. The calligraphic nature of
extraction of their features and submission to a
these character sets means that the approach used for
syntactic recognition system.
Latin characters is not directly transferable. A
recognition system which can deal with a variety of
Keywords poor quality characters from the Arabic, Latin and
Arabic, characters, thinning, optical character other character sets must use an alternative approach.
recognition, OCR, Urdu.
A number of approaches for thinning have been
Introduction reported in the literature. Lam and Suen [1]
referenced 138 publications in the area in 1992. A
The development of Optical Character Systems for
simple search, today, for material on thinning
Latin script has been a success. Commercial OCR
identifies 150 papers. This shows the importance of
systems are widely available at low cost and perform
thinning in pattern recognition and the various
well with high quality printed character. Even for
approaches to yield skeletons of shapes: Iterative
Latin script, however, the recognition of poor quality
approaches based on having black edge pixels with at
printed characters for common applications such as

Proceedings of the Fifth International Conference on Information Visualisation (IV’01)


0-7695-1195-3/01 $10.00 © 2001 IEEE
least one non-black pixel have been developed by Arabic characters, some problems quickly became
Rutovitz [2], Hilditch [3], and Arcelli [3,4]. Holt [5] apparent.
and Hall [6] discuss some practical problems
associated with this approach. Associated with this is The Thinning Application
the development of parallel thinning algorithms.
An application was developed which showed the
Performance of these is reported by Chen and Hsu
form of the character at each stage of the thinning
[7], whilst critical evaluations of recent developments
process, so that it could be examined. The software
is given by Lam and Suen [8]. An alternative
also allowed the initial image to be edited prior to the
approach to thinning is based on the medial axis
thinning process, so that the effect of adding or
transform. This effectively works with removing the
deleting a few pixels could be examined. The
“flesh” of an object to return the skeleton [9, 10]. This
running application is shown in figure 1.
is a widely used approach since skeletons can be
produced fast and are invariant to the orientation of
the input character. The skeletons produced, however,
may not be connected and therefore secondary
processing is required to produce the skeletons in a
form for easy feature extraction. In addition, the
medial axis approach relies on certain shape
characteristics (such as parallel stem) which are not
common in the Arabic case. An iterative algorithm is
reported here that produces thinned characters which
are suitable for feature extraction. One feature of the
approach used is that the thinned forms produced vary
slightly depending on the orientation of the original Figure 1 : The thinning application.
character. It was found, however, that the differences
were not sufficient to cause difficulties in the The Language menu option allows the character
extraction of the features required for identification of set to be specified, in this prototype only Arabic and
the characters. Latin are available. The File menu option allows a
The paper also provides experimental results that file containing an isolated character to be specified.
show the problems associated with the approach when When the character is displayed in the top left square,
applied to large characters of size 100 by 100 pixels. pixels may be added or removed by clicking
The work is seen as an intermediate stage to anywhere on the image. Clicking on the 'Thin shape'
developing a full system for identifying and button removes a layer of pixels from the edges of the
recognising characters through thinning. character. The resulting character is displayed in the
next square on the right, when the first row is full the
next row is used. The number of pixels deleted in
The Thinning Process producing a partially thinned form is shown below
The recognition system developed is based on the each image. The application also has a zoom facility
extraction of features from the character. The which allows the current character to be displayed as
recognition of the characters is based on a syntactic a larger image. The 'Print current pattern' button
approach which expresses the spatial relations prints the zoomed character with a background grid,
between these features in terms of a new pattern so that the x, y position of individual pixels can easily
grammar. The features which the system requires are: be read. The 'Identify features' button extracts
the number of stroke end points, the number of holes, structural information about the character and is
the number and position of dots, and a representation reported in Hussain and Cowell [in preparation].
of the curvature of the strokes which make up the
character. The extraction of these features is made
easier by producing thinned forms of the characters.
The Thinning Algorithm
Thinning is a process, which is well documented and The thinning algorithm in the initial phase used a
widely used, in the first stage of recognition of high conventional approach. A list of edge pixels was
quality Latin characters. When conventional produced. An edge pixel is one which is black (a
characters were applied, however, to test forms of the pixel which is a part of the character) and touches at

Proceedings of the Fifth International Conference on Information Visualisation (IV’01)


0-7695-1195-3/01 $10.00 © 2001 IEEE
least one white pixel (a pixel which is a part of the Thinning Problems
background). 8-connectedness not 4-connectedness
It was anticipated that the thinned forms produced
was used; that is, two pixels touch is they are by this process would be suitable for feature
diagonally adjacent in addition to being horizontally
extraction since the algorithm outlined produced one
or vertically adjacent. A pixel touches eight
pixel skeletal forms such that:
surrounding pixels. Edge pixels are candidates for
deletion. To determine whether a particular edge • An end pixel is identified as a black pixel
which touches only one other black pixel.
pixel should be deleted, a set of 3×3 templates are
applied to each of the edge pixels in turn, so that the • Touching three or more black pixels could
pixel under consideration is at the centre of the identifies a black pixel marking an
template. If there is a match, the pixel is deleted. intersection of two or more strokes.
There are eight pixels surrounding the central pixel • All other black pixels touch two other black
giving 256 possible templates. In fact because of a pixels.
four-fold symmetry only 64 templates need to be However a number of significant problems
considered. The deletion templates all have the emerged: Dots play a significant part in the
following characteristics: identification of Arabic characters: For example,
1. The deletion of the central pixel still leaves jeem, haa, and chaa are distinguished solely by the
all other pixels in the template connected. position of the dot. Haa does not use a dot, whilst
2. The deletion of the pixel should leave at least jeem and chaa have one dot positioned respectively
two black pixels in the template. middle and top with respect to the main part of the
Applying this criteria produced 32 templates. character. Thinning a dot does not produce a
Part (a) of Figure 2 shows a template which is used to meaningful output. Patterns which approach a circle
identify a pixel suitable for deletion. In part (b) the or square are not amenable to thinning and the
central pixel is not suitable for deletion since it would thinned form is susceptible to minor differences
leave the right pixel disconnected from the right three variations in the shape of the dot.
pixels, that is the first criteria is not met. Characters of poor quality may contain a single
Part (c) and (d) indicate pixels not suitable for white pixel which grows into a hole in the thinning
deletion since these do not match the second criteria. process. The thinning process produces distorted
Deleting pixels which match these two templates characters particularly where stroke intersections
would have the effect of erasing the final one pixel occur. The cursive nature of the Arabic characters
thick form. increases the likelihood that spurious 'tails' will be
present in the thinned forms. This problem is
inherent to the approach used and although it occurs
less frequently if an algorithm based on the medial
axis is used, this difficulty is replaced by another
problem, namely disconnected skeletons. This
requires a further phase of thinning, using an
(a) (b) (c) (d) algorithm of the type described here to produce
Figure 2 : Typical thinning templates. skeletons of the form in which feature can be readily
extracted for analysis and the recognition of the
The deletion of a edge pixel alters the partially character.
thinned character and may have an impact on whether
other edge pixels are deleted or not. After
considering every edge pixel the remaining edge
Thinning Problems – Observations and
pixels which have not been deleted are considered Solutions
again until no more are deleted. The whole image is Each of these problems has been successfully
considered again to produce a new list of edge pixels solved by the application to produce thinned forms
and the whole process repeated until no more pixels which allow us to extract the structural information
can be deleted. The outcome is that one-pixel thick which we need for the recognition process.
representations are produced. The thinning of dots is not a valid process, indeed it is
not required. The significance of dots in the original
character is determined by their number and position,

Proceedings of the Fifth International Conference on Information Visualisation (IV’01)


0-7695-1195-3/01 $10.00 © 2001 IEEE
not by their shape and therefore the original character size of character is much higher than used by many
is pre-processed to remove dots and note their researchers and results by improved hardware which
position and number prior to submission to the allows scanning at higher resolutions. A 12 point
thinning process. character scanned at 600 dpi, which can be achieved
The problem of single white pixels has also been by a cheap scanner will yield an image which is
dealt with by pre-processing. Holes which are three 12× 600/72 = 100 pixels high. The growth of these
pixels or less in size are converted to black pixels tails is illustrated in figure 4 which shows
prior to submission to the thinning process. This successfully thinned forms of the character faa. Note
successfully solved the problem for the test that the first square shows the character as read from
characters. the file, including a dot. This is removed and its
The next two problems of character distortion and position logged prior to thinning.
the growth of spurious 'tails' are not so easily
resolved.

Character Distortion
The problem of distortion of the characters is more
intractable and is a fundamental feature of this
approach to thinning. Figure 3 shows successive
forms of the character hamza.

Figure 4 : Growth of superfluous 'tails'.


Two approaches were used to resolve this problem.
The first approach was to prevent the growth of tails
wherever possible. The second was to remove this
where the first approach had failed.
An analysis of the characters which exhibited tails
indicated that they grew from a small number of seed
pixels on a ragged edge. The 3×3 templates were not
sufficient to identify this problem and therefore two
larger 5×5 templates were used to as indicated in
Figure 3 : Distortions in the thinning process. figure 5. The grey pixels indicate that it does not
The original form to an Arabic reader is clearly matter what colour the pixel is.
made up of one continuous stroke with one abrupt
change in direction. The thinned form clearly shows
three strokes and although recognisable does not look
as one would anticipate an ideal thinned form of this
character would look. Note also the distortion where
the apparent intersection occurs. This is an inherent
feature of this approach to thinning and when
extracting structural information for thinning it is
important to appreciate the limitations of this
algorithm. There is no solution to this problem, Figure 5 : 5×5 templates to prevent tail growth.
however an awareness of it allows it to be taken into The application of these templates and their
account in the latter parts of the recognition process. rotational varients was sufficient to prevent the
growth of many tails. Figure 6 part (a) shows the
Removing 'Tails' original character caf, part (b) shows the thinned form
The final problem, the growth or spurious tails is
without the larger templates and part (c) shows the
one which occurs when the characters to be thinned final form when the larger templates employed in
are large and the thinning process requires many
addition to the 3×3 templates.
stages. For the 100×100 pixel resolution characters in
the test set, the maximum number of iterations
required was seventeen by the character haa. This

Proceedings of the Fifth International Conference on Information Visualisation (IV’01)


0-7695-1195-3/01 $10.00 © 2001 IEEE
thinned forms produced by applying a set of 3×3
templates often have superfluous tails which require
further processing before the features can be extracted
as the next phase of the recognition process. An
(a) (b) (c) application has been successfully implemented using
Figure 6 : Effect of larger templates. iterative thinning with post processing to produce
thinned forms of Arabic characters including the pre-
This approach, while successful, did not remove processing and tail removal phases.
all tails from the test sample of characters. For
example, as for the sample character faa, shown in Bibliography
figure 4. To remove this tail and others which were
not removed by the larger templates, the final thinned
form was checked for strokes which were less than a [1] Lam L and Suen Y, Thinning Methodologies –
specified threshold in size. For the test sample, 8 A comprehensive survey, IEEE Trans. on
pixels were found to be sufficient. The end points of Pattern Analysis and Machine Intelligence, vol.
strokes are identified readily, since these pixels only 14, No. 9, Sept 1992 pp869-885.
touch one other pixel. These pixels are followed until [2] Rutovitz D. Pattern Recognition, Journal Royal
another end point or an intersection is encountered. If Statistical Society, vol. 129 Series A pp504-530
the number of pixels in this stroke is less than the (1966)
threshold, the stroke is considered to be a tail and is [3] Hilditch C.J Linear skeletons from square
deleted. Part (a) of figure 7 shows the original form cupboards, Machine Intelligence New York:
of faa, part (b) shows the thinned form without the tail Amer. Elsevier, 1969 vol. 4 p403-420.
removal and part(c) shows the form after successful [4] Arcelli C. and Sannit1 de Baja G, On the
removal. sequential line transformation. IEEE Trans Sys
Man Cyber, vol. SMC-8 no. 2 pp139-144.
(1978)
[5] Arcelli C Pattern thinning by contour tracing,
Computer Graphics and Image processing vol.
17 pp130-144(1981)
[6] Hall R.W. Fast parallel thinning algorithms:
(a) (b) (c) parallel speed and connectivity preservation.
Figure 7: Tail removal. Comm ACM vol. 32 no 1 pp124-131 (1989)
[7] Chen Y. S and Hsu W.H. A comparison of some
one-pass parallel thinnings. Pattern
Conclusions
Recognition Letters vol. 11 no 1 pp35-41 1990.
Recognition systems often using thinning as the [8] Lam L and Suen C Y An evaluation of parallel
first stage in the extraction of features from a thinning algorithms for character recognition.
character in order to recognise it. This approach IEEE Transactions on pattern Recognition and
works well with high quality Latin characters, but Machine Intelligence vol. 17 no 9 September
difficulties are encountered when dealing with a 1995.
calligraphic character set such as Arabic, due to wide [9] Xia Y Skeletonization via the realisation of the
changes in the thickness of the strokes which mimic fire front's propagation and extinction in digital
the writing of the characters by a calligraphy pen or binary shapes. IEEE Transactions PAMI vol. 11
brush. The increased resolution of scanned characters no. pp1076-1086 10 October 1989.
means that the thinning process of successively [10] Jang B. K. and Chin R.T. Analysis of thinning
removing layers of edge pixels requires more algorithms using mathematical morphology.
iterations which results in greater distortion of the IEEE Transactions PAMI vol. 12 no 6 pp541-
characters. To resolve these problems when working 551 June 1990.
with poor quality Arabic characters pre-processing is
required to remove white pixels embedded in the
body of the character and to remove any dots in the
character after noting their position and number
which is significant in the recognition process. The

Proceedings of the Fifth International Conference on Information Visualisation (IV’01)


0-7695-1195-3/01 $10.00 © 2001 IEEE

You might also like