You are on page 1of 4

IMAGE SEGMENTATION OF HISTORICAL

DOCUMENTS
Carlos A.B. Mello and Rafael D. Lins
Department of Electronics and Systems UFPE Brazil
{cabm, rdl}@cin.ufpe.br

ABSTRACT
This paper presents a new entropy-based segmentation
algorithm for images of documents. The algorithm is used
to eliminate the noise inherent to the paper itself specially
in documents written on both sides. It generates good
quality monochromatic images increasing the hit rate of
OCR commercial tools.
I. INTRODUCTION
We are interested in processing and automatic
transcription of historical documents from the nineteenth
century onwards. Image segmentation [2] of this kind of
documents is more difficult than more recent documents
because while the paper colour darkens with age, the
printed part either handwritten or typed, tends to fade.
These two factors acting simultaneously narrows the
discriminations gap between the two predominant colour
clusters of documents. If a document is typed or written
on both sides and the opacity of the paper is such as to
allow the back printing to be visualized on the front side,
the degree of difficulty of good segmentation increases
enormously. A new set of hues of paper and printing
colours appears. Better filtering techniques are needed to
filter out those pixels, reducing back-to-front noise.
The segmentation algorithm presented was
applied to documents from Joaquim Nabucos1 file [5,12]
held by Joaquim Nabuco Foundation (a research center in
Recife-Brazil). The segmentation process is used to
generate high quality greyscale or monochromatic images.
Figure 1 shows the application of a nearest colour
algorithm for decreasing the colours of a sample
document from Nabucos bequest, using Adobe Photoshop
[10]. The document is written on both sides the colour
reduction process has not produced satisfactory results as
the ink of one side of the paper interferes with the
monochromatic image of the other side.
This paper introduces a new entropy-based segmentation
algorithm and compares it with three of the most
important entropy-based segmentation algorithms
1

Brazilian statesman, writer, and diplomat, one of the key


figures in the campaign for freeing black slaves in Brazil,
Brazilian ambassador to London (b.1861-d.1910).

described in the literature. Two different grands for


comparison are presented: visual inspection of the filtered
document and the response of Optical Character
Recognition (OCR) tools.
II. ENTROPY-BASED SEGMENTATION
The documents of Nabucos file are digitized with 200 dpi
in true colour and then converted to 256-level greyscale
format by using the equation:
C = 0.3*R + 0.59*G + 0.11*B
where C is the new greyscale colour and R, G and B are,
respectively, the Red, Green and Blue components of the
palette of the original colour image.
Three segmentation algorithms based on the
entropy function [1] are applied to greyscale images and
are studied here: Pun [9], Kapur et al [3] and Johannsen
[8].
A. Puns Algorithm
Puns algorithm analyses the entropy of black pixels, Hb,
and the entropy of the white pixels, Hw, bounded by the
threshold value t. The algorithm suggests that t is such
that maximizes the function H = Hb + Hw, where Hb and
Hw are defined by:
t
(Eq. 1)
Hb = p[i ] log( p[i ])

i =0
255

Hw = p[i] log( p[i ])

(Eq. 2)

i = t +1

where p[i] is the probability of pixel i with colour


colour[i] is in the image. The logarithm function is taken
in base 256. Figure 2 presents the application of Puns
algorithm to the sample image shown in figure 1-left.
B. Kapur et als Algorithm
Reference [3] defines a probability distribution A for an
object and a distribution B to the background of the
document image, such that:
A: p0/Pt, p1/Pt, ..., pt/Pt
B: (pt+1)/(1 Pt), (pt + 2)/(1 - Pt),..., p255/(1 Pt)

The entropy values Hw and Hb are evaluated using


equations (1) and (2) above, with p[i] following are
applied to greyscale images the previous distributions.
The maximization of the function Hw + Hb is analysed to
define the threshold value t. The sample image of figure
1-left is segmented with this algorithm and the result is
presented on figure 3.
C. Johannsens Algorithm
Another variation of an entropy-based algorithm is
proposed by Johannsen trying to minimize the function
Sb(t) + Sw(t), with:
255

255

255

i =t +1

i =t +1

i =t +1

S w (t ) = log( pi ) + (1 / pi )[ E ( pt ) + E ( pi )]
and
t

t 1

i =0

i =0

i =0

Figure 3. Kapur et als segmentation

Sb (t ) = log( pi ) + (1 / pi )[E ( pt ) + E ( pi )]
where E(x)=-xlog(x) and t is the threshold value. Figure 4
presents the application of this algorithm to the image of
the document under study.

Figure 4. Johannsens segmentation.


III. A NEW SEGMENTATION ALGORITHM

Figure 1. (left) Original image in 256 greyscale levels


and (right) its monochromatic version generated by
Photoshop

Figure 2.Puns algorithm applied to document

The algorithm scans the image looking for the most


frequent colour, which is likely to belong to the image
background (the paper). This colour is used as inicial
threshold value, t, to evaluate Hw and Hb as defined in
equations (1) and (2).
The entropy H of the complete histogram of the
image is also evaluated. It must be noticed that in this
new algorithm the logarithmic function used to evaluate
H, Hw and Hb is taken with a base equal to the product of
the dimensions of the image. This means that, if the
image has dimensions x versus y, the logarithmic base is
x.y. As can be seen in [4], this does not change the
concept of entropy.
Using the value of H two multiplicative factors, mw
and mb, are defined folllowing the rules:
If 0.25 < H < 0.30, then mw = 1 and mb = 2.6
If H 0.25, then mw = 2 and mb = 3
If 0.30 H < 0.305, then mw = 1 and mb = 2
If H 0.305, then mw = 0.8 and mb = 0.8

These values of mw and mb were found empirically after


several experiments. By now, they can be applied to
images of historical documents only. For any other kind
of image, these values must be analysed again. We
empathise that this new algorithm was developed to work
with images with characteristcs of historical kinds.
The greyscale image is scanned again and each
pixel i with colour[i] is turned white if:
(colour[i]/256) (mw*Hw + mb*Hb)
Otherwise, its colour remains the same (to generate a new
greyscale image) or it is turned to black (generating a
monochromatic image).
This condition can be inverted generating a new
image where the pixels with colour corresponding to the
ink are eliminated, remaining only the pixels classified as
paper.
This new segmentation algorithm was used for
two kind of applications: 1) to create high quality
monochromatic images of the documents for minimum
space storage and to efficient network transmission and 2)
to look for better hit rates of OCR commercial tools. The
application of the algorithm to the sample document of
figure 1-left can be found in figure 5 next.

Figure 5. Application of new segmentation algorithm in


the document presented in figure 1-left.
Comparing figures 2, 3, 4 and 5, one can observe
that the algorithm proposed in this paper yielded the best
quality image, with most of the back-to-front interfernce
removed.
It is also important to notice that the new
algorithm presented the lowest processing time amongst
the algorithms analysed.
The entropy filtering presented here was used in
a set of 40 images of documents and letters from Nabucos
bequest. Only four times unsatisfactory images were
produced which required the intervention of an operator.

Figure 6 zooms into one of these documents and the


output obtained.
For typed documents (also from Nabucos file)
the segmentation algorithm was applied in search for
better responses from OCRs commercial tools. In
previous tests [6], the OCR tool Omnipage [11] from
Caere Corp. achieved the best hit rates1 amongst six
commercial analysed softwares. These rates reached
almost 99% in some cases. When applied to historical
documents, however, this rate decreased to much lower
values. The segmented images for a sample typed
document can be seen in detail in figure 7.

Figure 6. (top left) Original image; (top right) Original


image in black-and-white; (center left) Original image
segmented by Puns algorithm; (center right) Application
of Kapur et als algorithm; (bottom left) Johannsens
algorithm and (bottom right) our algorithm applied to
original image.
The table below presents the hit rate of
Omnipage for four typed documents representative of
Nabucos bequest after segmentation with the four
entropy-based algorithms presented here. They are
compared with the use of the original image with no preprocessing besides the one used by the own software (the
column labeled as Omnipage).
It can be seen a little degradation in one of the
cases in the hit rate of the software when compared with
its use after the application of the new segmentation
technique (in the D064 image). This degradation can be
justified by a possible loss of part of some characters in
1

Number of characters correctly transcribed from image


to text

D023

80.3

78.3

43.3

91.7

91.4

producing, under visual inspection, better quality images


than the best known algorithms described in literature.
The automatic image-to-text transcription of
those documents using Omnipage 9.0, a commercial OCR
tool from Caere Corp. [11], improved after segmentation.
The algorithm presented did not work well with
very faded documents. We are currently working on retunning the algorithm for this class of documents.

D064

84.4

84.5

63.7

85.2

80.1

V. REFERENCES

D077

80.1

80.1

71.8

77.3

92.4

D097

75.4

5.1

69.5

73.4

88.0

[1] N.Abramson. Information Theory and Coding. McGraw-Hill


Book Company, 1963.

the segmentation process producing errors in the


character recognition process. Eventhough the
segmentation algorithm proposed in this paper reached
the best rates in average.
Image

Omni

Johannsen

Pun

Page

Kapur

New

et al

Scheme

Table 1. Hit rate of Omnipage for images of typed


historical documents in percentage
Figure 7-bottom shows another application of the
algorithm, as explained before, where the frequencies classified
as ink are eliminated remaining only the background of the
image (the paper). This image is used in another part of the
system in the generation of paper texture for historical
documents [7].

[2] R.Gonzalez and P.Wintz. Digital Image Processing. Addison


Wesley, 1987.
[3] J.N.Kapur, P.K.Sahoo and A.K.C.Wong. A New Method for
Gray-Level Picture Thresholding using the Entropy of the
Histogram, Computer Vision, Graphics and Image Processing,
29(3), 1985.
[4] S.Kullback. Information Theory and Statistics. Dover
Publications, Inc.1997.
[5] R.D.Lins et al. An Environment for Processing Images of
Historical Documents. Microprocessing & Microprogramming,
pp. 111-121, North-Holland, 1995.
[6] C.A.B.Mello & R.D.Lins. A Comparative Study on
Commercial OCR Tools. Vision Interface99, pp. 224323.Quebc, Canada, 1999.
[7] C.A.B.Mello & R.D.Lins. Generating Paper Texture Using
Statistical Moments. IEEE International Conference on
Acoustic, Speech and Signal Processing, Istanbul, Turkey, June
2000.
[8] J.R.Parker. Algorithms for Image Processing and Computer
Vision. John Wiley and Sons, 1997.
[9] T.Pun, Entropic Thresholding, A New Approach,
Graphics and Image Processing, 16(3), 1981.

Figure 7. (top left) Original image, (top right) segmented


image (ink) and (bottom) negative segmentation (paper)

[10] Adobe Systems Inc. http://www.adobe.com

The algorithm was also tested with other


segmentation methods such as iteractive selection yielding
better results in terms of OCR hit rates and visual
inspection of monochromatic images quality.

[12] Nabuco Project. URL: http://www.di.ufpe.br/~nabuco.

IV. CONCLUSION
This paper introduces a new segmentation algorithm for
historical documents, which is particularly suitable to
reduce back-to-front noise of documents written on both
sides. Applied to a set of 40 samples from Nabucos
bequest it worked satisfactorily in 90% of them

[11] Caere Corporation. http://www.caere.com

C.

You might also like