Professional Documents
Culture Documents
DOCUMENTS
Carlos A.B. Mello and Rafael D. Lins
Department of Electronics and Systems UFPE Brazil
{cabm, rdl}@cin.ufpe.br
ABSTRACT
This paper presents a new entropy-based segmentation
algorithm for images of documents. The algorithm is used
to eliminate the noise inherent to the paper itself specially
in documents written on both sides. It generates good
quality monochromatic images increasing the hit rate of
OCR commercial tools.
I. INTRODUCTION
We are interested in processing and automatic
transcription of historical documents from the nineteenth
century onwards. Image segmentation [2] of this kind of
documents is more difficult than more recent documents
because while the paper colour darkens with age, the
printed part either handwritten or typed, tends to fade.
These two factors acting simultaneously narrows the
discriminations gap between the two predominant colour
clusters of documents. If a document is typed or written
on both sides and the opacity of the paper is such as to
allow the back printing to be visualized on the front side,
the degree of difficulty of good segmentation increases
enormously. A new set of hues of paper and printing
colours appears. Better filtering techniques are needed to
filter out those pixels, reducing back-to-front noise.
The segmentation algorithm presented was
applied to documents from Joaquim Nabucos1 file [5,12]
held by Joaquim Nabuco Foundation (a research center in
Recife-Brazil). The segmentation process is used to
generate high quality greyscale or monochromatic images.
Figure 1 shows the application of a nearest colour
algorithm for decreasing the colours of a sample
document from Nabucos bequest, using Adobe Photoshop
[10]. The document is written on both sides the colour
reduction process has not produced satisfactory results as
the ink of one side of the paper interferes with the
monochromatic image of the other side.
This paper introduces a new entropy-based segmentation
algorithm and compares it with three of the most
important entropy-based segmentation algorithms
1
i =0
255
(Eq. 2)
i = t +1
255
255
i =t +1
i =t +1
i =t +1
S w (t ) = log( pi ) + (1 / pi )[ E ( pt ) + E ( pi )]
and
t
t 1
i =0
i =0
i =0
Sb (t ) = log( pi ) + (1 / pi )[E ( pt ) + E ( pi )]
where E(x)=-xlog(x) and t is the threshold value. Figure 4
presents the application of this algorithm to the image of
the document under study.
D023
80.3
78.3
43.3
91.7
91.4
D064
84.4
84.5
63.7
85.2
80.1
V. REFERENCES
D077
80.1
80.1
71.8
77.3
92.4
D097
75.4
5.1
69.5
73.4
88.0
Omni
Johannsen
Pun
Page
Kapur
New
et al
Scheme
IV. CONCLUSION
This paper introduces a new segmentation algorithm for
historical documents, which is particularly suitable to
reduce back-to-front noise of documents written on both
sides. Applied to a set of 40 samples from Nabucos
bequest it worked satisfactorily in 90% of them
C.