You are on page 1of 6

2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan.

07 - 09, 2016, Coimbatore, INDIA

A Technical Study and Analysis of Text


Classification Techniques in N - Lingual Documents
An Overview of Various Methodologies for Identifying and Categorizing Textual Data in
N-Lingual Printed Imaged and Non-Scanned Documents

Shalini Puri S. P. Singh


Research Scholar Department Of Computer Science
BIT, Mesra, Ranchi, Noida Campus BIT, Mesra, Ranchi, Noida Campus
Noida, India Noida, India
eng. sha[ inipuri3 O@gmai[.com spsinghbit@yahoo.co.in

Abstract- In the current era, there is a high demand of accurate desired categories. Many efficient techniques and methods
text identification and categorization methods in N - Lingual have been developed in this direction.
non-scanned and scanned machine printed documents, where N
represents mono, bi, tri or multi mode. In this paper, a technical In today's scenario, different languages and scripts based
study and analysis is presented to show N-lingual document systems are also gaining so much attention of the researchers.
classification for normal text, printed and handwritten Today documents and data are available in many languages
documents. Text classification for normal text documents is using different/same scripts. Additionally, many online and
simple, whereas in scanned machine printed systems, it offline systems have been generated for the operations like
inherently begins with the correct recognition of text, i.e.; translation, conversion, searching the keywords etc. Such kind
characters and words. The steps involved in the latter case are of systems accepts scanned/non-scanned mono lingual [2 -4],
script identification, page layout determination, separation of [11], [22], bi lingual [5-10], [12-19], [21], tri lingual and
text and non-text data, line segmentation, word detection and multilingual documents [20]. Additionally, many researchers
finally character recognition. After performing such processing are more focused on the multi lingual text classification (either
steps, text or script is identified and separated. Three statistically scanned or non-scanned) irrespective of their increased system
analyzed charts are also shown, which are based on content type and time complexity.
classification, language-mode pair and most-to-least preferred
languages of existing algorithms. The paper is organized as follows. Section II discusses
theoretical perspectives which include document classification,
Keywords- N-Lingual Documents; Scanned and Non-scanned a description on text extraction and classification of printed
Documents; Script Identification; Text Detection; Text images, and a structure on N-lingual documents and systems.
Classification [n the last case, a hierarchy is designed to depict the wide use
and applications of such systems. In section III, a comparative
I. INTRODUCTION
study of different models is presented. [n section [V, three
With the increased technological advancements, charts are shown to measure different analyzed results found.
digitization of world and rapid development of automated [n section V, conclusion is discussed.
systems, there is a high demand of efficient and accurate
classification systems [1-9]. Such systems not only cover the II. THEORETICAL PERSPECTIVES
categorization of the textual data (.doc, .txt etc.) [5-7], rather
A. Document Classification
this field has come up with further extensions in the directions
of machine printed (scanned image) [8-9] and handwritten Document classification [2-3], [5-[4], refers to classify the
(scanned form) data also. In this lieu, important information given set of documents into predefmed categories under the
which is presented in the form of images has attracted a lot of concept of supervised learning paradigm, so that each
researcher's attention from last two decades. Obtained document is categorized and grouped into the most suitable
information from such images may lead to a wide range of category. It is opposite to the concept of clustering which
applications - content based image retrieval system [8-13], makes use of unsupervised learning to make the clusters of
automatic document reader [2-22], automatic information data. In classification, a set of training documents is first
extraction from image [8-9], [11], [16-17], [19], [21-22] and preprocessed and used to train the classifier on the basis of
video frames etc. Nowadays, document classification systems, predefmed labels. Then, testing documents are automatically
either for scanned or non-scanned text, are not only limited to classified by the classifier. Such mechanism is not only used to
one language rather they are being developed to accept classify the web pages, text documents, images, video and
documents of more than one language. So, these systems must speech but also used to separate the scripts, to recognize the
be capable enough to categorize them efficiently into the characters, words and text lines in the machine printed imaged
documents.

978-1-4673-6680-9/16/$31.00 ©2016 IEEE


2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA

B. Text Extraction and Classification in Printed Images recognition and post processing. Such method is widely used in
In recent years, many researchers have focused on the area script classification and image mining.
of N-Lingual printed imaged text classification. For printed C. N - Lingual Documents and Systems
documents, many researchers have worked on the extraction of
N-lingual documents are characterized as those documents
written text [8-9], [16-17], [19], [21], separation of topics [10],
whose text is written by using mono/bi/tri/multi scripts. Such
separation of two language texts, identification of line and
words, separation of text and non text documents, OCR [18- systems work on classification or extraction or separation of
text. In N-lingual documents, text written in N languages is
19], [21-22], text classification [8], word classification[9] and
many more. Some more related aspects are given below. contained in one document. Researchers have done a lot of
work in this field and they have designed different techniques,
• Feature extraction and feature selection methods used methodologies and mechanisms for it. Fig. 1 depicts taxonomy
to identify character, word and line for different of N-lingual based systems which depict the wide scope of
languages. fields and applications. Here the input documents can be inu
either Indian or Non - Indic languages or both.
• To consider the font, style, size, regular/basic,
bold/italics/underline, portrait/landscape, colored! BW For an example, consider the case of Hindi language. Hindi
etc. [5], [12], [21], written in Devnagari script [5], [18], [19] and an
official language of India has got a lot of attention of
• Processing of scanned/non-scanned documents with researchers and scientists in the last decade. It is the native
one/two/three/multi languages. language of 10 states in India. Today, a great research work is
The basic steps of text extraction and classification from a going on for the identification of characters, words and,
printed imaged document includes image acquisition, separation of a letter in three zones, shirorekha and lines [18-
preprocessing, page layout detection and segmentation, line 19], [21-22] etc.
detection and segmentation, word identification and character

N-LiNGUAL BASED SYSTEMS

!
Recognizers Search Engines Editor Other Supplements

I
! ! ! ! ! !
Translators / Optical Character Speech Reader
Writer Dictionary Lexicons Glossary
Converters Recognizer Recognizers (Pronunciation)

I I
! ! ! !
Into Target Text to Speech
Image
Language Speech to Text

Figure I. Taxonomy on N-Lingual Based Systems

III. A TECHNICAL STUDY OF N-LINGUAL MODELS BASED some problem ofN - lingual document classification for which
ON DIFFERENT MODES they provide the solutions. In addition to this, such review
includes mode (text / printed / handwritten), lingual type, target
Efficient bilingual document classification depends on a languages, classifier used and data sets for the experiments.
number of key concerns which need to be evaluated and Such study helps to understand the concepts underlying in the
analyzed to get accurate and correct results. Table I provide a model or system and also aids in evaluating their scope. Here
detailed summarized report of such concerns and compares most systems used bilingual mode with preferred language
different existing techniques and algorithms. This report covers English and SVM classifier. This study provides a good
many important and primary concerns used in the respective overview and summary to show their usage in the current
models and algorithms. These algorithms are designed for document classification environment.
2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA

TABLE I. COMPARISON AMONG VARIOUS EXISTING ALGORITHMS BASED ON KEY ASPECTS

Problem Proposed N- Target


Author Primary Aspects Mode Classifier Data Set
Focused Solutiou Lin2ual Lan2ua2es
History of
Coding Concerns
Indian Scripts Focus on Design
Word Processing
and Computer Challenges and Mono Devnagari
Sinha [5]
Processing of Structure Translation Text
Lingual (Hindi)
- -
Indian Similarity Character
Languages Recognition
• Traveller Set
Inclusion of Cross English,
Civera et. Bilingual Text Bi (Spanish - English)
Lingual Structure Language Pairs Text Spanish and SVM
al. [6] Classification Lingual • BAF Corpus (French
using IBMI Model French
- English)
• 3 State Transducer
Algorithms
Civera et. Bilingual Text Bi English and Bi lingual
al.[7] Categorization • Joined
Naive - Text
Lingual Spanish Classifiers
Two Categorized
Smoothed N- Bilingual Corpora
Grams
First find Chinese 1517 Images from
Parts Before Character Decision- Newspapers and
Merging and then Recognizers for Tree SVM Magazines. Samples
Linet. al. Textual Entities Bi Chinese are:
send Latter Chinese, Printed (Using
[8] Classification Lingual and English
Characters to Alphanumeric & Local .365672 for Training
Chinese Punctuation SVMs) .91418 for Validation
Recognizer .91418 for Testing
Separate Words by Dot, Line and 57 Documents.
Neural
Haboubi et. Word using Statistical Word Detection Bi Arabic and Words obtained:
Printed Network
al. [9] Classification and Geometric Character Lingual Latin • 4229 for Training
(NN)
Analysis Recognizer • 846 for Testing
Document Using Bilingual
5 Different Sized
Bilingual Topic Clustering Before Dictionary
Bi Chinese Affinity Law and Language
Zhang [10] Taxonomy or After Text Find Cluster Net Text
Lingual and English Propagate Domains of Parallel
Generation Feature Similarity, Entropy
Corpora.
Reconstruction and Purity
Text & Non-
Text 240 Images for
Block
Segmentation Testing - 60 for each
Ibrahim et. Segmentation with Character Mono SVM and
and Printed English Text & Non-Text by
al. [II] Dilation, Labeling Recognition Lingual BPNN
Classification using SVM and
and Zoning
from Document BPNN separately.
Images
Translators: Sentences Taken:
15000 Training -
Online (J.lsoft,
Evaluation of from Tourism
Human and Google, Babylon)
Joshi et. al. Machine Bi English and Domain, 3300 for
[12] Translation
Automatic Trained (Moses Text
Lingual Hindi
- Tuning (Tuning
Evaluation Phrase, E5 Moses
Systems from ACL's
Syntax & E6 workshop) and 1300
Example Based)
for Testing.
Character
Dictionary Based Recognizer Three Leveled ChiN
Liu et. al. Web Page Bi Chinese
Text Encoding Text KNN Editors having 341
[13] Classification Lingual and English
Categorization Detection and Categories
Integration
Represent
AI-Ghuribi Web Page Text Bi Arabic and
et. al. [14] Classification
Technical Study Document Text
Lingual English
- -
Measuring Factors

• Similarity .TDT3 & TDT4


Identifying
Topic Aspect Measures • 3 Sources for Each
Contiguous
Wuet. al. (Subtopic or • Optimal LSA Bi Chinese KNN& Language
Passages Text
[15] Facet) Dimension Lingual and English LSA .3388 Chinese &
Using Sentiment
Classification • Centroid 38083 English
Analysis
Correction Documents
From Books and
Laser Print
Script Separation Newspapers and
Mohanty Indian Scripts Documents Bi English and
using Horizontal Printed SVM more than 10000
et. al. [16] Recognition Variable Font Lingual Oriya
Histogram Samples used for
Styles and Sizes Training and Testing.
2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA

Problem Proposed N- Target


Author Primary Aspects Mode Classifier Data Set
Focused Solution Lin~ual Lan~ua~es
From Magazines and
.3 Spatial Zone Newspapers:
Word Separation
Word Subdivision Total Samples -1008
Dhanyaet. Script Languages having Bi English and SVM,NN
• Distribute Word Printed in each Training and
al. [17] Identification Variable Styles & Lingual Tamil and KNN
Directional Energy Testing having equal
Fonts
using Gabor Filters Number of Tamil and
English Words.
Handwritten
Documents of 1500
Three Tier
Pal et. al. Text Line Structural and Character Printed and Bi Bangia and Individuals for
Tree
[18] Identification Statistical Features Recognition handwritten Lingual Devnagari Statistical Analysis
Classifier
and 600 Document
Images for Testing.
Corpus of 2.5
Reading Two • Detect Skew & Character Million and 3 Million
Decision
Chaudhari Indian Zone Separation Recognition Bi Bangia and Words in both
Printed Binary
et. al. [19] Language • Segmenting Line, Only for Single Lingual Devnagari Languages. Checked
Tree
Scripts Word & Character Font Errors on 10000
Words.
English,
German,
Identification of Domain Multilingual
Gaizauskas Multi Spanish, Domain
the Domain of a Classification of 5 Language Pairs Text Thesaurus and
et. al. [20] Lingual Czech, Classifier
Term the Document Wikipedia
Lithuanian
and Latvian
Collected from
Text Block Newspapers etc.
Identification and 2 Lac Sample
lawahar et. Bi Hindi - SVM and
al. [21]
OCR Extraction, Line - Printed
Lingual Telugu KNN
Characters & 90000
and Word Samples to
Segmentation Experiment on Font
Independence.

From Magazines and


Recognition of Based on Wavelet Character Newspapers: 30
Two-
Kunte et. Complete Set of based Feature Recognizer Mono Training Samples
Printed Kannada Stage
al. [22] Kannada Extraction Font & size Lingual with 9000
Multi-NN
Characters Approach Independence Characters and 6000
Test Characters

IV. ANALYSIS RESULTS AND DISCUSSION


This section discusses three analytical results which are
obtained from Table I and are based on different measuring
criteria as given below. These criteria provide statistically
comparative results for content type classification, N-lingual -
mode pair and language preference.
A. Based on Content Type Classification
This is the first key issue to analyze different architectures
and algorithms and it is based on the contents type used for text
classification as shown in Fig. 2. These content types can be
script identification or classification, domain or topic
classification, web page classification, text categorization and
machine translation. First and fifth types come under the iii Script Identification/Classfication iii Domain/Topic Classification
category of recognizers of fig. 1, whereas other three are
related to the text, domain and web contents categorization. iii Web Page Classification iii Text Categorization
Statistical measurements of fig. 2 depict that the script iii Machine Translation
classification has highest ratio, i. e. 50%, whereas machine
translation has lowest % level, i. e.; 5.5%. Figure 2. Graph Depicting Content Type Comparisons for Document
Classification
2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA

B. Based on Language-Mode Pair V. CONCLUSION


In Fig. 3, an analysis graph is depicted for different In this paper, we presented a detailed technical study and
languages and modes which were used by the proposed review of different N-lingual models for document
algorithms. Here, it shows the use of oneltwolthree/multi classification. It depicts a framework of multi mode systems
languages along with normal text (non-scanned), printed text with respect to current scenario. These algorithms are
(image) and handwritten text (image). These numerical figures compared for different aspects. Three analytical charts are also
show there are 38.38% bilingual documents which worked presented and discussed, which include content type
upon the normal text, whereas monolingual text, bilingual text classification, language - mode pair and the language preferred
and handwritten, and multi text documents are least preferred most to least. Such analytical study provides a good platform
ones for document classification. for better understanding the problem and the direction in which
research can be proceed.
Some findings from the existing algorithms are as follows.
... Bilingual Text
First, text identification and classification in scanned printed
til documents is complicated than normal text classification.
c..
Bilingual Printed Secondly, the complexity of text classification or script
Qj
"C identification increases with the number of languages used.
o Monolingual Printed
~ This paper is limited to a set of N-lingual document
classification algorithms. [t can further be extended to include
til Monolingual Text more multilingual algorithms and techniques which can
::J provide a wide view on classification algorithms.
ll.O
.!: Bilingual Printed and Handwritten
....I REFERENCES
I
Z [1] M. H. Dunham, "Data Mining: Introductory and Advanced Topics,"
Multi-lingual Text
First Edition, Pearson Education India, 2006.
[2] S. Puri and S. Kaushik, "An Enhanced Fuzzy Similarity Based Concept
0.00 20.00 40.00 60.00 80.00 100.00 Mining Model for Text Classification Using Feature Clustering," IEEE
Students' Conference on Engineering and Systems, pp. I - 6, 2012.
%Use of Each Pair
[3] S. Puri, "A Fuzzy Similarity Based Concept Mining Model for Text
Figure 3. An Analysis Graph for Different Language-Mode Pairs Classification," International Journal of Advanced Computer Science
and Applications, Vol. 2, pp. 115 - 121, 20 II.
C. Based on Most Preferred to Least Preferred Language [4] S. Puri and S. Kaushik, "A Technical Review and Analysis on Fuzzy
Similarity Based Models on Text Classification," International Journal
To differentiate the proposed categorization techniques, a of Data Mining & Knowledge Management Process, Vol. 2, No.2, pp. 1
third criterion is to find out the language which is used - 15, 2012.
primarily for the system designs. It is found that English having [5] R. M. K. Sinha, "A Journey from Indian Scripts Processing to Indian
66.6.7% was the most popular and preferred language among Language Processing," IEEE Annals of the History of Computing, Vol.
31, Issue 1, pp. 8 - 31, 2009.
all the systems, whereas 10 languages, having 5.5% use, are
[6] J. Civera and A. Juan-Ciscar, "Bilingual Text Classification using the
least preferred languages. [t is depicted in Fig. 4.
IBM 1 Translation Model," International Conference on Language
Resources and Evaluation, pp. 58-61,2008.
English 66.67%
[7] J. Civera, E. Cubel, and E. Vidal, "Bilingual Text Classification,"
Hindi 27.7SJ
Pattern Recognition and Image Analysis, Vol. 4477, pp. 265-273, 2007.
Chinese 22.22% [8] X. R. Lin, C. Y. Guo, and F. Chang, "Classifying Textual Components
...
I/)

I:
Spanish 1 6. 7% of Bilingual Documents with Decision-Tree Support Vector Machines,"
Arabic IEEE International Conference on Document Analysis and Recognition,
Q)
~ 11.1 %
E BangIa pp. 498 - 502, 2011.
:::s ~ 11.1 %
u [9] S. Haboubi, S. Maddouri, and H. Amiri, "Word Classification in
0 Latvin ~ 5.5% Bilingual Printed Documents," 6th IEEE International Conference on
0
I: Lithunian . 5.5% Sciences of Electronics, Technologies of Information and
Czech Telecommunications, pp. 502 - 506, 2012 .
-c . . 5.5%
Q)
German [10] C. Z. Zhang, "Bilingual Topic Taxonomy Generation based on Bilingual
I/)
. 5.5%
::l Documents Clustering," Proceedings of IEEE International Conference
I/) Latin . . 5.5% on Machine Learning and Cybernetics, Vol. 4, pp. 1889 -1895,2011.
Q)
Ill) French ~ 5.5% [II] Z. Ibrahim, D. Isa, and R. Rajkumar, "Text and Non-text Segmentation
"':::s
Ill)
Oriya ~ 5.5% and Classification from Document Images," IEEE International
I: Tamil __ 5.5% Conference on Computer Science and Software Engineering, Vol. I, pp.
"'
..J
Kannada ~ 5.5%
973 - 976, 2008 .
[12] N. Joshi, I. Mathur, H. Darbar, A. Kumar, and P. Jain, "Evaluation of
Telugu . 5.5% Some English-Hindi MT Systems," IEEE International Conference on
Advances in Computing, Communications and Informatics, pp. 1751 -
0.00 20.00 40.00 60.00 80.00 100.00
1758,2014.
%Use of Target Languages
[13] J. Liu, C. Liang, and J. Qi, "Dictionary-based Bilingual Web Page
Figure 4. Analyzing the Usage of Most Preferred to Least Preferred Classification," IEEE 4th International Conference on Wireless
Languages Communications, Networking and Mobile Computing, pp. 1 - 4, 2008.
2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA

[14] S. M. AI-Ghuribi and S. Alshomrani, "A Simple Study of Web page Text Fourth IEEE International Conference on Document Analysis and
Classification Algorithms for Arabic and English Languages," IEEE Recognition, Vol. 2, pp. 1011 -1015, 1997.
International Conference on IT Convergence and Security, pp. I - 5, [20] R. Gaizauskas, E. Barker, M. L. Paramita, and A. Aker, "Assigning
2013. Terms to Domains by Document Classification," Proceedings of the 4th
[15] Y. Wu and D. W. Oard, "English and Chinese Bilingual Topic Aspect International Workshop on Computational Terminology, pp. 11-21,
Classification: Exploring Similarity Measures, Optimal LSA 2014.
Dimensions, and Centroid Correction of Translated Training Examples," [21] C. V. Jawahar, P. Kumar, and S. S. Ravikiran, "A Bilingual OCR for
ASIST, pp. 1 - 12,2013. Hindi-Telugu Documents and its Applications," Proceedings of Seventh
[16] S. Mohanty and H. N. D. Bebartta, "A Novel Approach for Bilingual IEEE International Conference on Document Analysis and Recognition,
(English - Oriya) Script Identification and Recognition in a Printed Vol. l,pp.408-412,2003.
Document," International Journal of Image Processing, Vol. 4, pp. 175 - [22] R. S. Kunte and R. D. S. Samuel, "An OCR System for Printed Kannada
191,2010. Text Using Two - Stage Multi-network Classification Approach
[17] D. Dhanya, A. G. Ramakrishnan, and P. B. Pati, "Script Identification in Employing Wavelet Features," IEEE International Conference on
Printed Bilingual Documents," Sadhana, Vol. 27, Issue I, pp. 73-82, Computational Intelligence and Multimedia Applications, Vol. 2, pp.
2002. 349 - 353, 2007.
[18] U. Pal and B. B. Chaudhari, "Machine - Printed and Hand - written Text [23] X. Ni, J. T. Sun, .I. Hu, and Z. Chen, "Cross Lingual Text Classification
Lines Identification," Pattern Recognition Letters, pp. 431 - 441, 200 I. by Mining Multilin-gual Topics from Wikipedia," Proceedings of the
[19] B. B. Chaudhari and U. Pal, "An OCR System to Read Two Indian Fourth ACM International Conference on Web Search and Data Mining,
Language Scripts: Bangia and Devnagari (Hindi)," Proceedings of pp. 375 - 384, 2011.

You might also like