Professional Documents
Culture Documents
Abstract- In the current era, there is a high demand of accurate desired categories. Many efficient techniques and methods
text identification and categorization methods in N - Lingual have been developed in this direction.
non-scanned and scanned machine printed documents, where N
represents mono, bi, tri or multi mode. In this paper, a technical In today's scenario, different languages and scripts based
study and analysis is presented to show N-lingual document systems are also gaining so much attention of the researchers.
classification for normal text, printed and handwritten Today documents and data are available in many languages
documents. Text classification for normal text documents is using different/same scripts. Additionally, many online and
simple, whereas in scanned machine printed systems, it offline systems have been generated for the operations like
inherently begins with the correct recognition of text, i.e.; translation, conversion, searching the keywords etc. Such kind
characters and words. The steps involved in the latter case are of systems accepts scanned/non-scanned mono lingual [2 -4],
script identification, page layout determination, separation of [11], [22], bi lingual [5-10], [12-19], [21], tri lingual and
text and non-text data, line segmentation, word detection and multilingual documents [20]. Additionally, many researchers
finally character recognition. After performing such processing are more focused on the multi lingual text classification (either
steps, text or script is identified and separated. Three statistically scanned or non-scanned) irrespective of their increased system
analyzed charts are also shown, which are based on content type and time complexity.
classification, language-mode pair and most-to-least preferred
languages of existing algorithms. The paper is organized as follows. Section II discusses
theoretical perspectives which include document classification,
Keywords- N-Lingual Documents; Scanned and Non-scanned a description on text extraction and classification of printed
Documents; Script Identification; Text Detection; Text images, and a structure on N-lingual documents and systems.
Classification [n the last case, a hierarchy is designed to depict the wide use
and applications of such systems. In section III, a comparative
I. INTRODUCTION
study of different models is presented. [n section [V, three
With the increased technological advancements, charts are shown to measure different analyzed results found.
digitization of world and rapid development of automated [n section V, conclusion is discussed.
systems, there is a high demand of efficient and accurate
classification systems [1-9]. Such systems not only cover the II. THEORETICAL PERSPECTIVES
categorization of the textual data (.doc, .txt etc.) [5-7], rather
A. Document Classification
this field has come up with further extensions in the directions
of machine printed (scanned image) [8-9] and handwritten Document classification [2-3], [5-[4], refers to classify the
(scanned form) data also. In this lieu, important information given set of documents into predefmed categories under the
which is presented in the form of images has attracted a lot of concept of supervised learning paradigm, so that each
researcher's attention from last two decades. Obtained document is categorized and grouped into the most suitable
information from such images may lead to a wide range of category. It is opposite to the concept of clustering which
applications - content based image retrieval system [8-13], makes use of unsupervised learning to make the clusters of
automatic document reader [2-22], automatic information data. In classification, a set of training documents is first
extraction from image [8-9], [11], [16-17], [19], [21-22] and preprocessed and used to train the classifier on the basis of
video frames etc. Nowadays, document classification systems, predefmed labels. Then, testing documents are automatically
either for scanned or non-scanned text, are not only limited to classified by the classifier. Such mechanism is not only used to
one language rather they are being developed to accept classify the web pages, text documents, images, video and
documents of more than one language. So, these systems must speech but also used to separate the scripts, to recognize the
be capable enough to categorize them efficiently into the characters, words and text lines in the machine printed imaged
documents.
B. Text Extraction and Classification in Printed Images recognition and post processing. Such method is widely used in
In recent years, many researchers have focused on the area script classification and image mining.
of N-Lingual printed imaged text classification. For printed C. N - Lingual Documents and Systems
documents, many researchers have worked on the extraction of
N-lingual documents are characterized as those documents
written text [8-9], [16-17], [19], [21], separation of topics [10],
whose text is written by using mono/bi/tri/multi scripts. Such
separation of two language texts, identification of line and
words, separation of text and non text documents, OCR [18- systems work on classification or extraction or separation of
text. In N-lingual documents, text written in N languages is
19], [21-22], text classification [8], word classification[9] and
many more. Some more related aspects are given below. contained in one document. Researchers have done a lot of
work in this field and they have designed different techniques,
• Feature extraction and feature selection methods used methodologies and mechanisms for it. Fig. 1 depicts taxonomy
to identify character, word and line for different of N-lingual based systems which depict the wide scope of
languages. fields and applications. Here the input documents can be inu
either Indian or Non - Indic languages or both.
• To consider the font, style, size, regular/basic,
bold/italics/underline, portrait/landscape, colored! BW For an example, consider the case of Hindi language. Hindi
etc. [5], [12], [21], written in Devnagari script [5], [18], [19] and an
official language of India has got a lot of attention of
• Processing of scanned/non-scanned documents with researchers and scientists in the last decade. It is the native
one/two/three/multi languages. language of 10 states in India. Today, a great research work is
The basic steps of text extraction and classification from a going on for the identification of characters, words and,
printed imaged document includes image acquisition, separation of a letter in three zones, shirorekha and lines [18-
preprocessing, page layout detection and segmentation, line 19], [21-22] etc.
detection and segmentation, word identification and character
!
Recognizers Search Engines Editor Other Supplements
I
! ! ! ! ! !
Translators / Optical Character Speech Reader
Writer Dictionary Lexicons Glossary
Converters Recognizer Recognizers (Pronunciation)
I I
! ! ! !
Into Target Text to Speech
Image
Language Speech to Text
III. A TECHNICAL STUDY OF N-LINGUAL MODELS BASED some problem ofN - lingual document classification for which
ON DIFFERENT MODES they provide the solutions. In addition to this, such review
includes mode (text / printed / handwritten), lingual type, target
Efficient bilingual document classification depends on a languages, classifier used and data sets for the experiments.
number of key concerns which need to be evaluated and Such study helps to understand the concepts underlying in the
analyzed to get accurate and correct results. Table I provide a model or system and also aids in evaluating their scope. Here
detailed summarized report of such concerns and compares most systems used bilingual mode with preferred language
different existing techniques and algorithms. This report covers English and SVM classifier. This study provides a good
many important and primary concerns used in the respective overview and summary to show their usage in the current
models and algorithms. These algorithms are designed for document classification environment.
2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA
I:
Spanish 1 6. 7% of Bilingual Documents with Decision-Tree Support Vector Machines,"
Arabic IEEE International Conference on Document Analysis and Recognition,
Q)
~ 11.1 %
E BangIa pp. 498 - 502, 2011.
:::s ~ 11.1 %
u [9] S. Haboubi, S. Maddouri, and H. Amiri, "Word Classification in
0 Latvin ~ 5.5% Bilingual Printed Documents," 6th IEEE International Conference on
0
I: Lithunian . 5.5% Sciences of Electronics, Technologies of Information and
Czech Telecommunications, pp. 502 - 506, 2012 .
-c . . 5.5%
Q)
German [10] C. Z. Zhang, "Bilingual Topic Taxonomy Generation based on Bilingual
I/)
. 5.5%
::l Documents Clustering," Proceedings of IEEE International Conference
I/) Latin . . 5.5% on Machine Learning and Cybernetics, Vol. 4, pp. 1889 -1895,2011.
Q)
Ill) French ~ 5.5% [II] Z. Ibrahim, D. Isa, and R. Rajkumar, "Text and Non-text Segmentation
"':::s
Ill)
Oriya ~ 5.5% and Classification from Document Images," IEEE International
I: Tamil __ 5.5% Conference on Computer Science and Software Engineering, Vol. I, pp.
"'
..J
Kannada ~ 5.5%
973 - 976, 2008 .
[12] N. Joshi, I. Mathur, H. Darbar, A. Kumar, and P. Jain, "Evaluation of
Telugu . 5.5% Some English-Hindi MT Systems," IEEE International Conference on
Advances in Computing, Communications and Informatics, pp. 1751 -
0.00 20.00 40.00 60.00 80.00 100.00
1758,2014.
%Use of Target Languages
[13] J. Liu, C. Liang, and J. Qi, "Dictionary-based Bilingual Web Page
Figure 4. Analyzing the Usage of Most Preferred to Least Preferred Classification," IEEE 4th International Conference on Wireless
Languages Communications, Networking and Mobile Computing, pp. 1 - 4, 2008.
2016 International Conference on Computer Communication and Informatics (Ieeel -20 16), Jan. 07 - 09, 2016, Coimbatore, INDIA
[14] S. M. AI-Ghuribi and S. Alshomrani, "A Simple Study of Web page Text Fourth IEEE International Conference on Document Analysis and
Classification Algorithms for Arabic and English Languages," IEEE Recognition, Vol. 2, pp. 1011 -1015, 1997.
International Conference on IT Convergence and Security, pp. I - 5, [20] R. Gaizauskas, E. Barker, M. L. Paramita, and A. Aker, "Assigning
2013. Terms to Domains by Document Classification," Proceedings of the 4th
[15] Y. Wu and D. W. Oard, "English and Chinese Bilingual Topic Aspect International Workshop on Computational Terminology, pp. 11-21,
Classification: Exploring Similarity Measures, Optimal LSA 2014.
Dimensions, and Centroid Correction of Translated Training Examples," [21] C. V. Jawahar, P. Kumar, and S. S. Ravikiran, "A Bilingual OCR for
ASIST, pp. 1 - 12,2013. Hindi-Telugu Documents and its Applications," Proceedings of Seventh
[16] S. Mohanty and H. N. D. Bebartta, "A Novel Approach for Bilingual IEEE International Conference on Document Analysis and Recognition,
(English - Oriya) Script Identification and Recognition in a Printed Vol. l,pp.408-412,2003.
Document," International Journal of Image Processing, Vol. 4, pp. 175 - [22] R. S. Kunte and R. D. S. Samuel, "An OCR System for Printed Kannada
191,2010. Text Using Two - Stage Multi-network Classification Approach
[17] D. Dhanya, A. G. Ramakrishnan, and P. B. Pati, "Script Identification in Employing Wavelet Features," IEEE International Conference on
Printed Bilingual Documents," Sadhana, Vol. 27, Issue I, pp. 73-82, Computational Intelligence and Multimedia Applications, Vol. 2, pp.
2002. 349 - 353, 2007.
[18] U. Pal and B. B. Chaudhari, "Machine - Printed and Hand - written Text [23] X. Ni, J. T. Sun, .I. Hu, and Z. Chen, "Cross Lingual Text Classification
Lines Identification," Pattern Recognition Letters, pp. 431 - 441, 200 I. by Mining Multilin-gual Topics from Wikipedia," Proceedings of the
[19] B. B. Chaudhari and U. Pal, "An OCR System to Read Two Indian Fourth ACM International Conference on Web Search and Data Mining,
Language Scripts: Bangia and Devnagari (Hindi)," Proceedings of pp. 375 - 384, 2011.