You are on page 1of 11

KUMARAGURU COLLEGE OF TECHNOLOGY

VIRTUOSO 2K8 - NATIONAL LEVEL TECHNICAL SYMPOSIUM

ARTIFICIAL INTELLIGENCE

THE AUTOMATED CONVERSION OF TAMIL - HINDI FOR NETWORKING BY


NATURAL LANGUAGE PROCESSING

ARASU ENGINEERING COLLEGE

KUMBAKONAM

R.Muthukumaran (B.E., II - CSE)


(E-mail:muthukumaran.tdr@gmail.com)
S.P.Balakumar (B.E., II - CSE)
(E-mail:milkysympo@gmail.com )
ARTIFICIAL INTELLIGENCE

THE AUTOMATED CONVERSION OF TAMIL - HINDI FOR NETWORKING BY


NATURAL LANGUAGE PROCESSING

ABSTRACT

In this paper, we have proposed automatic translation of Tamil - Hindi by Natural


language processing. It is a subfield of Artificial Intelligence and linguistics. Natural language
understanding systems convert samples of human language into more formal representations
that are easier for Gadgets and Networking .

The automated conversion of Tamil and Hindi is used in Lexical Resources and
Computational Models. Tamil & Hindi are the morphologically rich languages. Most of the
grammatical functions are embedded into the word in the form of inflections. This language
conversion involves Phonetic analysis, Text analysis, Morphological analysis, Syntactic
analysis, Semantic analysis, Discourse analysis and Pragmatic analysis.

A Machine Translation System for Hindi-Tamil basically has three major components,
Morphological analyser of source language, Mapping unit and the target language generator.
The Morphological Analyser splits a word into its constituent morphemes. The root word and
its inflections are mapped and generated to equivalent target language terms.
INTRODUCTION

Today, people from all walks of life including professionals are confronted by
unprecedented volumes of information, the vast bulk of which is stored as unstructured text.
These estimates are also based on printed materials; increasingly the information is also
available electronically on the World Wide Web. A large and growing fraction of work and
leisure time by professionals and students is spent navigating and accessing this universe of
information.

The Natural Language Understanding is a Mundane tasks that humans can do easily
but very difficult to automate by computers. It reflects the Artificial Intelligence of the
computer. The Natural Language Processing involving Natural Language Understanding and
Natural Language Generation. The Natural Language Understanding refers the steps involved
in the Language Conversion. The Natural Language Generation refers the Generation of target
Language.

The Natural Language process was involving many steps. It is based on the source and
the target Languages. According to the Sentence pattern of the language the NLP was divided
into two types. These are Aspects model Standard theory and Extended Standard theory.

ASPECTS MODEL STANDARD THEORY

It was in the Aspects of the


Theory of Syntax nouns are chosen on the
basis of context free rules; verbs are then
chosen on the basis of context sensitive
rules, which are the terms to express the
lexical features. Since nouns are the first
words to be chosen, they are identified by
lexical features only.

Verbs and adjectives require additional


features to indicate the environments in
which they can appear. An aspect of
grammar was organized into three major
components.
So the Aspects Model Standard theory was used for the non-positional Languages. For
example Tamil-English translation.

EXTENDED STANDARD THEORY (EST)

The substantial criticisms of the


Standard Theory show that the surface
structure played a much more important role
in semantic interpretation than the Deep
structure.

Here the partial representation of


meaning is determined by grammatical
structure. The derivation of logical form
proceeds step by step which is determined by
a derivational process analogous to those of
syntax and phonology.

So the Extended Standard theory was


used for the positional Languages. For
Example Tamil - Hindi translation. But the
positional Languages are also differed in their
Discourse sentences and the Pragmatic
sentences. So the Logical form is used to
convert them

ANALYSIS OF NLP IN POSITIONAL LANGUAGES

The Natural Language Processing involving the following steps that are used in the
automated conversion of the language. Because the positional languages are differed in the
many causes. These steps are

• Phonetic analysis
• Morphological analysis
• Syntactic analysis
• Semantic analysis
• Discourse and Pragmatic analysis

PHONETIC ANALYSIS

The Phonetic analysis involved


to read the sentences in the case of
written language. For example to
translates the phonemes/scanned text
into words that means to read the
source sentence in the any application
software.

So it acts as an interpreter to convert the user desired application’s data into the
Natural Language Analysis. The data was send as an input unit for the morphological analysis.

MORPHOLOGICAL ANALYSIS

In the Morphological Analyses, A word splits into its constituent morphemes. A


morpheme is a smallest unit of a word conveying a meaning. These morphemes collectively
describe the word grammatically. Thus complete grammatical information of a word is
obtained from the morphemes.

The root word is created from the morphemes. The Collective group of morphemes
reflects the sentence pattern of the source language. This identification was used in the
syntactic analysis and the semantic analysis.

SYNTACTIC ANALYSIS

Syntactic analysis helps


determining the meaning of a sentence
by working out possible word
structures. Rules of syntax are
specified by writing a grammar for the
language. A grammar specifies
allowable sentence structures in terms
of basic categories such as nouns and
verbs. A given grammar, however, is
unlikely to cover all possible
grammatical sentences.
Parsing sentences is to help determining their meanings, not just to check that they are
correct. A good starting point is a simple context free grammar. A parse trees illustrates the
syntactic structure of the sentence.
SEMANTIC ANALYSIS

The stages of semantic and pragmatic analysis are concerned with getting the meaning
of a sentence. Semantics is a partial representation of the meaning which is obtained based on
the possible syntactic structure(s) of the sentence, and on the meanings of the words.
Pragmatics is the meaning which is elaborated based on contextual and world knowledge.

The meaning of the whole sentence can be put together from the meaning of the parts
of the sentence. The division of the sentence into meaningful parts was done by syntactic
analysis and it is called as computational semantic. In general the meaning of a sentence may
be represented using any of the knowledge representation schemes.

Many words even with the same syntactic category may have more than one meaning
is called Semantic ambiguity. It is sometimes unclear which object a pronoun refers to
referential ambiguity or Pragmatic ambiguity. These are removed by the semantic analysis.

DISCOURSE AND PRAGMATIC ANALYSIS

Discourse integration is one of the steps. The inter-sentence connections are made
here. For example considered the following sentence”The apple was black. John wanted it. He
always had.” Here “it” refers to a previous thing, namely the apple, where as “John” connects
to”He” in the third sentence. This type of integration is done during the discourse analysis.

The pragmatic analysis done the process of creating correct sentences which are not in
the grammatical manner. For example the grammatical answer of the following question “Can
you tell the time?” is “Yes”, But the ordinary answer tell the correct time. So the pragmatic
analysis removes these types of ambiguity.

TAMIL-HINDI SYSTEM

The choice of Tamil-Hindi MAT is because; both are free word-order languages unlike
English which is a positional language. Ultimately our aim is to built a Human Aided Machine
Translation System for Hindi-Tamil. A MT system basically has three major components,

 Morphological analyser of source language,


 Mapping unit and
 The target language generator
MORPHOLOGICAL ANALYSER

A source language sentence is first processed by the MA. MA splits the sentence into
words and in turn the words are split into morphemes. The root word is obtained by this
process and this root word is given as input to the mapping block along with the other
morphemes. The other morphemes includes tense marker, GNP marker, Vibakthi etc. For
splitting a word into morphemes the dictionary is used.

Typically this dictionary contains the root words and its inflections of Tamil language
in its first field. The inflections includes GNP marker, TAM marker, vibakthi.A given word is
compared with the words/morphemes in the first field of the dictionary. Matching is done
from right to left. Thus the inflections of the words are split and finally we arrive at the root
form.

Each root word along with its inflections are given as the input to the mapping unit.
So the Morphological analyzer performs the following analysis Phonetic analysis and
Morphological analysis.

TAMIL MORPHOLOGICAL ANALYSER

The current coverage of the morphological analyser is greater than 95% when tested
over the three million word CIIL corpus. This follows the paradigm-based approach and is
implemented as a Finite State Machine. This version can analyse nearly 3.5 million word
forms. The objective of this Tamil morphological analyser API is to retrieve the root from its
inflected form.

Words in Tamil have a strong postpositional inflectional component. As an example, for


verbs, these inflections carry information on the gender, person and number of the subject.
Further, modal and tense information for verbs is also collocated in the inflections. For nouns,
inflections serve to mark the case (accusative, dative & etc) of the noun. The aim of this
morphological analyser is to retrieve the root of the word along with the inflectional
information.

MAPPING UNIT
The root word and its inflections are mapped to equivalent target language terms in
this block. Explaining the structure of the dictionary will be very useful at this juncture.
Dictionary has seven fields for aiding in the process of mapping. As said earlier the first field
contains the Tamil root words and inflections.

The second field contains paradigm type followed by paradigm number which are
useful in the generation of words. Subsequent field contains the category of the word,
equivalent Hindi meaning(s), and gender information. The last field contains information
about the dictionary which is there for some maintenance work.
The gender information is important especially for Hindi because all the nouns in
Hindi will be either of the two genders and this information is very helpful for semantic
analysis.The corresponding Hindi equivalents of the words are taken and are given as input to
the generator part of the MT system. All equivalent Hindi words for a Tamil word are given in
the dictionary separated. So the Mapping Unit involves Syntactic analysis and Semantic
analysis.

GENERATOR
This is the reverse process of analyser. Given a root word and it inflections this
generates the equivalent Hindi word. While generating, this takes into account all the
information like the gender, tense etc. and the equivalent word is generated accordingly. For
the generator to generate the word the input to it should be in some proper order. The order
that is followed here is: Hindi root, Category, Gender, Number, Person and finally TAM
(Tense-Aspect-Modality if any). The Hindi generator that is being used here is from IIIT,
Hyderabad, which is also used for other anusaarakaa products. It is being used here as a black
box. So the Generator involves the Discourse and Pragmatic analysis.
CONCULSION

Nowadays millions of PCs are in the Internet. Besides we use Networked computers
which are not well utilized because of un-linguistic language. But these Networked computers
can be well utilized by using NLP to the user in their own language. We are the technical
fellows must construct software’s having these privileges. By developing this kind of
software’s will improve our Computer Network communication as well.
REFERENCE:

• F. S. Singh and H. H. Hummel, “Language Processing ”, NLTC (National Language


Translation Council), Pune , Jan., 2006.
• M. Levenson, V. G. Trice, Jr., and W. J. Mecham, “Machine Aided Translation
between Human Languages”, ANL 7137, 2005.
• Abrial JR. The B Book - Assigning Programs to Meanings. Cambridge University Press,
August 1996.
• Ambriola V, Gervasi Processing Natural Language Requirements, 12th IEEE Conf. On
Automated Software Engineering (ASE'97), IEEE Computer Society Press, Nov. 1997.
• Ben Achour C, Tawbi M, Souveyet Bridging the Gap between Users and Requirements
Engineering, The Scenario-Based Approach (CREWS Report Series 99-07), International
Journal of Computer Systems Science and Engineering,
• Bolognesi T, Brinksma E. Introduction to the ISO Specification Language LOTOS.
Computer Networks, 14 (1), 25-59, 1987.

You might also like