You are on page 1of 25

A PROJECT SYNOPSIS ON

“Transliteration between English and


Marathi”

SUBMITTED BY

SHRIKANT NAYAK

PRASANNA MEHTA

RAHUL AMBADKAR

DISHA YADAV

SUPERVISOR

Prof. VARUNAKSHI BHOJANE

Department of Information Technology

MES’s Pillai Institute of Information Technology,


Engineering,Media Studies and Research,

New Panvel, Navi Mumbai 410 206

2014-15

1
Department of Information Technology
Pillai Institute of Information Technology,

Engineering, Media Studies & Research

New Panvel – 410 206

This is to certify that the requirements for the synopsis entitled


‘Transliteration between English and Marathi’ have been
successfully completed by the following students:

Name Roll No.

SHRIKANT NAYAK

PRASANNA MEHTA

RAHUL AMBADKAR

DISHA YADAV

in partial fulfillment of Bachelor of Engineering of Mumbai University in the


Department of Information Technology, Pillai Institute of Information
Technology, Engineering, Media Studies & Research, New Panvel during the
academic year 2014 – 2015.

Supervisor External guide

Mrs.Varunakshi Bhojane

Internal Examiner External Examiner


2
ACKNOWLEDGEMENT
We feel privileged to express our deepest sense of gratitude and
sincere thanks to our Project guide Prof. Varunakshi Bhojane for her
excellent guidance through our project work. Her prompt and kind help led to
the completion of the dissertation work.

We are immensely thankful to our Project Co-ordinatorProf.Suresh


Babu.We would also like to thank our H.O.D Dr.Satish.L.Verma, for
approving our project and giving us ideas regarding the project.

We are immensely thankful to our Principal Dr.R.I.K. Moorthy.

We would also like to thank them for their patience and co-operation, which
proved beneficial for us.We own a substantial share of our success to the
whole faculty and staff member who provided us the requisite facilities
required to complete the project work.

Finally, we wish to express our sincere appreciation and thanks to our


college library and all those who have guided and helped us directly or
indirectly for accomplishing our goal.

SHRIKANT NAYAK

PRASANNA MEHTA

RAHUL AMBADKAR

DISHA YADAV

3
ABSTRACT
Machine Transliteration is an important problem in an increasingly
multilingual world, asit plays a critical role in many downstream applications,
such as machine translation or“Cross Lingual Information Retrieval (CLIR)”
systems. In this project, we proposecompositional machine transliteration
systems, where multiple transliteration componentsmay be composed either
to improve existing transliteration quality, or to enabletransliteration
functionality between languages even when no direct parallel namescorpora
(set of texts) exist between them. Specifically, we propose Parallel
Composition. In parallel composition evidence from multiple transliteration
paths between X → Z areaggregated for improving the quality of a direct
system. We demonstrate the functionalityand performance benefits of the
compositional methodology using a state of the artmachine transliteration
frame-work between English and Marathi.

Finally, we underscore the utility and practicality of our compositional


approach byshowing that a CLIR (Cross Lingual Information Retrieval) system
integrated withcompositional transliteration systems performs consistently
on par with and some time better than that integrated with a direct
transliteration system. General transliteration/interpretation is just what you
think the transliteration or interpretation of non-specific language that does
not require any specialized vocabulary or knowledge. However, the best
translators and interpreters read extensively in order to be up-to-date with
current events and trends so that they are able to do their work to the best
of their ability, having knowledge of what they might be asked to convert. In
addition, good translators and interpreters make an effort to read about
whatever topic they are currently working on.

4
TABLE OF CONTENTS

i Abstract

1 Introduction……………………………………….6
1.1 Aims and objectives

1.2 Problem Statement

1.3 Scope of the project

1.4 Advantages

1.5 Disadvantages

2 Literature Survey.........................................9
2.1 Introduction
2.2 Feasibility study
2.3 Requirement analysis
2.4 System analysis

3 Existing system...........................................14

4 Proposed System Methodology...................16


4.1 Proposed methodology
4.2 Features provided by our system
4.3 Applications

5 Analysis Details of Hardware & Software....18

6 Design details..............................................19

7 Implementation Plan for next semester.......22

8 References...................................................25

5
Chapter 1

INTRODUCTION

1.1 OBJECTIVE:

General converter is just what you think - the transliteration or interpretation


of non-specific language that does not require any specialized vocabulary or
knowledge. However, the best translators and interpreters read extensively
in order to be up-to-date with current events and trends so that they are able
to do their work to the best of their ability, having knowledge of what they
might be asked to convert. In addition, good translators and interpreters
make an effort to read about whatever topic they are currently working on. If
a translator is asked to translate an article on organic farming, for example,
he or she would be well served to read about organic farming in both
languages in order to understand the topic and the accepted terms used in
each language.

Specialized transliteration or interpretation refers to domains which require


at the very least that the person be extremely well read in the domain. Even
better is training in the field (such as a college degree in the subject, or a
specialized course in that type of transliteration or interpretation). Some
common types of specialized transliteration and interpretation are

 language converter

 legal converter

 literary converter

 medical converter

 scientific converter

 technical converter

1.2 Problem Statement:

6
Designing of machine translator for English to Marathi with hybrid approach
including rule based and example based approach to obtain a good enough
translation for SVO formats of the English statement.

1.3 Scope:

In this project, we have studied the effect of transliteration on human readability by


analyzing the eye-movement of the participants subjected to reading stimuli.

Transliteration is the process of converting a text from one writing script to


another by substituting the alphabets. Here the substitution is done from
English alphabets (source script) to Marathi alphabets (target script). Across
transliterations, the pronunciation of the lexicon however remains unaltered.
Off late, transliteration is quite frequently seen especially in case of digital
communication like email, chat, blogs etc. The target language in majority of
the cases is observed to be English. This is due to that fact that there is an
ease to type in English given Marathi layout keyboard. The reverse is also
seen in practice where an English word is observed in a different script other
than Marathi. This is majorly seen in case of borrowed vocabulary words.
Globalized use of English as official language is accounted as the main
reason for it.

The abundant use of transliteration in digital communication has introduced


a need for better design of text input mediums and product designers are
now considering factors effecting readability, to come up with better display
devices. However these are challenging issues as investigating the factors
that contribute to better reading or writing experience are not straight
forward as writing and reading are not just physical but also a unique
cognitive ability of humans, and cognitive aspects are tough to be directly
articulated, identified or answered. Here we have made an effort towards
identify such factors, by exploring the eye tracking technique. Except here
we are having transliterated text instead of the regular text. We have chosen
Marathi and English languages, written in Devanagari and Latin scripts
7
respectively, due to high availability of Marathi-English bilingual speakers in
the neighborhoods.
1.4 Advantages:

1. For the Marathi pronunciation our system is useful those who can learn
standard level English language.
2. User friendly environment.
3. Better user interface.
4. Fast mechanism.
5. Small memory factor.

1.5 Disadvantages:

1. Users don’t know the standard pronunciation of words.


2. Cannot transliterate Indian Languages among themselves.
3. Lacks user input.
4. Cannot be reliable

Chapter 2

8
LITERATURE SURVEY

What is transliteration?
Transliteration is a representation of the words of one language in the script
of another,i.e., it is the transcription of one alphabet in another. Some other
interesting definitions are:
 The representation of characters or words of one language by
corresponding characters of words of another language.
 A systematic way to convert characters in one alphabet or
phonetic sounds into another alphabet.
 The translation of text from one writing system into another
where the writing conventions of the target writing system are applied.
The transliterated text should read naturally in the target script.
 A letter-for-letter or sound-for-letter spelling of a word to
represent a word in another language.

3.1: P H Rathod, M L Dhore, R M Dhore.[4]

Hindi and Marathi languages are written using Devanagari script. Devanagari
script used for Hindi and Marathi have 12 pure vowels , 2 loan vowels from
the Sanskrit language and 1 loan vowel from English. There are total 34
consonants, 5 conjuncts, 7 loan consonants and 2 traditional signs in
Devanagari script and each consonant have 14 variations through
integration of 14 vowels [32-34]. Table 1 shows Devanagari script along with
their equivalent phonetic mapping in Roman. The consonant /ळ/ is used only
in Marathi and not in Hindi.

9
Name in Devanagari→ महारा STUs → [म | हा | रा | ]

Name in Devanagari→ STUs → [ कारे र |का| रे |श् व |र]

Name in Devanagari→ नोवरोझाबाद STUs → [नो | व | रो | झा | बा | द ]

Name in Devanagari→ अ दु लाहगं ज STUs → [अ | दु | ला | ह | गं |ज]

Name in Devanagari→ िनरं जनकु मार STUs → [िन | रं | ज | न | कु | मा | र]

Name in Devanagari→ नारायणगावकर STUs → [ना | रा | य | ण | गा | व | क | र]

Name in Devanagari→ ि भु वननारायण STUs → [ि | भु | व | न | ना | रा | य | ण]

Interpreters -

1. This process involves two or more speakers who may not be speaking the
same language.

2. Basically this is an oral activity that involves sign language to effectively


communication.

3. Therefore, interpreters may be required to successfully transliteration the


needs of clients, which is taken up for implementation by the service
provider ormanufacturer.

4. It is important to realize that not all countries follow English as their


medium of communication.

5. Interpreters are also highly useful in providing customer support services


for telecom services.

Machine Transliteration - This kind of transliteration employs a computer


program that will produce the transliteration result without any human
intervention. But in reality, there is a lot of intervention required by
translators to do the pre and post editing work.

10
Transliteration Services - These are also computer-assisted transliteration,
except that the software employed is highly efficient and proficient in
translating a particular language. Using Internet, transliteration software can
be used from remote locations to translate web pages and client provided
content. There are experienced players in the transliteration field who offer
language transliteration services as a SaaS service offering. They provide for
continuous improvements in transliteration speed and quality along with
rapid development of new languages for high volume transliteration
deployments.

3.2: A KUMARAN,MITESH M. KHAPRA1 and PUSHPAK


BHATTACHARYYA .[5]

In this paper, we introduce the concept of Compositional Transliteration Sys-


tems as a composition of multiple transliteration systems to achieve
transliteration functionality or to enhance the transliteration quality between
a given pair of languages. We propose two distinct forms of composition –
serial and parallel. In serial compositional systems, the transliteration
systems are combined serially; that is,transliteration functionality between
two languages X & Z may be created by combining transliteration engine
X→ Y and Y → Z. Such compositions may be useful for situations where no
parallel data exists between two languages X & Z, but sufficient parallel
names data may exist between X & Y, and Y & Z. Such partial availability of
pair-wise data is common in many situations, where one central language
dominates many languages of a country or a region. For example, there are
22 constitutionally recognized languages in India, but it is more likely that
parallel names data might exist between Hindi and a foreign language, say,
Russian, than between any other Indian language and Russian. In such
situations, a transliteration system between Kannada, an Indian language,
and Russian may be created by composing two transliteration modules, one
between Kannada and Hindi, and the other between Hindi and Russian. Such
compositions, if successful quality-wise,may alleviate the need for
developing and maintaining parallel names corpora between many language
pairs, and leverage the existing resources whenever possible, indicating a
less resource intensive approach to develop transliteration functionality
among a group of languages. In parallel compositional systems, we explore
combining transliteration evidence from multiple transliteration paths in
parallel, in order to develop a good quality transliteration system between a
pair of languages. While it is generally accepted that the transliteration
11
quality of data-driven approaches grows with more data, typically the quality
plateaus accruing only marginal benefit after certain size of the training
corpora. In parallel compositional systems, we explore if transliteration
quality between X & Z could be improved by leveraging evidences from
multiple transliteration paths between X & Z. Such systems could be very
useful when data is available between many different pairs among a set of n
languages. Again, such situations naturally exist in many multicultural and
multilingual societies, such as, India and the European Union. For example,
parallel names data exists between many language pairs of the Indian
subcontinent as most states enforce a 3-language policy, where all
government records, such as census data, telephone directories, railway
database, etc., exist in English, Hindi and one of the regional languages.
Similarly, many countries publish their parliamentary proceedings in multiple
languages as mandated by legislative processes. In our research we explore
compositional transliteration functionality among a group of languages, and
in this paper, our specific contributions are:

(1) Proposing the idea of compositionality of transliteration functionality, in


two different methodologies: serial and parallel.

(2) Composing serially two transliteration systems – namely,X → Y and Y → Z


to provide a practical transliteration functionality between two languages X &
Z with no direct parallel data between them.

(3) Improving the quality of an existing X → Z transliteration system through


a parallel compositional methodology.

(4) Finally, demonstrating the effectiveness of different compositional


transliteration systems – both serial and parallel – in an important
downstream application domain of Crosslingual Information Retrieval.

Serial Compositional Methodology

It is a well known fact that transliteration is lossy, and hence it is expected


that the composition of the two transliteration systems is only bound to have
lower quality than that of each of the individual systems X → Y and Y → Z, as
well as that of a direct system X → Z. We carry out a series of compositional
experiments among a set of languages, to measure and quantify the
expected drop in the accuracy of such compositional transliteration systems,
with respect to the baseline direct system. We train two baseline CRF based
transliteration systems, between the languages X and Y, and between the

12
languages Y and Z,using appropriate parallel names corpora between them.
For testing, each name in language X was provided as an input into X → Y
transliteration system, and the top-10 candidate strings in language Y
produced by the system were further given as an input into system Y → Z.
The outputs of this system were merged and re-ranked by their probability
scores. Finally, the top-10 of the merged outputs were output as the
compositional system output.

Parallel Compositional Methodology

In this section, we explore if data is available between X and multiple


languages, then is it possible to improve the accuracy of the X→Z system by
capturing transliteration evidence from multiple languages. Specifically, we
explore whether the information captured by a direct X→Z system may be
enhanced with a serial X→Y→Z system, if we have data between all the
languages. We evaluate this hypothesis by employing the following
methodology, assuming that we have sufficient pair-wise parallel names
corpora between X, Y & Z. First we train a X→Z system, using the direct
parallel names corpora between X & Z. This system is called Direct System.
Next, we build a serially composed transliteration system using the following
two components: First, a X→Y transliteration system, using the 15K data
available between X & Y, and, second a fuzzy transliteration system Y→Z that
is trained using a training set that pairs the top-k outputs of the above
trained X→Y system in language Y for a given string in language X, with the
reference string in language Z corresponding to the string in language X.

13
Chapter 3
EXISTING SYSTEM
Existing system:

In the previous system, it will only convert English word into Marathi
language, but the user cannot understand the actual pronunciation of that
word.

Disadvantages of Existing system:

1. Users don’t know the standard pronunciation of words.

2. Cannot transliterate Indian Languages among themselves.

3. Lacks user input.

4.Cannot be reliable.

3.1 Hunterian system


The Hunterian system is the "national system of romanization in India" and
the one officially
adopted by the Government of India.
The Hunterian system was developed in the nineteenth century by William
Wilson Hunter,
then Surveyor General of India.When it was proposed, it immediately met
with opposition
from supporters of the earlier practiced non-systematic and often distorting
"Sir Roger Dowler method" (an early corruption of Siraj ud-Daulah) of
phonetic transcription, which climaxed in a dramatic showdown in an India
Council meeting on 28 May 1872 where the new Hunterian method carried
the day. The Hunterian method was inherently simpler and extensible to
several Indic scripts because it systematized grapheme transliteration, and it
came to prevail and gain government and academic acceptance. Opponents

14
of the grapheme transliteration model continued to mount unsuccessful
attempts at reversing government policy until the turn of the century, with
one critic calling appealing to "the Indian Government to give up the whole
attempt at scientific (i.e. Hunterian) transliteration, and decide once and for
all in favour of a return to the old phonetic spelling."

3.2 ITRANS scheme


ITRANS is an extension of Harvard-Kyoto. Many web pages are written in
ITRANS. Many forums are also written in ITRANS.

The ITRANS transliteration scheme was developed for the ITRANS software
package, a preprocessor for Indic scripts. The user inputs in Roman letters
and the ITRANS preprocessor converts the Roman letters into Devanagari (or
other Indic scripts). The latest version of ITRANS is version 5.30 released in
July, 2001.

3.3 Quill pad

Quillpad is the Number One predictive transliteration tool for inputting Indian
languages. Unlike the rule-based phonetic transliteration solutions where
users had to type by memorizing clumsy key combinations, Quill pad
provided a huge leap in ease of use by enabling users to type in freestyle,
without having to follow any rigid typing rules. Launched in 2006, Quill pad is
the first Indic transliteration solution to use statistical machine learning
method for intelligently converting user entered free-style phonetic input to
its accurate representation in a chosen Indian language.

3.4 Google Transliteration

Google transliteration (formerly Google Indic Transliteration) is a


transliteration typing service for Hindi and other languages.
This tool first appeared in Blogger, Google's popular blogging service. Later
on it came into existence as a separate online tool. Keeping in view its
popularity it was embedded in Gmail and Orkut. In December 2009 Google
released its offline version named Google IME.

15
Chapter 4

PROPOSED SYSTEM

In our application English word is taken as input. Then this words are
converted into tokens. The tokens then compare with Dictionary and then
give final result as English-Marathi words.

Here, when we provide the Input as any English Word to our


Transliteration Machine System, we get the desired output
accordingly.

Advantages of proposed system:

1. For the Marathi pronunciation our system is useful those who can learn
standard level English language.

2. User friendly environment.

3. Better user interface.

4. Fast mechanism.

5. Small memory factor.

16
Block Diagram:

17
Chapter 5

ANALYIS DETAILS OF HARDWARE & SOFTWARE

Hardware:

1. Processor: Pentium 4
2. RAM: 512 MB or more
3. Hard disk: 16 GB or more
Software

JAVA JDK1.6
Net beans.
MySQL

1.JAVA JDK1.6:

The Java Development Kit (JDK) is an implementation of either one of the


Java SE, Java EE or Java ME platforms released by Oracle Corporation in the
form of a binary product aimed at Java developers on Solaris, Linux, Mac OS
X or Windows. The JDK includes a private JVM and a few other resources to
finish the recipe to a Java Application.Since the introduction of the Java
platform, it has been by far the most widely used Software Development Kit
(SDK).

2.Net beans.:
Net Beans is an integrated development environment (IDE) for developing
primarily with Java, but also with other languages, in particular PHP, C/C++,
and HTML5.It is also an application platform framework for Java desktop
applications and others.

3.My SQL:
4.MySQL is a popular choice of database for use in web applications, and is a
central component of the widely used LAMP open source web application
software stack (and other 'AMP' stacks).

18
Chapter 6

Design Details

System flowchart:

A flowchart is a type of diagram that represents an algorithm or process,


showing the steps as boxes of various kinds, and their order by connecting
them with arrows. This diagrammatic representation illustrates a solution to
a given problem. Process operations are represented in these boxes, and
arrows; rather, they are implied by the sequencing of operations. Flowcharts
are used in analyzing, designing, documenting or managing a process or
program in various fields.

Algorithm:-

1. Enter English words as Input in Text-box of Transliteration Machine


System.

2. Then Convert these input words into tokens.

3. Check the tokens of given input words.

4. Compare these words in the dictionary of Transliteration Machine System.

5. The equivalent Transliterated English-Marathi words are obtained as


output.

19
Flowchart

Start

Enter The English words


as Input

Divide words into Tokens

Token are checked

Compare with dictionary

English-Marathi words are


generated

Stop

20
DFD Level 0:

English word Our system English-Marathi


word

DFD Level 1:

Enter English
words as Input
Convert word into token

Check the token


of word

Compare with the


dictionary

Find out
checked words

English-Marathi words
generated

21
CHAPTER 7
IMPLEMENTATION PLAN

IMPLEMENTATION PLAN:
The implementation plan includes a description of all the activities that must
occur to implement the new system and to put it into operation. It identifies
the personnel responsible for the activities and prepares a time chart for
implementing the system. The implementation plan consists of the following
steps.

 List all files required for implementation.

 Identify all data required to build new files during the


implementation.

 List all new documents and procedures that go into the new
system.

The implementation plan should anticipate possible problems and must be


able to deal with them. The usual problems may be missing documents;
mixed data formats between current and files, errors in data transliteration,
missing data etc.

Implementation includes all those activities that take place to convert from
the old system to the new. The old system consists of manual operations,
which is operated in a very different manner from the proposed new system.
A proper implementation is essential to provide a reliable system to meet the
requirements of the organizations. An improper installation may affect the
success of the computerized system.

IMPLEMENTATION METHODS:
There are several methods for handling the implementation and the
consequent conversion from the old to the new computerized system.
The most secure method for conversion from the old system to the new
system is to run the old and new system in parallel. In this approach, a
person may operate in the manual older processing system as well as start
operating the new computerized system. This method offers high security,
because even if there is a flaw in the computerized system, we can depend
22
upon the manual system. However, the cost for maintaining two systems in
parallel is very high. This outweighs its benefits.
Another commonly method is a direct cut over from the existing manual
system to the computerized system. The change may be within a week or
within a day. There are no parallel activities. However, there is no remedy in
case of a problem. This strategy requires careful planning.
A working version of the system can also be implemented in one part of the
organization and the personnel will be piloting the system and changes can
be made as and when required. But this method is less preferable due to the
loss of entirety of the system.

23
Conclusion
Thus,we conclude the advent of transliteration system. It is an effective
token based system for transliteration between English and Marathi. As
English and Marathi are structurally similar languages, it generates target
language sentence retaining a flavor of the source language. It should be
noted that transliteration is not performed here in the sense of linguistics,
but word-for-word transliteration is performed. It requires limited linguistic
effort and tools for achieving the said goal. Result, demonstrates the
potential advantage and accuracy of our approach.

The translator has successfully realised his intention. Referentially, the main
ideas of the SL text are reproduced. The language is rather more informal
than it is in the original, which is in line with the difference between
educated English and Marathi. There are several instances of under
translation, sometimes inevitable in the context of different collocations and
normal and natural usage. In fact the use of more general words helps to
strengthen the pragmatic effect, since, being common and frequently used,
they have more connotations and are more emotive than specific, let alone
technical, words which are purely referential.

24
REFERENCES

1. Carbonell, J., Cullingford, R., & Gershman, A. 1981. Steps Towards


Knowledge-Based Machine Translation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAMI-3

2. Guida, G. & Mauri, G. 1986. Evaluation of Natural Language Processing


systems: Issues and approaches. Proceedings of the IEEE, 74(7): 1026-
1035

3. Altintas K, Cicekli I (2002) A machine translation system between a


pair of closely related languages. In: Proceeding of International
Symposium on Computer and Information Sciences Scannell KP (2006)
Machine translation for closely related language pairs. In: Proceedings
of Language Resource Evaluation Conference.

4. “HINDI AND MARATHI TO ENGLISH MACHINETRANSLITERATION USING


SVM”,P H Rathod, M L Dhore,Department of Computer Engineering,
Vishwakarma Institute of Technology, Pune and R M Dhore,Pune
Vidhyarthi Griha’s College of Engineering and Technology, Pune.

5. MITESH M. KHAPRA ,
PUSHPAKBHATTACHARYYA,“CompositionalMachine Transliteration” By
A KUMARAN,Microsoft Research India,Indian Institute of Technology
Bombay.

6. Bharti A, Vineet C, Sangal R (1994) Natural language processing: a


paninian perspective. Prentice-Hall of India, New Delhi Patel K, Pareek J
(2010) Rule base to resolve translation problems due to differences in
gender properties in sibling language pair Gujarati–Hindi. In:
Proceedings of IEEE International Conference on Computer and
Communication Technology.

25

You might also like