You are on page 1of 11

Sinhala Language Processor

1. Introduction
Sinhalese or Sinhala (, ISO 15919: sihala, pronounced [sihl], earlier referred to as
Singhalese) is the language of the Sinhalese, the largest ethnic group of Sri Lanka. It belongs to
the Indo-Aryan language family. Sinhala is spoken by about 19 million people in Sri Lanka,
about 16 million of them are native speakers. It is one of the constitutionally-recognized official
languages of Sri Lanka, along with Tamil. Sinhala has its own writing system (see Sinhala script)
which is an offspring of the Indian Brahmi script. The oldest Sinhala inscriptions were written in
the 3rd and 2nd centuries BCE; the oldest existing literary works date from the 9th century CE.
The closest relative of Sinhala is the language of the Maldives, Dhivehi. The word order of
Sinhala language is SOV (subject-object-verb) order and Japanese, Korean and most of other
languages in Asia follows this order.[1]
As well as, when talking about Natural Language Processing (NLP). It means the ability of a
computer program to understand human speech as it is spoken. NLP is a component of artificial
intelligence (AI). The development of NLP applications is challenging because computers
traditionally require humans to speak to them in a programming language that is precise,
unambiguous and highly structured or, perhaps through a limited number of clearly-enunciated
voice commands. Human speech, however, is not always precise -- it is often ambiguous and the
linguistic structure can depend on many complex variables, including slang, regional dialects and
social context. Current approaches to NLP are based on machine learning, a type of artificial
intelligence that examines and uses patterns in data to improve a program's own
understanding.[2]
This project effort is about to combine Sinhala language with NLP system for survival of this
grate language in to future generations. It is a combination with Sinhala Speech to text system,
Sinhala text to speech system and Sinhala OCR system engaging with Google Translator. This
papers second chapter (Background & Motivation) explain about present situations of Sinhala
language processing. Third chapter (Problem in brief) explains the problem. Fourth chapter
explains the aims and objectives of this project. Then, proposed solution and its design have been
discussed using through relevant resources.
1

2. Background and Motivation


Nowadays, there have been done some Natural Language Processing researches and projects
related to the Sinhala Language. Among them, Language Technology Research Laboratory
(LTRL) in University of Colombo School of Computing was established in 2004 to address the
growing need of local language computing in Sri Lanka by doing Localization and Language
Processing research and development. With the award of a Canadian IDRC grant, LTRL started
its work on the PAN Localization Project with the aims of producing a large Sinhala Corpus, a
Lexical Resource, a Text-to-Speech Engine (TTS), Speech Recognition, Machine Translation
and an Optical Character Recognition application (OCR), which are some of the fundamental
requirements for language processing tasks.[3]

Figure 2.1 UCSC Sinhala TTS Application

Also, University of Moratuwa, Computer Science Students have developed AMoRA. It is a


TTS system for Sinhala Unicode which was initiated with the intention of offering the
advantages of a typical TTS system for the benefit of Sri Lankans who speak Sinhala. The core
of the system is based on Festival, a free open source text-to-speech engine. Furthermore, in
order to localize, the linguistic and prosody rules for Sinhala language were applied.[4]
As well as, Google Translate is now active for Sinhala. Anyone can now help Google to translate
English words and phrases into Sinhala (Sinhalese) Language in Google Translation. According
to Google, Google Translate is a statistical machine translation tool.
However, with respect to the other languages, it is still not enough about our Sinhala Language
researches. Most of Sinhala Natural Language Processing systems are based on another English

Dictionary. There is still not proper Sinhala Language Processing System based relevant Sinhala
language own core. But, it is needed for the survival of Sinhala Language with IT world.

3. Problem in Brief
Within the modern world, Our Sinhala language have been alone because most of them do not
try to do something new regarding Sinhala language related to Information Technology world.
English, French, German, Japanese and like languages have been build their own strong position
in IT world.
So, Sinhala Language Speech to Text System is still not developed. . It will be used for disability
people who cant write.
As well as, there is still not available proper accurate Sinhala Language Text to Speech (TTS)
System. Some Sinhala TTS systems have been developed but they havent their own core. They
are based on English dictionaries.
Also, there is still a need of proper Optical Character Recognition System for Sinhala language
and then it helps to translate recognized details to another language. These systems will be very
helpful for foreign tourist in Sri Lanka. Because foreign tourist face big problem of translating
local language to their own language.
Intelligent applications (like Google Now, Cortana and Siri) are still not event started to develop
for Sinhala Language because the there is no any proper API for Sinhala Language Processings.

4. Aim and Objectives


Our aim is to propose an Intelligent Sinhala Language Processing System for every person
working with Sinhala Language to overcome above described problems. It will be help to
improve the usage of Sinhala Language within IT world and more survival within technological
world.
To fulfill our aim, it is essential to across required objectives. Therefore, our main objective is to
do a special analysis for identifying characters in Sinhala language through their voice
frequencies and relevant sound features.
Also, optical Sinhala character recognition system should be developed and need to clearly
identify what are the characters in separately.
Another main thing is about Sinhala grammar and other Sinhala language patterns recognition
system is required. As well as, finally, it is needed to develop a special API for Sinhala Language
Processing.
However, the final application will be presented following major components that will be
developed through above mentioned techniques.
I.

Sinhala Speech to Text Component.

II.

Sinhala Text to Speech component

III.

Sinhala Text detection and its translation system to given particular language.

Figure 4.1 OCR Application Demo

5. Proposed Solution
This chapter proposed a possible software solution for fulfilling the aim and overcoming the
problems that mentioned previous chapters. Therefore, we are going to develop a Desktop
Application for Sinhala language processing. The proposed application will be able to do the
conversion of Sinhala speech to Sinhala text, Sinhala text to speech (male/female), Sinhala
optical character recognition system and Sinhala to given language translation system. Google
Translator service will be hoped to use for translation activities.
Sinhala Text to Speech
Inputs: Sinhala texts
Process:

SINHALA TEXT TO SPEECH PROCESSING

Natural Language Processing


TEXT

Digital Signal Processing

Linguistic formalisms
Inferences engines
Logical inferences

Mathematical Models
Algorithm computations

Figure 5.1 Sinhala TTS system overview

Output: Male/Female voice speech relevant to input texts


The TTS system converts an arbitrary Sinhala text to speech. The first step involves extracting
the phonetic components of the message, and we obtain a string of symbols representing soundunits (phonemes or allophones), boundaries between words, phrases and sentences along with a
set of prosody markers (indicating the speed, the intonation etc.). The second step consists of
finding the match between the sequence of symbols and appropriate items stored in the phonetic
inventory and binding them together to form the acoustic signal for the voice output device.

SPEECH

Sinhala Speech to Text


There is no proper mechanism / tool to convert Sinhala speech to text. We are going to introduce
new application speech recognition. First we detect the voice and filtering the noise and
segmentation the audio file by using algorithms. Detect the correct frequency voice part and
compare with suitable sound and convert to text.

Input

Time Algorithm

Filtering

+
Other Algorithm

Speech

Database

Feature
Extraction

Comparison

Convert to
Sinhala Text

Recognized
Speech
Figure 5.1 Sinhala STT system overview

Sinhala Text detection (OCR) and its translation to given particular language
The currently available Sinhala OCR Systems supports only one font style and one font size at a
time, when it needs to be processed by the OCR engine. The currently available Sinhala OCR
engine is unable to handle low quality images with noise and other disturbances, the accuracy is
drastically low and sometimes it wont even produce an output. Therefore it is a necessity to
enhance or come up with a new Sinhala OCR engine which is having the capability of image
preprocessing and which supports multi-fonts and multi-size character recognition.Our priority is
to fill that gap.
Now google translate is also available in Sinhala language.so it is easy to translate detect word
in to any language. Detect words also used to our proposed solution text to speech application.

6. Design
The proposed application is contained major 3 functions. Sinhala speech to text, text to speech
and OCR identified Sinhala text to translate given language are the main functions. At the
beginning, windows desktop application is hoped to develop.

Figure 7.1 Design overview of proposed system

The speech text interface which provides methods to start, pause, resume, fast forward, rewind, and stop
the TTS engine during speech. The attribute interface allows access to control the basic behavior of the
TTS engine.

7. Resource Requirements

It is essential of good Sinhala voice audio file collection to do the character analysis and
properly train the system.

Quality sound captured microphone

Quality image captured camera for OCR system

MatLab for application core development.

8. References
[1] Sinhala alphabet, pronunciation and language - Omniglot
http://www.omniglot.com/writing/sinhala.htm
19th February 2015
[2] What is natural language processing (NLP)? Definition
http://searchcontentmanagement.techtarget.com/definition/natural-language-processing-NLP
19th February 2015
[3] Language Technology Research Laboratory - University of Colombo School of Computing
http://ucsc.cmb.ac.lk/ltrl/?page=home&lang=en&style=default
19th February 2015
[4] AMoRA Sinhala TTS University of Moratuwa
http://lms.uom.lk/sf/shantha/Project-web-sites/2009-10/PI-58-AMoRA/index.html
19th February 2015

10

Appendix
Action Plan (Time Line for 2015)

Feb

Mar

Apr

May

June

July

Aug

Sept

Oct

Nov

1, 2 &3
4
5
6
7
8
9
10

1. Study about related projects and Prepare with relevant requirements to start the project.
2. Gather Requirements for starting the project.
3. Design and Plan about the product.
4. Start the implementations.
5. Release a basic product with test cases for interim evaluation.
6. Start second iteration with gathering complex requirements
7. Design, Plan and implementation for publishing a complete product
8. Deeply consider about user visualization parts and make it in proper manner, considering
with HCI
9. Finalize the project with test cases
10. Prepare for final evaluation.

11

You might also like