You are on page 1of 5

2016 5th International Conference on Informatics, Electronics and Vision (ICIEV) 547

A Comprehensive Text Analysis for Bengali TTS


using Unicode
Sheikh Abujar Mahmudul Hasan
Dept. of CSE, Britannia University, Dept. of CSE, Comilla University
Comilla, Bangladesh Comilla, Bangladesh
cybermanbd@gmail.com mhasanraju@gmail.com

Abstract—Communication is a very natural characteristic of Unicode can represent the data as is, everywhere. For this
every creature. Sometimes we use different symbols, or many reason we can convert the data in any form it required. In java
formed languages to communicate each other. Every Languages we directly can analysis the Bengali letter to letter or either we
we use are able for both oral and text communications. Writing can convert those Unicode values and then send them to
symbols is a way to express our intentions through using any
analysis section. For a better TTS, we need several key parts
physical material. As we have oral communication capability too
in our TTS model, where we also have to consider synthesis
which we could use exactly as we want to speak or read
something from a written text. Machine is not capable to speak
module and speech module. These two important parts will be
by its own. It needs commands, instructions and little sustained properly if we could pass a simplified text. In our
intelligence. It is important to learn the basic elements of a TTS we have used prerecorded voice, where we initially parse
language before we speak properly. If we can understand letters the entire text to a form of individual words, and then the
of that language then we can speak words and then a sentence. So words to individual little token if necessary. The dictionary
without understanding or analyzing a text completely it is quite based model will help us to produce a better pronounced
difficult to build such a better TTS (Text To Speech) [11] in any prerecorded voice which will remains like a natural voice.
language. In this paper we discussed various ways of analyzing a English language have the most efficient and powerful TTS
Bengali text, and how to prepare this text for a better TTS engine, because of using an immense dictionary resource.
development. We have also discussed various aspects of model
Where maximum texts would be played form dictionary
and necessary implementation approach based on text analysis.
Bengali text parsing and analyzing are extensively discussed here.
database, so every word sounds like a real and perfect speech,
however there is no such complete lexical resources for
Keywords— Text to speech, Bengali text analysis, Dictionary Bengali TTS. But we discussed here about few models where
and non-dictionary based model, parsing. we can maximize the performance of TTS.

II. LITERATURE REVIEW


I. INTRODUCTION Edification and research for Bangla TTS Engine was
The tool Text to speech can assist the blind or handicapped improved very highly in last few years. Few of techniques of
people very much , there are more than 39 billion blind methods were broadly applied for instance, Text analysis,
people, a report of WHO shows that where we have more Speech synthesis and word or alphabetical database. In few
than 250 million Bengali native speakers amid the world. And researches the alphabetical rules were applied and lexical rules
the position of Bengali language is 7 based on most speaking in rest or may be both. In very recent research, the
language [5]. Text analysis in one the key component for pronunciation rules were introduced. Few tools were used
developing a better TTS [1][11] engine. TTS is a very fine broadly for developing a better TTS, FESTIVAL is one of
tool between machines and human. Larges of texts we are them were used by many Bengali TTS developers. The
reading every day, it takes lots of time which could be omitted Festival system is written in C++ and uses the Edinburgh
through a single TTS [13] [14] engine. It is very much Speech Tools Library for low level architecture and has a
important to pronounce every single word properly. Scheme (SIOD) based command interpreter for control.
Sometimes words could be ambiguous or maybe unknown,
some new words user may want to hear. Considering all this Firoj, Mumitetal.[2] used FESTIVAL . They converted
circumstances we have discussed on various aspects of grapheme to phoneme conversion using pronunciation and
Bengali TTS development where text analysis play the major letters to sound rule. Prosodic rules were used to identify
role. stress, phrase, etc. Authors were introduced NSW (Non-
Bengali text input is one of the major factors; if the input form Standard Word) words as e.g. numbers (year, time, ordinal,
is not accurate then rest of the analysis part would be difficult. and cardinal, floating point), abbreviations, acronyms,
Based on this issue we have used Unicode, because only currency, dates, URLs. They have used Text normalization
for formatting NSW to SW (Standard Word) and they
disambiguate the ambiguous token using rule. In their research

978-1-5090-1269-5/16/$31.00 ©2016 IEEE


548

they didn’t work with Unicode directly because Festival A. Text Analysis.
doesn’t support Unicode, So that they convert Unicode text to Text analysis is a major part to develop a TTS engine. If the
ASCII, For example, ক = k, আ =aa, দ =da , etc. input text could take in appropriate format and able to
normalize the data, then a better speech output will appear. All
Nayemul Islam Ayon [3] started to develop an independent the words may not come from dictionary words, words may
Bengali TTS engine rather than using any third part TTS come from different unknown forms. For example, Rahim
engines like FESTIVAL, author developed a tool in JAVA =রিহম is a name of a human and it doesn’t exit is dictionary. So
platform. He initially introduced few pronunciation rules to
make the output speech very perfect as described the necessity data may come in any format and we make the engine capable
of rich lexical database. for dealing with any such known or unknown format with
maximum possibilities.
In another paper Firoz Mahmud, et al. [4] published a
simplified method to identify or populate the accurate text
required from the given input text to produce equivalent
output speech using text analysis. From the Unicode resource
[5], they distinguished between independent and dependent
Bangla words.

III. METHODOLOGY
Purpose of this research is to ensure a better Bengali TTS,
which will overcome the problem of pronunciation, accuracy
and a way to enrich the database using non-dictionary training
module. Another intention is to provide a better
understanding to develop our own Bengali TTS API. Which
could be used over online for any such device, like android,
IOS, etc. We will use the cloud technologies; it will help to
reduce the chance of OS related complexities. In our research
we used JAVA and Unicode because Unicode give the
individual identity for every native language. The alphabets or
words need not to convert in any ASCII value or into any
other format because Unicode provides individual identity
code. Very rich lexical resources are mandatory for any such
TTS. However we know there are many possibilities of
appearing non-dictionary words. And for that, the TTS engine
should be able for generating maximum possible output,
which may happen if the text analysis is done perfectly. So,
among of all the necessary required modules text analysis
module is one of the major part we need to focus.
Figure 1: Proposed Architecture for text to speech.

IV. ARCHITECTURE FOR TTS 1) Parsing the text


The Figure 1, shows the basic architecture of proposed Text to Bengali language is formed with vowels and consonants.
speech model. Here we have added all TTS requirements, for Words here may build by using only vowels, only consonants
example text analysis, speech generation, etc. The text or using both vowels and consonants. For example ei=
analysis module is divided into many different forms like, e+i.Here the word ei is build with two different vowels -
parsing, word analysis and final word processing, etc. Both eand i. Another example is চমক =চ+ম+ক,The word চমক is
dictionary and non-dictionary methods are included as per build by only consonants.
using when necessary. This model will show few other
methods. For instance, how we can normalize the input text, The impact of consonants and vowels in speech utterance is
the way we can deal with non-dictionary words and how we huge. For example the word – eiিদেকis exactly as e+i+দ+
can update our lexical resources for unknown text speech by ি◌+ক+ে◌ but it will speak like e+i+িদ+েক , here দ+ ি◌ = িদ and
joining lots different speech files to a single one. More
ক+ে◌ = েক. The vowels “(ি◌ = i)” “(ে◌ = e)” here are
importantly the automated training and learning modules are
being implemented through this model. independent. were The classification of Bangla keywords
mentioned as dependent and independent keywords]4[,]5[.
Dependent vowels change the pattern ofutterance. so it is
549

very important to analyze the whole sentence or passage vowel then we set new token as a new letter like ম+ি◌ =িম .
exquisitely. Always the dependent vowel took place after exactly in the
next position it belong to which. For this reason, it wouldn’t
We developed our TTS engine using JAVA language and be difficult to tokenize a new separated virtual letter like িম,and
Unicode. Unicode gives us the unique identification number of
its accurate speech will be pre-recorded. The Unicode here
every individual letter, for which we can identify the pattern of
will help to change the two different letters value by
text and able to process properly.
generating a new one.

Figure 2: Unicode conversion and parsing

The Bangla input will pass through the input parser, the parser
will separate the whole word into individual letter where space
Figure 3: Simplifying input text as new virtual letter.
or any other symbols will be considered too. Every individual
letter will produce own Unicode value for further processing.
In the figure 3, there are 3 dependent vowels where two are
same. First one is ি◌ align with ম and the other two is ◌া aligned
2) Word Analysis of dictionary and non-dictionary based
model with ভ and খ respectively. Finally we have changed the letters
We can consider maximum input will come from dictionary, as a new virtual letter ,for example ভ+◌া =ভা.which will
but there are many other words belongs to non-dictionary redirect to the accurate speech file of "ভা "with an identity of
word. For which it is difficult to generate a speech output. 09ad1 (09ad +09be) .For every 09be(◌া) we will add an extra
Usually prerecorded voices are being used widely for TTS
engine. There should be a process to handle non-dictionary value at the end of the affiliated letters corresponding Unicode
based words. We count every space to separate words. Every as– “1” and “2” for all 09bf(ি◌).
words will be checked within the dictionary resource database,
if exist then the output speech will be added in a temporary In bangla language there are lots of compound characters.
token and wait till the whole process finish. In between if any The compound characters sample set were discussed in many
word doesn’t exist in dictionary resource then we sent this different research [4]. We have also considered the compound
unknown word to non-dictionary data processor to generate a characters together as a compound character set. Those words
new output speech by concatenating individual letters will be available in database too.
prerecorded speech. The dictionary based words will be added
in a queue and if any non-dictionary based words found, then 3) Applying the prime rules of auto learning
that will immediately process through the unknown word Auto learning is process through which we can let learn the
processor and the single new generated audio files will be TTS engine about new unknown words and their required
added to the main queue and process the rest with the same speech output will be added in dictionary database for further
sequence. So dealing with dictionary word is very simple, use. If that unknown word comes again as input, it will be
generate the Unicode of input and match with dictionary considered as a known word. Processing unknown words
database. And add the .Wave file in final queue. speech is a major part. Dependent vowels normalization will
For non-dictionary words, we process the required .wave files increase the chance of generating accurate speech. Here the
which will be added as a single speech. For example if we dependent vowels are two types [4], single part and two part.
consider the input sentence “আিমভাতখাi" as full of unknown When we are dealing with non-dictionary words speech
words ,this Bengali input will be separated as generation process then considering the dependent vowels
আ+ম+ি◌+space+ভ+◌া+ ত+ space+খ+◌া +i .Here the text আিম with consonant is a major factor. Which play a very big role in
pronunciation? And as we detect the pronunciation pattern
be considered as willআ+ম+ি◌(ি◌=i), But as being a Dependent perfectly we are required to add the different prerecorded files
vowel signs the ি◌ will change the whole speech of together to produce a single speech output file which will be
pronunciation . There is huge difference between ম+ি◌ and িম redirected for that unknown words speech. The pattern we use
regarding pronuncation. for our TTS engine is given in following table (Table-1).
In non-dictionary speech generator, we initially check how
much dependent vowels exist. If we found any dependent
550

Table 1: Unicode modifier character set. letters, modified letters and words. Letters are included with
consonants, vowels, digits, etc. Modified letters are specially
being used to generate new speech of unknown words through
auto training process. The entire training process finish with
joining different audio files together as per required. The
algorithms of joining different audio files are given below.
String Playlist []
Try
Get Audio Input Stream (Playlist [])
Set New Sequence Input Stream
Get frame length (Playlist [])
Write appended Files, Type .WAVE
Exception
Print Stack Trace
The non-dictionary word processor will generate a single
If we consider the word “মহাখালী” as an unknown word the
speech output after analyzing the whole word perfectly and
after completion of entire process the word will be updated in word will process like “ম +হ+◌া(আ) +খ+◌া(আ)+ ল + ◌ী(ঈ)”, but
dictionary database. After processing unknown words, every the pronunciation of “মহাখালী” is not exactly like we synthesis,
single time the database will be updated of that unknown word the speech of “হ+আ” is never similar to “হা”. This is why we
as known word or resource.
used modified character set, where the accurate speech of “হা”
was being stored. We only just ave to findout the dependetn
V. DATABASE AND WAVE FILE JOINING RULES ANALYSIS vowels and different types of additional letters.
The need of rich word database is incomprehensible. In our
TTS engine we process every single input. We sever the words The equivalent speech of token will be given by Alphabet
in two different phase, dictionary or non-dictionary word. The database and dictionary database. The letters and modified
dictionary words speech will be added from “Dictionary characters belong in same database called alphabet database.
Database”. For processing the unknown words we need the
alphabets speech as well the speech of new proposed modified
virtual alphabets , for example - কা , েক, িক , েকা , েকৗ, etc.Which VI. PROSODIC COMPLEXITY ANALYSIS
will be used to generate new words speech. In this purpose we We recorded different types of voice; we segment all the
will concatenate different audio speeches together. voices as letters, modified character set and words. Words are
being used as a segment of a full sentence and letters or
modified character sets for words. As we are joining different
required audio files together, the pitches may not adjust
always. But maximum acceptance rate comes if we join those
audio files through reducing the noise frame of each audio.
For example if we add a new voice speech of word “আিম ” by
joining a letter “আ ” and a modified character “িম ” speech
together. And if we distinguish between the natural speeches
and the modified new generated speech wave form then we
can visualize the prosodic complexity. The frame rate duration
of actual word speech is 0.7 second. However if we add these
two different speech then we found a rate of frame duration is
0.3 second + 0.5 second, in total if we add these two frame
then we will get a new generated speech of 0.8 frame duration.
The figure below shows the actual wave form comparison.
Figure 4: Database model for non-dictionary words.
আ িম
In the figure 4, we derive the database model of processing
unknown words text to speech and how we update the +
dictionary database with lots of new unknown words. The
compound character set will be available in alphabet database;
আিম
we consider those as a modified virtual alphabet too.

Prerecorded Speech files are required for a TTS engine. In our Figure 5: Wave form example of joining speech.
TTS model we proposed different types of speech files -
551

VII. OUTPUT SPEECH PROCESSING will be done using Unicode, for this purpose we will reduce
Every prerecorded voice was taken from native speaker. the chance of being confronted of any machine related
Both male and female voices are available. On the time of complexities. We developed our TTS engine in JAVA
recording, noises were reduced in maximum level. For platform; it will be possible to release a complete
ensuring noise free audio files, after every recording each Application Programming Interface (API) for other Bengali
audio file was processed through Wave Surfer tool, an open TTS developers’ use. Releasing an API will help us to
source application. Where every audio files frame were avoid using third party TTS resources. We will be able to
being assessed and confirmed for addition in TTS engine. use our own TTS engine in any electronic devices.
After processing through text analysis module, we get the Furthermore development will do in near future with
final token list of audio files will being played. The process intention of developing a flaw less Bengali TTS engine.
we follow to generate a new words speech is similar to
generating a new sentence. For word processing purpose we
add letters and modified character set, and for sentence we REFERENCES
add all word files together as a single wave file to be [1] Khan, S.; Roy, R., “Creation of acoustic signal dictionary for ESNOLA
played. These entire tasks will be done by audio speech based concatenated Bangla and Nepali TTS system” International
processor which was built by using java sound library files. Conference on Speech Database and Assessments (Oriental
COCOSDA), Pages: 162 - 167 , 26-28 Oct. 2011, Hsinchu, Taiwan.
[2] Dr. Mumit Khan Firoj Alam, Promila Kanti Nath “Text To Speech for
Bangla Language using Festival” 2011, BRAC University, Bangladesh
VIII. PERFORMANCE [3] Nayemul Islam Ayon, “Text to Speech System for Bengali Language
using Pronunciation” 2013, AIUB, Bangladesh
This Bengali TTS Engine were examined by many different [4] Mahmud, F., Abdullah-al-Mamun, M. ; Aktar, M. ; Afroge, S. “A Novel
users. An android version of this TTS engine was installed Training Based Concatenative Bangla Speech Synthesizer Model”
in different android OS mobile phones and tablets. The International Conference on Electrical Engineering and Information
users are from different ages. Some of them are university Communication Technology (ICEEICT), 21-23 May 2015, Dhaka,
Bangladesh.
students, teachers, different officials and technical persons. [5] Debnath, R.; Hanumante, V.; Bhattacharjee, D.; Tripathi, D.; Roy, S.
More than 150 people assessed this TTS engine. The score "Multilingual speech translator using MATLAB", Electrical,
was given out of 5.00. Average score we achieved is Electronics, Signals, Communication and Optimization (EESCO), 2015
approximately 4.1, which means 81% accuracy. International Conference on, On page(s): 1 – 5
[6] Sultana, S., Akhand, M.A.H. ; Das, P.K. ; Hafizur Rahman, M.M.
The assessment were based on few parameters are speech “Bangla Speech-to-Text conversion using SAPI” , International
accuracy, pronunciation, prosody, speech utterance and Conference on Computer and Communication Engineering (ICCCE), 3-
compare with native speech. The figure below states the 5 July 2012, Kua lalampur, Malaysia.
result overview. [7] Deshmukh, S.; Laulkar, C.; Rajankar, S. "Automatic recognition of class
variants of Marathi consonants", Pervasive Computing (ICPC), 2015
International Conference on, On page(s): 1 – 4
[8] A. Sen, “Bangla Pronunciation Rules and a Textto-Speech System”,
Symposium on Indian Morphology, Phonology & Language
Engineering, 2004, pp. 39.
[9] Firoj Alam, S.M. Murtoza Habib, and Mumit Khan, “Text
normalization system for Bangla,” Proc. of Conf. on Language and
Technology, Lahore, pp. 22-24, 2009
[10] Daniel Erro, Asunción Moreno, and Antonio Bonafonte, “Flexible
Harmonic/Stochastic Speech Synthesis,” Proc. 6th ISCA Speech
Synthesis Workshop, 2007.
[11] D.Sasirekha, E.Chandra, “Text To Speech: A Simple Tutorial”,
International Journal of Soft Computing and Engineering (IJSCE),
Volume-2, Issue-1, March 2012.
Figure 6: Performance report. [12] K. Knight and D. Marcu. “Statistics-based summarization -step one:
Sentence compression” In Proceeding of AAAI-01, pages 703–710,
Austin, Texas, 2001.
IX. CONCLUSION [13] Ʉ.R. Aida–Zade, C. Ardil and A.M. Sharifova “The Main Principles of
Text-to-Speech Synthesis System” World Academy of Science,
A complete text analysis and process of generating new
Engineering and Technology, International Journal of Computer,
words speech through auto learning process will help to Electrical, Automation, Control and Information Engineering Vol:7,
enrich the lexical resources continuously. Auto learning and No:3, 2013.
database updating procedure will reduce the reuse of audio [14] Kurematsu, M., Hakura, J., Fujita, H., The Framework of the Speech
Communication System with Emotion Processing, Proceedings of the
processor every time and will be capable to serve faster 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge
speech output every time than before. The words we Engineering and Data Bases, Corfu Island, Greece, February 16-19,
process through our TTS, still does not give 100 percent 2007, 46-52
accuracy. But as we are still trying to develop this TTS
engine better, it will be possible to overcome the
complexities of handling exceptional words which does not
belong to any rule in near future. The input and text analysis

You might also like