You are on page 1of 14

St.

Mary’s University

School of Graduate Studies, Department of Computer Science

REPORT ON:

Language model for Amharic Language

By: Gerabirhan Paulos SGS/0448/2010A


Sara Adamu Genet SGS/0456/2010A
Jordano Bezawit SGS/0447/2010A
G/mariam SGS/0442/2010A

Submitted to: Martha Yifiru (PhD)

June, 2018

Addis Ababa, Ethiopia


TABLE OF CONTENTS
1 INTRODUCTION........................................................................................................................1

1.1 Language Model (LM).......................................................................................................... 1

1.2 Tigrigna Languages............................................................................................................... 2

2 LANGUAGE MODELING FOR TIGRIGNA LANGUAGE......................................................3

2.1 Data........................................................................................................................................3

2.2 Software.................................................................................................................................3

2.3 N-Gram Language Model Implementations..........................................................................3

2.3.1 Uni-Gram Language Model Implementation.................................................................4

2.3.2 Bi-Gram Language Model Implementation................................................................... 5

2.3.3 Tri-Gram Language Model Implementation...................................................................6

3 EVALUATIONS...........................................................................................................................9

4 SMOOTHING N-GRAM MODEL............................................................................................10

4.1 Add-one/Laplace Smoothing............................................................................................... 10

REFERENCES..............................................................................................................................12
1 INTRODUCTION
This practical assignment report is about construct language modeling and test the language
model, that is, probability estimation for arbitrary string of symbols (in our case sequences of
Tigrigna language words). Data for the assignment have been generated using a probabilistic
grammar.

Natural Language Processing (NLP), also variously called computer speech and language
processing or human language technology or computational linguistics, is a field of computer
science, artificial intelligence (also called machine learning), and linguistics and human
through (natural) languages [33] [34].

The field of NLP involves making computers to perform useful tasks with the natural
languages humans use. The input and output of an NLP system can be Speech or Written text.

Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT), among
other natural language applications rely on a language model (LM) to generate coherent
natural language text. The LM plays a crucial role in identifying the correct word sequences by
providing a strong prior over word sequences, in the often prohibitively large and complex
hypothesis space of these systems [4].

1.1 Language Model (LM)


The language model plays a crucial role in identifying the correct word sequences by providing
a strong prior over word sequences, a stochastic process model for word sequences. A
mechanism for computing the probability of: P(w1,…,wT). Probabilities are essential in any task
in which we have to identify words in noisy, ambiguous input, like speech recognition or
handwriting recognition [2] [3] [4].

Language modeling are important for deferent NLP application such as [2]:

Important component of a speech recognition system (Helps discriminate between


similar sounding words, helps reduce search costs)

1
In statistical machine translation, a language model characterizes the target
language, captures fluency
For selecting alternatives in summarization, generation Text
classification (style, reading level, language, topic…)
Language Models can be used for more than just words (letter sequences (language
identification), speech act sequence modeling, case and punctuation restoration)

Whether estimating probabilities of next words or of whole sequences, the N -gram model is
one of the most important tools in speech and language processing [3].

1.2 Amharic Languages


There are ninety languages that are spoken in Ethiopia (according to the 1994 Ethiopian
census conducted by Ethnologue). Amharic is spoken by more than 17 million people,
which is about one third of Ethiopia’s population (and another third speak Oromo
language). It has been the language of the court and the dominant population group in
Highland Ethiopia since the late 13th century. It is spoken to some extent in every
province, including the Amhara region.

Amharic language, also called Amarinya or Kuchumba, Amarinya also


spelled Amharinya and Amarigna, one of the two main languages of Ethiopia (along with
the Oromo language). It is spoken principally in the central highlands of the country.
Amharic is an Afro-Asiatic language of the Southwest Semitic group and is related
to Geʿez, or Ethiopic

2
2 LANGUAGE MODELING FOR TIGRIGNA LANGUAGE
In this section we assign probabilities to Tigrigna language sentence and sequence or words
using N-gram. An N-gram is a sequence od N words: a 2-gram (or bigram) is a two-word
sequence of words like “please turn”, “turn your”, and a 3-gram (or trigram) is a three-
word sequence of words like “please turn your” [3].on this part we found probabilities of
Amharic language through sample sentence from our corpus using N-gram.

2.1 Data
We First collected (download) the corpus data from different website such as VoA Tigrigna,
BBC Tigrigna, SBS Tigrigna news, and JW Tigrigna Bible. Then combine the corpus and
sentences tokenized. And split the corpus into training data (90%), and test data (10%).

This will create several text files described below:

 All-Tigrigna-Corpus.txtThe text of the original corpus (all words are available)


 All-Tigrigna-Corpus-10tail.data The text of the test data(10%)

2.2 Software
We use SRILM software for develop and testing the language model. SRILM is a collection C++
library, executable programs, and helper scripts designed to allow both production of and
experimentation with statistical language models for speech recognition and other applications.
SRILM is freely available for noncommercial purposes [7]. The SRI Language Modeling
(SRILM) toolkit offers tools for building and applying statistical language models (LMs) for use
in speech recognition, statistical tagging and segmentation, and machine translation [6].

2.3 N-Gram Language Model Implementations


Let’s begin with the task of computing P(w|h), the probability of a word w given some history h,
Suppose the history h is “its water is transparent that” and we want to know the probability that
the next word is the [3]:

3
P(the| its water is so transparent that).

2.3.1 Uni-Gram Language Model Implementation

Unigram counts, a vector containing the number of occurrences in the dataset of each word. For
example, ኺበልዕ = 6, ይዕመጻ = 2 is the number of times! occurs in the dataset.

For unigram case, we get a zero probability, which is generally a bad idea, since you often don’t
want to rule things out entirely (when you have finite training data). The workhorse program
for estimating n-gram models in SRILM is ngram-count. We can try this for counting ungrams
in our Tigrigna-Corpus:

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 1


-write /home/cats/Desktop/LM/Tigrigna-Corpus-unigram.count

Figure 1. Example of Unigram count (take from screen).


By default, SRILM computes a smoothed model, so we have to specify that we want
an unsmoothed model.

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 1


-addsmooth 0 -lm /home/cats/Desktop/LM/Tigrigna-Corpus.lm1

4
ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
P(.) P(.) P(.) P(.) P(.) P(.) P(.) P(.)
4.141222 2.889108 2.874561 1.635011 4.461557 4.461557 3.622136 4.122739
Table 1. Unigram probabilities for eight words in the Tigrigna Project corpus of 9325 lines.

2.3.2 Bi-Gram Language Model Implementation


Bigram counts, a matrix containing count of pairs of adjacent word in the dataset. For example,
የሩሳሌምን ይሁዳን = 2, አስቤዛ ብምዕዳግ = 1 is the number of times! The workhorse program for estimating n-

gram models in SRILM is ngram-count. We can try this for counting bigrams in our
Tigrigna-Corpus:

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 2


-write /home/cats/Desktop/LM/Tigrigna-Corpus-bigram.count

Figure 2. Example of Bigram count (take from screen).


By default, SRILM computes a smoothed model, so we have to specify that we want
an unsmoothed model.

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 2


-addsmooth 0 -lm /home/cats/Desktop/LM/Tigrigna-Corpus.lm2

5
ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
ናይ 3.1031 0.8514 0 3.4041 0 0 3.103119 3.404149
19 81 49
ምስሊ 0 0 0.0611 0 0 0 0 0
7361
መግለጺ 0 0 0 0.9123 0 0 0 0
856
ኣብ 2.0166 0 0 3.8679 2.964849 3.867939 2.788757 0
8 39
ድርኩኺት 1.0413 0 0 0 0 1.041393 0 0
93
ምጥፋእ 0 0 0 1.0413 0 0 0.7403627 0
93
ዝርከብ 1.8808 0 0 0 0 0 0 1.579784
14
መለኸት 0 0 0 1.0791 0 0 0 0
81
Table 2. Bigram probabilities for eight words in the TigrignaProject corpus of 9325 lines.

2.3.3 Tri-Gram Language Model Implementation

Trigram counts, a matrix counting count of three of adjacent word in the dataset. For
example, መንእሰያትን ድኽመታት ከምዘሎ = 1, ክርእዮ ኸሎ ኣዝዮም = 2 is the number of times! The workhorse
program for estimating n-gram models in SRILM is ngram-count. We can try this for
counting trigrams in our Tigrigna-Corpus:

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 3


-write /home/cats/Desktop/LM/Tigrigna-Corpus-trigram.count

6
Figure 3. Example of Trigram count (take from screen).

By default, SRILM computes a smoothed model, so we have to specify that we want


an unsmoothed model.

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 3


-addsmooth 0 -lm /home/cats/Desktop/LM/Tigrigna-Corpus.lm3

ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ


ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
ናይ 0 0.00121 0 0 0 0 0
4811
ምስሊ 0 0 0.845098 0 0 0 0
መግለጺ 0 0 0 0 0 0 0
ኣብ 0 0 0 0 0 0 0
ድርኩኺት 0 0 0 0 0 0 0
ምጥፋእ 0 0 0 0 0 0 0
ዝርከብ 0 0 0 0 0 0 0
መለኸት 0 0 0 0 0 0 0
Table 3. Trigram probabilities for eight words in the Tigrigna Project corpus of 9325 lines.

A header that tells you how many unique n-gram types were observed of each order n up to the
maximum order of the model:

./ngram-count -text /home/cats/Desktop/LM/All-Tigrigna-Corpus.txt -order 3


-addsmooth 0 -lm /home/cats/Desktop/LM/Tigrigna-Corpus.lm3

7
\data\

ngram 1=71381

ngram 2=219164

ngram 3=25085

Example Output

\1-grams: \2-grams: \3-grams:


-
-5.20192 "'ሀ'ን -99 -1.079181 "ሎሚ ለይቲ 0.1760913 <s> "ሕዚ ናብቲ
-5.50295 "'ሓኪም-99 -1.079181 "ሎሚ መዓልቲ -0.39794 <s> "ሕጂ ሓቢርና
-5.20192 "'ሰብ -7.478543 -1.079181 "ሎሚ ሰንበት -0.39794 <s> "ሕጂ እቲ
-
-5.50295 "'ሰብነት-99 -1.079181 "ሎሚ ብሓቂ 0.4771213 ከኣ፡ "መምህር፡ ሓደስ
-5.50295 "'ሰብኣይ - -1.079181 "ሎሚ ኣብዚ -0.544068 ኸኣ፡ "መምህር፡ እዚ
99
. . .
. . .
. . .

5.20192 ፍሉይ -99 0 ፍሉይ ዓቕሚ -0.455402 0 መቐለ — </s>


Table 4. Example of n-gram types were observed of each order n.

8
3 EVALUATIONS
For an intrinsic evaluation of a language model we need a test set. The probabilities of an N-
gram model come from the corpus it is trained on, the training set or training corpus. We can
then measure the quality of an N-gram model by its performance on some unseen data called
the test set or test corpus [3].

Calculating model perplexity with SRILM: Once you have a language model written to a file,
you calculate its perplexity on a new dataset using SRILM’s ngram command, using –lm
option to specify the language model file and the –ppl option to specify the test-set file.

./ngram -lm /home/cats/Desktop/LM/Tigrigna-Corpus.lm3 -ppl


/home/cats/Desktop/LM/All-Tigrigna-Corpus-10tail.data

932 sentences, 89731 words, 0 OOVs


0 zeroprobs, logprob= -115181 ppl= 18.6393 ppl1= 19.2143

9
4 SMOOTHING N-GRAM MODEL
4.1 Add-one/Laplace Smoothing
The simplest way to do smoothing is to add one to all the bigram counts, before we normalize
them in to probabilities. All the counts that used to be zero will now have a count of 1, the counts
of 1 will be 2, and soon. This algorithm is called Laplace smoothing [3].

ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት


ናይ 4.3916 2.3148 0 4.5677 0 0 4.391617 4.567708
17 55 08
ምስሊ 0 0 2.3021 0 0 0 0 0
87
መግለጺ 0 0 0 3.1318 0 0 0 0
79
ኣብ 3.0389 0 0 4.5952 3.942052 4.595265 3.782351 0
62 65
ድርኩኺት 4.5526 0 0 0 0 4.552613 0 0
13
ምጥፋእ 0 0 0 4.5526 0 0 4.376522 0
13
ዝርከብ 4.5530 0 0 0 0 0 0 4.376917
09
መለኸት 0 0 0 4.3766 0 0 0 0
01
Table 5. Add-one smoothed Bigram probabilities for eight words in the Tigrigna Project corpus
of 9325 lines.

10
ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ
ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
ናይ 0 2.30186 0 0 0 0 0
ምስሊ 0 0 3.13974 0 0 0 0
መግለጺ 0 0 0 0 0 0 0
ኣብ 0 0 0 0 0 0 0
ድርኩኺት 0 0 0 0 0 0 0
ምጥፋእ 0 0 0 0 0 0 0
ዝርከብ 0 0 0 0 0 0 0
መለኸት 0 0 0 0 0 0 0
Table 6. Add-one smoothed Trigram probabilities for eight words in the Tigrigna Project corpus
of 9325 lines.

11
REFERENCES
[1] Benjamin Elisha Sawe, “What Languages Are Spoken in Ethiopia,” worldatlas.com,
August 1, 2017. [Online]. Available: www.worldatlase.com/articles/what- languages-
are-spoken-in-ethiopia.htlm. [Accessed April 1, 2018].

[2] D. Hillard and S. Petersen. Lecture not, Topic: “N-gram Language Modeling Tutorial”

[3] Daniel Jurafsky & James H.Martin.yright “Speech and Language Processing,” August 7,
2017.

[4] Ariya Rastrow, “Practical and Efficient Incorporation of Syntactic Features into
Statistical Language Models” PhD dissertation, Johns Hopkins University,
Baltimore, Maryland, May, 2012.

[5] R. Fabri et al., Natural Language Processing of Semitic Languages, Imed


Zitouni Editor, 2014. [E-book] Available: Springer. P. 3-38.

[6] SRI International, “SRI Language Modeling Toolkit,” [Online]. Available:


https://www.sri.com/engage/products-solutions/sri-language- modeling-toolkit
[Accessed April 1, 2018]

[7] Andreas Stolcke, “SRILM-An Extensible Language Modeling Toolkit,” Speech


Technology and Research Laboratory, SRI International, Menlo Park, CA, U.S.A.
[Online]. Available: http://www.speech.sri.com/ [Accessed April 1, 2018].

12

You might also like