Professional Documents
Culture Documents
Mary’s University
REPORT ON:
June, 2018
2.1 Data........................................................................................................................................3
2.2 Software.................................................................................................................................3
3 EVALUATIONS...........................................................................................................................9
REFERENCES..............................................................................................................................12
1 INTRODUCTION
This practical assignment report is about construct language modeling and test the language
model, that is, probability estimation for arbitrary string of symbols (in our case sequences of
Tigrigna language words). Data for the assignment have been generated using a probabilistic
grammar.
Natural Language Processing (NLP), also variously called computer speech and language
processing or human language technology or computational linguistics, is a field of computer
science, artificial intelligence (also called machine learning), and linguistics and human
through (natural) languages [33] [34].
The field of NLP involves making computers to perform useful tasks with the natural
languages humans use. The input and output of an NLP system can be Speech or Written text.
Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT), among
other natural language applications rely on a language model (LM) to generate coherent
natural language text. The LM plays a crucial role in identifying the correct word sequences by
providing a strong prior over word sequences, in the often prohibitively large and complex
hypothesis space of these systems [4].
Language modeling are important for deferent NLP application such as [2]:
1
In statistical machine translation, a language model characterizes the target
language, captures fluency
For selecting alternatives in summarization, generation Text
classification (style, reading level, language, topic…)
Language Models can be used for more than just words (letter sequences (language
identification), speech act sequence modeling, case and punctuation restoration)
Whether estimating probabilities of next words or of whole sequences, the N -gram model is
one of the most important tools in speech and language processing [3].
2
2 LANGUAGE MODELING FOR TIGRIGNA LANGUAGE
In this section we assign probabilities to Tigrigna language sentence and sequence or words
using N-gram. An N-gram is a sequence od N words: a 2-gram (or bigram) is a two-word
sequence of words like “please turn”, “turn your”, and a 3-gram (or trigram) is a three-
word sequence of words like “please turn your” [3].on this part we found probabilities of
Amharic language through sample sentence from our corpus using N-gram.
2.1 Data
We First collected (download) the corpus data from different website such as VoA Tigrigna,
BBC Tigrigna, SBS Tigrigna news, and JW Tigrigna Bible. Then combine the corpus and
sentences tokenized. And split the corpus into training data (90%), and test data (10%).
2.2 Software
We use SRILM software for develop and testing the language model. SRILM is a collection C++
library, executable programs, and helper scripts designed to allow both production of and
experimentation with statistical language models for speech recognition and other applications.
SRILM is freely available for noncommercial purposes [7]. The SRI Language Modeling
(SRILM) toolkit offers tools for building and applying statistical language models (LMs) for use
in speech recognition, statistical tagging and segmentation, and machine translation [6].
3
P(the| its water is so transparent that).
Unigram counts, a vector containing the number of occurrences in the dataset of each word. For
example, ኺበልዕ = 6, ይዕመጻ = 2 is the number of times! occurs in the dataset.
For unigram case, we get a zero probability, which is generally a bad idea, since you often don’t
want to rule things out entirely (when you have finite training data). The workhorse program
for estimating n-gram models in SRILM is ngram-count. We can try this for counting ungrams
in our Tigrigna-Corpus:
4
ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
P(.) P(.) P(.) P(.) P(.) P(.) P(.) P(.)
4.141222 2.889108 2.874561 1.635011 4.461557 4.461557 3.622136 4.122739
Table 1. Unigram probabilities for eight words in the Tigrigna Project corpus of 9325 lines.
gram models in SRILM is ngram-count. We can try this for counting bigrams in our
Tigrigna-Corpus:
5
ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
ናይ 3.1031 0.8514 0 3.4041 0 0 3.103119 3.404149
19 81 49
ምስሊ 0 0 0.0611 0 0 0 0 0
7361
መግለጺ 0 0 0 0.9123 0 0 0 0
856
ኣብ 2.0166 0 0 3.8679 2.964849 3.867939 2.788757 0
8 39
ድርኩኺት 1.0413 0 0 0 0 1.041393 0 0
93
ምጥፋእ 0 0 0 1.0413 0 0 0.7403627 0
93
ዝርከብ 1.8808 0 0 0 0 0 0 1.579784
14
መለኸት 0 0 0 1.0791 0 0 0 0
81
Table 2. Bigram probabilities for eight words in the TigrignaProject corpus of 9325 lines.
Trigram counts, a matrix counting count of three of adjacent word in the dataset. For
example, መንእሰያትን ድኽመታት ከምዘሎ = 1, ክርእዮ ኸሎ ኣዝዮም = 2 is the number of times! The workhorse
program for estimating n-gram models in SRILM is ngram-count. We can try this for
counting trigrams in our Tigrigna-Corpus:
6
Figure 3. Example of Trigram count (take from screen).
A header that tells you how many unique n-gram types were observed of each order n up to the
maximum order of the model:
7
\data\
ngram 1=71381
ngram 2=219164
ngram 3=25085
Example Output
8
3 EVALUATIONS
For an intrinsic evaluation of a language model we need a test set. The probabilities of an N-
gram model come from the corpus it is trained on, the training set or training corpus. We can
then measure the quality of an N-gram model by its performance on some unseen data called
the test set or test corpus [3].
Calculating model perplexity with SRILM: Once you have a language model written to a file,
you calculate its perplexity on a new dataset using SRILM’s ngram command, using –lm
option to specify the language model file and the –ppl option to specify the test-set file.
9
4 SMOOTHING N-GRAM MODEL
4.1 Add-one/Laplace Smoothing
The simplest way to do smoothing is to add one to all the bigram counts, before we normalize
them in to probabilities. All the counts that used to be zero will now have a count of 1, the counts
of 1 will be 2, and soon. This algorithm is called Laplace smoothing [3].
10
ናይ ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ
ምስሊ መግለጺ ኣብ ድርኩኺት ምጥፋእ ዝርከብ መለኸት
ናይ 0 2.30186 0 0 0 0 0
ምስሊ 0 0 3.13974 0 0 0 0
መግለጺ 0 0 0 0 0 0 0
ኣብ 0 0 0 0 0 0 0
ድርኩኺት 0 0 0 0 0 0 0
ምጥፋእ 0 0 0 0 0 0 0
ዝርከብ 0 0 0 0 0 0 0
መለኸት 0 0 0 0 0 0 0
Table 6. Add-one smoothed Trigram probabilities for eight words in the Tigrigna Project corpus
of 9325 lines.
11
REFERENCES
[1] Benjamin Elisha Sawe, “What Languages Are Spoken in Ethiopia,” worldatlas.com,
August 1, 2017. [Online]. Available: www.worldatlase.com/articles/what- languages-
are-spoken-in-ethiopia.htlm. [Accessed April 1, 2018].
[2] D. Hillard and S. Petersen. Lecture not, Topic: “N-gram Language Modeling Tutorial”
[3] Daniel Jurafsky & James H.Martin.yright “Speech and Language Processing,” August 7,
2017.
[4] Ariya Rastrow, “Practical and Efficient Incorporation of Syntactic Features into
Statistical Language Models” PhD dissertation, Johns Hopkins University,
Baltimore, Maryland, May, 2012.
12