You are on page 1of 4

Learning agglutinated Indian languages with linguistically motivated

Adaptor Grammars
Author 1
Affiliation 1
email1@domain1.com

Abstract
In this paper we learn the complex agglutinated morphology of Indian languages using adapter grammars and linguistic rules
of morphology. Adapter grammars a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinated languages and morphological boundaries are
inferred from a corpora of plain text. Once
it produces morphological segmentation,
regular expressions of sandhi rules and orthography applied to achieve final segmentation. We test our algorithm in the case of
three complex languages from the Dravidian family and evaluate the results comparing to other state of the art unsupervised
morphology learning systems.

Introduction

Morphological processing is an important process


for natural language processing systems. In morphological processing a word is segmented in corresponding morphemes. Most of the morphological systems are hand-built, which is a time
consuming and costly process. For example;
Finite state methods based morphology learning
(Beesley, 1998). Due to this reason, least resourced languages lack this important component
which act as major hurdle for building NLP systems. Unsupervised learning of morphology is
a solution for dealing with this problem. In the
case of unsupervised morphology learning systems, for details refer (Hammarstrom and Borin,
2011), morphology of languages is learned using
a corpus of plain text. Statistical measures are
used in the process of morphology learning. Unsupervised morphology learning systems produce
state of art results for many languages, such as English and Finnish (Creutz and Lagus, 2005) (Goldsmith, 2001). In the case of Dravidian languages,

Author 2
Affiliation 2
email2@domain2.com

it produce poor result because of lack of knowledge of orthography and morphological complexities, such as sandhi,a morpho-phonemic change
happens in word or morpheme boundaries at time
of concatenation. In section1.1, we briefly discuss
morphological properties and orthography of popular languages in Dravidian family that make unsupervised learning difficult. There are some efforts to test Dravidian languages on state of art
system but it gives poor results, such as vasudevanlittle and (Bhat, 2012). These studies give idea
a rule and statistics based model that can work
well on these langauges.
Recent research in morphology learning shift
to semi supervised learning and it produce better
results than fully unsupervised learning, such as
(Kohonen et al., 2010a) and and (Kohonen et al.,
2010b)inspring from these works We propose a
semi supervised morphological processing system
based on Adapter grammars and linguistic rule
to deal with the complex orthography of the languages. Our system uses a method is combination
of statistical and rule based method.
Adapter grammars are Bayesian non parametric
models that can handle linguistic structure formalism. They are non parametric version of Probabilistic Context Free Grammar. They are designed
for unsupervised structure learning. It is successfully used in various natural language processing
applications such as In section 1.2, we give an informal definition of Adapter grammars and inference procedure.
We use adapter grammars to learn model of
morphology and once the model produce output
we use regular expressions created from morphological rules to refine the results. The major idea
is that as these languages are agglutinated, suffixes
are stacked to together to create a large word sequence. It indicates as the length of the word increase more number of morphemes are present in
the word.

We test our system on three major languages


from Dravidian family namely Tamil, Malayalam and Kannada. As these languages are least
resourced, we created plain text corpus from
Wikipedia. In the result section, we compare our
result with other state of the art unsupervised morphological processing systems.
1.1

Challenges related to Dravidian


languages

Dravidian family of languages are mother tongues


of more than 150 million peoples. They are highly
agglutinated and inflected. The popular ones of
the family are Tamil, Telugu, Kannada and Malayalam. In this study we focus on Tamil, Kannada
and Malayalam. These languages use alpha syllabic writing systems. For Example, In the case
of Malayalam (Taylor and Olson, 1995), /ka/ represent a consonant and a short vowel /a/ similarly
in other languages also. They also contain large
number of diricirits and digraphs. As a result morpheme boundaries are marked at syllabic level.
Dravidian language are agglutinated so it create
large word sequences. As these languages use alpha syllabic writing system, letter are syllables instead of individual character. All these languages
are inflected, Nouns with case, gender, number
and preposition and verbs are inflected with tense
and gender. As these languages use alpha syllabic
writing system, phonological changes occurs during the concatenation of morphemes and word results in a change in orthography. For example a
word from Tamil. nari+ ya nariyya (Steever,
1998).
Compounding is another challenge in unsupervised learning. In the case of compounds sub compounds are embedded with in a compound and
which become another sub compound (Mohanan,
1986). For example meesha + petti +kasala that
means Tables +chairs and box.
The languages contains co compounds and sub
compounds with phonological changes, for example:
ka.t.tilemaram
forest-tree which is a co compound.
tvan..ti train
Another challenges are named entities are also
inflected with case and number markers. It becomes more difficult when the named entity is a

loaned word. For example Computerinte, which a


combination of Computer + case marker
1.2

Adapter Grammar

Adapter
Grammar
is
a
7tuple(N, W, R, S, , A, C), where (N, W, R, S, )
is a PCFG. In this PCFG N stands for non
terminals and W stands for terminals, S N is
a start symbol, R stands for rule set and stands
for rules probability. r is the probability of rule
r R. A N is a set of adapted non terminals
and C is a vector of adapters index by elements
of A, such that CX is an adapter for adapter non
terminal X A.
CX maps trees of base distribution HX , who
support TX . TX is a set of sub trees whose root is
X N . In adapter grammar HX is determined by
PCFG rules of expanding X and probability distribution , for more information (Johnson et al.,
2006).Various non parametric Probabilistic processes can be used as adapter like Dirichlet Process. Johnson use Dirichlet Process as adapter for
word segmentation of Sesotho (Johnson, 2008).
Adapter have been applied to various NLP tasks
such as: Word segmentation (Johnson, 2008),
named entity recognition (Elsner et al., 2009), and
machine transliteration (Wong et al., 2012).

Adapter grammars on Dravidian


languages and inference procedure

We use Pitman-Yor process (Pitman and others,


2002) for learning the complex morphology of
Dravidian languages. Pitman - yor process is a
stochastic model that can be represented in the
form of a Chinese Restaurant Process metaphor.
This representation helps to do inference on the
model. Inclusion of Pitman - Yor process to PCFG
allows PCFG to choose re write rule independently. Pitman -Yor process is caching model that
mean it increases the probability of frequently occurring trees. We define a model similar to (Goldwater et al., 2005). Our model is more complex
than their model because they consider one word
can composed of one stem and one suffix, in the
case of agglutinated languages this assumption is
not valid. In the case of agglutinated languages,
many suffixes can be stacked together to form a
word sequence consisting of many morphemes.
Considering this factor we define a complex model
as following Word stem suffixes. suffixes acts
as a submorph like in (Sirts and Goldwater, 2013).

An agglutinated word phrase from Malayalam


sansthanann al.ileannan. ( is in the following form.
A PCFG tree a can represent all segmentation
of a particular word like san + sthana + n n al.il
+eannan., infinitely many segmentations are possible but we need only right morpheme segmentation, which is sansth+anann al.ilea+nnan. From
this we define a general model of the morpheme
structure of the languages. It is
Word stem suffixes
Word stem
stem syllables
suffixes M1 M2 M3 M4 M5
Suffixes M1
suffixes M1 M2
M1 syllables
here suffixes is an adopted non terminal. M1,
M2, M3 M4 and M5 can be various morphological features, such as gender, number and cases.
This model represent all segmentation that can be
produced. But we place a Pitman- Yor process on
adapted non terminal acts as suffixes here. It also
important note that morphemes in the words are
not completely independent entity, there are various dependencies between morpheme. We consider it a bigram dependency so the grammar we
defined above is similar to cocollocation adaptor
grammar described in (Johnson, 2008), where terminals are syllables. We use the inference algorithm described in (Johnson et al., 2006)
2.1

Morphological rules as regular


expressions

We create Sandhi rules as regular expressions. Our


work is similar to (Vempaty and Nagalla, 2011)
but we use create FST rules at syllable levels. The
idea is to transliterate the syllable into corresponding orthographic script based on the position of
syllable. For example; A word from Malayalam
marat the, if the syllable t at middle it will placed
by one script in the other case another script. We
use forma library for the purpose.
The algorithm as follows

Data and Experiments

For testing our method, we have extracted a corpus of five million words of each languages from
Wikipedia and news paper websites. Scripts of the
languages converted to 8 bit Extended-ASCII to
deal with complex orthography. The conversion
a script as follows. A Malayalam word .tarcal in

Malayalam is converted to tutarcl, a non ASCII


character .t is converted to ASCII character t. Then
we run the inference algorithms for 100 iteration.
We take different sample and compared it to regular expression of the languages.

Evaluation, Error analysis and


Discussion

For the evaluation of presented algorithms, we


10000 morphologically segmented words of each
languages1 . The evaluation is based on how well
the methods predict the morpheme boundaries
and calculate precision, recall and F- score
Information about test data is provided in table.
We used python suite provided in the morpho
-challenge website for evaluation purposes We
also train Morfessor baseline, Morfessor-CAP
and Undivide, with the same amount of tokens.
Tamil Kannada Malayalam
Token frequency
500000 500000
500000
No. Segmented tokens 10000
10000
10000
No. R E expressions
34
62
34
No suffixes
22
28
38
The result of the experiment is presented in the
table1
Table 1: Results Compared to state of art systems
Method

Kannada

Malayalam

Tamil

Morfessor-base
NPY
Morfessor-CAP
Adapter grammr and rule
Undivide

48.1
66.8
66.8
66.8
66.8

60.4
58.0
58.0
58.0
58.0

53.5
62.1
62.1
62.1
62.1

47.3
60.3
60.3
60.3
60.3

60.0
59.6
59.6
59.6
59.6

52.9
59.9
59.9
59.9
59.9

Conclusion and future research

We have presented a semi supervised morphology learning technique that uses statistical measures and linguistic rules. The result of the proposed method outperforms other state of art unsupervised morphology learning techniques. Another important aspect of our experiments is we
tested Adapter grammars in the real world data of
highly agglutinated and complex languages. We
also use large amount of data to train the model,
1

available in the website

where in other previous work experiments are carried out on toy corpus.

Acknowledgments
Do not number the acknowledgment section. Do
not include this section when submitting your paper for review.

References
Kenneth R Beesley. 1998. Arabic morphology using
only finite-state operations. In Proceedings of the
Workshop on Computational Approaches to Semitic
languages, pages 5057. Association for Computational Linguistics.
Suma Bhat. 2012. Morpheme segmentation for kannada standing on the shoulder of giants. In 24th International Conference on Computational Linguistics, page 79.
Mathias Creutz and Krista Lagus. 2005. Unsupervised
morpheme segmentation and morphology induction
from text corpora using Morfessor 1.0. Helsinki
University of Technology.
Micha Elsner, Eugene Charniak, and Mark Johnson.
2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of
Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the
Association for Computational Linguistics, pages
164172. Association for Computational Linguistics.
John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
linguistics, 27(2):153198.
Sharon Goldwater, Mark Johnson, and Thomas L Griffiths. 2005. Interpolating between types and tokens
by estimating power-law generators. In Advances in
neural information processing systems, pages 459
466.
Harald Hammarstrom and Lars Borin. 2011. Unsupervised learning of morphology. Computational Linguistics, 37(2):309350.
Mark Johnson, Thomas L Griffiths, and Sharon Goldwater. 2006. Adaptor grammars: A framework for
specifying compositional nonparametric bayesian
models. In Advances in neural information processing systems, pages 641648.
Mark Johnson. 2008. Unsupervised word segmentation for sesotho using adaptor grammars. In Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and
Phonology, pages 2027. Association for Computational Linguistics.

Oskar Kohonen, Sami Virpioja, and Krista Lagus.


2010a. Semi-supervised learning of concatenative
morphology. In Proceedings of the 11th Meeting of
the ACL Special Interest Group on Computational
Morphology and Phonology, pages 7886. Association for Computational Linguistics.
Oskar Kohonen, Sami Virpioja, Laura Leppanen, and
Krista Lagus. 2010b. Semi-supervised extensions to
morfessor baseline. In Proceedings of the Morpho
Challenge 2010 Workshop, pages 3034.
Karuvannur P Mohanan. 1986. The theory of lexical
phonology: Studies in natural language and linguistic theory. Dordrecht: D. Reidel.
Jim Pitman et al. 2002. Combinatorial stochastic
processes. Technical report, Technical Report 621,
Dept. Statistics, UC Berkeley, 2002. Lecture notes
for St. Flour course.
Kairit Sirts and Sharon Goldwater. 2013. Minimallysupervised morphological segmentation using adaptor grammars. Transactions of the Association for
Computational Linguistics, 1:255266.
Sanford B Steever. 1998. The Dravidian Languages.
Routledge London.
Insup Taylor and David R Olson. 1995. Scripts and
literacy: Reading and learning to read alphabets,
syllabaries, and characters, volume 7. Springer Science & Business Media.
Phani Chaitanya Vempaty and Satish Chandra Prasad
Nagalla. 2011. Automatic sandhi spliting method
for telugu, an indian language. Procedia-Social and
Behavioral Sciences, 27:218225.
Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson.
2012. Exploring adaptor grammars for native language identification. In Proceedings of the 2012
Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 699709. Association
for Computational Linguistics.

You might also like