Graph-Based Morphological Analysis

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 19, ISSUE 2, JULY 2013
1

Graph-Based Morphological Analysis
S. Iazzi, A. Yousfi, M. Bellafkih, and D.Aboutajdine
AbstractIn this article, we propose a new morphological analysis system for Arabic words. This system combines both the
Buckwalter approach as well as a graph-based morphological analytical approach. This system is based on very restricted
dictionaries and looks for solutions in a global network by using the Viterbi Algorithm. Our approach was tested on a corpus of
4000 Arabic words. The results achieved are very impressive and show how important our new approach can be.
Index Terms Correspondance Table, Graph, the Viterbi Algorithm, Morphological Analysis, The Analyzer of Buckwalter.

1 INTRODUCTION
HE arabic morphological analysis is one of the oft-
used tools in most automatic language processing
programs such as document research, electronic dic-
tionaries and marking systems.
Several studies have been conducted that purport to
develop morphological analyzers for the Arabic language.
In the present study, our focus is on finite-state automata
and on the Analyis of Buckwalter. [5].
- Analyzer based on a Finite-state automaton.
Finite-state automata are used to represent morpholog-
ical relations in a given language. A finite-state automa-
ton aims at delimiting possible combinations between the
letters of a given lexicon.
Finite-state automata are defined by the following ele-
ments:
- Set of finite states: { }
F n I
q q q q q Q , ,..., , ,
2 1
=
- A transition function with a parameter consisting of
a state and a symbol that forwards to another state:
Q Q : o

- One or several finite-states : Q q
F
_
Many authors have made use of finite-state automata in
their analysis of the morphology of the Arabic language
such as the studies in [2][3][4] [5] [9] [13] and [18].
- The Buckwalter Analyzer [5] :
Buckwalter developed a morphological analytical sys-
tem for the Arabic language. This is a system that deter-
mines all the possible segmentations of a word and
searches the results in the list of stems, suffixes and prefix-
es. It, then, checks whether each of the elements are com-
patible with each other, by examining three licit corres-
pondence tables: prefix-stem, prefix-suffix and stem-
suffix[5].
- Drawback of the two methods
- The finite-state automaton approach makes use of
all the words of a lexicon to form a global network
[18].
- The difficulty in developing a dictionary for stems:
The latter requires knowledge of different mor-
phological transformations of every root. The dic-
tionary has the further disadvantage that it is
large, for one can find at least five stems for every
root. For example, for the root "" in the dictio-
nary of stems, one can find "" "" "" ""
""...
To solve these two problems, we propose a new ap-
proach that makes use of graphs and a restricted dictionary
of stems in order to execute a morphological analysis for a
word.
We have termed this new system IBN-GINNY1 Mor-
phological Analyzer.
2 INTRODUCING THE IBN-GINNY ANALYZER
In order to reduce the size of stems in the Buckwalter
Analyzer, [5], the graphs are resorted to in order to group
together the affixes, suffixes and infixes of the Arabic lan-
guage. Then the search for a morphological analysis of a
word is carried out in the graph.
In this case, every root will be presented by one single
stem.
Example:

To put our idea into practice, we made recourse to graph-
based method, where every word is represented in a sin-
gle graph, whose radical letters are represented by a state
that in turn loops around itself and the affixes are pre-
sented by the letters that form these affixes. Consider the
example
The words "", "", are presented in the follow-
ing graph:

1
We have chosen to name of our analyzer after (and in honor of) the
eminent Arab linguist Ibn-Ginny.

- S.Iazzi. Laboratory GSCM-LRIT, Mohammed V Agdal University, Facul-
ty of Sciences Rabat, Maroc.
- A.Yousfi. Team ERADIASS, Mohammed V Souissi University, FSJES,
Rabat, Maroc.
- M. Bellafkih. Laboratory GSCM-LRIT, Mohammed V Agdal University,
Faculty of Sciences Rabat, Maroc.
- D. Aboutajdine. Laboratory GSCM-LRIT, Mohammed V Agdal Universi-
ty, Faculty of Sciences Rabat, Maroc.

T
2

Fig
. 1. Graph for the word "".
For the word "", a graph would consist of the fol-
lowing elements:
- q
I
: The initial state.
- q
F
The final state.
- "" a letter of the prefix of the word " ".
- "" an infix of the word " ".
- "","" two letters of the suffix of the word " ".
- A = "", "", "" represent the letters that make the root
of the word , in which case one will loop state A
three times.
Based on all the prefixes, infixes and suffixes of the
Arabic language, a global network of states is built with
one input state and one output state = q
I
and one output
state = q
F
.
Our global network is wholly defined in terms of:
- The totality of all states : this consists of all the letters
that make the suffixes, prefixes and infixes, of state A,
the initial state qI and the final state qF :
Q={qI, qF, A,"","","","",,"","","",}
- The totality of all possible transitions that link the let-
ters of suffixes, prefixes and infixes with states A, qI
and qF .

Fig. 2. Example of the path the words "" and "" take
through the global network.
In order to analyze a given word w , one needs to
search the global network for different possible routes
associated with w .
The totality of these paths is given by:

{ } 0 ) / ( / = e = w P B S
r
(1)
B : The totality of possible routes in our network.
: This is a possible route with the same length
as w in the global network, which can present the word
w .
The solutions are all the paths that would maximize
the non null probability of emitting the word
w
through
these paths. In order to both ease and reduce the compu-
tation in the formula in (1), we resort to the Viterbi algo-
rithm. This algorithm is given in the following formula:
)) ( 1 ) ( ( NL ) (
1
j
c
i
c
t c ij i t j
t
w a c c
j
=

o
o

NL(x) is the function that gives non null values of x.
We search the states
i
c that yield non null values of
) ( 1 ) (
1 t c ij i t
w a c
j

o .
) (
F T
q o : is the maximal probability for the emission of
the word w starting from a given path. Through recursive
computation, one will retrieve all possible paths that will
yield these non null values of NL (T: the length of the
word w ).
with :
j
c : The j
th
state Q c
j
e .
ij
a :The transition probablility from state
i
c to
state
j
c .
{
possible. is n transitio the if 1
otherwise. 0
=
ij
a

t
w : t
th
letter in the word w .
{
t j
j
w C if
otherwise t c
w
=
=
1
0
) ( 1

We take :

1 = ) ( 1
t A
w

Initialization :
{
I i
q c if
i
c
=
=
1
otherwise 0 0
) ( o
Example:
Let the Word be
w
= "". The analysis will be car-
ried out according to the following steps
{
I i
q q if
otherwise
i
q
=
=
1
0
0
) ( o
1 )) ( 1 ) ( ( NL ) (
1 0
1
= =
w a q
q i
i
i
q
o
o

In fact: 1 ) (
0
=
I
q o et
I i i
q q q = = 0 ) (
0
o
1 =
q
I
a All the states are linked to the initial state.
1 = ) ( 1
1
w
because " "

1
w =
1 1
1 0
1
= = ) (w a ) (q ( (A)
A A q I
I

In the same way, we compute ) (
i t
c o for t=2,.,5
and for all the states.

3

By the end, we find the following paths:

TABLE 1
POSSIBLE PATHS ASSOCIATED WITH THE WORD ""
Suffix Prefix Proposed paths
0 0 q
I
A A A A A q
F

0 q
I
A A A A q
F

0 q
I
A A A q
F

0 q
I
A A q
F

0 q
I
A A A q
F

3 TESTS AND RESULTS
3.1 Tests
In order to evaluate our new approach, we have at first
created different dictionaries for suffixes, prefixes, and
stems. Then, starting from the list of prefixes, suffixes and
infixes, we have generated a global network of states, as
has been previously mentioned without using lexical dic-
tionaries (this is a great advantage of our analysis as
compared to the Analyzer of Buckwalter and an analysis
based on finite state automata.

Fig. 3. Diagram of our global network
In the course of developing a dictionary for stems, a
distinction is made between two types of roots: unaf-
fected roots and affected ones ()1. The introduction of
unaffected roots in our dictionary is made by the roots
themselves. However, for the affected verbs, we have
kept the stems of Buckwalter [5].
The root "" has five stems as exponents in this dic-
tionary: . However, for the unaffected
root "", it is presented only as " ".
This way, we have reduced the size of the dictionary of
stems with 62.5% as compared to that of Buckwalter[5].
In the same vein, we have developed a prohibition ta-
ble between prefixes and suffixes, with the aim of ruling
out impossible solutions.

TABLE 2
EXTRACT FROM THE PROHIBITION TABLE PREFIX-SUFFIX

3.2 Application of the approach:
In order to put into practice our approach, we have
developed a java program that consists of three main
classes:
- The ConstRseau class: this allows for the develop-
ment of a global network from the list of prefixes, suf-
fixes and infixes.
- The AnalyMorpholo class: it uses the Viterbi algo-
rithm to search all possible paths associated with a
given word in the global network already built by the
preceding class.
- The VerfCompatib class: it checks the validity of the
solutions proposed by the AnalyMorpholo class. This
is accompanied with the checking of the existence of a
root in the dictionary of stems as well as the compati-
bility between suffixes and prefixes.
Example:
For an analysis of the word "", our system fol-
lows these steps:
- The search for the different paths in our global net-
work:
TABLE 3
POSSIBLE PATHS FOR THE WORD " " IN THE GLOBAL NET-
WORK WITH THEIR SUGGESTED ROOTS
Suffix prefix Possible paths
Suggested
roots
N
0 AAAA 1
0 AAAAA 2
0 0 AAAAAAA 3
AAA 4
AAAA 5
0 AAAAAA 6
AA 7
AAA 8
0 AAAAA 9

- The extraction of roots associated with paths by
searching the positions of "A" states in the word
"".
- Checking the existence of roots in the stem dictionary.
This step keeps the following solutions only:
- -
- -
- Checking the compatibility between the suffixes and
prefixes for the remaining solutions. It keeps only:
- -
Prefixes Suffixes

States
of
Suffixes
A
q
I

Statess
of
prefixes
q
F
4

Fig. 4. Schema of the morphological analysis process IBN-GINNI.
3.3 Results:
The test was carried out on 4000 words that represent
different grammatical categories (verbs, nouns...). 92% of
these words were correctly analyzed, and our analyzer
IBN-GINNY offered different possible analyses for these
words, while it failed to do so for the rest (8%). 90% of
these errors were due to the fact of not taking into ac-
count computational relations between prefixes, roots and
suffixes in our analyzer.
4 CONCLUSION AND PERSPECTIVES
In the present article, we have proposed a method of
graph-based morphological analysis. This method com-
bines a dictionary-based approach and an approach based
on finite-state automata.
Our approach has significantly reduced the size of the
dictionary of stems. In the same vein, even regarding the
development of a global network, there was no need for a
dictionary of stems, since this network is built only by
means of a list of affixes.
The results achieved through our approach are very im-
portant and show how important our approach can be.
As an implication of this work, we are looking forward to
integrating two other correspondence tables (prefix-root;
suffix-root) in this analyzer.
REFERENCES
[1] Alexia Blanchard, Analyse morphologique des rponses dapprenants
en environnement dApprentissage Assist par Ordinateur Universit
Stendhal-Grenoble III,UFR des Sciences du Langage.
[2] Audebert C, Jaccarini A. (1988). De la reconnaissance des mots outils et
des tokens. Annales islamologiques 24, Institut francais darcheologie
orientale du Caire.
[3] Beesly.KR (1998). Arabic Morphology Using Only Finate-State Opera-
tions, Proceedings of the Workshop on Computational Approaches to
Semetic languages. Montreal, Quebec, pp 50-57.
[4] Beesley KR (1996). Arabic Finite-State Morphological Analysis and
Generation.Proceedings of the 16th conference on Computational lin-
guistics, Vol 1. Copenhagen,Denmark: Association for Computational
Linguistics, pp 89-94
[5] Buckwalter.T (2002). Buckwalter Arabic Morphological Analyzer. Ver-
sion 1.0. Linguistic Data Consrtium, catalog. Number LDC2002L49 and
ISBN 1-58563-257-0.
[6] Darwish.K. (2002). Building a shallow Arabic morphologi-cal analyser
in one day. in Proceedings of the ACLWorkshop on Computational
Approaches to Semitic Lan-guages, Philadelphia, PA, 2002.
[7] Darwish K (2002). Building a Shallow Morphological Analyzer in One
Day. Proceedingsof the workshop on Computational Approaches to
Semitic Languages in the 40th Annual Meeting of the Association for
Computational Linguistics (ACL-02). Philadelphia, PA, USA
[8] El-Sadany.T.A and Hashish.M.A (1989). An Arabic Morphological
System. IBM Systems Journal. Vol.28, No.4, 600-612.
[9] Gaubert C., Analyse morphologique dun texte par ordinateur
Rsultats et valuation , AnIsl 29 (1996), IFAO, p. 283-311
[10] Goldsmith and John.A (2001). Unsupervised learning of the morphol-
ogy of a natural language. Computational Linguistics, 27(2), 153-198.
[11] Hanafi. (1914). - - - - ,
. 1914 .
[12] Hegazi.N and ElSharkawi.A (1986). Natural Arabic Language
Processing, Proceedings of the 9th National Computer Conference and
Exhibition, Riyadh, Saudi Arabia, 1-17.
[13] Iazzi, S, Yousfi, A, Bellafkih, M, Aboutajdine, D, "Morphological Ana-
lyzer of Arabic Words Using the Surface Pattern ", IJCSI International
Journal of Computer Science Issues, Vol. 10, Issue 2, No 1, March 2013,
ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784.
[14] Jaccarini A., (1997). Grammaires modulaires de larabe. These de
doctorat. Universite de Paris-Sorbonne.
[15] Khoja.S and Garside.R (1999). Stemming Arabic text. Computer Science
Departement, Lancaster University, Lancaster, UK.
[16] Koskenniemi and Kimmo (1983). Two Level Morpology. A General
Computational Model for Word-form Recognition and Production.
Publication No. 11, Dep. of General Linguistics, University of Helsinki,
Helsinki.
[17] Saliba.B and Al-Dannan.A (1989). Automatic Morphological Analysis
of Arabic: A Study of Content Word Analysis. Proceedings of the First
Kuwait Computer Conference, Kuwait, March, 3-5.
[18] Yousfi.A (2010). The morphological analysis of Arabic verbs by using
the surface patterns. IJCSI International Journal of Computer Science Is-
sues, Vol. 7, Issue 3, No 11, May 2010.
[19] Wehrli, E. 1997. Lanalyse syntaxique des langues naturelles :
problmes et mthodes, Paris,Masson.
Building the
global net-
work
Morphologi-
cal analysis
Return of all poss-
ible paths asso-
ciated with a
word.
(Viterbi Algorith.)

checking
Correspondance
table
Dictionary
of stems
Final
results

word to analyze
analyze
List of suffixes,
prefixes, infixes

Graph-Based Morphological Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graph-Based Morphological Analysis

Uploaded by

Copyright:

Available Formats

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 19, ISSUE 2, JULY 2013

because " "

You might also like