You are on page 1of 20

Urdu Grammar Report

PARC, July 24th, 2007


Urdu Grammar Report - ParGram Meeting PARC July 2007

Urdu Grammar Report


Table of contents:
Verbs Nouns Adjectives Pruning the network Reduplication FST Demo Correlatives
Urdu Grammar Report - ParGram Meeting PARC July 2007

Current Grammar Team


Miriam Butt Tina Bgel Annette Hautli Sebastian Roth Sebastian Sulger

Urdu Grammar Report - ParGram Meeting PARC July 2007

Resources
Grammar Books
Mainly: Ruth Schmidts Urdu Grammar, Eugene Glassmans Spoken Urdu

Urdu Classes at Konstanz Transliteration Systems Developed in various Masters Theses Other Computational Work on Urdu Morphology/Lexicon:
CRULP (Lahore) Savoie (http://www.lama.univsavoie.fr/~humayoun/UrduMorph/)

Urdu Grammar Report - ParGram Meeting PARC July 2007

Current Activities
Shift to a more principled, broader coverage FST Morphology Integration 90% complete Systemize and Increase Morphological Tags Expand Grammar (Currently: Correlatives, more Complex Predicates)

Urdu Grammar Report - ParGram Meeting PARC July 2007

Morphological Analyzer
Operates on ASCII transliteration in order to allow for processing of both Urdu (Arabicbased script) and Hindi (Devanagari) FST Transliterators from Urdu and Hindi scripts exist, remain to be integrated.

Urdu Grammar Report - ParGram Meeting PARC July 2007

Verbs
Previous Morphology: problems with massive overgeneration of verbal morphology New Verbal Morphology flags included to impose restrictions on overgeneration (particularly the future paradigm) So far 28 verbs included, need to differentiate more verb classes. Irregularities are now treated via phonological rules

Urdu Grammar Report - ParGram Meeting PARC July 2007

Verb Network size: 3

Anticipated Tokenization Problems


future forms such as milEgI:
xfst[1]: up milEgI mil+Verb+3P+Sg+Fut+Fem mil+Verb+2P+Sg+Fut+Fem

are morphologically one word, but not written this way in Urdu (but in Hindi)
Urdu Grammar Report - ParGram Meeting PARC July 2007

Anticipated Tokenization Problems


ho-g-a be-Fut-M.Sg

Hindi

Urdu

Same Poem from Ghalib


Urdu Grammar Report - ParGram Meeting PARC July 2007

Nouns I
So far: includes 216 Nouns Size: 157.3 kb, 4768 states Includes Gender (masc / fem) Number (pl / sg ) Case ( Obl / Nom) Based largely on the use of flags (reducing rules and size (?)); Version using more continuation classes being worked on (efficiency difference?)
Urdu Grammar Report - ParGram Meeting PARC July 2007

Nouns II
Types of nouns: - Fem: kursI hill Masc: kamrA room - natural gender depending on subject: laRk (I/A) girl-boy - both genders (no gender marking): jIrAf giraffe - Arabic/Persian loanwords: tAliba fem. Student tAlib+Noun(Ar)+Fem+Pl+Nom To Do: Need to redo Natural Gender Better differentiation of Noun-classes,Pronouns, Case Systematize and Increase Morphological Tags (Names, etc.)
Urdu Grammar Report - ParGram Meeting PARC July 2007

Urdu Adjectives/Adverbs
262 adjectives so far Numbers up to 100 (each number has a different name) 38 adverbs so far 48.7 Kb, 1418 states, 1995 arcs, 1397 paths Difference between unmarked and marked adjectives

Urdu Grammar Report - ParGram Meeting PARC July 2007

Unmarked Adjectives
dont overtly agree with the noun they modify: amIr laRkA rich boy+Sg+Masc amIr laRkIAN rich girl+Pl+Fem

Urdu Grammar Report - ParGram Meeting PARC July 2007

Marked Adjectives
Overt morphology for gender number case which agree with the noun they modify.

Urdu Grammar Report - ParGram Meeting PARC July 2007

Marked Adjectives
e.g. cHOtA little+Sg+Masc+Nom cHOtI little+Sg+Fem laRkA boy+Sg+Masc

laRkI girl+Sg+Fem

Urdu Grammar Report - ParGram Meeting PARC July 2007

Integration into one FST


No problems so far with unwanted phonological rule interactions Main problem: resulting stack size (3.4 Mb) Current Morphology makes heavy use of flag diacritics --- these are incompatible with XLE and have to be eliminated ahead of time (eliminate flag X).

Urdu Grammar Report - ParGram Meeting PARC July 2007

Pruning the network


lexicon/rules pruned by Tina (phonological rules now all in one file, turn stack commands instead of intermediate stack saving) eliminate flag x: command reduces network by ~3 Mb Question: why? --- eliminate flag supposedly causes INcrease in network size (Beesley & Karttunen 2003, 360) resulting network is smaller; faster to apply at runtime??

Urdu Grammar Report - ParGram Meeting PARC July 2007

Reduplication
any content words in Urdu can be reduplicated accomplished by compile-replace operator phonological change in onset can occur with nouns (echo forms) example:
kursI kursI kursI vursI kursI+Noun+Fem+Sg+Nom+Redup kursI+Noun+Fem+Sg+Nom+Echo chair after chair, many chairs something like a chair

Urdu Grammar Report - ParGram Meeting PARC July 2007

Demo
Demo: reduplication, flag elimination Note: reduplication still to be integrated into tokenizer for grammar (on to do list)

Urdu Grammar Report - ParGram Meeting PARC July 2007

Correlatives
(only) 28 verbs included, (only) one verb class Irregularities are treated via phonological rules network size: 365.6 Kb. 9336 paths. flags included to impose restrictions on overgenerating future sublexicon future forms such as mArEgI:
xfst[1]: up mArEgI mAr+Verb+3P+Sg+Fut+Fem mAr+Verb+2P+Sg+Fut+Fem

are treated as one word, although this is the case in Hindi only (written Urdu: mArE gI)

Urdu Grammar Report - ParGram Meeting PARC July 2007

You might also like