Professional Documents
Culture Documents
Assignment 1
Part of speech tagging and chunking (partial parsing)
Due: Tuesday Sep 29, 2015 Midnight to the D2L Dropbox
Some parts require you to use the NLTK and some simple programming in
Python; with permission, a student from non-informatics programs may do the
alternative that does not require programming. Each part of the assignment
into its own file. Use tar or zip (only) to combine the directory into a single file
for the drop box.
>>>import nltk
>>> corpus = GetCorpus(nltk.corpus.brown, news)
>>> PlotNumberOfTags(corpus)
...show a plot with axis: X - number of tags (1, 2...) and Y number of words having this number of tags...
>>> cfd = GetAmbiguousWords(corpus, 4)
<conditionalFrequency ...>
>>> TestGetAmbiguousWords(cfd, 4)
All words occur with more than 4 tags.
>>> ShowExamples(book, cfd, corpus)
book as NN: ....
book as VB: ....
(a) Function 1: Write a function GetCorpus(corpusName,categoryName)
that returns the tagged words for that corpus and category, so that
you can call the remaining functions without repeating this work.
(Hint: the body of this function should be a single statement.)
(b) Function 2: Write a function PlotNumberOfTags(corpus) that plots
the number of words having a given number of tags. The X-axis
should show the number of tags and the Y-axis the number of words
having exactly this number of tags.
If you do not have access to the pylab module, you can simplify your
plot by using an X at the y -value.
If you have access to the pylab module, you can use the following
example from the NLTK book as an inspiration:
def performance(cfd, wordlist):
lt = dict((word, cfd[word].max()) for word in wordlist)
baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger(NN))
return baseline_tagger.evaluate(brown.tagged_sents(categories=news))
def display():
import pylab
words_by_freq = list(nltk.FreqDist(brown.words(categories=news)))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories=news))
sizes = 2 ** pylab.arange(15)
perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
pylab.plot(sizes, perfs, -bo)
pylab.title(Lookup Tagger Performance with Varying Model Size)
pylab.xlabel(Model Size)
pylab.ylabel(Performance)
pylab.show()
ditions are the words and the frequency distribution indicates the tag
frequencies for each word.
(d) Function 4: Write a test function TestGetAmbiguousWords(cfd,N)
that verifies that the words indeed have more than N distinct tags in
the returned value.
(e) Function 5: Write a function ShowExamples(word,corpus) that given
a word, finds one example of usage of the word with each of the different tags in which it can occur. (The corpus can be the tagged
sentences or tagged words according to what is most convenient)
NON-PROGRAMMING OPTION
(a) Write a set of 4 sentences sentences and then tag them with the Penn
Treebank tagset. (Include both the untagged and tagged text.) The
sentences should contain at least 3 words with at least 3 different tags.
You should use the online tagger program: ONLINE POS DEMO to
assign the tags.
(b) Create a table that includes all words in your hand-written textthat
received more than one tag, showing the overall frequency of the
word, the number of tags received, as well as the actual tags with
their frequency. For example,
word Total Occurences Total Tags
Distribution
book
3
2 NN(2), VB (1)
(c) Select an existing text of at least 500 words that is available on
the web that appears to have at least 3 ambiguous words, tag it
using the Penn Treebank tagset, and then create a table for this text
that includes all words that received more than one tag, showing the
overall frequency of the word, the number of tags received, as well as
the actual tags with their frequency, as before.
PART 2:
NP: {<DT>?<JJ>*<NN>}
Another symbol you might use is a + which is used to indicate one or more
occurences.
2. (Exercise 7.5 from NLTK) Write a tag pattern to cover noun phrases that
contain gerunds, e.g. the/DT receiving/VBG end/NN, assistant/NN
managing/VBG editor/NN. For each pattern, include an example phrase
that matches the pattern. If programming, add these patterns to the
grammar, one per line and test your work using some tagged sentences
of your own devising. Non-programming students should provide a set of
tagged sentences and mark by hand the noun phrases with gerunds that
match the new patterns; you can also use the online tagger demo to help
check your work.)
3. (Exercise 7.6 from NLTK) Write one or more tag patterns to handle coordinated noun phrases, e.g. uly/NNP and/CC August/NNP,
all/DT your/PRP$ managers/NNS and/CC supervisors/NNS,
company/NN courts/NNS and/CC adjudicators/NNS. For each pattern
include an example that matches the pattern.
4. Provide a text fragment of at least 120 words from an online source that
includes both noun phrases with gerunds and with coordination and show
the chunk boundaries. (Include both your chunk-marked text and a link
to the original source.)