Automatic Creation of Quality Multi-Word Lexica From Noisy Text Data

Automatic Creation of quality Multi-word Lexica from noisy text data
Francesca Frontini
ILC - CNR Pisa, Italy
Valeria Quochi
ILC - CNR Pisa, Italy
Francesco Rubino
Synthema srl Pisa, Italy
francesca.frontini@ilc.cnr.it valeria.quochi@ilc.cnr.it ABSTRACT

This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally light) tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work ow whose rst step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a signicant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later de-duplicated version of the corpus. The paper shows how our method can extract with sufciently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.
francesco.rubino@synthema.it
1.
INTRODUCTION
Categories and Subject Descriptors

I.2.7 [Natural Language Processing]: Text analysis; H.3.5 [Information Storage and Retrieval]: Web based services
General Terms
Experimentation, performance
Keywords
Lexical induction, multi-word extraction, web-based distributed platform, noisy data
In the last few decades a large amount of research has been devoted to the automatic acquisition, learning or extraction of multi-word expressions (MWEs hereafter) from textual data, applying dierent methods and adopting different assumptions. Some of the experiments also report a high accuracy. Yet, MWEs still pose problems for most language technology applications (e.g. information retrieval, text mining, semantic web and machine translation), and readily available, and possibly customizable, tools for the acquisition of MWEs in languages other than English are not so popular. Also, the experiments described in the scientic literature mostly have a narrow focus either in the specic type of multi-words targeted or in the evaluations performed, or both. Furthermore, they are usually tested on relatively clean data, i.e. reference corpora. This paper describes the implementation of a tool for the automatic creation of large-scale MWE lexicons which is integrated as a web service in the PANACEA distributed platform1 - a virtual, distributed production line where dierent interoperable components can be chained in work-ows to produce dierent types of lexical resources for dierent languages. In a typical PANACEA work-ow, data are normally generated by real-time web crawling of domain corpora based on user provided seeds; lexical extraction workows normally include POS-tagging and sometimes parsing as well as the lexical extractor proper and some converters. Compliant with the overall objective of the platform, the tool has been implemented using computationally light methods. The main purpose is to provide a robust and free tool which creates a full lexical resource from webcrawled data. Noise, in particular in the form of duplicated paragraphs, is frequent in PANACEA crawled corpora (only whole document de-duplication is performed in real time by the crawler) and may reduce the performance of the extractors. The aim of this paper is to evaluate the impact of duplicated paragraphs on the precision of our tool. To this purpose we use two versions of an automatically crawled corpus: the original crawled and automatically cleaned corpus with duplicates at paragraph level;
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. AND 12 Mumbai, India. Copyright 2012 ACM 978-1-4503-1919-5/12/12 ...$10.00.
the same corpus de-duplicated at paragraph level2 ; and we show how our method can perform equally well on clean and noisy data. see www.panacea-lr.eu The service, http://registry.elda.org/services/159, was developed at ILSP [12] [13]
2 1
The structure of the paper is as follows: in the second chapter previous approaches are discussed; in the third chapter the implementation of our MWE acquisition tool/service is described; in the fourth chapter the tool is evaluated with a particular focus on the assessment of the impact of duplicate text on performance.
2.
RELATED WORKS
MWE acquisition procedures all adopt some kind of statistical approach and usually involve two steps: 1) the identication of candidates (usually based on n -grams or pattern matching);(2) ltering and candidate ranking according to some statistical score and an experimental threshold. First approaches made use of plain text corpora and identied candidate on the basis of (positional) n-grams (sequences of n adjacent words), optionally using POS ltering to clean the candidate lists and stop word lists to reduce the search space (e.g. [2], [16]). More recent methods make use of tagged or parsed corpora to rst identify relevant patterns in the attempt to improve precision, although for the latter the improvement is not clearly proven due to parsing errors ([1], also see the interesting review in [15, 73-74]). The ranking of candidates is then achieved by applying some association measure (hereafter AM) calculated on the basis of the co-occurrence frequency of the content words involved in candidates. Several works have also carried out detailed comparisons of the methods used in the literature, evaluating the association measures used (among others, [10],[4], [5], [11]). Overall, it seems that the simplest measures (frequency, MLE and Log Likelihood) perform best. Although much of the work is on English data, research on MWE extraction has been carried out also for many other languages, such as German ([8]), Dutch ([17]), Czech ([11]), French ([9]), Portuguese ([18]), among others. Notwithstanding the vast literature, evaluation is often either not reported or not detailed enough. Precision and recall may vary considerably: for example Smadja reports a precision of 80% for his XTRACT system on English texts (but of 40% before the syntactic-based ltering), with evaluation carried out by manual inspection by a lexicographer. [15, 80] performed experiments in 4 languages and reported dierent gures for precision (English=0.42-0.58, Spanish=0.39-0.42, French=0.46-0.35, Italian=0.32-0.37). Most importantly, all of the aforementioned works are generally designed to extract multiwords or collocates for one or a list of words or lemmas, and never report performance issues related to data cleanness (most certainly because in general they make use of reference corpora which are relatively clean and manually normalised). Our method instead automatically extracts all possible MWEs from a corpus, without any need of a list of lemmas; it is also designed to work on noisy data (i.e. automatically web-crawled data).
nal work of Smadja[16], but also integrates more recent experimented statistical methods and association measures for ltering and ranking the acquired candidates, and thus for producing a cleaner output lexicon, promoting precision over recall. Our tool requires as input a part-of-speech tagged corpus in the CoNLL format3 , performs a sequence of steps each implementing dierent methods in such a way as to be ecient in terms of processing time and memory usage, and outputs an LMF-XML lexicon. Within a typical PANACEA work-ow, the corpus used for extraction is rst crawled by a Focussed Monolingual Crawler4 , then tokenised, lemmatised, and POS-tagged with a Freelin-based webservice5 and nally converted to CoNLL format by a converter6 . All these components are deployed as web-services and no human intervention is foreseen. Thus, although the corpus is automatically normalised and cleaned after crawling[12], it is still to be considered noisy for tasks such as automatic lexical acquisition or induction. In particular duplicated paragraphs are present which may have a heavy impact on the precision of lexical extractors by giving some repeated strings of texts an abnormally high frequency in the corpus. Our lexicon acquisition tool works in 6 steps, that are described in the remainder of this chapter.
Step 1: Window based collocation extraction.

The extractor takes as input a targeted POS tags pair, which refers to the rst and last word of the targeted MWEs, and the maximum length of the MWEs to be retrieved. The rst and last targeted element are referred to as the (word)pair or collocation. The complete sequence, including the intervening elements, is referred to as pattern and the maximum span of a pattern is called a window. If for instance Noun-Noun and window 5 are selected, the extractor will retrieve all sequences of tokens in the corpus that begin and end with a noun and have length from 2 to 5. The result is a data structure containing word sequences of minimum length 2 and maximum length the window size. Both the POS tag pairs and the window size are passed as usercongurable parameters to the system. The output of this step is a list of candidate collocation pairs, with their related frequencies as well as all the patterns they produce.
Step 2: Pre-lter and collocation ranking through association measures.

The data structure produced in Step 1 may grow to a considerable size7 . A pre-ltering stage is therefore necessary based on the raw frequency distribution of the collocations that lters out pairs below a given proportional threshold. Collocation frequencies normally show a zipan distribution with a long tail of low frequency pairs and hapaxes8 that are
3 http://ufal.m.cuni.cz/conll2009-st/taskdescription.html#Dataformat 4 http://registry.elda.org/services/160 5 http://registry.elda.org/services/214 6 http://registry.elda.org/services/213 7 Our current java implementation of the tool requires a maximum heap memory of 4GB for a 37 million words corpus, when Noun-Noun pairs are extracted with a window of 5, while processing time is around 20 minutes. Obviously less frequent POS pairs and a smaller window reduce the memory requirements. 8 By hapaxes we mean here word pairs of frequency 1.
3.
THE PANACEA MW EXTRACTOR AND A TYPICAL WORK FLOW
Due to the need to operate in a web service distributed environment, where processing time is critical, and because of possible computing and memory limitations, it was decided to avoid computing intensive methods. Also, as often reported in the literature, simpler methods seem to perform equally well, if not even better, in constrained set-ups. Our MWE acquisition component is thus inspired by the semi-
not tractable and need to be ltered away. In this experiment we used the AverageFrequency PreFilter, that sets the threshold to (1): (1) n where fi the frequency of the i th collocation/pair and n is the number of collocations extracted. After applying the pre-lter, Log Likelihood and Pointwise Mutual Information are calculated and may be used for re-ranking and further ltering the collocations. In our experiments raw frequency and Log Likelihood seem to produce similar interpolated precision curves, frequency being slightly over Log Likelihood. Both perform signicantly better than Pointwise Mutual Information (see Figure 1) that has a precision very close to zero in its top 5000 ranks. It is known in the literature [11] that dierent kinds AMs give dierent results depending on the kind of MWE extraction task. Frequency is also known to perform well [7] when ltering on PoS is used. AF = X =
n i=1
B gas ad eetto serra (greenhouse eect gas): frequency 1547 C gas a eetto serra (greenhouse eect gas): frequency 1365 ...
fi
Step 4: Pattern based collocation ltering and MWE selection.

Recall that our search algorithm works on pairs of POS tags in a given window frame n ; thus for a given pair of words (or candidate collocation) the algorithm will collect all word sequences of maximum length n that have those words as their extremes. This is what we shall call the set of patterns for the given collocations. So, the collocations retained after pre-ltering go through an additional ltering step based on the distribution of their set of patterns: each candidate collocation is represented as a vector of frequencies of all its patterns and the algorithm detects and lters out vectors that show an even distribution; otherwise only the signicant patterns (i.e. the outliers) are retained as good candidate MWEs. For example, the collocation GAS - SERRA produces a pattern vector v1 : v1 = [7151,1,1,1,1,51,21,2,43,1,1,38,1547,1,1,10,1,2,4,1,5,1, 1,1,1,1,1,14,1,1,6,5,71,1365] where all elements of v1 are the frequencies of patterns extracted for GAS-SERRA. v1 contains three clearly outstanding patterns for this collocation (with frequencies 7151, 1547, and 1365, corresponding to the above listed patterns A, B, C ). On the other hand, the collocation MARE-COSTA (sea - coast) produces a long vector as in v2 : v2 = [1,1,1,1,8,1,1,1,1,1,1,...] with less clearly recognizable outstanding patterns. The intuition here is that this fact is evidence for the lower xedness of the second collocation with respect to the rst, and thus as a criterion for rejecting the collocation and its set of patterns. In order to quantify this intuition, mean (X ) and standard deviation ( ) are calculated for each set of patterns.
k i=1 (fi
Figure 1: Interpolated precision graph for the different rankings.
Step 3: Pattern extraction.

Having reduced our extraction to a tractable size in Step 2 by getting rid of low frequency pairs, our algorithm can now perform a more rened analysis on the patterns generated by the remaining pairs. The rationale behind pattern extraction is to retrieve all possible intervening items for each targeted pair of words A and B. Patterns are retrieved complete with all the available information, that is, for each element in the pattern the token, the lemma and POS are available. A distinct pattern will thus be a sequence of elements each one being the concatenation of Token+Lemma+POS. Frequencies for each pattern are also retrieved. For instance, for the pair GAS-SERRA (gas-greenhouse, combined frequency: 10353) the algorithm retrieves several patterns such as9 : A gas serra (greenhouse gas): frequency 7151
9 For reasons of space, here only the sequence of tokens is given. Notice how B and C only dier in the alternate
X )2
(2)
where fi is the frequency of the i th pattern and k is the number of patterns extracted. We empirically assessed that only collocations whose vectors show > 1 have some chance of producing at least one signicant pattern. This excludes those collocations that are normally made up by long series of very low frequency items, diering from each other very little in frequency. The only exception is when = 0 because the collocation extracts just one pattern of identical frequency. In this case the algorithm candidates the pattern as a good MWE without further analysis10 . Now that we have ltered out much noise and irrelevant collocations, the algorithm has to select the good patterns spelling of the preposition a; they have the same lemma but not the same token. 10 This is the case for very xed MWEs such as stati membri, member states
to be encoded in the output lexicon. This is done by using the same variance analysis of the distribution in the sets of patterns described above with the aim of selecting more than one relevant pattern. In particular, the presence of outliers is evaluated in terms of standard deviation above the mean. Empirical evidence showed that one above the mean is a good enough threshold in our case. It normally extracts 1-3 signicant MWEs. Notice how this approach diers from the one used in [16]). There the task was searching for collocates on the basis of a given list of words. Thus the vector was built in order to determine the position of any word w with respect to a word from the list. Thus, in Smadjas approach a high standard deviation indicates randomness, and therefore low association strength between the two words. In our case the pair, not a single word, is considered. Subsequently, the only case when a low sigma is indicative of a good candidate pair is when equals zero because its set of patterns actually contains only one element. Randomness however is null in this case. In all other cases, good pattern vectors will contain a strongly uneven distribution, with ideally one or few very frequent patterns and a long tail of very low frequency elements11 . The performance of this algorithm (henceforth SigmaPatternExtraction ) has been evaluated in [14] against a simpler method which consists in only retrieving the most frequent pattern. SigmaPatternExtraction showed an increase in the recall while not losing on the precision side. In this paper we shall instead apply our best method, SigmaPatternExtraction, to the original crawled corpus and to the de-duplicated one in order to verify the impact of paragraph de-duplication.
low this frequency threshold. In theory the whole extraction algorithm could be run without preltering, thus applying only the average frequency post-ltering. Yet from the point of view of optimization it is useful to progressively reduce the extraction dataset in order to minimize processing time. In this paper we shall instead apply our best method, SigmaPatternExtraction, to the original crawled corpus and to the de-duplicated one in order to verify the impact of paragraph de-duplication. The results show that, while the method adopted is robust relatively to de-duplication, postltering is a necessary step in order to increase precision overall, in both settings. Nested Strings Post-lter The second post-lter aims to remove an eect of the extraction that can be quite pervasive especially when using a 5-token window: the extraction of nested strings. In our case for example the system extracts as MWEs both morti sui luoghi and morti sui luoghi di lavoro (deaths in workplaces), which both have the same frequency (1499); clearly, the genuine MWE is morti sui luoghi di lavoro and the other is a substring thereof which never occurred independently from the containing one. It is possible, in principle, to avoid extracting substrings altogether, but this would mean losing important information. In fact when a substring also occurs independently from the longer string it is contained in, this can be considered evidence of it being a genuine MWE. In fact, the substring luoghi di lavoro (workplaces) has higher frequency than morti sui luoghi di lavoro, and should be taken into account as a genuine MWE. The nested string lter thus looks for MWEs that are substring of others and have same frequency, and keeps the longer one.
Step 5: Post-ltering.
Post-ltering12 is needed for correcting the possible noise introduced by the extraction process, both due to the extraction methodology and the characteristics of the corpus. Several levels of post-ltering can be applied, some of which may be ne-tuned for a specic language, such as for instance using stop words list. For what concerns this paper, we are going to concentrate on language independent post-ltering, most specically on two post-lters: average frequency post-lter and nested string removal. AverageF Post-lter The rst post-lter is a repetition of the average frequency prelter as in (1), only this time performed on pattern frequency instead of collocation frequency. Notice that a collocation of frequency over (1) can produce several patterns, some of which may have much smaller frequency. Some of them will be ltered away by previous steps, but others will remain whose frequency is often too low to provide any statistical evidence. We thus calculate the average frequency of the remaining patterns and lter away those which are be11
Step 6: The lexicon builder.

The nal step of the tool is lexicon building which compiles the MWEs that were selected according to the steps described above into a full lexicon, encoded according to the LMF standard [6].
4.
EVALUATION
As pointed out by one of the reviewers of this paper a vector of with only few high frequency items is also imaginable, with small standard deviation but signicant individual components. In fact we found such a case to be very rare at least for MW extraction. For other tasks, such as Named Entity Recognition, it may be more likely to occur and it could be possible to imagine an absolute or relative threshold above which a pattern is always to be selected. 12 We refer to [14] for precision and recall gures prior to post-ltering.
In order to assess the impact of duplicate text in the web crawled corpus on our MWE induction service, we conducted experiments using an Italian crawled corpus for the environment domain13 . The corpus was rst automatically normalised (e.g. the character encoding set to UTF-8, . . . ), cleaned (e.g. boilerplate were removed), and de-duplicated at document level; then at a second stage it was de-duplicated at the level of paragraphs14 . In our experiments we used the two versions of the corpus for evaluation; for simplicity, we will refer to them as to crawled and de-duplicated, where the latter refers to the de-duplication at paragraph level. The crawled corpus has about 37 million tokens, while the de-duplicated has about 30 million tokens. Evaluation is performed intrinsically, i.e. against a goldstandard, using the standard evaluation measures of precision, recall and f-measure.
4.1
13 14
The gold standard
The choice of the domain is dictated by the PANACEA project requirements All these steps were performed by a web-service component developed by colleagues at ILSP and described in [12] also mentioned above.
A gold standard, or reference resource, has been created by semi-manually collecting and POS-annotating nominal MWEs from several authoritative glossaries and thesauri for the Environment domain. For each multi-word collected, its frequency in the corpus was computed using simple regular expressions to search for potential morphological variants. Only MWEs that occurred at least once in the corpus were retained in the gold standard15 . In the gold standard the citation forms were kept as they were found in the given resources, no structuring was created nor dierent lemmatizations merged. The resulting gold-standard consists of 2192 MWEs entries.
maximum distance 5, the number of extracted patterns is in the order of (tens) of thousands; thus precision calculated over the entire extracted set is necessarily very low (recall the gold standard contains little more than 2000 MWEs) and unfair. Thus, only MWE patterns whose collocation appears also in the gold standard are selected for evaluation. For example: if acqua di mare sea water is in the gold standard, we search for a collocation of the form ACQUA + MARE, and consider all MWEs that the algorithm has extracted for that pair. Also, in order to allow for variants of the rst pattern, such as: fonte di inquinamento > fonti di inquinamento 18 to be also retrieved, a more exible comparison is applied, such as allowing for edit distance (as described in [3]) up to 3 between the strings. In Table 1 the results of this evaluation for both the original corpus (crawled) and the de-duplicated one (deduplicated) are given. crawled 1077 0.66 0.37 0.47 de-duplicated 1095 0.67 0.38 0.48
4.2
System experimental set-ups
Two extractions have been performed, one for the original corpus and one for the de-duplicated one. Both shared the following general parameters: target = extraction of nominal multi-words, i.e. multi-words whose rst and last word is a noun (N-N henceforth) window = 5 tokens including the rst and last element (i.e. the extracted MWEs have a maximum length of 5 words) prelter = AverageFrequency PreFilter pattern extraction = SigmaPatternExtraction algorithm The Average Frequency pre-lter (1) sensibly reduces the size of collocation and patterns the service has to deal with: consider in fact that, without this pre-lter, the corpus of 37 million words produces 2,052,140 collocations, containing a long tail of hapaxes. With the pre-lter the collocations reduce to 260,33916 ; the lowest collocation has frequency 5, the highest has frequency 10,353. The Average Frequency post-lter further reduces the lexicon to 30-25,000 entries, the lowest having frequency of around 6. Although recall in terms of collocations of the unltered extraction is maximum17 the recall in terms of MWEs is only 0.59%, that is the SigmaPatternExtraction fails to recognize the correct pattern, probably because it is too rare with respect to others. Recall drops considerably when applying pre and post ltering due to the fact that low frequency terms were included in the gold standard. This heavy ltering was nonetheless necessary in order to achieve a higher precision in an acceptable processing time which was our primary goal. Also consider that very low-frequency collocations defy analysis with any association measure and are discarded in several approaches [4].
test precision recall F1
Table 1: Evaluation for crawled and de-duplicated corpus.
As we can see from the table, the performance on the deduplicated corpus is only 1 % higher than the one achieved on the original corpus. We can thus infer that the precision of our tool is not greatly aected by the presence of duplicated paragraphs which is typical of a noisy, web crawled corpus that has only undergone a light cleaning. As for the overall precision, manual inspection of the false positives shows that precision is much higher for both extractions. For instance, in addition to zona di pressione (pressure zone), which is present in the gold standard, the tool also extracts zona di bassa pressione (low pressure zone), which is not in the gold standard, but is in fact a genuine MWE. By analyzing false positives we get a nal estimate of our tools accuracy of around 81% precision for the original (crawled) corpus.
4.3.1
Manual evaluation of MWE extraction
4.3
Evaluation of MWE extraction
In this section we compare the results of our tool on the two versions of the corpus as mentioned above. As our system extracts all possible MWE candidates for two nouns at
15
We are aware that this is penalizing for our system as, with pre-ltering, expressions occurring less than 5 in the corpus will never be extracted. This choice was made in order to realistically assess recall over known MWEs. 16 This is a number of pairs our algorithm can further process in a reasonable amount of time. 17 This is not surprising, considering that the gold standard contains only MWEs that are present in the corpus.
The evaluation in the previous paragraph can also be seen as an evaluation of the performance of the algorithm in extracting good patterns from good pairs/collocations, (since only pairs present in the gold standard are selected). However, this evaluation, while useful for development purposes, does not fully assess either the quality of the output lexicon as a whole, or the impact of noisy data. In fact, our system produces a MWE lexicon in the order of 200 - 20K entries depending on the dierent possible set-ups and lters applied, while the gold-standard consists only of 2K entries. In order to have a better idea of the quality of our lexicon, we decided to manually evaluate at least the top portion
18
The second MWE is the plural of the rst - source of pollution - sources of pollution
of our retrieved lexicon. Given that collocation frequency seemed to perform better than other association measures (see Figure 1) in assigning good MWEs a high rank, we decided to assess the accuracy of the rst 1000 most frequent MWEs, by manually checking the false positives for good MWEs that were not present in the gold standard. crawled 1000 0.80 de-duplicated 1000 0.79
machine translation, syntactic parsing or subcategorization frame acquisition (which would be interesting to assess the impact of automatically acquired multi-word prepositions).
5.
CONCLUSION
test precision
Table 2: Evaluation for crawled and de-duplicated corpus - rst 1000 MWEs.
As shown in table 2 the precision for the rst 1000 most frequent items in both extractions is fairly high, with the original crawl producing a slightly better list than the deduplicated one. Notice also how the real precision of the top 1000 portion of the extraction is very close to the one assessed by manually checking the false positives of the reduced evaluation setting.
We have presented a tool for the acquisition of multi-word expressions of various lengths that generates a lexicon as output. The tool is deployed as a web service within the PANACEA platform. In a typical scenario, the service will use a corpus of automatically crawled and processed texts for a given domain which will be intrinsically noisy, although it undergoes several automatic cleaning and normalisation steps. Results of the evaluation both on data containing duplicate paragraphs and on a de-duplicated corpus have been reported, showing surprisingly that only a slight increase in precision is observable when using de-duplication, and that even on relatively noisy data our light-weight approach performs around 80% precision.
6.
ACKNOWLEDGMENTS
4.3.2
Discussion and Future Work
Our results for both kinds of evaluation are in line with the precision performance of Smajdas XTRACT; it is therefore possible to build a MWE acquisition tool with state of the art precision, running without a pre-dened list of target words and on noisy web crawled data. Yet some problems remain to be dealt with, in particular in the choice of the POS pair. For instance among the real false negatives we found a number of partial strings such as: dati meteo in tempo, and produzione di energia da fonti. These expressions are clearly fragments of the longer expression dati meteo in tempo reale (real time meteo data) and produzione di energia da fonti rinnovabili (production of energy from renewable sources), which can only be retrieved when targeting a Noun + Adjective collocation pair. It should thus be possible for the tool to accept a list of possible POS as a target for the extraction, such as here Noun - Noun OR Adjective. Once the longer string is retrieved, the postlter should then be able to remove these fragments. In addition to this simple improvement, other additions to the algorithm are also suggested by our analysis of the output, such as: 1. Extracting MWEs with POS patterns by progressive test and reduction of patterns for learning the free slots in the multiwords. For instance we may want to derive a pattern of the form articolo NUM della legge (article NUM of the law) from a series of patterns of the form: articolo 6 della legge, articolo 12 della legge, articolo 23 della legge, . . . . 2. Using language specic ne tuning will also be implemented, in the form of optional stop words list and legitimate patterns check, as well as head detection heuristics. Finally, an important development should be a task based evaluation, in order to see how useful the automatically induced MWEs are for improving the performance of other NLP tools. Interesting tasks for this could be: rule-based
This work has been realized at CNR-ILC within the EU FP7 funded project PANACEA (Platform for Automatic. Normalized Annotation and Cost-Eective Acquisition of Language Resources for Human Language Technologies) under grant agreement n. 248064.
7.
REFERENCES
[1] T. Baldwin. Deep lexical acquisition of verb-particle constructions. Comput. Speech Lang., 19(4):398414, Oct. 2005. [2] Y. Choueka. Looking for needles in a haystack or locating interesting collocation expressions in large textual databases. In Proceedings of the RIAO, pages 3843, 1988. [3] F. J. Damerau. A technique for computer detection and correction of spelling errors. Commun. ACM, 7(3):171176, Mar. 1964. [4] S. Evert. The statistics of word cooccurrences: word pairs and collocations. Unpublished doctoral dissertation, Institut fuer maschinelle Sprachverarbeitung, Universitaet Stuttgart, 2004. [5] S. Evert and B. Krenn. Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, 19(4):450 466, 2005. Special issue on Multiword Expression. [6] G. Francopoulo, L. Romary, M. Monachini, and N. Calzolari. Lexical markup framework (lmf). In LREC 2006, 2006. [7] J. S. Justeson and S. M. Katz. Principled disambiguation: Discriminating adjective senses with modied nouns. Computational Linguistics, 21(1):127, 1995. [8] B. Krenn and S. Evert. Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL Workshop on Collocations, Toulouse, France, 2001. [9] E. Laporte, T. Nakamura, and S. Voyatzi. A French Corpus Annotated for Multiword Nouns. In Proceedings of the Language Resources and Evaluation Conference. Workshop Towards a Shared Task on
[10]
[11]
[12]
[13]
[14]
[15]
[16] [17]
[18]
Multiword Expressions, pages 2730, Marrakech, Maroc, 2008. D. Pearce. A comparative evaluation of collocation extraction techniques. In In Third International Conference of on Language Resources and Evaluation, Las Palmas, Spain, 2002. P. Pecina. Lexical association measures and collocation extraction. Language Resources and Evaluation, 44:137158, 2010. P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, and J. van Genabith. Domain adaptation of statistical machine translation using web-crawled resources: a case study. In M. Cettolo, M. Federico, L. Specia, and A. Way, editors, EAMT 2012: Proceedings of the 16th Annual Conference of the European Association for Machine Translation, pages 145152, Trento, Italy, 2012. P. Pecina, A. Toral, A. Way, P. Prokopidis, V. Papavassiliou, and M. Giagkou. Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation. In M. F. H. D. V. Vadeghinste, editor, Proceedings of the 15th Annual conference of the European Association for Machine Translation, pages 297304, Leuven, Belgium, May 2011. V. Quochi, F. Frontini, and F. Rubino. A mwe acquisition and lexicon builder web service. In Proceedings of the 24th Conference on Computational Linguistics, Mumbai, India, 2012. V. Seretan and E. Wehrli. Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation, 43(1):7185, March 2009. F. Smadja. Retrieving collocations from text: Xtract. Comput. Linguist., 19(1):143177, Mar. 1993. B. Villada Moiron. Data-driven identication of xed expressions and their modiability. PhD thesis, University of Groningen, 2005. A. Villavicencio, C. Ramisch, A. Machado, H. de Medeiros Caseli, and M. Jos e Finatto. Identica ca o de Express oes Multipalavra em Dom nios Espec cos. Linguam` atica, 2(1):1533, Apr. 2010.

Automatic Creation of Quality Multi-Word Lexica From Noisy Text Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Creation of Quality Multi-Word Lexica From Noisy Text Data

Uploaded by

Copyright:

Available Formats

Automatic Creation of quality Multi-word Lexica from noisy text data

francesca.frontini@ilc.cnr.it valeria.quochi@ilc.cnr.it ABSTRACT

Categories and Subject Descriptors

Step 1: Window based collocation extraction.

Step 2: Pre-lter and collocation ranking through association measures.

THE PANACEA MW EXTRACTOR AND A TYPICAL WORK FLOW

Step 4: Pattern based collocation ltering and MWE selection.

Figure 1: Interpolated precision graph for the different rankings.

Step 3: Pattern extraction.

Step 6: The lexicon builder.

The gold standard

System experimental set-ups

test precision recall F1

Table 1: Evaluation for crawled and de-duplicated corpus.

Manual evaluation of MWE extraction

Evaluation of MWE extraction

Discussion and Future Work

You might also like