You are on page 1of 9

ISTILAH SAINS: A Malay-English Terminology Retrieval System Tengku Mohd. T.

Sembok, Kulothunkan Palasundram, Nazlena Mohd Ali, Aidanismah Yahya Universiti Kebangsaan Malaysia Email : tmts@pkrisc.ukm.my
Abstract

The emergence of information era has increased the importance of translating articles from Malay to English, and vice-versa, to assist in the dissemination of information into and from South East Asia, mainly Malaysia, Indonesia and Brunei where the Malay language is widely used. This has created the need to have a good computer-aided translation system to facilitate the translation of terms. In this article, the authors describe the design and implementation of a MalayEnglish and English-Malay scientific terms retrieving software named Istilah Sains UKM. The system was designed and implemented using object-oriented techniques with user friendly interfaces and some basic facilities. For example, the database maintainer can add, change or delete terms from the database, and reindex files. The general user can retrieved terms through the Internet using any browser with JAVA plug-in. The indexing and retrieving strategies are based on stemming and n-gram methods. The system was developed using JAVA language and thus Web enabled and can be accessed at http://www.ftsm.ukm.my/istilah/. Introduction Research in automatic information retrieval has started since 1940s (Frakes 1991). A lot of information retrieval systems have been developed since then such as CONIT (Marcus 1983), and SMART and SIRE (Salton & McGill 1983). However, research in information retrieval using Malay language is still new and few (Fatimah 1995). In an attempt to have a good Malay document retrieval system, SISDOM was developed (Belal 1995). SISDOM uses a Malay-English and English-Malay terms translation module called Istilah. The ever growing need to translate articles from English to Malay and Malay to English has triggered the need to have an independent translation software to maintain and retrieve terms. In this article, the design and implementation of Istilah Sains UKM a bilingual (Malay and English) scientific terms translation software is described. Design Approach The design of Istilah Sains UKM is based on the object-oriented modeling. The architecture (meta module) of Istilah Sains UKM consists of two main modules : the maintenance module and internet access module. The maintenance module consists of indexing and retrieving modules. The function of indexing module is to update the main data file which keeps the terms and create new index files for the system.

The process of indexing can be divided into two steps: first, clustering of the terms into root words using stemming algorithm; second, clustering of the root words into two character subwords (bigrams). The stemming algorithm used for Malay terms clustering is Fatimahs algorithm (1996). Porters suffix striping stemming algorithm (Porter 1980) is used to cluster the English terms. The function of the retrieving module is to retrieve words which matches the query word. Partial string matching techniques used are stemming and bigram matching which determine whether two words match each other. For bigram matching, dice coefficient value with threshold 0.6 is used for partial matching. Figure 1 shows an example of the output for the query solar energy.

Figure 1: Output Interface

Evaluation Procedure The experiments are performed to evaluate the retrieval effectiveness of various techniques employed within the framework of n-gram matching in the contact of automatic query expansion approach as set by Lennon et al. [1981]. This involve the ranking and the calculation of string similarity measures of each unique terms in the

dictionary to a specified query term. The effectiveness of retrieval is evaluated by using recall and precision measures based on the number of words retrieved and relevant (R&R). The recall, R, is defined as proportion of relevant terms actually retrieved from the dictionary with respect to a specified query term. The precision, P, is thus defined as the proportion of retrieved actually relevant. The relevancy of each term in the dictionary to a specified query term is determined manually by exhaustive scanning through the dictionary with the help of the string matching utility find in the Microsoft Excel software. Two terms are regarded to be relevant to each other if they share a common root. The measure of van Rijsbergen, E, which is a weighted combination of recall and precision is also used to calculate the retrieval effectiveness:

( 1 + 2 ) PR E = 100 * 1 2P+ R where , in the range of 0 to infinity, is used to reflect the relative importance of recall and precision. = 1.0 reflects attaching equal importance to precision and recall, while = 2.0 reflects attaching twice as much importance to recall as to precision. In our experiments we used the value of = 2.0. There is an inverse relationship between the E value and the effectiveness.
For each query term, the terms in the dictionary are ranked based on a similarity matching measure as specified in each experiment. The E values are calculated for cutoff points at 10 to 100, at an interval of 10, top ranking dictionary terms. This is done for all the 84 query terms and the mean values of E is calculated. The average of the mean values of E at the ten cutoff points are calculated and used to evaluate the effectiveness of each technique. Generally the average of the mean values of E at all the cutoff points serve as a good indicator of performance, we used the symbol aE to denote this value. Statistical sign test will be used to measure the degree of significance between two methods under consideration with the null hypothesis as follows: P[Xi > Yi] = P[Xi < Yi] = 1/2 where Xi and Yi are the two scores for a matched pair obtained from the two methods. In applying the sign test , the direction of the difference between every Xi and Yi noting whether the sign of the difference is positive or negative. Binomial distribution with p = q = 1/2 is used to calculate the probability, p, associated with the occurrence of a particular number of positives and negatives. If the number of matched pairs, N, is larger than 35, the normal approximation to the binomial distribution is used in term of the z value [Siegel and Castellan, 1988].

The Results The commonly used similarity coefficients are Dice and Overlap coefficients. We run

experiments using both coefficients on digrams and trigrams and the results obtained are given in Table_1. Using the E value averages as the basis of comparison, Table_1 shows that Overlap coefficient performed better than Dice coefficient for both the digrams and trigrams. This may be due to the Malay morphology which allows long affixes to be attached to a root word and thus Overlap coefficient performs better in this condition. In digrams matching, the Overlap coefficient gives the value of aE = 72.58 and the Dice coefficient gives aE = 73.27. In trigrams matching, the Overlap coefficient gives aE = 75.07 and Dice coefficient gives aE = 75.41. But the difference between the performance of the Overlap and the Dice coefficients is not significant as shown in Table_2. We also tried using both Dice and Overlap coefficient on the digrams matching by taking the average value of the two in order to measure the performance of such fusion of similarity coefficients. The E values of this run is also given in Table_1, its performance is worst than Overlap coefficient at all the cutoff points accept at the cutoff of 10. But it performs better than the Dice coefficient with aE = 72.64 as compare to 73.27. Generally, we can conclude that the Overlap coefficient performs better than the Dice coefficient and the fusion of the two.

Digrams and Trigrams Matching Looking at the E values in Table_1, digrams matching performs better than trigrams at all level of cutoffs and coefficients used. Sign test is performed on the performance of digrams and trigrams matching using the Overlap coefficient. The result obtained indicates that that the digrams performs significantly better as the figures in Table_3 show. We also tried using the fusion of digrams and trigrams by taking the average value of the two using the Overlap coefficient. The results of using this fusion is given in Table_6 as compared to the approach using the digrams only. The fusion approach performs inferiorly at all the cutoff points except at the cutoff 10. Thus we conclude that the fusion approach is not worth pursuing further. Incorporating-stemming Approach Stemming process is incorporated into n-gram matching framework in order to investigate it effectiveness. The first approach taken is to stem the query term first before doing the n-grams matching against the unstemmed dictionary. This approach is performed on digrams and trigrams using Overlap similarity coefficient. The results obtained are given in Table_7 and shows that the one with digrams matching performs better than the trigrams with the aE = 64.61 as compare to 66.45. It also performs significantly better than the best non-stemmed approach experimented in the last section, i.e. digrams matching with Overlap similarity coefficient with the aE = 72.58, as shown by the sign test figures given Table_4. The second approach taken is to stem both the query term and the dictionary terms. The

results obtained using this approach is given in Table_7. The one using digrams performs better than the trigrams with aE = 61.04 as compare to 61.35 and obviously better than with the previous approach when only the query terms are stemmed. Table_5 shows the results obtained from sign test which indicates that the digrams matching using stemmedquery and stemmed-dictionary performs significantly better than with stemmed-query alone. Comparison with Stemmed-Boolean Matching We also run experiments using the conventional stemmed-boolean matching between the stemmed query and the stemmed dictionary to compare its effectiveness to the approach of incorporating stemming in the n-gram framework. Comparing these two approaches pose some problems: stemmed-boolean matching is based on boolean match whereas the later is based on ranking approach with degree of similarity. E values are used to compare the performance calculated using the following methods: 1) using variable cutoff points based on the number of words retrieved by the stemmed-boolean match: for each query term if the stemmed-boolean match retrieved x terms then the cutoff point of x is taken for that query; the mean E values obtained for this method is shown in Table_8. 2) using cutoff with weighted transformation, at 5 and 10 cutoff points, on the number of words retrieved by the stemmed-boolean match: if c is the cutoff point and x is the number of words retrieved by the stemmed-boolean match then do the following transformation: if x c then add in c-x non-relevant words to the number of words retrieved; else multiply the number of words retrieved and relevant by c/x; the mean E value obtained for this method is given in Table_8. From Table_8 it shows that using variable cutoff points evaluation, stemmed-boolean approach performs better than the incorporating-stemming approach. Table_9 shows that the difference in performance is significant. However, this method of evaluation is biased to stemmed-boolean approach. On the other hand, using the cutoff with weighted transformation evaluation the incorporated-stemming approach performs better but not significant as Table_10 shows. Thus, we cannot unequivocally say that which performs better. The two methods based on two different approaches, one on the boolean matching and the other on matching with uncertainty measure and both have their functional purposes.

Conclusions From the experiments performed we can conclude that the Overlap coefficient performs better than the Dice coefficient but not significantly, and the digrams matching performs significantly better than the trigrams. The usage of stemming prior to digrams and trigram

matching have significantly enhance the performance. The application of stemming on the query terms and the dictionary performs significantly better than the application of stemming on the query terms alone which in turn performs significantly better than without the application of stemming. However, we cannot equivocally conclude which is better between the conventional stemmed-boolean approach and the incorporating-stemming in the n-grams approach. Furthermore, the two approaches are based on different paradigms, one on the boolean retrieval and the other on the best match ranking retrieval and they having advantages and disadvantages of their own.

References Belal Mustafa, A.A., Tengku Mohd. T.S., & Mohd. Yusoff. 1995. SISDOM : A Multilingual Document Retrieval System. Asian Libraries, Vol. 4, No.3, MCB University Press Limited, England. Fatimah Ahmad. 1995. Satu Sistem Capaian Dokumen Bahasa Melayu : Satu Pendekatan Eksperimen Dan Analisis. Ph.D Thesis, Universiti Kebangsaan Malaysia.. Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok. 1996. Experiments with A Malay Stemming Algorithm, Journal of American Society of Information Science. Frakes, W.B. 1992. Stemming Algorithms. In Frakes, W.B. & Baeza-Yates, R. (ed).). Information Retrieval: Data Structures & Algorithms : 131-160. Englewood Cliffs : Prentice Hall. Marcus, R.S. 1983. An Experimental Comparison of the Effectiveness of Computers and Humans as Search Intermediaries. Journal of the American Society for Information Science 34(6):381-404. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137. Salton, G. & McGill, M.J. 1983. Introduction to Modern Information Retrieval . New York : McGraw-Hill.

Table_1: Mean values at cutoff points to assess similarity coefficients performance Measur e n-gram Coeff 10 2.85 2.89 2.78 2.84 57.66 56.85 55.85 58.72 58.03 20 3.40 3.57 3.22 3.36 64.15 62.25 62.82 65.85 64.33 30 3.67 3.90 3.53 3.54 68.47 67.35 67.56 69.96 70.03 40 4.04 4.17 3.65 3.72 71.48 70.75 71.21 73.81 73.62 Cutoffs 50 4.23 4.39 3.75 3.82 73.87 73.44 73.82 76.70 76.52 60 4.42 4.50 3.90 3.94 76.29 75.95 76.13 78.61 78.58 70 4.53 4.65 3.94 4.00 78.08 77.78 78.11 80.62 80.48 80 4.63 4.81 4.06 4.08 79.69 79.25 79.15 81.96 81.93 90 4.71 4.91 4.08 4.17 81.05 80.60 80.24 83.37 83.11 100 4.85 5.06 4.11 4.26 82.0 1 81.5 8 81.5 2 84.5 4 84.1 4 Cutoff Mean 4.13 4.28 3.70 3.77 73.27 72.58 72.64 75.41 75.07

Digrams Number of R&R Trigrams Digrams E

Dice Overlap Dice Overlap Dice Overlap Fusion

Trigrams

Dice Overlap

Table_2: Sign test to determine whether Overlap coefficient coefficient. Cutoff = 10 20 30 40 Digrams N 27 27 23 29 negatives 13 8 8 12 p .500 .061 .105 .229 significant( =0.05) no yes no no Trigrams N 18 15 11 13 negatives 6 3 5 6 p .119 .018 .500 .500 significant( =0.05) no yes no no

performs significantly better than Dice 50 23 9 .202 no 16 7 .400 no 60 22 9 .262 no 15 7 .500 no 70 22 10 .416 no 15 6 .304 no 80 26 13 .577 no 16 8 .598 no 90 26 12 .423 no 13 5 .291 no 100 23 10 .339 no 12 3 .073 no

Table_3: Sign test to determine whether coefficient is used). Cutoff = 10 20 30 N 26 25 29 negatives 12 6 5 p .423 .007 .001 significant( =0.05) no yes yes

digrams performs significantly better than trigrams (overlap 40 32 6 .000 yes 50 25 0 .000 yes 60 26 2 .000 yes 70 30 3 .000 yes 80 30 5 .000 yes 90 30 6 .001 yes 100 25 3 .000 yes

Table_4: Sign test to determine whether digrams with stemmed-query digrams with non-stemmed query (overlap coefficient is used). Cutoff = 10 20 30 40 50 60 70 N 34 30 32 30 30 32 31 negatives 2 2 3 2 2 2 2 p .000 .000 .000 .000 .000 .000 .000 significant( =0.05) yes yes yes yes yes yes yes

performs significantly better than 80 31 2 .000 yes 90 31 2 .000 yes 100 29 1 .000 yes

Table_5: Sign test to determine whether digrams with stemmed-query and stemmed-dictionary performs

significantly better than digrams with stemmed-query only (overlap coefficient is used). Cutoff = 10 20 30 40 50 60 70 80 90 N 45 29 22 19 15 14 14 13 12 negatives 5 3 2 3 2 1 1 2 1 p or z 5.06 .000 .000 .002 .004 .001 .001 .011 .003 significant( =0.05) yes yes yes yes yes yes yes yes yes

100 12 1 .003 yes

Table_6: Mean values at cutoff points to assess the fusion between digrams and trigrams against Digrams only (overlap coefficient is used). Measure n-gram 10 Number of R&R E Digrams Fusion Digrams Fusion 2.89 3.06 56.8 5 54.9 7 20 3.57 3.57 62.2 5 62.5 9 30 3.90 3.78 67.3 5 68.2 0 40 4.17 4.03 70.7 5 71.6 2 50 4.39 4.22 73.4 4 74.2 5 Cutoffs 60 4.50 4.38 75.9 5 76.5 0 70 4.65 4.53 77.7 8 78.3 2 80 4.81 4.72 79.2 5 79.5 4 90 4.91 4.91 80.6 0 80.5 3 100 5.06 4.97 81.5 8 81.7 9 Cuto ff Mea n 4.28 4.21 72.5 8 72.8 3

Table_7: Mean values at cutoff points when stemming is incorporated. Measure n-gram Stemmed Quer y yes yes yes yes yes yes yes yes yes yes Dict n no no yes yes yes no no yes yes yes 10 4.19 4.00 5.26 5.26 5.27 42.7 0 45.0 5 31.2 2 31.3 4 31.0 1 20 5.11 4.64 6.07 6.02 6.07 49.8 0 53.1 4 42.3 5 42.8 0 42.3 3 30 5.45 4.92 6.21 6.13 6.20 56.8 0 60.0 5 51.7 8 52.2 2 51.8 3 40 5.70 5.28 6.25 6.14 6.23 61.8 1 63.9 1 58.5 7 59.0 7 58.6 2 50 5.82 5.48 6.27 6.17 6.29 65.9 9 67.4 9 63.6 1 63.9 7 63.4 8 Cutoffs 60 5.94 5.57 6.32 6.26 6.32 69.1 1 70.6 9 67.3 5 67.5 7 67.3 5 70 6.00 5.66 6.35 6.28 6.34 71.9 2 73.1 9 70.4 2 70.6 5 70.4 6 80 6.09 5.70 6.35 6.28 6.35 74.1 0 75.4 4 73.0 2 73.2 4 73.0 2 90 6.10 5.83 6.36 6.28 6.35 76.1 5 76.9 7 75.1 4 75.4 0 75.1 9 100 6.15 5.86 6.38 6.28 6.36 77.7 3 78.5 9 76.9 5 77.2 3 76.9 9 Cutof f Mean 15.1 14.1 6.18 6.11 6.18 64.61 66.45 61.04 61.35 61.03

Number of R&R

Digrams Trigram s Digrams Trigram s Fusion Digrams Trigram s Digrams Trigram s Fusion

Table_8: The mean values to compare the stemmed-boolean and the incorporating-stemming approaches. Measure Approach Stemmed Variable Cutoff with Weighted Cutoff Transformation c/x Quer Dict 5 10 y n No. Retrieved Digrams(overla yes yes 5.44 3.46 5.26 p) and Relevant Stemmedyes yes 5.60 3.48 5.15 boolean E Digrams(overla yes yes 13.43 36.94 31.22 p) Stemmedyes yes 8.19 37.13 33.24 boolean

Table_9: Sign test using variable cutoff to determine whether stemmed-boolean approach performs significantly better than incorporating-stemming approach (overlap coefficient is used). Cutoff = Variable N 12 negatives 0 p .000 significant( =0.05) yes

Table_10: Sign test using cutoff with weighted transformation to determine whether incorporatingstemming approach performs significantly better than stemmed-boolean approach (overlap coefficient is used). Cutoff = 5 10 N 11 16 negatives 3 6 p .113 .227 significant( =0.05) no no

You might also like