You are on page 1of 4

International Journal of Wisdom Based Computing, Vol.

1(3), December 2011

Text Segmentation Based Pattern Search Algorithm


Radha krishna Vangipuram Associate Professor of CSE TallaPadmavathi College of Engineering Somidi,Kazipet ,Warangal,A.P,INDIA VangipuramRadhakrishna@gmail.com Sravya,Jahnavi,Sandeep Department of CSE Tallapadmavathi College of Engineering Somidi,Kazipet,A.P,INDIA jahnavi_185@yahoo.com Anji Reddy Department of CSE Vaageshwari Engg College Karimnagar,A.P anjireddy.knr@gmail.com

Abstract: - In this paper we propose a fast pattern matching algorithm based on text segmentation by slicing the text in to segments each equal to size of the pattern. The idea is to perform preprocessing of both text and pattern strings before beginning to search for the pattern in the text so as to achieve substantial speed up in the search process. The experimental results show that the proposed algorithm is superior to other algorithms even when the pattern is in the end of the text. Keywords-pattern, segmentation, pattern matching

mismatch occurs, in either case the window is shifted to the right in a certain distance. The shift value, the direction of the sliding window and the order in which comparisons are made varies in different pattern matching algorithms. Some pattern matching algorithms concentrate on the pattern itself [5]. Other algorithms compare the corresponding characters of the pattern and the text from left to right. Others perform character comparisons from right to left. The performance of the algorithms can be enhanced when comparisons are done in a specific order. In some algorithms the order of comparisons is irrelevant such as Brute Force and Horspool algorithms [7]. II. RELATED WORKS

I.

INTRODUCTION

Pattern matching is one of the important areas which have been studied in the literature.In a standard formulation it is required to search for the string pattern in the string text. If the string pattern is present in the string text then we have to find the position of the first occurrence of pattern string in the text string. This process of searching for a pattern string in a given text string is called Pattern Matching. Network security applications such as virus scan software, anti-spam software, and firewall use pattern matching algorithms to extract the threat from the network by properly tracking incoming traffic for suspicious contents. When Pattern matching algorithms are used to such applications the speed of the algorithm usually forms the bottleneck. Many algorithms have been designed in the literature as improvements of Brute Force algorithm each of which tries to avoid problems of existing algorithms. Still to determine which of the algorithms is the best depends on the application where the algorithm is to be used. Generally, pattern matching algorithms make use of a single window whose size is equal to the pattern length. The searching process starts by aligning the pattern to the left end of the text and then the corresponding characters from the pattern and the text are compared. Character comparisons continue until a whole match is found or a

Several pattern matching algorithms have been developed with a view to enhance searching processes by minimizing the number of comparisons performed [14-16]. To reduce the number of comparisons, the matching process is usually divided into two phases. The pre-processing phase and the searching phase. The pre-processing phase determines the distance (shift value) that the pattern window will move. The searching phase uses this shift value while searching for the pattern in the text with as minimum character comparisons as possible. In Brute Force algorithm (BF), no pre-processing phase is performed. It compares the pattern with the text from left to right. After each attempt, it shifts the pattern by exactly one position to the right. The time complexity of the searching phase is O (mn) in the worst case and the expected number of text character comparisons is (2n). New ways to reduce the number of comparisons performed by moving the pattern more than one position are proposed by many algorithms such as Boyer-Moore (BM)[11,17] and Knuth-Morris-Pratt algorithms (KMP)[6,18]. In [20] the authors propose pattern search algorithm using two sliding windows where the pattern was preprocessed.

International Journal of Wisdom Based Computing, Vol. 1(3), December 2011

In this paper we propose a pattern search algorithm based on preprocessing of both text and pattern string as against to the other algorithms where either text or pattern was preprocessed. The principle is based on text segmentation. The approach is towards reducing the number of comparisons there by substantially reducing the time taken for the search process thus making it more efficient than existing algorithms.

say c_min. If more than one character exists with equal segment count find no of occurrences of those characters in the respective segments and select the max count then find character with minimum value say c_min. Process 2: Preprocessing phase of pattern and text 1. Find distinct characters in a given pattern represented in the variable cmin_1[]
//preprocessing module flag = False for i=0, k=0 to n do Check for the current character if available in the previous position if available set flag = true otherwise set flag = false if (flag==false) then for j = i+1 to n do if (pattern[i]= = pattern[j]) c++; end if end for // count of no of occurrences of the char char_count[k]=c+1 char_pos[k++]=current char position in pattern

III.

PROPOSED ALGORITHM

The proposed algorithm consists of three phases. Preprocessing Phase of pattern Preprocessing of text Searching Phase

A. PREPROCESSING PHASE OF PATTERN AND TEXT In this phase, the pattern to be searched for in the given text is considered and a count of number of occurrences of each character in the pattern is done. This is followed by computation of one or more characters occurring least number of times in the pattern, finally stored in c_min. In other words C_min denotes the character(s) occurring least number of times in the pattern. The length of c_min array gives the number of characters whose occurrence is minimum. Step 1: Initially the text in which the pattern is to be searched for is divided in to segments each of whose length is equal to the pattern length. Step 2: Now each segment is searched for the minimum occurrence character c_min. If a segment contains c_min then the number of occurrences of c_min in that segment is found. If more than one c_min exists then the occurrence count of each c_min in segments where c_min was found is obtained. In the preprocessing phase of the text, the text string is divided in to a number of segments equal to n/m where n, m are text and pattern length respectively followed by Obtaining the segment base address Segm_pos [] and storing each segment in the Segm_str array Then each segment is checked for the character appearing minimum times in the pattern. If it exists in a particular segment then the count of number of occurrences of that character in that segment is done. If more than one minimum character exists then obtain segment count for each min occurrence character and choose the character with min segment count

2. From the above array char_count, obtain count of minimum occurrence number (character), store it in c_min1 [] whose length is c_min_count 3. Obtain the segment base address Segm_pos [] and store each segment in the Segm_str array
for i=0 to no_of_segments then Segm_pos[i]=i*n; Segm_str[i]=substring of text index between i*n and (i+1)*n

If c_min_count >1 Obtain segment count for each min occurrence character and choose the character with min segment count say c_min Else go to process 3 If more than one character exists with equal segment count. Find no of occurrences of those characters in the respective segments and select the max count then find character with minimum value say c_min go to process 3 B. SEARCHING PHASE Process 1: Searching phase to check if string at the end of the text
1. m: text length and n: pattern length 2. res = m mod n and no_of_segments = m/n 3. if (res!=0) 4. for j=n-1, i=m-1 to 0 do

International Journal of Wisdom Based Computing, Vol. 1(3), December 2011

4.1. if (text[i] = = pattern[j]) 4.2. i--;j--; 5. if (j+1= = 0) 5.1. found=true Else 5.2. found=false 6. if(found==true) 6.1. Pattern Match at i+1 Else 6.2. Go to process2

Several experiments have been conducted using the proposed algorithm. In each experiment, we consider the Book1 from Calgary corpus to be the text. Book1 consists consists of 141,274 words (752,149 characters). Patterns of different lengths are also taken from Book1.The searching process is performed using the proposed algorithm on Book1 and the index of the pattern found is also shown in the table below for each case. Fig.1 shows the results of graphical comparison of the algorithms TSW, BR, BM, KMP, BF.

Process 3: Searching phase if process1 fails Step1: Find segments in which c_min appears Step2: For each segment in which c_min appears Align c_min to the segment and search in the one half of the pattern side followed by latter half. Step 3: If found return success else fail IV.
RESULTS AND DISCUSSIONS

Consider for example, the Text string and Pattern string as Text AACATCATAACCCTAATTGGCAGAGAGAGAATCA ATCGAATCA; Text length = 47 Pattern GAATCAAT; Pattern length = 8 The figure below shows the pattern search was done in 8 comparisons.

Fig.1: Shows the graphical comparison of various pattern matching algorithms

Pattern Proposed Length index Algorithm TSW BR BM KMP BF 4 5 6 7 8 9 10 11 12 67 44 72 29 94 284 482 212 2279 4 7 68 6 8 9 V. 27 17 58 29 17 22 15 10 12 28 17 22 17 12 14 28 18 24 65 44 43 CONCLUSION 120 69 93 47 26 29 402 226 269 72 72 38 38 89 89 46 46 109 109 278 280 607 607 202 202 2537 2541

Table 1: Number of comparisons performed to search for the first appearance of the selected pattern from the beginning of the text

International Journal of Wisdom Based Computing, Vol. 1(3), December 2011

VI.

CONCLUSION
[4] [5] Smyth, W.F., 2003. Computing Patterns in Strings. First Edition. Pearson Addison Wesley. United States. ISBN: 978-0-201-39839-7 Charras, C. and T. Lecroq, 2004. Handbook of Exact String Matching Algorithms. First Edition.Kings College London Publications.ISBN: 0954300645 Knuth, D.E., J.H. Morris and V.R. Pratt, 1977. Fast pattern matching in strings. SIAM J. Comput., 6: 323-350. Horspool, R.N., 1980. Practical fast searching in strings. Software Practice Experience, 10: 501-506. Berry, T. and S. Ravindran, 1999. A fast string matching algorithm and experimental results. In: Proceedings of the Prague Stringology Club Workshop 99, Liverpool John Moores University,pp: 16-28. Crochemore, M. and D. Perrin, 1991. Two-way string-matching. ACM, 38: 651-675. DOI: http://doi.acm.org/10.1145/116825.116845 Thathoo, R. et al., 2006. TVSBS: A fast exact pattern matching algorithm for biological sequences. Current Sci., 91:47-53. Boyer, R.S. and J.S. Moore, 1977. A fast string searching algorithm. Commun. ACM., 20: 762-772. DOI:10.1145/359842.359859 Michael, T.G. and Roberto Tamassia, 2002. Algorithm Design, Foundations, Analysis and Internet Examples. First Edition. John Wiley and Sons, Inc, USA. ISBN: 0-471-38365-1 He, L., F. Binxing and J. Sui, 2005. The wide window string matching algorithm. Theor. Compu. Sci., 332: 391-404. DOI: 10.1016/j.tcs.2004.12.002 Hume, A. and D. Sunday, 1991. Fast string searching. Software Practice Experience, 21: 1221-1248. DOI: 10.1002/spe.4380211105 Lecroq, T., 1995. Experimental results on string matching algorithms. Software-practice and Experience, 25: 727-765.DOI: 10.1002/spe.4380250703 Davies G., and Bowsher S., 1996. Algorithms for pattern matching, Software-Practice and Experience,16:575601.DOI:10.1002/spe.4380160608

In Conclusion this paper proposes a Pattern matching algorithm as an improvement of the existing algorithms. The algorithm uses the idea of preprocessing both the text and pattern strings as against to other existing algorithms which either pre process text or pattern or does no preprocessing such as Brute Force algorithm.The behavior of the algorithm depends on the minimum occurrence character in the pattern. The search is performed only in the segments where the minimum character of the pattern is found thus skipping the comparisions in the segments not containing the same which reduces the number of comparisions being performed. Further more as the amount of data available in the applications where string searching is used is doubling every 3 years the algorithms used should be efficient. Thus we always need to devolop efficient algorithms for the same.

[6] [7] [8]

[9] [10] [11] [12]

REFERENCES
[1] Wang, Y. and H. Kobayashi, 2006. High performance pattern matching algorithm for network security. IJCSNS, 6: 83- 87. URL:http://paper.ijcsns.org/07_book/200610/200610A3.pdf Navarro, G. and M. Raffinot, 2002. Flexible Pattern Matching in Strings-Practical On-line Search Algorithms for Texts and Biological Sequences. First Edition. Cambridge University Press, New York. ISBN: 0-521-81 307-7 Crochemore, M. and W. Rytter, 2002. Jewels of Stringology. First Edition. WorldScientific, Singapore. ISBN: 9789810247829

[13]

[14] [15]

[2]

[3]

[16]