You are on page 1of 25

String Processing

Engr. Tazeen Muzammil


Basic Terminologies
• Each programming language contains a character
set that is used to communicate with the computer.
The usually indicates the following:
• Alphabet: A,B,C,D…..,Z
• Digits: 0,1,2,3,4,5,6,7,8,9
• Characters: +, -, /, *, ^, &, %, = etc.
• A finite sequence of 0 or more characters is called a
string.
• The number of characters in a string is called its
length.
• The string with zero characters is called the empty
string or null string.
Storing Strings
Strings are sorted in there types of
structures

1. Fixed-length structure
2. Variable-length structure
3. Linked Structure
Fixed-Length Storage
• Record-Oriented
– In fixed-length storage each line of print is viewed as a
record, where all records have the same length, i.e.
each record accommodate the same number of
characters. Assume our record has length 80 unless
otherwise stated.
• Suppose the input consists of a program. Using a
record-oriented, fixed length storage medium,
the input data will appear in memory as shown in
figure, where we assume that 200 is the address
of the first character of the program.
Program

C PROGRAM PRINTING TWO INTEGERS IN INCREASING


ORDER

READ *,J,K
IF(J,LE,K)
PRINT *,J,K
ELSE
PRINT *,K,J
EFNDIF
STOP
END
Record stored sequentially in
computer
C P ROG R AM P R I N T I NG T WO

200 210 220


R E AD C* ,PROGRAM
J , K PRINTING TWO INTEGERS IN
INCREASING ORDER

READ *,J,K
208 290
IF(J,LE,K) 300
PRINT *,J,K
ELSE
PRINT *,K,J
EFNDIF
STOP
END
Record stored sequentially in
computer
I F ( J , L E , K ) T H E N

360 370 380


E ND C PROGRAM PRINTING TWO INTEGERS IN
INCREASING ORDER

READ *,J,K
840 850
IF(J,LE,K) 860
PRINT *,J,K
ELSE
PRINT *,K,J
EFNDIF
STOP
END
Advantages
• Advantages
– The ease of accessing data from any given record
– The ease of updating data in any given record (as
long as the length of the new data does not
exceed the record length)
• Disadvantages
– Time is wasted reading an entire record if most of
the storage consists of blank spaces.
– Certain records may require more space that
available.
– When the correction consists of more or fewer
characters than the original text, changing a
misspelled word requires the entire record to be
changed.
Variable-Length Storage with
Fixed Maximum
• Although string may be stored in fixed-length
memory location as above, there are advantages
in knowing the actual length of each string; one
does not have to read the entire record when the
string occupies only the beginning part of the
memory location.C PROGRAM PRINTING TWO INTEGERS IN
• The storage of INCREASING
variable-length
ORDER strings in memory
cells with fixed lengths can be done in two
general ways:
READ *,J,K
1. One can useIF(J,LE,K)
a marker that is two $$ signs, to signal
the end of the string.
PRINT *,J,K
2. One can list ELSE
the length of the string as an additional
PRINT
item in the pointer *,K,J .
array
EFNDIF
STOP
END
Linked Storage
• Computer must be able to correct and
modify the printed matter, which usually
means deleting, changing, and inserting
words, phrases, sentences and even
paragraphs in the text. The fixed-length
memory cells do not easily lend
themselves to these operations. For this
reason strings are stored by means of
linked lists.
Linked List

• A linked list, or one-way list is a


linear collection of data elements
called nodes, where linear order is
given by means of pointer.
Linked Lists

A B C ∅

Head

• A linked list is a series of connected


nodes
• Each node contains at least
– A piece of data (any type)
– Pointer to the next node in the list
• Head: pointer to the first nodenode
• The last node points to NULL A

dat pointe
a r
Linked Storage
• String may b used in a linked list as follows. Each
memory cell is assigned one character or a fixed
number of characters, and a link contained in the cell
gives the address of the cell containing the next
character or goup of characters in the string. For
example:
To be or not to be, that is the question.
Linked Storage

T O B

One character per node

T O B E O R

Four character per node


String Operations
• Substring ( substr(pos,len))
– Accessing a substring form a given string requires two piece of information.
1. The position of the first character of the substring, and
2. The length of the substring .

• Indexing (find())
– Indexing refers to finding the location of the substring.
find(string)
find(string, positionFirstChar)
find(string, positionFirstChar, len)
rfind()-(Find last occurrence of string or substring)

• Concatenation
– String concatenation is the operation of joining two character strings end to end.
For example, the strings "snow" and "ball" may be concatenated to give "snowball".

• Length( length(), size())


– The number of characters in the strng is called the length or size of string.
string s = "abc def abc";
string s2 = "abcde
uvwxyz";
Example char c;
char ch[] = "aba daba
do";
• Substring char *cp = ch;
s = s2.substr(1,4);
s = s2.substr(1,50);
• Length
i = s.length();
i = s.size();
• Concatenation
s2 = s2 + "x";
s2 += "x";
• Find
i = s.find("ab",4);
Word Processing
• The operations usually associated with
word processing are:
– Replacement
• Replacing one string in the text by another
replace(pos1, len1, string)
replace(pos1, len1, string, pos2, len2)
– Insertion
• Inserting a string in the middle of the text
insert()
– Deletion
• Deleting a string from the text.
erase(positionFirstChar)
erase(positionFirstChar,len)
string s = "abc def abc";
string s2 = "abcde
uvwxyz";
Example char c;
char ch[] = "aba daba
do";
• Replace char *cp = ch;

s.replace(4,3,"x");

• Erase
s.erase(4,5);
s.erase(4);
Question
A. A text T and a pattern P are in
memory. Write an algorithm which
B.
A.[Find
[Findthe
theindex
indexof
ofP]
P]Set
SetK=Find(T,P)
K=Find(T,P)
deletes
Repeat whileevery
Repeatwhile k=!0
k=!0 occurrence of P in T
a)
a) [Replace
[Delete PPfrom
fromT]
Q]
Set
SetT=Replace(T,P,Q)
T=Delete(T, Find(T,P),Length(P))
a)
a) [Update
[Updateindex]
index]Set
SetK=
K=Find(T,P)
Find(T,P)
B. A text T and a pattern P and Q are
[End
[Endof
ofloop]
loop]
Writ
WritTT
in memory. Write an algorithm
Exit
Exit
which replaces every occurrence of
P in T by Q.
Pattern matching Algorithm
• Given strings T (text) and P(pattern), the
pattern matching problem consists of finding a
substring of T equal to P
• T: “the rain in spain stays mainly on the plain”
• P: “n th”
• We assume that the length of pattern does not
exceed the length of text.
• Applications:
– Text editors
– Web search engines (e.g. Google)
The Brute Force Algorithm
• Check each position in the text T to
see if the pattern P starts in that posi
tion

T: a n d r e w T: a n d r e w

P: r e w P: r e w
P moves 1 char at a time through T
The Brute Force Algorithm
• The first pattern matching algorithm is the one in which we compare a given
pattern P with each of the substring of T, moving from left to right, until we
get a match.
• Let Wk denote the substring of T having the same length as P and beginning
with the Kth character of .
Wk = Substring(T,K,LENGTH(P))
• First we compare P, character by character, with first substring W1
• If all the characters are the same, then P= W1 and so P appears in T and
Index(T,P)=1.
• If some characters of p is not the same as corresponding character W1 . Then
P is not equal to W1 and we can move on to the next substring W2
• The process stops when we find the match of P with some substring Wk and
so P appears in T and Index(T,P)=K, or
• We exhaust all the Wk with no match that means P does not appear in T.
• The maximum value of substring K is equal to Length(T)-Length(P) +1.
The Brute Force Algorithm
• P and T are strings with length R and S, respectively, and are stored
as array with one character per element. The algorithm finds the
Index of P in T
1. [Initialize] Set K= 1 and MAX=S-R+1
2. Repeat Step 3 to 5 while K<=MAX
3. Repeat for L=1 to R [Test each character of P]
If P[L]!= T[K+L-1], then: Go to step 5.
[End of inner loop]
4. [Success] Set INDEX=K, and Exit
5. Set K=K+1
[End of Step 2 outer loop]
6. [Failure] Set INDEX=0
7. Exit.
Analysis
• Brute force pattern matching runs in time O(mn) in the worst case.

• But most searches of ordinary text take


O(m+n), which is very quick.

• Example of a worst case:


– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
– P: "aaah"

• Example of a more average case:


– T: "a string searching example is standard"
– P: "store"
The Boyer-Moore Algorithm
• The Boyer-Moore pattern matching algorithm is based
on two techniques.

• 1. The looking-glass technique


– find P in T by moving backwards through P, starting at its end
• 2. The character-jump technique
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]

• There are 3 possible


cases, tried in order.

You might also like