You are on page 1of 2

A Quick Guide To

PERL Regular Expressions

Search&Replace: substitution operator s///


EXPR =~ s/MOTIF/REPLACE/egimosx
Example: correct typo for the word rabbit
$ex =~ s/rabit/rabbit/g;

This is a Quick reference Guide for PERL regular expressions


(also known as regexps or regexes).
These tools are used to describe text as motifs or patterns
for matching, quoting, substituting or translitterating. Each
programming language (Perl, C, Java, Python...) dene its
own regular expressions although the syntax might differ from
details to extensive changes. In this guide we will concentrate
on the Perl regexp syntax, we assume that the reader has some
preliminary knowledge of Perl programming.

Here is the content of $ex:

Perl uses a Traditional Nondeterministic Finite Automata


(NFA) match engine. This means that it will compare each
element of the motif to the input string, keeping track of
the positions. The engine choose the rst leftmost match
after greedy (i.e., longest possible match) quantiers have
matched.

e
g
i
m
o
s
x

References
For more information on Perl regexps and other syntaxes
you can refer to OReillys book Mastering Regular
Expressions.
Examples:
The following sentence will be used in all our examples:
The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345

Motif nding: match operator m//


EXPR =~ m/MOTIF/cgimosx
EXPR =~ /MOTIF/cgimosx
EXPR !~ m/MOTIF/cgimosx
EXPR !~ /MOTIF/cgimosx
Examples: match any SwissProt ID for a rat protein
if ($ex =~ m/\w{2,5}_RAT/) { print Rat entry\n; }

will match

The ID sp:UBP5_RAT is similar to the rabbit AC tr:Q12345

Example: nd and tag any TrEMBL AC


$ex =~ s/tr:/trembl_ac=/g;

Here is the content of $ex:

The ID sp:UBP5_RAT is similar to the rabit AC trembl_


ac=Q12345

Options
evaluate REPLACE as an expression
global matches (matches all occurrences)
case insensitive
multiline, allow ^ and $ to match with (\n)
compile MOTIF only once
single line, dot . matches new-line (\n)
ignore whitespace and allow comments # in MOTIF

Quoting: quote and compile operator qr//


EXPR =~ qr/MOTIF/imosx
Example: reuse of a precompiled regexp
$myregexp = qr/\w{2,5}_\w{2,5}/;
if ($ex =~ m/$myregexp/) { print SwissProtID\n; }

will match:

The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345

and as a result will print SwissProtID.


Options
i
m
o
s
x

case insensitive
multiline, allow ^ and $ to match with (\n)
compile MOTIF only once
single line, dot . matches new-line (\n)
ignore whitespace and allow comments # in MOTIF

The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345

and as a result print Rat entry.


Options
cg
g
i
m
o
s
x

continue after a failure in /g


global matches (matches all occurrences)
case insensitive
multiline, allow ^ and $ to match with (\n)
compile MOTIF only once
single line, dot . matches new-line (\n)
ignore whitespace and allow comments # in MOTIF

Character classes
[...]
Match any one character of a class
[^...]
Match any one character not in the bracket
.
Match any character (except newline [^\n]) in non
single-line mode (/s)
\d
Any digit. Equivalent to [0..9] or [[:digit:]]
Any non-digit.
\D
Any whitespace. [ \t\s\n\r\f\v] or [[:space:]]
\s
Any non-whitespace.
\S
\w
Any word character. [a-zA-Z0-9_] or [[:alnum:_]]

\W

Any non-word character. Warning \w != \S

POSIX Character class


[[:class:]]
class can be any of:

alnum alpha ascii blank cntrl digit graph lower


print punct space upper xdigit

Special characters
\a
alert (bell)
\b
backspace
\e
escape
\f
form feed
\n
newline
\r
carriage return
\t
horizontal tabulation
\nnn
\xnn
\cX

octal nnn
hexadecimal nn
control character X

Repetitions
?
Zero or one occurrence of the previous item.
*
Zero or more occurrences of the previous item.
+
One or more occurrences of the previous item.
{n,m}
{n,}
{n}
{}?

Match at least n times but no more than m times the


previous item.
Match n or more times
Match exactly n times
Non-greedy match (i.e., match the shortest string)

Anchors
^ or \A
$ or \Z
\z
\b
\B

Match beginning of the string/line


Match end of the string/line
End of string in any match mode
Match word boundary
Match non-word boundary

Capture & Grouping


(...)
Group several characters together for later use or
capture as a single unit
|
Match either subexpressions (equivalent to OR)
Example: match any database code in the list
$ex =~ m/(sp:|tr:|rs:)/g;

will match:

The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345

\n

$n

Back reference. Match the same as the captured


group number n that was previously matched in
the same MOTIF.
Substring of captured group n

Example: reverse and complement a DNA sequence


$DNA = AAATATTTCATCGTACAT;
$revcom = reverse $DNA;
$revcom =~ tr/ACGTacgt/TGCAtgca/;

Example: match several instances with back reference


$ex =~ m/(the).+\1/i;

will match:

The ID sp:UBP5_RAT is similar to the rabit AC tr:Q12345

Example: rename any tr:AC to trembl_AC= using a capture


$ex =~ s/tr:([[:alnum:]]{6})/trembl_AC=$1/gi;

will match:

The ID sp:UBP5_RAT is similar to the rabit AC trembl_


AC=Q12345

The transliteration will produce the following:


print($DNA);
AAATATTTCATCGTACAT
print($revcom);
ATGTACGATGAAATATTT

Options
c
d
s

complement REPLACELIST
delete non-replaced characters
single replace of duplicated characters

UniCode matches
Perl 5.8 supports UniCode 3.2. However it would be too
long to describe all the properties in details here. For more
information see Mastering Regular Expressions.
Text-span modiers
\Q
Quote following metacharacters until \E or end of
motif (allow the use of scalars in regexp)
\u
Force next character to uppercase
\l
Force next character to lowecase
\U
Force all following characters to uppercase
\L
Force all following characters to lowercase
\E
End a span started with \Q, \U or \L
Extended Regexp
(?#...)
Substring ... is a comment
(?=...)
Positive lookahead. Match if exists next match
(e.g., allow overlapping matches in global mode)
(?!...)
Negative lookahead. Match if no next match
(?<=...) Positive lookahead. Fixed length only.
(?<!...) Negative lookahead. Fixed length only.
(?imsx) Modify matching options
Transliteration: translate operator tr///
EXPR =~ tr/SEARCHLIST/REPLACELIST/cds
Transliteration is not - and does not use - a regular expression,
but it is frequently associated with the regexp in PERL. Thus
we decided to include it in this guide.

\p{PROP} Matches a UniCode property


\P{PROP} Matches anything but a UniCode property

This document was written and designed by Laurent Falquet


and Vassilios Ioannidis from the Swiss EMBnet node and being
distributed by P&PR Publications Committee of EMBnet.
EMBnet - European Molecular Biology Network - is a
bioinformatics support network of bioinformatics support
centers situated primarily in Europe. Most countries have a
national node which can provide training courses and other
forms of help for users of bioinformatics software.
You can nd information about your national node from the
EMBnet site:
http://www.embnet.org/

A Quick Guide To PERL Regular Expressions


First edition 2005

You might also like