You are on page 1of 8

Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

Ryans Tutorials (/) More Tutorials

Now there is looking back.

Now that you've got a feel for regular expressions, we'll add a
bit more complexity. In demonstrating the features on this
page we will also be using features introduced in the Basic
(./regular-expressions-basics.php) and Intermediate
(./regular-expressions-intermediate.php) sections of this
tutorial. If some of this stuff seems a bit confusing it may be
worth reviewing those sections first. Once you complete this
section (and understand it) you won't be a complete Regular
Expressions guru but you will be well on your way and you
should be armed with enough Regular Expressions ammo to
tackle the majority of problems you encounter.

We may group several characters together in our regular expression using brackets '( )' (also referred to as
parentheses). There are then various things which can be done with that group. Some of these we'll look at further
down this page. They also allow us to add a multiplier to that group of characters (as a whole).

So, for instance, we may want to find out if a particular person is mentioned. Their name is John Reginald Smith but
the middle name may or may not be present.

John (Reginald )?Smith

John Reginald Smith is sometime just called John Smith.

1 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

Tip

Notice where the spaces are and aren't in the regular expression above. It's important to remember that
they are part of your regular expression and you need to make sure they are and aren't in the right
places.

The above tip is very important and a common source of problems when people first start playing with regular
expressions. Below is a common mistake that people make.

John (Reginald)? Smith

The problem with this regular expression is that it will match John Reginald Smith perfectly fine and John Smith
(two spaces between John and Smith) but not John Smith. Can you see why?

We aren't limited to just normal characters in the brackets. You may include special characters in there (including
multipliers) as well.

For instance, maybe we would like to find instances of IP addresses. An IP address is a set of 4 numbers (between
0 and 255) separated by full stops (eg. 192.168.0.5).

\b(\d{1,3}\.){3}\d{1,3}\b

The server has an address of 10.18.0.20 and the printer has an address of 10.18.0.116.

Let's break it down as this is starting to get a little complex:

\b indicates a word boundary so we can be sure the IP address is not part of something else.
We have now broken the IP address into 3 chunks consisting of a number between 0 and 255 and a full stop,
and a final number between 0 and 255.
In the brackets we handle the first 3 chunks so \d{1,3} indicates we are looking for between 1 and 3 digits and
we remember to escape the full stop to remove it's special meaning. We are looking for exactly 3 of these so
we place the multiplier {3} just outside the brackets.
Finally we include the fourth number with \d{1,3} and end with another word boundary

The above expression uses elements that have been covered in the previous sections of this tutorial. Be sure to
review these sections if need be.

Tip

As you can see, regular expressions can soon get hard to read once you get various brackets and
backslashes in there. This makes it easy to make silly mistakes by missing or misplacing one of these
characters and the mistakes can be hard to spot. Remember the strategies (./#learning) for handling this.

2 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

Whenever we match something within brackets, that value is actually stored in a variable which we may refer to later
on in the regular expression. To access these variables we use the escape character ( \ ) followed by a digit. The
first set of brackets is referred to with \1, the second set of brackets with \2 and so on.

Let's say we went to find lines with two mentions of a person whos last name is Smith. We don't know that their first
name may be however. We could do the following:

(\b[A-Z]\w+\b) Smith.*\1 Smith

Harold Smith went to meet John Smith but John Smith was not there.

In the above example you'll notice that we matched the text between the two instances of John Smith as well but in
this case that is ok as we are not too concerned in what was matched, only that there was a match.

With alternation we are looking for something or something else. We have seen a very basic example of alternation
with the range operator (./regular-expressions-basics.php#ranges). This allows us to perform alternation with a
single character, but sometimes we would like to perform the operation with a larger set of characters. We can
achieve this with the pipe symbol ( | ) which means or.

So for intance, if we wanted to find all instance of either 'dog' or 'cat' we could do the following:

dog|cat

Harold Smith has two dogs and one cat.

We can also use more than one | to include more options.

dog|cat|bird

Harold Smith has two dogs, one cat and three birds.

Maybe we only want alternation to happen on a part of the regular expression instead of the whole regular
expression. To achieve this we use brackets.

Maybe we want to match Harold Smith or John Smith but not any other Smith.

(John|Harold) Smith

Harold Smith went to meet John Smith but instead bumped into Jane Smith.

3 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

Lookaheads and Lookbehinds are the final thing we are going to introduce in this tutorial and they can be one of the
trickiest things you will encounter in regular expressions. Both of them operate in one of two modes:

Positive - in which we are seeking to find something which matches.


Negative - in which we are seeking to find something which doesn't match.

The main idea of both the lookahead and lookbehind is to see if something matches (or doesn't) and then to throw
away what was actually matched.

Lookaheads
With a lookahead we want to look ahead (hence the name) in our string and see if it matches the given pattern, but
then disregard it and move on. The concept is best illustrated with an example.

Let's say we wish to identify numbers greater than 4000 but less than 5000. This is a problem which seems simple
but is in fact a little trickier than you suspect. A common first attempt is to try the following:

\b4\d\d\d\b

This looks promising with 4021 but unfortunately also matches 4000.

Then you realise that the way we can tackle this is to say we are looking for a '4' followed by 3 ditigs and at least
one of those digits is not a '0'. For us as humans that seems like a simple thing to look for but with what we have
learnt so far in regular expressions, it is not so easy. We could try something like:

\b4([1-9]\d\d|\d[1-9]\d|\d\d[1-9])\b

Now we will match 4010 but not 4000.

That is, use alternation to check three different scenarios, each with a different of the three digits not being '0'.

I reckon you're probably looking at the above and thinking that's a lot of regular expression to mach just 4
characters. Worse still, think about how that would increase if instead of between 4000 and 5000 we wanted
between 40000 and 50000. It soon becomes clear that the above regular expression works but it is not elegant and
it doesn't scale.

It turns out that a negative lookahead can solve problems like this quite well. A negative lookahead is set up as
follows:

4 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

(?!x)

Our negative lookahead is contained within brackets and the first two characters inside the brackets are ?!. Replace
x with what it is you don't want to match.

Now we can set up our regular expression as follows:

\b4(?!000)\d\d\d\b

Now we still match 4010 but not 4000.

That might seem a little confusing so let's break it down

First we look for the character '4'.


When we find a '4' the negative lookahead returns true if the next 3 characters are not '000'.
If this returns true we go back to just after the '4' and continue with our regular expression.

In plain english we could say: "We are looking for a '4' which is not followed by 3 '0's but is followed by 3 digits".

A positive lookahead works in the same way but the characters inside the lookahead have to match rather than not
match. The syntax for a positive lookahead is as follows:

(?=x)

All we need to do is replace the '!' with an '='.

Lookbehinds
Lookbehinds work similarly to lookaheads but instead of looking forwards then throwing it away, we look backwards
and then throw it away. Similar to lookaheads, they are available in both positive and negative. They follow a similar
syntax but include a '<' after the '?' (Think of it as an arrow pointing backwards).

(?<=x) and (?<!x)

Is the syntax for a positive lookbehind and negative lookbehind respectively.

Let's say we would like to find instances of the name 'Smith' but only if they are a surname. To achieve this we have
said that we want to look at the word before it and if that word begins with a capital letter we'll assume it is a
surname (the more astute of you will have already seen the flaw in this, ie what if Smith is the second word in a
sentence, but we'll ignore that for now.)

5 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

(?<=[A-Z]\w* )Smith

Now we won't identify Smith Francis but we will identify Harold Smith.

Lookaheads and lookbeinds can be a bit tough to get your head around at first. I would suggest you experiment with
a few different searches yourself to get the hang of it.

Tip

Applications and programming languages differ in how they implement lookaheads and lookbehinds.
Some will allow you to use other regular expression features within a lookahead and lookbehind, some
will not. Some will allow some features but not all of them. If you are getting unexpected behaviour you
may need to find out which features are and aren't implemented for your particular application or
programming language.

You've now learnt enough about regular expressions to get you through the majority of problems you will probably
face. You've really only been introduced to the building blocks though. Learning how to put the building blocks
together into effective patterns is something which will take time and practice. Don't worry if some of this stuff is still
a little confusing at this point in time. With practice it will all become clearer and you will become very powerful in
terms of the things you can achieve.

Stuff We Learnt

()
Group part of the regular expression.

\1 \2 etc
Refer to something matched by a previous grouping.

|
Match what is on either the left or right of the pipe symbol.

(?=x)
Positive lookahead.

(?!x)
Negative lookahead.

6 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

(?<=x)
Positive lookbehind.

(?<!x)
Negative lookbehind.

Regular Expressions Intermediate (./regular-expressions-intermediate.php)


Regular Expressions Examples (./regular-expressions-examples.php)

By Ryan Chadwick (https://plus.google.com/105636787773904848687) © 2017 Follow @funcreativity


(https://twitter.com/funcreativity)

Home Linux Tutorial HTML Tutorial Binary Tutorial


(/) (/linuxtutorial/) (/html-tutorial/) (/binary-tutorial/)

Education is the kindling


of a flame,
not the filling of a vessel.

Bash Scripting CSS Tutorial Regular


Tutorial Expressions
Contact (/contact.php) | (/css-tutorial/)
(/bash-scripting-tutorial/) (/regular-expressions-
Disclaimer
tutorial/)
(/disclaimer.php)

Programming Problem Solving Boolean Algebra


Challenges Tutorial
(/problem-solving-skills/)
(/programming- (/boolean-algebra-
challenges/) tutorial/)

7 of 8 11/22/2017, 2:13 AM
Advanced - Regular Expressions Tutorial https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions...

Basic Design Solve the Cube


Tutorial
(/rubiks-cube-tutorial/)
(/graphic-design-tutorial/)

8 of 8 11/22/2017, 2:13 AM

You might also like