You are on page 1of 156

ARTIFICIAL

INTELLIGENCE

Author of Reference book- Stuart Russell& Norvig

Chapter-1
Characterizations of
Artificial Intelligence

Artificial Intelligence is not an easy science to describe, as it has fuzzy borders with
mathematics, computer science, philosophy, psychology, statistics, physics, biology
and other disciplines. It is often characterized in various ways, some of which are
given below. I'll use these categorizations to introduce various important issues in
AI.

1.1 Long Term Goals


Just what is the science of Artificial Intelligence trying to achieve? At a very high
level, you will hear AI researchers categorized as either 'weak' or 'strong'. The
'strong' AI people think that computers can achieve consciousness (although they
may not be working on consciousness issues). The 'weak' AI people don't go that
far. Other people talk of the difference between 'Big AI' and 'Small AI'. Big AI is
the attempt to build robots of intelligence equaling that of humans, such as
Lieutenant Commander Data from Star Trek. Small AI is all about getting programs
to work for small problems and trying to generalize the techniques to work on
larger problems. Most AI researchers don't worry about things like consciousness
and concentrate on some of the following long term goals.
Firstly, many researchers want to:

Produce machines which exhibit intelligent behavior.

Machines in this sense could simply be personal computers, or they could be robots
with embedded systems, or a mixture of both. Why would we want to build
intelligent systems? One answer appeals to the reasons why we use computers in
general: to accomplish tasks which, if we did them by hand would be error prone.
For instance how many of us would not reach for our calculator if required to
multiply two six digit numbers together? If we scale this up to more intelligent
tasks, then it should be possible to use computers to do some fairly complicated
things reliably. This reliability may be very useful if the task is beyond some
cognitive limitation of the brain, or when human intuition is counter-constructive,
such as in the Monty Hall problem described below, which many people - some of
whom call themselves mathematicians - get wrong.
Another reason we might want to construct intelligent machines is to enable us to
do things we couldn't do before. A large part of science is dependent on the use of
computers already, and more intelligent applications are increasingly being

employed. The ability for intelligent software to increase our abilities is not limited
to science, of course, and people are working on AI programs which can have a
creative input to human activities such as composing, painting and writing.
Finally, in constructing intelligent machines, we may learn something about
intelligence in humanity and other species. This deserves a category of its own.
Another reason to study Artificial Intelligence is to help us to:

Understand human intelligence in society.

AI can be seen as just the latest tool in the philosopher's toolbox for answering
questions about the nature of human intelligence, following in the footsteps of
mathematics, logic, biology, psychology, cognitive science and others. Some
obvious questions that philosophy has wrangled with are: "We know that we are
more 'intelligent' than the other animals, but what does this actually mean?" and
"How many of the activities which we call intelligent can be replicated by
computation (e.g., algorithmically)?"
For example, the ELIZA program discussed below is a classic example from the
sixties where a very simple program raised some serious questions about the nature
of human intelligence. Amongst other things, ELIZA helped philosophers and
psychologists to question the notion of what it means to 'understand' in natural
language (e.g., English) conversations
By stating that AI helps us understand the nature of human intelligence in society,
we should note that AI researchers are increasingly studying multi-agent systems,
which are, roughly speaking, collections of AI programs able to communicate and
cooperate/compete on small tasks towards the completion of larger tasks. This
means that the social, rather than individual, nature of intelligence is now a subject
within range of computational studies in Artificial Intelligence.
Of course, humans are not the only life-forms, and the questions of life (including
intelligent life) poses even bigger questions. Indeed, some Artificial Life (ALife)
researchers have grand plans for their software. They want to use it to:

Give birth to new life forms.

A study of Artificial Life will certainly throw light on what it means for a complex
system to be 'alive'. Moreover, ALife researchers hope that, in creating artificial
life-forms, given time, intelligent behaviour will emerge, much like it did in human
evolution. Hence, there may be practical applications of an ALife approach. In
particular, evolutionary algorithms (where programs and parameters are evolved to

perform a particular task, rather than to exhibit signs of life) are becoming fairly
mainstream in AI.
A less obvious long term goal of AI research is to:

Add to scientific knowledge.

This is not to be confused with the applications of AI programs to other sciences,


discussed later. Rather, it is worth pointing out that some AI researchers don't write
intelligent programs and are certainly not interested in human intelligence or
breathing life into programs. They are really interested in the various scientific
problems that arise in the study of AI. One example is the question of algorithmic
complexity - how bad will a particular algorithm get at solving a particular problem
(in terms of the time taken to find the solution) as the problem instances get bigger.
These kinds of studies certainly have an impact on the other long term goals, but the
pursuit of knowledge itself is often overlooked as a reason for AI to exist as a
scientific discipline. We won't be covering issues such as algorithmic complexity in
this course, however.

1.2 Inspirations
Artificial Intelligence research can be characterised in terms of how the following
question has been answered:
"Just how are we going to get a computer to perform intelligent tasks?"
One way to answer the question is to say that:

Logic makes a science out of various forms of reasoning, which play their part in
intelligence. So, let's build our programs as implementations of logical theories.

This has led to the use of logic - drawing on mathematics and philosophy - in a
great deal of AI research. This means that we can be very precise about the
algorithms we implement, write our programs in very clear ways using logic
programming languages, and even prove things about the programs we produce.
However, while it's theoretically possible to do certain intelligent things (such as
prove some easy mathematics theorems) with programs based on logic alone, such
methods are held back by the very large search spaces involved. People began to
think about heuristics - rules of thumb - which they could use to enable their
programs to get jobs done in a reasonable time. They answered the question like
this:

We're not sure that humans reason with perfect logic all the time, but we are
certainly intelligent. So, let's use introspection and tell our AI programs how to
think like us.

In answering this question, AI researchers started building expert systems, which


encapsulated factual, procedural and heuristic knowledge about particular domains

1.4 General Tasks to Accomplish


Once you've worried about why you're doing AI, what has inspired you and how you're
going to approach the job, then you can start to think about what task it is that you want
to automate. AI is so often portrayed as a set of problem-solving techniques, but I think
the relentless shoe-horning of intelligent tasks into one problem formulation or another
is holding AI back. That said, we have determined a number of problem solving tasks in
AI - most of which have been hinted at previously - which can be used as a
characterization. The categories overlap a little because of the generality of the
techniques. For instance, planning could be found in many categories, as this is a
fundamental part of solving many types of problem.

1.5 Generic Techniques Developed


In the pursuit of solutions to various problems in the above categories, various
individual techniques have sprung up which have been shown to be useful for solving a
range of problems (usually within the general problem category). These techniques are
established enough now to have a name and provide at least a partial characterisation of
AI. The following list is not intended to be complete, but rather to introduce some
techniques you will learn later in the course. Note that some of these overlap with the
general techniques above.

Forward/backward chaining (reasoning)


Resolution theorem proving (reasoning)
Proof planning (reasoning)
Constraint satisfaction (reasoning)
Davis-Putnam method (reasoning)
Minimax search (games)
Alpha-Beta pruning (games)
Case-based reasoning (expert systems)
Knowledge elicitation (expert systems)
Neural networks (learning)
Bayesian methods (learning)

Explanation based (learning)


Inductive logic programming
(learning)
Reinforcement (learning)
Genetic algorithms (learning)
Genetic
programming
(learning)
Strips (planning)
N-grams (NLP)
Parsing (NLP)
Behavior based (robotics)
Cell decomposition (robotics)

1.6 Representations/Languages Used


Many people are taught AI with the opening line: "The three most important things in
AI are representation, representation and representation". While choosing the way of
representing knowledge in AI programs will always be a key concern, many techniques
now have well-chosen ways to represent data which have been shown to be useful for
that technique. Along the way, much research has been undertaken into discovering the
best ways to represent certain types of knowledge. The way in which knowledge can be
represented is often taken as another way to characterize Artificial Intelligence. Some
general representation schemes include:

First order logic


Higher order logic
Logic programs
Frames
Production Rules
Semantic Networks
Fuzzy logic
Bayes nets
Hidden Markov models
Neural networks
Strips

Some standard AI programming languages have been developed in order to build


intelligent programs efficiently and robustly. These include:

Prolog
Lisp
ML

Note that other languages are used extensively to build AI programs, including:

Perl
C++
Java
C

1.7 Application Areas


Individual applications often drive AI research much more than the long term goals
described above. Much of AI literature is grouped into application areas, some of which
are:

Agriculture
Architecture
Art
Astronomy
Bioinformatics
Email classification
Engineering
Finance
Fraud detection
Information retrieval
Law

Mathematics
Military
Music
Scientific discovery
Story writing
Telecommunications
Telephone services
Transportaion
Tutoring systems
Video games
Web search engines

Chapter2
Artificial Intelligence Agents
In the previous lecture, we discussed what we will be talking about in Artificial
Intelligence and why those things are important. This lecture is all about how we
will be talking about AI, i.e., the language, assumptions and concepts which will be
common to all the topics we cover.
These notions should be considered before undertaking any large AI project. Hence,
this lecture also serves to add to the systems engineering information you have/will
be studying. For AI software/hardware, of course, we have to worry about which
programming language to use, how to split the project into modules, etc. However,
we also have to worry about higher level notions, such as: what does it mean for our
program/machine to act rationally in a particular domain, how will it use knowledge
about the environment, and what form will that knowledge take? All these things
should be taken into consideration before we worry about actually doing any
programming.

2.1 Autonomous Rational Agents


In many cases, it is inaccurate to talk about a single program or a single robot, as
the combination of hardware and software in some intelligent systems is
considerably more complicated. Instead, we will follow the lead of Russell and
Norvig and describe AI through the autonomous, rational intelligent agents
paradigm. We're going to use the definitions from chapter 2 of Russell and Norvig's
textbook, starting with these two:

An agent is anything that can be viewed as perceiving its environment through


sensors and acting upon that environment through effectors.
A rational agent is one that does the right thing.

We see that the word 'agent' covers humans (where the sensors are the senses and
the effectors are the physical body parts) as well as robots (where the sensors are
things like cameras and touch pads and the effectors are various motors) and
computers (where the sensors are the keyboard and mouse and the effectors are the
monitor and speakers).
To determine whether an agent has acted rationally, we need an objective measure
of how successful it has been and we need to worry about when to make an

evaluation using this measure. When designing an agent, it is important to think


hard about how to evaluate it's performance, and this evaluation should be
independent from any internal measures that the agent undertakes (for example as
part of a heuristic search - see the next lecture). The performance should be
measured in terms of how rationally the program acted, which depends not only on
how well it did at a particular task, but also on what the agent experienced from its
environment, what the agent knew about its environment and what actions the agent
could actually undertake.

Acting Rationally
Al Capone was finally convicted for tax evasion. Were the police acting rationally?
To answer this, we must first look at how the performance of police forces is
viewed: arresting and convicting the people who have committed a crime is a start,
but their success in getting criminals off the street is also a reasonable, if
contentious, measure. Given that they didn't convict Capone for the murders he
committed, they failed on that measure. However, they did get him off the street, so
they succeeded there. We must also look at the what the police knew and what they
had experienced about the environment: they had experienced murders which they
knew were undertaken by Capone, but they had not experienced any evidence
which could convict Capone of the murders. However, they had evidence of tax
evasion. Given the knowledge about the environment that they can only arrest if
they have evidence, their actions were therefore limited to arresting Capone on tax
evasion. As this got him off the street, we could say they were acting rationally.
This answer is controversial, and highlights the reason why we have to think hard
about how to assess the rationality of an agent before we consider building it.
To summarize, an agent takes input from its environment and affects that
environment. The rational performance of an agent must be assessed in terms of the
task it was meant to undertake, it's knowledge and experience of the environment
and the actions it was actually able to undertake. This performance should be
objectively measured independently of any internal measures used by the agent.
In English language usage, autonomy means an ability to govern one's actions
independently. In our situation, we need to specify the extent to which an agent's
behavior is affected by its environment. We say that:

The autonomy of an agent is measured by the extent to which its behaviour


is determined by its own experience.

At one extreme, an agent might never pay any attention to the input from its
environment, in which case, its actions are determined entirely by its built-in
knowledge. At the other extreme, if an agent does not initially act using its built-in
knowledge, it will have to act randomly, which is not desirable. Hence, it is
desirable to have a balance between complete autonomy and no autonomy.
Thinking of human agents, we are born with certain reflexes which govern our
actions to begin with. However, through our ability to learn from our environment,
we begin to act more autonomously as a result of our experiences in the world.
Imagine a baby learning to crawl around. It must use in-built information to enable
it to correctly employ its arms and legs, otherwise it would just thrash around.
However, as it moves, and bumps into things, it learns to avoid objects in the
environment. When we leave home, we are (supposed to be) fully autonomous
agents ourselves. We should expect similar of the agents we build for AI tasks: their
autonomy increases in line with their experience of the environment.

2.3 Internal Structure of Agents


We have looked at agents in terms of their external influences and behaviors: they
take input from the environment and perform rational actions to alter that
environment. We will now look at some generic internal mechanisms which are
common to intelligent agents.

Architecture and Program

The program of an agent is the mechanism by which it turns input from the
environment into an action on the environment. The architecture of an agent is the
computing device (including software and hardware) upon which the program
operates. On this course, we mostly concern ourselves with the intelligence behind
the programs, and do not worry about the hardware architectures they run on. In
fact, we will mostly assume that the architecture of our agents is a computer getting
input through the keyboard and acting via the monitor.
RHINO consisted of the robot itself, including the necessary hardware for
locomotion (motors, etc.) and state of the art sensors, including laser, sonar, infrared
and tactile sensors. RHINO also carried around three on-board PC workstations and
was connected by a wireless Ethernet connection to a further three off-board SUN
workstations. In total, it ran up to 25 different processes at any one time, in parallel.
The program employed by RHINO was even more complicated than the
architecture upon which it ran. RHINO ran software which drew upon techniques
ranging from low level probabilistic reasoning and visual information processing to
high level problem solving and planning using logical representations.

An agent's program will make use of knowledge about its environment and methods
for deciding which action to take (if any) in response to a new input from the
environment. These methods include reflexes, goal based methods and utility based
methods.

Knowledge of the Environment

We must distinguish between knowledge an agent receives through it's sensors and
knowledge about the world from which the input comes. Knowledge about the
world can be programmed in, and/or it can be learned through the sensor input. For
example, a chess playing agent would be programmed with the positions of the
pieces at the start of a game, but would maintain a representation of the entire board
by updating it with every move it is told about through the input it receives. Note
that the sensor inputs are the opponent's moves and this is different to the
knowledge of the world that the agent maintains, which is the board state.
There are three main ways in which an agent can use knowledge of its world to
inform its actions. If an agent maintains a representation of the world, then it can
use this information to decide how to act at any given time. Furthermore, if it stores
its representations of the world, then it can also use information about previous
world states in its program. Finally, it can use knowledge about how its actions
affect the world.
The RHINO agent was provided with an accurate metric map of the museum and
exhibits beforehand, carefully mapped out by the programmers. Having said this,
the layout of the museum changed frequently as routes became blocked and chairs
were moved. By updating it's knowledge of the environment, however, RHINO
consistently knew where it was, to an accuracy better than 15cm. RHINO didn't
move objects other than itself around the museum. However, as it moved around,
people followed it, so its actions really were altering the environment. It was
because of this (and other reasons) that the designers of RHINO made sure it
updated its plan as it moved around.

Reflexes

If an agent decides upon and executes an action in response to a sensor input


without consultation of its world, then this can be considered a reflex response.
Humans flinch if they touch something very hot, regardless of the particular social
situation they are in, and this is clearly a reflex action. Similarly, chess agents are
programmed with lookup tables for openings and endings, so that they do not have
to do any processing to choose the correct move, they simply look it up. In timed
chess matches, this kind of reflex action might save vital seconds to be used in more
difficult situations later.

Unfortunately, relying on lookup tables is not a sensible way to program intelligent


agents: a chess agent would need 35100 entries in its lookup table (considerably
more entries than there are atoms in the universe). And if we remember that the
world of a chess agent consists of only 32 pieces on 64 squares, it's obvious that we
need more intelligent means of choosing a rational action.
For RHINO, it is difficult to identify any reflex actions. This is probably because
performing an action without consulting the world representation is potentially
dangerous for RHINO, because people get everywhere, and museum exhibits are
expensive to replace if broken!

Goals

One possible way to improve an agent's performance is to enable it to have some


details of what it is trying to achieve. If it is given some representation of the goal
(e.g., some information about the solution to a problem it is trying to solve), then it
can refer to that information to see if a particular action will lead to that goal. Such
agents are called goal-based. Two tried and trusted methods for goal-based agents
are planning (where the agent puts together and executes a plan for achieving its
goal) and search (where the agent looks ahead in a search space until it finds the
goal). Planning and search methods are covered later in the course.
In RHINO, there were two goals: get the robot to an exhibit chosen by the visitors
and, when it gets there, provide information about the exhibit. Obviously, RHINO
used information about its goal of getting to an exhibit to plan its route to that
exhibit.

Utility Functions

A goal based agent for playing chess is infeasible: every time it decides which
move to play next, it sees whether that move will eventually lead to a checkmate.
Instead, it would be better for the agent to assess it's progress not against the overall
goal, but against a localized measure. Agent's programs often have a utility function
which calculates a numerical value for each world state the agent would find itself
in if it undertook a particular action. Then it can check which action would lead to
the highest value being returned from the set of actions it has available. Usually the
best action with respect to a utility function is taken, as this is the rational thing to
do. When the task of the agent is to find something by searching, if it uses a utility
function in this manner, this is known as a best-first search.
RHINO searched for paths from its current location to an exhibit, using the distance
from the exhibit as a utility function. However, this was complicated by visitors
getting in the way.

2.4 Environments
We have seen that intelligent agents should take into account certain information
when choosing a rational action, including information from its sensors,
information from the world, information from previous states of the world,
information from its goal and information from its utility function(s). We also need
to take into account some specifics about the environment it works in. On the
surface, this consideration would appear to apply more to robotic agents moving
around the real world. However, the considerations also apply to software agents
which are receiving data and making decisions which affect the data they receive in this case we can think of the environment as the flow of information in the data
stream. For example, an AI agent may be employed to dynamically update web
pages based on the requests from internet users.
We follow Russell and Norvig's lead in characterizing information about the
environment:

Accessibility

In some cases, certain aspects of an environment which should be taken into


account in decisions about actions may be unavailable to the agent. This could
happen, for instance, because the agent cannot sense certain things. In these cases,
we say the environment is partially inaccessible. In this case, the agent may have to
make (informed) guesses about the inaccessible data in order to act rationally.
The builders of RHINO talk about "invisible" objects that RHINO had to deal with.
These included glass cases and bars at various heights which could not be detected
by the robotic sensors. These are clearly inaccessible aspects of the environment,
and RHINO's designers took this into account when designing its programs.

Determinism

If we can determine what the exact state of the world will be after an agent's action,
we say the environment is deterministic. In such cases, the state of the world after
an action is dependent only on the state of the world before the action and the
choice of action. If the environment is non-deterministic, then utility functions will
have to make (informed) guesses about the expected state of the world after
possible actions if the agent is to correctly choose the best one.
RHINO's world was non-deterministic because people moved around, and they
move objects such as chairs around. In fact, visitors often tried to trick the robot by
setting up roadblocks with chairs. This was another reason why RHINO's plan was
constantly updated.

Episodes

If an agent's current choice of action does not depend on its past actions, then the
environment is said to be episodic. In non-episodic environments, the agent will
have to plan ahead, because it's current action will affect subsequent ones.
Considering only the goal of getting to and from exhibits, the individual trips
between exhibits can be seen as episodes in RHINO's actions. Once it had arrived at
one exhibit, how it got there would not normally affect its choices in getting to the
next exhibit. If we also consider the goal of giving a guided tour, however, RHINO
must at least remember the exhibits it had already visited, in order not to repeat
itself. So, at the top level, its actions were not episodic.

Static or Dynamic

An environment is static if it doesn't change while an agent's program is making the


decision about how to act. When designing agents to operate in dynamic (nonstatic) environments, the underlying program may have to refer to the changing
environment while it deliberates, or to anticipate the change in the environment
between the time when it receives an input and when it has to take an action.
RHINO was very fast in making decisions. However, because of the amount of
visitor movement, by the time RHINO had planned a route, that plan was
sometimes wrong because someone was now blocking the route. However, because
of the speed of decision making, instead of referring to the environment during the
planning process, as we have said before, the designers of RHINO chose to enable
it to continually update its plan as it moved.

Discrete or Continuous

The nature of the data coming in from the environment will affect how the agent
should be designed. In particular, the data may be discrete (composed of a limited
number of clearly defined parts) or continuous (seemingly without discernible
sections). Of course, given the nature of computer memory (in bits and bytes), even
streaming video can be shoe-horned into the discrete category, but an intelligent
agent will probably have to deal with this as if it is continuous. The mathematics in
your agent's programs will differ depending on whether the data is taken to be
discrete or continuous.

Chapter-3
Search in Problem Solving
If Artificial Intelligence can inform the other sciences about anything, it is about
problem solving and, in particular, how to search for solutions to problems. Much
of AI research can be explained in terms of specifying a problem, defining a search
space which should contain a solution to the problem, choosing a search strategy
and getting an agent to use the strategy to find a solution.
If you are hired as an AI researcher/programmer, you will be expected to come
armed with a battery of AI techniques, many of which we cover later in the course.
However, perhaps the most important skill you will bring to the job is to effectively
seek out the best way of turning some vague specifications into concrete problems
requiring AI techniques. Specifying those problems in the most effective way will
be vital if you want your AI agent to find the solutions in a reasonable time. In this
lecture, we look at how to specify a search problem.

3.1 Specifying Search Problems


In our agent terminology, a problem to be solved is a specific task where the agent
starts with the environment in a given state and acts upon the environment until the
altered state has some pre-determined quality. The set of states which are possible
via some sequence of actions the agent takes is called the search space. The series
of actions that the agent actually performs is its search path, and the final state is a
solution if it has the required property. There may be many solutions to a particular
problem. If you can think of the task you want your agent to perform in these terms,
then you will need to write a problem solving agent which uses search.
It is important to identify the scope of your task in terms of the problems which will
need to be solved. For instance, there are some tasks which are single problems
solved by searching, e.g., find a route on a map. Alternatively, there are tasks such
as winning at chess, which have to be broken down into sub-problems (searching
for the best move at each stage). Other tasks can be achieved without searching
whatsoever e.g., multiplying two large numbers together - you wouldn't dream of
searching through the number line until you came across the answer!
There are three initial considerations in problem solving (as described in Russell
and Norvig):

Initial State

Firstly, the agent needs to be told exactly what the initial state is before it starts its
search, so that it can keep track of the state as it searches.

Operators

An operator is a function taking one state to another via an action undertaken by the
agent. For example, in chess, an operator takes one arrangement of pieces on the
board to another arrangement by the action of the agent moving a piece.

Goal Test

It is essential when designing a problem solving agent to know when the problem
has been solved, i.e., to have a well defined goal test. Suppose the problem we had
set our agent was to find a name for a newborn baby, with some properties. In this
case, there are lists of "accepted" names for babies, and any solution must appear in
that list, so goal-checking amounts to simply testing whether the name appears in
the list. In chess, on the other hand, the goal is to reach a checkmate. While there
are only a finite number of ways in which the pieces on a board can represent a
checkmate, the number of these is huge, so checking a position against them is a
bad idea. Instead, a more abstract notion of checkmate is used, whereby our agent
checks that the opponent's king cannot move without being captured.

3.2 General Considerations for Search


If we can specify the initial state, the operators and the goal check for a search
problem, then we know where to start, how to move and when to stop in our search.
This leaves the important question of how to choose which operator to apply to
which state at any stage during the search. We call an answer to this question a
search strategy. Before we worry about exactly what strategy to use, the following
need to be taken into consideration:

Path or Artifact

Broadly speaking, there are two different reasons to undertake a search: to find an
artifact (a particular state), or to find a path from one given state to another given
state. Whether you are searching for a path or an artifact will affect many aspects of
your agent's search, including its goal test, what it records along the way and the
search strategies available to you.
For example, in the maze below, the game involves finding a route from the top left
hand corner to the bottom right hand corner. We all know what the exit looks like (a
gap in the outer wall), so we do not search for an artifact. Rather, the point of the
search is to find a path, so the agent must remember where it has been.

However, in other searches, the point of the search is to find something, and it may
be immaterial how you found it. For instance, suppose we play a different game: to
find an anagram of the phrase:

ELECTING NEIL
The answer is, of course: (FILL IN THIS GAP AS AN EXERCISE). In this case,
the point of the search is to find an artifact - a word which is an anagram of
"electing neil". No-one really cares in which order to actually re-arrange the letters,
so we are not searching for a path.

Completeness

It's also worth trying to estimate the number of solutions to a problem, and the
density of those solutions amongst the non-solutions. In a search problem, there
may be any number of solutions, and the problem specification may involve finding
just one, finding some, or finding all the solutions. For example, suppose a military
application searches for routes that an enemy might take. The question: "Can the
enemy get from A to B" requires finding only one solution, whereas the question:
"How many ways can the enemy get from A to B" will require the agent to find all
the solutions.
When an agent is asked to find just one solution, we can often program it to prune
its search space quite heavily, i.e., rule out particular operators at particular times to
be more efficient. However, this may also prune some of the solutions, so if our
agent is asked to find all of them, the pruning has to be controlled so that we know
that pruned areas of the search space either contain no solutions, or contain
solutions which are repeated in another (non-pruned) part of the space.
If our search strategy is guaranteed to find all the solutions eventually, then we say
that it is complete. Often, it is obvious that all the solutions are in the search space,
but in other cases, we need to prove this fact mathematically to be sure that our
space is complete. A problem with complete searches is that - while the solution is

certainly there - it can take a very long time to find the solution, sometimes so long
that the strategy is effectively useless. Some people use the word exhaustive when
they describe complete searches, because the strategy exhausts all possibilities in
the search space.

Time and Space Tradeoffs

In practice, you are going to have to stop your agent at some stage if it has not
found a solution by then. Hence, if we can choose the fastest search strategy, then
this will explore more of the search space and increase the likelihood of finding a
solution. There is a problem with this, however. It may be that the fastest strategy is
the one which uses most memory. To perform a search, an agent needs at least to
know where it is in a search space, but lots of other things can also be recorded. For
instance, a search strategy may involve going over old ground, and it would save
time if the agent knew it had already tried a particular path. Even though RAM
capacities in computers are going steadily up, for some of the searches that AI
agents are employed to undertake, they often run out of memory. Hence, as in
computer science in general, AI practitioners often have to devise clever ways to
trade memory and time in order to achieve an effective balance.

Soundness

You may hear in some application domains - for example automated theorem
proving - that a search is "sound and complete". Soundness in theorem proving
means that the search to find a proof will not succeed if you give it a false theorem
to prove. This extends to searching in general, where a search is unsound if it finds
a solution to a problem with no solution. This kind of unsound search may not be
the end of the world if you are only interested in using it for problems where you
know there is a solution (and it performs well in finding such solutions). Another
kind of unsound search is when a search finds the wrong solution to a problem. This
is more worrying and the problem will probably lie with the goal testing
mechanism.

Additional Knowledge in Search

The amount of extra knowledge available to your agent will effect how it performs.
In the following sections of this lecture, we will look at uninformed search
strategies, where no additional knowledge is given, and heuristic searches, where
any information about the goal, intermediate states and operators can be used to
improve the efficiency of the search strategy.

3.3 Uninformed Search Strategies


To be able to undertake an uninformed search, all our agent needs to know is the
initial state, the possible operators and how to check whether the goal has been
reached. Once these have been described, we must then choose a search strategy for
the agent: a pre-determined way in which the operators will be applied.
The example we will use is the case of a genetics professor searching for a name for
her newborn baby boy - of course, it must only contain the letters D, N and A. The
states in this search are strings of letters (but only Ds, Ns and As), and the initial
state is an empty string. Also, the operators available are: (i) add a 'D' to an existing
string (ii) add an 'N' to an existing string and (iii) add an 'A' to an existing string.
The goal check is possible using a book of boys names against which the professor
can check a string of letters.
To help us think about the different search strategies, we use two analogies. Firstly,
we suppose that the professor keeps an agenda of actions to undertake, such as: add
an 'A' to the string 'AN'. So, the agenda consists of pairs (S,O) of states and
operators, whereby the operator is to be applied to the state. The action at the top of
the agenda is the one which is carried out, then that action is removed. How actions
are added to the agenda differs for each search strategy. Secondly, we think of a
search graphically: by making each state a node in a graph and each operator an
edge, we can think of the search progressing as movement from node to node along
edges in the graph. We then allow ourselves to talk about nodes in a search space
(rather than the graph) and we say that a node in a search space has been expanded
if the state that node represents has been visited and searched from. Note that
graphs which have no cycles in them are called trees, and many AI searches can be
represented as trees.

Breadth First Search

Given a set of operators o1, ..., on in a breadth first search, every time a new state s
is reached, an action for each operator on s is added to the bottom of the agenda,
i.e., the pairs (s,o1), ..., (s,on) are added to the end of the agenda in that order.
However, once the 'D' state had been found, the actions:
1.(empty,add'D')
2.(empty,add'N')
3. (empty,add 'A')
would be added to the top of the agenda, so it would look like this:

4. ('D',add 'D')
5. ('D',add 'N')
6. ('D',add 'A')

However, we can remove the first agenda item as this action has been undertaken.
Hence there are actually 5 actions on the agenda after the first step in the search
space. Indeed, after every step, one action will be removed (the action just carried
out), and three will be added, making a total addition of two actions to the agenda.
It turns out that this kind of breadth first search leads to the name 'DAN' after 20
steps. Also, after the 20th step, there are 43 tasks still on the agenda to do.
It's useful to think of this search as the evolution of a tree, and the diagram below
shows how each string of letters is found via the search in a breadth first manner.
The numbers above the boxes indicate at which step in the search the string was
found.

We see that each node leads to three others, which corresponds to the fact that after
every step, three more steps are put on the agenda. This is called the branching
rate of a search, and seriously affects both how long a search is going to take and
how much memory it will use up.
Breadth first search is a complete strategy: given enough time and memory, it will
find a solution if one exists. Unfortunately, memory is a big problem for breadth
first search. We can think about how big the agenda grows, but in effect we are just

counting the number of states which are still 'alive', i.e., there are still steps in the
agenda involving them. In the above diagram, the states which are still alive are
those with fewer than three arrows coming from them: there are 14 in all.
It's fairly easy to show that in a search with a branching rate of b, if we want to
search all the way to a depth of d, then the largest number of states the agent will
have to store at any one time is bd-1. For example, if our professor wanted to search
for all names up to length 8, she would have to remember (or write down) 2187
different strings to complete a breadth first search. This is because she would need
to remember 37 strings of length 7 in order to be able to build all the strings of
length 8 from them. In searches with a higher branching rate, the memory
requirement can often become too large for an agent's processor.

Depth First Search

Depth first search is very similar to breadth first, except that things are added to the
top of the agenda rather than the bottom. In our example, the first three things on
the agenda would still be:
However, once the 'D' state had been found, the actions:
1.(empty,add'D')
2.(empty,add'N')
3. (empty,add 'A')
would be added to the top of the agenda, so it would look like this:
4. ('D',add 'D')
5. ('D',add 'N')
6. ('D',add 'A')

Of course, carrying out the action at the top of the agenda would introduce the
string 'DD', but then this would cause the action:
('DD',add 'D')
to be added to the top, and the next string found would be 'DDD'. Clearly, this can't
go on indefinitely, and in practice, we must specify a depth limit to stop it going
down a particular path forever. That is, our agent will need to record how far down
a particular path it has gone, and avoid putting actions on the agenda if the state in
the agenda item is past a certain depth.

Note that our search for names is special: no matter what state we reach, there will
always be three actions to add to the agenda. In other searches, the number of
actions available to undertake on a particular state may be zero, which effectively
stops that branch of the search. Hence, a depth limit is not always required.
Returning to our example, if the professor stipulated that she wanted very short
names (of three or fewer letters), then the search tree would look like this:

We see that 'DAN' has been reached after the 12th step, so there is an improvement
on the breadth first search. However, it was lucky in this case that the first letter
explored is 'D' and that there is a solution at depth three. If the depth limit had been
set at 4 instead, the tree would have looked very much different:

It looks like it will be a long time until it finds 'DAN'. This highlights an important
drawback to depth first search. It can often go deep down paths which have no
solutions, when there is a solution much higher up the tree, but on a different
branch. Also, depth first search is not, in general, complete.
Rather than simply adding the next agenda item directly to the top of the agenda, it
might be a better idea to make sure that every node in the tree is fully expanded
before moving on to the next depth in the search. This is the kind of depth first
search which Russell and Norvig explain. For our DNA example, if we did this, the
search tree would like like this:

The big advantage to depth first search is that it requires much less memory to
operate than breadth first search. If we count the number of 'alive' nodes in the
diagram above, it amounts to only 4, because the ones on the bottom row are not to
be expanded due to the depth limit. In fact, it can be shown that if an agent wants to
search for all solutions up to a depth of d in a space with branching factor b, then in
a depth first search it only needs to remember up to a maximum of b*d states at any
one time.
To put this in perspective, if our professor wanted to search for all names up to
length 8, she would only have to remember 3 * 8 = 24 different strings to complete
a depth first search (rather than 2187 in a breadth first search).

Iterative Deepening Search

So, breadth first search is guaranteed to find a solution (if one exists), but it eats all
the memory. Depth first search, however, is much less memory hungry, but not
guaranteed to find a solution. Is there any other way to search the space which
combines the good parts of both?
Well, yes, but it sounds silly. Iterative Deepening Search (IDS) is just a series of
depth first searches where the depth limit is increased by one every time. That is, an

IDS will do a depth first search (DFS) to depth 1, followed by a DFS to depth 2,
and so on, each time starting completely from scratch. This has the advantage of
being complete, as it covers all depths of the search tree. Also, it only requires the
same memory as depth first search (obviously).
However, you will have noticed that this means that it completely re-searches the
entire space searched in the previous iteration. This kind of redundancy will surely
make the search strategy too slow to contemplate using in practice? Actually, it isn't
as bad as you might think. This is because, in a depth first search, most of the effort
is spent expanding the last row of the tree, so the repetition over the top part of the
tree is not a major factor. In fact, the effect of the repetition reduces as the
branching rate increases. In a search with branching rate 10 and depth 5, the number
of states searched is 111,111 with a single depth first search. With an iterative
deepening search, this number goes up to 123,456. So, there is only a repetition of
around 11%.

Bidirectional Search

We've concentrated so far on searches where the point of the search is to find a
solution, not the path to the solution. In other searches, we know the solution, and
we know the initial state, but we don't know how to get from one to the other, and
the point of the search is to find a path. In these cases, in addition to searching
forward from the initial state, we can sometimes also search backwards from the
solution. This is called a bidirectional search.
For example, consider the 8-puzzle game in the diagram below, where the point of
the game is to move the pieces around so that they are arranged in the right hand
diagram. It's likely that in the search for the solution to this puzzle (given an
arbitrary starting state), you might start off by moving some of the pieces around to
get some of them in their end positions. Then, as you got closer to the solution state,
you might work backwards: asking yourself, how can I get from the solution to
where I am at the moment, then reversing the search path. In this case, you've used
a bidirectional search.

Bidirectional search has the advantage that search in both directions is only required
to go to a depth half that of normal searches, and this can often lead to a drastic
reduction in the number of paths looked at. For instance, if we were looking for a
path from one town to another through at most six other towns, we only have to
look for a journey through three towns from both directions, which is fairly easy to
do, compared to searching all paths through six towns in a normal search.
Unfortunately, it is often difficult to apply a bidirectional search because (a) we
don't really know the solution, only a description of it (b) there may be many
solutions, and we have to choose some to work backwards from (c) we cannot
reverse our operators to work backwards from the solution and (d) we have to
record all the paths from both sides to see if any two meet at the same point - this
may take up a lot of memory, and checking through both sets repeatedly could take
up too much computing time.

3.5 Heuristic Search Strategies


Generally speaking, a heuristic search is one which uses a rule of thumb to improve
an agent's performance in solving problems via search. A heuristic search is not to
be confused with a heuristic measure. If you can specify a heuristic measure, then
this opens up a range of generic heuristic searches which you can try to improve
your agent's performance, as discussed below. It is worth remembering, however,
that any rule of thumb, for instance, choosing the order of operators when applied in
a simple breadth first search, is a heuristic.
In terms of our agenda analogy, a heuristic search chooses where to put a (state,
operator) pair on the agenda when it is proposed as a move in the state space. This
choice could be fairly complicated and based on many factors. In terms of the graph
analogy, a heuristic search chooses which node to expand at any point in the search.
By definition, a heuristic search is not guaranteed to improve performance for a
particular problem or set of problems, but they are implemented in the hope of
either improving the speed of which a solution is found and/or the quality of the
solution found. In fact, we may be able to find optimal solutions, which are as good
as possible with respect to some measure.

Optimality

The path cost of a solution is calculated as the sum of the costs of the actions which
led to that solution. This is just one example of a measure of value on the solution

of a search problem, and there are many others. These measures may or may not be
related to the heuristic functions which estimate the likelihood of a particular state
being in the path to a solution. We say that - given a measure of value on the
possible solutions to a search problem - one particular solution is optimal if it scores
higher than all the others with respect to this measure (or costs less, in the case of
path cost). For example, in the maze example given in section 3.2, there are many
paths from the start to the finish of the maze, but only one which crosses the fewest
squares. This is the optimal solution in terms of the distance travelled.
Optimality can be guaranteed through a particular choice of search strategy (for
instance the uniform path cost search described below). Alternatively, an agent can
choose to prove that a solution is optimal by appealing to some mathematical
argument. As a last resort, if optimality is necessary, then an agent must exhaust a
complete search strategy to find all solutions, then choose the one scoring the
highest (alternatively costing the lowest).

Uniform Path Cost Search

A breadth first search will find the solution with the shortest path length from the
initial state to the goal state. However, this may not be the least expensive solution
in terms of the path cost. A uniform path cost search chooses which node to expand
by looking at the path cost for each node: the node which has cost least to get to is
expanded first. Hence, if, as is usually the case, the path cost of a node increases
with the path length, then this search is guaranteed to find the least expensive
solution. It is therefore an optimal search strategy. Unfortunately, this search
strategy can be very inefficient.

Greedy Search

If we have a heuristic function for states, defined as above, then we can simply
measure each state with respect to this measure and order the agenda items in terms
of the score of the state in the item. So, at each stage, the agent determines which
state scores lowest and puts agenda items on the top of the agenda which contain
operators acting on that state. In this way, the most promising nodes in a search
space are expanded before the less promising ones. This is a type of best first
search known specifically as a greedy search.
In some situations, a greedy search can lead to a solution very quickly. However, a
greedy search can often go down blind alleys, which look promising to start with,
but ultimately don't lead to a solution. Often the best states at the start of a search
are in fact really quite poor in comparison to those further in the search space. One
way to counteract this blind-alley effect is to turn off the heuristic until a proportion
of the search space has been covered, so that the truly high scoring states can be

identified. Another problem with a greedy search is that the agent will have to keep
a record of which states have been explored in order to avoid repetitions (and
ultimately end up in a cycle), so a greedy search must keep all the agenda items it
has undertaken in its memory. Also, this search strategy is not optimal, because the
optimal solution may have nodes on the path which score badly for the heuristic
function, and hence a non-optimal solution will be found before an optimal one.
(Remember that the heuristic function only estimates the path cost from a node to a
solution).

A* Search

A* search combines the best parts of uniform cost search, namely the fact that it's
optimal and complete, and the best parts of greedy search, namely its speed. This
search
strategy
simply
combines
the
path
cost
function
g(n) and the heuristic function h(n) by summing them to form a new heuristic
measure f(n):
f(n) = g(n) + h(n)
Remembering that g(n) gives the path cost from the start state to state n and h(n)
estimates the path cost from n to a goal state, we see that f(n) estimates the cost of
the cheapest solution which passes through n.
The most important aspect of A* search is that, given one restriction on h(n), it is
possible to prove that the search strategy is complete and optimal. The restriction to
h(n) is that it must always underestimate the cost to reach a goal state from n. Such
heuristic measures are called admissible. See Russell and Norvig for proof that A*
search with an admissible heuristic is complete and optimal.

IDA* Search

A* search is a sophisticated and successful search strategy. However, a problem


with A* search is that it must keep all states in its memory, so memory is often a
much bigger consideration than time in designing agents to undertake A* searches.
We overcame the same problem with breadth first search by using an iterative
deepening search (IDS), and we do similar with A*.
Like IDS, an IDA* search is a series of depth first searches where the depth is
increased after each iteration. However, the depth is not measured in terms of the
path length, as it is in IDS, but rather in terms of the A* combined function f(n) as
described above. To do this, we need to define contours as regions of the search
space containing states where f is below some limit for all the states, as shown
pictorially here:

Each node in a contour scores less than a particular value and IDA* search agents
are told how much to increase the contour boundary by on each iteration. This
defines the depth for successive searches. When using contours, it is useful for the
function f(n) to be monotonic, i.e., f is monotonic if whenever an operator takes a
state s1 to a state s2, then f(s2) >= f(s1). In other words, if the value of f always
increases along a path, then f is monotonic. As an exercise, why do we need
monotonicity to ensure optimality in IDA* search?

SMA* Search

IDA* search is very good from a memory point of view. In fact, it can be criticised
for not using enough memory - using more memory can increase the efficiency, so
really our search strategies should use all the available memory. Simplified
Memory-Bounded A* search (SMA*) is a search which does just that. This is a
complicated search strategy, with details given in Russell and Norvig.

Hill Climbing

As we've seen, in some problems, finding the search path from initial to goal state is
the point of the exercise. In other problems, the path and the artefact at the end of
the path are both important, and we often try to find optimal solutions. For a certain
set of problems, the path is immaterial, and finding a suitable artefact is the sole
purpose of the search. In these cases, it doesn't matter whether our agent searches
down a path for 10 or 1000 steps, as long as it finds a solution in the end.
For example, consider the 8-queens problem, where the task is to find an
arrangement of 8 queens on a chess board such that no one can "take" another (one
queen can take another if its on the same horizontal, vertical or diagonal line). A
solution to this problem is:

One way to specify this problem is with states where there are a number of queens
(1 to 8) on the board, and an action is to add a queen in such a way that it can't take
another. Depending on your strategy, you may find that this search requires much
back-tracking, i.e., towards the end, you find that you simply can't put the last
queens on anywhere, so you have to move one of the queens you put down earlier
(you go back-up the search tree).
An alternative way of specifying the problem is that the states are boards with 8
queens already on them, and an action is a movement of one of the queens. In this
case, our agent can use an evaluation function and do hill climbing. That is, it
counts the number of pairs of queens where one can take the other, and only moves
a queen if that movement reduces the number of pairs. When there is a choice of
movements both resulting in the same decrease, the agent can choose one randomly
from the choices. In the 8-queens problem, there are only 56 * 8 = 448 possible
ways to move one queen, so our agent only has to calculate the evaluation function
448 times at each stage. If it only chooses moves where the situation with respect to

the evaluation function improves, it is doing hill climbing (or gradient descent if
it's better to think of the agent going downhill rather than uphill).
A common problem with this search strategy is local maxima: the search has not
yet reached a solution, but it can only go downhill in terms of the evaluation
function. For example, we might get to the stage where only two queens can take
each other, but moving any queen increases this number to at least three. In cases
like this, the agent can do a random re-start whereby they randomly choose a state
to start the whole process from again. This search strategy has the appeal of never
requiring to store more than one state at any one time (the part of the hill the agent
is on). Russell and Norvig make the analogy that this kind of search is like trying to
climb mount everest in the fog with amnesia, but they do concede that it is often the
search strategy of choice for some industrial problems. Local/Global
Maxima/Minima are represented in the diagram below:

Simulated Annealing

One way to get around the problem of local maxima, and related problems such as
ridges and plateaux in hill climbing is to allow the agent to go downhill to some
extent. In simulated annealing - named because of an analogy with cooling a liquid
until it freezes - the agent chooses to consider a random move. If the move
improves the evaluation function, then it is always carried out. If the move doesn't
improve the evaluation function, then the agent will carry out the move with some
probability between 0 and 1. The probability decreases as the move gets worse in
terms of the evaluation function, so really bad moves are rarely carried out. This
strategy can often nudge a search out of a local maximum and the search can
continue towards the global maximum.

Random Search

Some problems to be solved by a search agent are more creative in nature, for
example, writing poetry. In this case, it is often difficult to project the word
'creative' on to a program because it is possible to completely understand why it
produced an artefact, by looking at its search path. In these cases, it is often a good
idea to try some randomness in the search strategy, for example randomly choosing
an item from the agenda to carry out, or assigning values from a heuristic measure
randomly. This may add to the creative appeal of the agent, because it makes it
much more difficult to predict what the agent will do.

3.6 Assessing Heuristic Searches


Given a particular problem you want to build an agent to solve, there may be more
than one way of specifying it as a search problem, more than one choice for the
search strategy and different possibilities for heuristic measures. To a large extent,
it is difficult to predict what the best choices will be, and it will require some
experimentation to determine them. In some cases, - if we calculate the effective
branching rate, as described below - we can tell for sure if one heuristic measure is
always being out-performed by another.

The Effective Branching Rate

Assessing heuristic functions is an important part of AI research: a particular


heuristic function may sound like a good idea, but in practice give no discernible
increase in the quality of the search. Search quality can be determined
experimentally in terms of the output from the search, and by using various
measures such as the effective branching rate. Suppose a particular problem P has
been solved by search strategy S by expanding N nodes, and the solution lay at
depth D in the space. Then the effective branching rate of S for P is calculated by
comparing S to a uniform search U. An example of a uniform search is a breadth
first search where the number of branches from any node is always the same (as in
our baby naming example). We then suppose the (uniform) branching rate of U is
such that, on exhausting its search to depth D, it too would have expanded exactly N
nodes. This imagined branching rate, written b*, is the effective branching rate of S
and is calculated thus:
N = 1 + b* + (b*)2 + ... + (b*)D.
Rearranging this equation will provide a value for b*. For example (taken from
Russell and Norvig), suppose S finds a solution at depth 5 having expanded 52
nodes. In this case:
52 = 1 + b* + (b*)2 + ... + (b*)5.

and it turns out that b*=1.91. To calculate this, we use the well known mathematical
identity:

This enables us to write a polynomial for which b* is a zero, and we can solve this
using numerical techniques such as Newton's method.
It is usually the case that the effective branching rate of a search strategy is similar
over all the problems it is used for, so that it is acceptable to average b* over a
small set of problems to give a valid account. If a heuristic search has a branching
rate near to 1, then this is a good sign. We say that one heuristic function h1
dominates another h2 if the search using h1 always has a lower effective branching
rate than h2. Having a lower effective branching rate is clearly desirable because it
means a quicker search.

Chapter-4
Knowledge Representation
To recap, we now have some characterizations of AI, so that when an AI problem
arises, you will be able to put it into context, find the correct techniques and apply
them. We have introduced the agents language so that we can talk about intelligent
tasks and how to carry them out. We have also looked at search in the general case,
which is central to AI problem solving. Most pieces of software have to deal with
data of some type, and in AI we use the more grandiose title of "knowledge" to
stand for data including (i) facts, such as the temperature of a patient (ii)
procedures, such as how to treat a patient with a high temperature and (iii) meaning,
such as why a patient with a high temperature should not be given a hot bath.
Accessing and utilizing all these kinds of information will be vital for an intelligent
agent to act rationally. For this reason, knowledge representation is our final general
consideration before we look at particular problem types.
To a large extent, the way in which you organize information available to and
generated by your intelligent agent will be dictated by the type of problem you are
addressing. Often, the best ways of representing knowledge for particular
techniques are known. However, as with the problem of how to search, you will
need a lot of flexibility in the way you represent information. Therefore, it is worth
looking at four general schemes for representing knowledge, namely logic,
semantic networks, production rules and frames. Knowledge representation
continues to be a much-researched topic in AI because of the realization fairly early
on that how information is arranged can often make or break an AI application.

4.1 Logical Representations


If all human beings spoke the same language, there would be a lot less
misunderstanding in the world. The problem with software engineering in general is
that there are often slips in communication which mean that what we think we've
told an agent and what we've actually told it are two different things. One way to
reduce this, of course, is to specify and agree upon some concrete rules for the
language we use to represent information. To define a language, we need to specify
the syntax of the language and the semantics. To specify the the syntax of a
language, we must say what symbols are allowed in the language and what are legal
constructions (sentences) using those symbols. To specify the semantics of a
language, we must say how the legal sentences are to be read, i.e., what they mean.
If we choose a particular well defined language and stick to it, we are using a
logical representation.

Certain logics are very popular for the representation of information, and range in
terms of their expressiveness. More expressive logics allow us to translate more
sentences from our natural language (e.g., English) into the language defined by the
logic.
Some popular logics are:

Propositional Logic

This is a fairly restrictive logic, which allows us to write sentences about


propositions - statements about the world - which can either be true or false. The
symbols in this logic are (i) capital letters such as P, Q and R which represent
propositions such as: "It is raining" and "I am wet", (ii) connectives which are: and
( ), or ( ), implies (
) and not ( ), (iii) brackets and (iv) T which stands for
the proposition "true", and F which stands for the proposition "false". The syntax of
this logic are the rules specifying where in a sentence the connectives can go, for
example must go between two propositions, or between a bracketed conjunction
of propositions, etc.
The semantics of this logic are rules about how to assign truth values to a sentence
if we know whether the propositions mentioned in the sentence are true or not. For
instance, one rule is that the sentence P Q is true only in the situation when both P
and Q are true. The rules also dictate how to use brackets. As a very simple
example, we can represent the knowledge in English that "I always get wet and
annoyed when it rains" as:
It is raining

I am wet

I am annoyed

Moreover, if we program our agent with the semantics of propositional logic, then if
at some stage, we tell it that it is raining, it can infer that I will get wet and annoyed.

First Order Predicate Logic

This is a more expressive logic because it builds on propositional logic by allowing


us to use constants, variables, predicates, functions and quantifiers in addition to
the connectives we've already seen. For instance, the sentence: "Every Monday and
Wednesday I go to John's house for dinner" can be written in first order predicate
logic as:
X ((day_of_week(X, monday)
day_of_week(X, wednesday))
(go_to(me, house_of(john))
eat_meal(me, dinner))).

Here, the symbols monday, wednesday, me, dinner and john are all constants:
base-level objects in the world about which we want to talk. The symbols
day_of_week, go_to and eat_meal are predicates which represent relationships
between the arguments which appear inside the brackets. For example in
eat_meal, the relationship specifies that a person (first argument) eats a particular
meal (second argument). In this case, we have represented the fact that me eats
dinner. The symbol X is a variable, which can take on a range of values. This
enables us to be more expressive, and in particular, we can quantify X with the
'forall' symbol , so that our sentence of predicate logic talks about all possible X's.
Finally, the symbol house_of is a function, and - if we can - we are expected to
replace house_of(john) with the output of the function (john's house) given the
input to the function (john).
The syntax and semantics of predicate logic are covered in more detail as part of the
lectures on automated reasoning.

Higher Order Predicate Logic

In first order predicate logic, we are only allowed to quantify over objects. If we
allow ourselves to quantify over predicate or function symbols, then we have
moved up to the more expressive higher order predicate logic. This means that we
can represent meta-level information about our knowledge, such as "For all the
functions we've specified, they return the number 10 if the number 7 is input":

f, (f(7) = 10).

Fuzzy Logic

In the logics described above, we have been concerned with truth: whether
propositions and sentences are true. However, with some natural language
statements, it's difficult to assign a "true" or "false" value. For example, is the
sentence: "Prince Charles is tall" true or false? Some people may say true, and
others false, so there's an underlying probability that we may also want to represent.
This can be achieved with so-called "fuzzy" logics. The originator of fuzzy logics,
Lotfi Zadeh, advocates not thinking about particular fuzzy logics as such, but rather
thinking of the "fuzzification" of current theories, and this is beginning to play a
part in AI. The combination of logics with theories of probability, and programming
agents to reason in the light of uncertain knowledge are important areas of AI
research. Various representation schemes such as Stochastic Logic Programs have
an aspect of both logic and probability.

Other logics

Other logics you may consider include:


Multiple valued logics, where different truth value such as "unknown" are allowed.
These have some of the advantages of fuzzy logics, without necessarily worrying
about probability.
Modal logics, which cater for individual agents' beliefs about the world. For
example, one agent could believe that a certain statement is true, but another may
not. Modal logics help us deal with statements that may be believed to be true to
some, but not all agents.
Temporal logics, which enable us to write sentences involving considerations of
time, for example that a statement may become true some time in the future.
It's not difficult to see why logic has been a very popular representation scheme in
AI:

It's fairly easy to represent knowledge in this way. It allows us to be


expressive enough to represent most knowledge, while being constrained
enough to be precise about that knowledge.
There are whole branches of mathematics devoted to the study of it.
We get a lot of reasoning for free (theorems can be deduced about
information in a logical representation and patterns can be similarly
induced).
Some programming languages grew from logical representations, in
particular Prolog. So, if you understand the logic, it's fairly easy to write
programs.

Chapter-5
Game Playing
We have now dispensed with the necessary background material for AI problem
solving techniques, and we can move on to looking at particular types of problems
which have been addressed using AI techniques. The first type of problem we'll
look at is getting an agent to compete, either against a human or another artificial
agent. This area has been extremely well researched over the last 50 years. Indeed,
some of the first chess programs were written by Alan Turing, Claude Shannon and
other fore-fathers of modern computing. We only have one lecture to look at this
topic, so we'll restrict ourselves to looking at two person games such as chess
played by software agents. If you are interested in games involving more teamwork
and/or robotics, then a good place to start would be with the Robo Cup project,5.1
MinMax Search
Parents often get two children to share a cake fairly by asking one to cut the cake
and the other to choose which half they want to eat. In this two player cake-scoffing
game, there is only one move (cutting the cake), and player one soon learns that if
he wants to maximize the amount of cake he gets, he had better cut the cake into
equal halves, because his opponent is going to try and minimize the cake that player
1 gets by choosing the biggest half for herself.
Suppose we have a two player game where the winner scores a positive number at
the end, and the loser scores nothing. In board games such as chess, the score is
usually just 1 for a win and 0 for a loss. In other games such as poker, however, one
player wins the (cash) amount that the other player loses. These are called zerosum games, because when you add one player's winnings to the other player's loss,
the sum is zero.
The minimax algorithm is so called because it assumes that you and your oppenent
are going to act rationally, and so you will choose moves to try to maximise your
final score and your opponent will choose moves to try to minimise your final score.
To demonstrate the minimax algorithm, it is helpful to have a game where the
search tree is fairly small. For this reason, we will invent the following very trivial
game:

Take a pack of cards and deal out four cards face up. Two players take it in turn to
choose a card each until they have two each. The object is to choose two cards so

that they add up to an even number. The winner is the one with the largest even
number n (picture cards all count as 10), and the winner scores n. If both players get
the same even number, it is a draw, and they both score zero.

Suppose the cards dealt are 3, 5, 7 and 8. We are interested in which card player one
should choose first, and the minimax algorithm can be used to decide this for us. To
demonstrate this, we will draw the entire search tree and put the scores below the
final
nodes
on
paths
which
represent
particula

r games.

Our aim is to write the best score on the top branches of the tree that player one can
guarantee to score if he chooses that move. To do this, starting at the bottom, we
will write the final scores on successively higher branches on the search tree until

we reach the top. Whenever there is a choice of scores to write on a particular


branch, we will assume that player two will choose the card which minimises player
one's final score, and player one will choose the card which maximises his/her
score. Our aim is to move the scores all the way up the graph to the top, which will
enable player one to choose the card which leads to the best guaranteed score for
the overall game. We will first write the scores on the edges of the tree in the
bottom two branches:

Now we want to move the scores up to the next level of branches in the tree.
However, there is a choice. For example, for the first branch on the second row, we
could write either 10 or -12. This is where our assumption about rationality comes
into account. We should write 10 there, because, supposing that player two has
actually chosen the 5, then player one can choose either 7 or 8. Choosing 7 would
result in a score of 10 for player 1, choosing 8 would result in a score of -12.
Clearly, player 1 would choose the 7, so the score we write on this branch is 10.
Hence, we should choose the maximum of the scores to write on the edges in the
row above. Doing the same for all the other branches, we get the following:

Finally, we want to put the scores on the top edges in the tree. Again, there is a
choice. However, in this case, we have to remember that player two is making the
choices, and they will act in order to minimise the score that player 1 gets. Hence,
in the case when player one chooses the 3 card, player 2 will choose the 7 to
minimise the score player 1 can get. Hence, we choose the minimum possibility of
the three to put on the edges at the top of the tree as follows:

To choose the correct first card, player one simply looks at the topmost edges of the
final tree and chooses the one with the highest score. In this case, choosing the 7
will guarantee player one scores 10 in this game (assuming that player one chooses
according to the minimax strategy for move 2, but - importantly - making no
assumptions about how player two will choose).
Note that the process above was in order for player one to choose his/her first move.
The whole process would need to be repeated for player two's first move, and
player one's second move, etc. In general, agents playing games using a minimax
search have to calculate the best move at each stage using a new minimax search.
Don't forget that just because an agent thinks their opponent will act rationally,
doesn't mean they will, and hence they cannot assume a player will make a
particular move until they have actually done it.

5.2 Cutoff Search


To use a minimax search in a game playing situation, all we have to do is program our agent to look at
the entire search tree from the current state of the game, and choose the minimax solution before
making a move. Unfortunately, only in very trivial games such as the one above is it possible to
calculate the minimax answer all the way from the end states in a game. So, for games of higher
complexity, we are forced to estimate the minimax choice for world states using an evaluation
function. This is, of course, a heuristic function such as those we discussed in the lecture on search.
In a normal minimax search, we write down the whole search space and then propogate the scores
from the goal states to the top of the tree so that we can choose the best move for a player. In a cutoff
search, however, we write down the whole search space up to a specific depth, and then write down
the evaluation function for each of the states at the bottom of the tree. We then propogate these values
from the bottom to the top in exactly the same way as minimax.
The depth is chosen in advance to ensure that the agent doesn't take too long to choose a move: if it
has longer, then we allow it to go deeper. If our agent has a given time limit for each move, then it
makes sense to enable it to carry on searching until the time runs out. There are many ways to do the
search in such a way that a game playing agent searches as far as possible in the time available. As an
exercise, what possible ways can you think of to perform this search? It is important to bear in mind
that the point of the search is not to find a node in the above graph, but to determine which move the
agent should make.

Evaluation Functions

Evaluation functions estimate the score that can be guaranteed if a particular world state is reached. In
chess, such evaluation functions have been known long before computers came along. One such
function simply counts the number of pieces on the board for a particular player. A more sophisticated
function scores more for the more influential pieces such as rooks and queens: each pawn is worth 1,
knights and bishops score 3, rooks score 5 and queens score 9. These scores are used in a weighted
linear function, where the number of pieces of a certain type is multiplied by a weight, and all the
products are added up. For instance, if in a particular board state, player one has 6 pawns, 1 bishop, 1
knight, 2 rooks and 1 queen, then the evaluation function, f for that board state, B, would be calculated
as follows:
f(B) = 1*6 + 3*1 + 3*1 + 5*2 + 9*1 = 31
The numbers in bold are the weights in this evaluation function (i.e., the scores assigned to the
pieces).
Ideally, evaluation functions should be quick to calculate. If they take a long time to calculate, then

less of the space will be searched in a given time limit. Ideally, evaluation functions should also match
the actual score in goal states. Of course, this isn't true for our weighted linear function in chess,
because goal states only score 1 for a win and 0 for a loss. In fact, we don't need the match to be exact
- we can use any values for an evaluation function, as long it scores more for better board states.
A bad evaluation function can be disastrous for a game playing agent. There are two main problems
with evaluation functions. Firstly, certain evaluation functions only make sense for game states which
are quiescent. A board state is quiescent for an evaluation function, f, if the value of f is unlikely to
exhibit wild swings in the near future. For example, in chess, board states such as one where a queen
is threatened by a pawn, where one piece can take another without a similar valued piece being taken
back in the next move are not quiescent for evaluation functions such as the weighted linear
evaluation function mentioned above. To get around this problem, we can make an agent's search
more sophisticated by implementing a quiescence search, whereby, given a non-quiescent state we
want to evaluate the function for, we expand that game state until a quiescent state is reached, and we
take the value of the function for that state. If quiescent positions are much more likely to occur than
non-quiescent positions in a search, then such an extension to the search will not slow things down too
much. In chess, a search strategy may choose to delve further into the space whenever a queen is
threatened to try to avoid the quiescent problem.
It is also worth bearing in mind the horizon problem, where a game-playing agent cannot see far
enough into the search space. An example of the horizon problem given in Russell and Norvig is the
case of promoting a pawn to a queen in chess. In the board state they present, this can be forestalled
for a certain number of moves, but is inevitable. However, with a cutoff search at a certain depth, this
inevitability cannot be noticed until too late. It is likely that the agent trying to forestall the move
would have been better off doing something else with the moves it had available.
In the card game example above, game states are collections of cards, and a possible evaluation
function would be to add up the card values and take that if it was an even number, but score it as zero
if the sum is an odd number. This evaluation function matches exactly with the actual scores in goal
states, but is perhaps not such a good idea. Suppose the cards dealt were: 10, 3, 7 and 9. If player one
was forced to cutoff the search after only the first card choice, then the cards would score: 10, 0, 0 and
0 respectively. So player one would choose card 10, which would be disastrous, as this will inevitably
lead to player one losing that game by at least twelve points. If we scale the game up to choosing
cards from 40 rather than 4, we can see that a more sophisticated heuristic involving the cards left
unchosen might be a better idea.

5.3 Pruning
Recall that pruning a search space means deciding that certain branches should not be explored. If an
agent knows for sure that exploring a certain branch will not affect its choice for a particular move,
then that branch can be pruned with no concern at all (i.e., no effect on the outcome of the search for a

move), and the speed up in search may mean that extra depths can be searched.
When using a minimax approach, either for an entire search tree or in a cutoff search, there are often
many branches which can be pruned because we find out fairly quickly that the best value down a
whole branch is not as good as the best value from a branch we have already explored. Such pruning
is called alpha-beta pruning.
As an example, suppose that there are four choices for player one, called moves M1, M2, M3 and M4,
and we are looking only two moves ahead (1 for player one and 1 for player two). If we do a depth
first search for player one's move, we can work out the score they are guaranteed for M1 before even
considering move M2. Suppose that it turns out that player one is guaranteed to score 10 with move
M1. We can use this information to reject move M2 without checking all the possibilities for player
two's move. For instance, suppose that the first choice possible for player two after M2 from player
one means that player one will score only 5 overall. In this case, we know that the maximum player
one can score with M2 is 5 or less. Of course, player one won't choose this, because M1 will score 10
for them. We see that there's no point checking all the other possibilites for M2. This can be seen in the
following diagram (ignore the X's and N's for the time being):

We see that we could reject M2 straight away, thus saving ourselves 3 nodes in the search space. We
could reject M3 after we came across the 9, and in the end M4 turns out to be better than M1 for player
one. In total, using alpha-beta pruning, we avoided looking at 5 end nodes out of 16 - around 30%. If
the calculation to assess the scores at end-game states (or estimate them with an evaluation function)
is computationally expensive, then this saving could enable a much larger search. Moreover, this kind
of pruning can occur anywhere on the tree. The general principles are that:
1. Given a node N which can be chosen by player one, then if there is another node, X, along any path,
such that (a) X can be chosen by player two (b) X is on a higher level than N and (c) X has been shown
to guarantee a worse score for player one than N, then all the nodes with the same parent as N can
be pruned.

2. Given a node N which can be chosen by player two, then if there is a node X along any path such that
(a) player one can choose X (b) X is on a higher level than N and (c) X has been shown to guarantee a
better score for player one than N, then all the nodes with the same parent as N can be pruned.

As an exercise: which of these principles did we use in the M1 - M4 pruning example above? (To
make it easy, I've written on the N's and X's).
Because we can prune using the alpha-beta method, it makes sense to perform a depth-first search
using the minimax principle. Compared to a breadth first search, a depth first search will get to goal
states quicker, and this information can be used to determine the scores guaranteed for a player at
particular board states, which in turn is used to perform alpha-beta pruning. If a game-playing agent
used a breadth first search instead, then only right at the end of the search would it reach the goal
states and begin to perform minimax calculations. Hence, the agent would miss much potential to
peform pruning.
Using a depth first search and alpha-beta pruning is fairly sensitive to the order in which we try
operators in our search. For example above, if we had chosen to look at move M4 first, then we would
have been able to do more pruning, due to the higher minimum value (11) from that branch. Often, it
is worth spending some time working out how best to order a set of operators, as this will greatly
increase the amount of pruning that can occur.
It's obvious that a depth-first minimax search with alpha-beta pruning search dominates minimax
search alone. In fact, if the effective branching rate of a normal minimax search was b, then utilising
alpha-beta pruning will reduce this rate to b. In chess, this means that the effective branching rate
reduces from 35 to around 6, meaning that alpha-beta search can look further moves ahead than a
normal minimax search with cutoff.

Chapter-6
First-Order Logic
6.1 There's Reasoning, and then There's Reasoning
As humans, we have always prided ourselves on our ability to think things through: to reason things
out and come to the only conclusion possible in a Sherlock Holmes kind of way. But what exactly do
we mean by "reasoning" and can we automate this process?
We can take Sherlock Holmes as a case study for describing different types of reasoning. Suppose
after solving another major case, he says to Dr. Watson: "It was elementary my dear Watson. The
killer always left a silk glove at the scene of the murder. That was his calling card. Our investigations
showed that only three people have purchased such gloves in the past year. Of these, Professor
Doolally and Reverend Fisheye have iron-clad alibis, so the murderer must have been Sergeant
Heavyset. When he tried to murder us with that umbrella, we knew we had our man."
At least five types of reasoning can be identified here.

Firstly, how do we know that the killer always left a silk glove at the murder scene? Well, this is
because Holmes has observed a glove at each of the murders and basically guessed that they have
something to do with the murder simply by always being there. This type of reasoning is called
inductive reasoning, where a hypothesis has been induced from some data. We will cover this in the
lectures on machine learning.
Secondly, Holmes used abductive reasoning to dredge from his past experience the explanation that
the gloves are left by the murderer as a calling card. We don't really cover abductive reasoning in
general on this course, unfortunately.

Thirdly, Sherlock tracked down the only three people who bought the particular type of glove
left at the scene. This can be seen - perhaps quite loosely - as model generation, which plays a
part in the reasoning process. Models are usually generated to prove existence of them, or
often to disprove a hypothesis, by providing a counterexample to it. We cover model
generation in brief detail.

Fourthly, Sherlock managed to obtain alibis for two suspects, but not for the third. Hence, he ruled
out two possibilities leaving only one. This can be seen as constraint-based reasoning, and we will
cover this in the lecture on constraint solving.

Finally, Sherlock had two pieces of knowledge about the world, which he assumed were true: (i) the
killer leaves a silk glove at the murder scene (ii) the only person who could have left a glove was
Sergeant Heavyset. Using this knowledge, he used deductive reasoning to infer the fact that the killer
must be Heavyset himself. It's so obvious that we hardly see it as a reasoning step, but it is one: it's
called using the Modus Ponens rule of inference, which we cover in the lectures on automated
reasoning following this one.

As an aside, it's worth pointing out that - presumably for heightened tension - in most Sherlock
Holmes books, the murderer confesses, either by sobbing into a cup of tea and coming quietly, or by
trying to kill Holmes, Watson, the hapless inspector Lestrade or all three. This means that the case
never really has to go to trial. Just once, I'd like to see the lawyers get involved, and to see the
spectacle of Holmes trying to justify his reasoning. This could be disastrous as all but his deductive
reasoning was unsound. Imagine a good lawyer pointing out that all five victims happened - entirely
coincidentally - to be members of the silk glove appreciation society.....
Automating Reasoning is a very important topic in AI, which has received much attention, and has
found applications in the verification of hardware and software configurations, amongst other areas.
The topic known as "Automated Reasoning" in AI concentrates mostly on deductive reasoning,
where new facts are logically deduced from old ones. It is important to remember that this is only one
type of reasoning, and there are many others. In particular, in our lectures on machine learning later,
we cover the notion of inductive reasoning, where new facts are guessed at, using empirical evidence.
Automated Reasoning is, at present, mostly based on how we wish we reasoned: logically, following
prescribed rules to start from a set of things we know are true (called axioms), and end with new
knowledge about our world. The way we actually reason is much more sloppy: we use creativity, refer
to previous examples, perform analogies, wait for divine inspiration, and so on. To make this more
precise, we say that automated reasoning agents are more formal in their reasoning than humans.
The formal approach to reasoning has advantages and disadvantages. In general, if a computer
program has proved something fairly complex (for instance that a circuit board functions as
specified), then people are more happy to accept the proof than one done by a human. This is because
there is much less room for error in a well-written automated reasoning program. On the other hand,
by being less formal, humans can often skip around the search space much more efficiently and prove
more complicated results. Humans are still much more gifted at deducing things than computers are
likely to be any time soon.
In order to understand how AI researchers gave agents the ability to reason, we first look at how
information about the world is represented using first-order logic. This will lead us into the
programming language Prolog, and we will use Prolog to demonstrate a simple but effective type of
AI program known as an expert system.

6.2 Syntax and Semantics

Propositional logic is restricted in its expressiveness: it can only represent true and false facts about
the world. By extending propositional logic to first-order logic - also known as predicate logic and
first order predicate logic - we enable ourselves to represent much more information about the
world. Moreover, as we will see in the next lecture, first-order logic enables us to reason about the
world using rules of deduction.
We will think about first-order logic as simply a different language, like French or German. We will
need to be able to translate sentences from English to first-order logic, in order to give our agent
information about the world. We will also need to be able to translate sentences from first-order logic
into English, so that we understand what our agent has deduced from the facts we gave it. To do this,
we will look at the combinations of symbols we are allowed to use in first-order logic (the syntax of
the language). We will also determine how we assign meaning to the sentences in the language (the
semantics), and how we translate from one language to another, i.e., English to Logic and vice-versa.

Predicates

First and foremost in first-order logic sentences, there are predicates. These are indications that some
things are related in some way. We call the things which are related by a predicate the arguments of
the predicate, and the number of arguments which are related is called the arity of the predicate. The
following are examples of predicates:
lectures_ai(simon)

("simon lectures AI")

arity is 1 here

father(bob,bill)

("bob is bill's father")

arity is 2 here

lives_at(bryan, house_of(jack)) ("bryan lives at jack's house") arity is 2 here

Connectives

We can string predicates together into a sentence by using connectives in the same way that we did
for propositional logic. We call a set of predicates strung together in the correct way a sentence. Note
that a single predicate can be thought of as a sentence.
There are five connectives in first-order logic. First, we have "and", which we write , and "or",
which we write . These connect predicates together in the obvious ways. So, if we wanted to say
that "Simon lectures AI and Simon lectures bioinformatics", we could write:
lectures_ai(simon)

lectures_bioinformatics(simon)

Note also, that now we are talking about different lectures, it might be a good idea to change our

choice of predicates, and make ai and bioinformatics constants:


lectures(simon, ai)

lectures(simon, bioinformatics)

, which negates the


The other connectives available to us in first-order logic are (a) "not", written
truth of a predicate (b) "implies", written
, which can be used to say that one sentence being true
follows from another sentence being true, and (c) "if and only if" (also known as "equivalence"),
, which can be used to state that the truth of one sentence is always the same as the truth of
written
another sentence.
For instance, if we want to say that "if Simon isn't lecturing AI, then Bob must be lecturing AI", we
could write it thus:
lectures(simon, ai)

lectures(bob, ai)

The things which predicates relate are terms: these may be constants, variables or the output from
functions.

Constants

Constants are things which cannot be changed, such as england, black and barbara. They stand for
one thing only, which can be confusing when the constant is something like blue, because we know
there are different shades of blue. If we are going to talk about different shades of blue in our
sentences, however, then we should not have made blue a constant, but rather used shade_of_blue as a
predicate, in which we can specify some constants, such as navy_blue, aqua_marine and so on. When
translating a sentence into first-order logic, one of the first things we must decide is what objects are
to be the constants. One convention is to use lower-case letters for the constants in a sentence, which
we also stick to.

Functions

Functions can be thought of as special predicates, where we think of all but one of the arguments as
input and the final argument as the output. For each set of things which are classed as the input to a
function, there is exactly one output to which they are related by the function. To make it clear that we
are dealing with a function, we can use an equality sign. So, for example, if we wanted to say that the
cost of an omelette at the Red Lion pub is five pounds, the normal way to express it in first-order logic
would probably be:
cost_of(omelette, red_lion, five_pounds)

However, because we know this is a function, we can make this clearer:

cost_of(omelette, red_lion) = five_pounds

Because we know that there is only one output for every set of inputs to a function, we allow
ourselves to use an abbreviation when it would make things clearer. That is, we can talk about the
output from a function without explicitly writing it down, but rather replacing it with the left hand side
of the equation. So, for example, if we wanted to say that the price of omelettes at the Red Lion is less
than the price of pancakes at the House Of Pancakes, we would normally write something like this:
cost_of(omelette, red_lion)=X

cost_of(pancake, house_of_pancakes)=Y

less_than(X,Y).

This is fairly messy, and involves variables (see next subsection). However, allowing ourselves the
abbreviation, we can write it like this:
less_than(cost_of(omelette, red_lion), cost_of(pancake, house_of_pancakes))

which is somewhat easier to follow.

Variables and Quantifiers

Suppose now that we wanted to say that there is a meal at the Red Lion which costs only 3 pounds. If
we said that cost_of(meal, red_lion) = three_pounds, then this states that a particular meal (a constant,
which we've labeled meal) costs 3 pounds. This does not exactly capture what we wanted to say. For a
start, it implies that we know exactly which meal it is that costs 3 pounds, and moreover, the landlord
at the Red Lion chose to give this the bizarre name of "meal". Also, it doesn't express the fact that
there may be more than one meal which costs 3 pounds.
Instead of using constants in our translation of the sentence "there is a meal at the Red Lion costing 3
pounds", we should have used variables. If we had replaced meal with something which reflects the
fact that we are talking about a generic, rather than a specific meal, then things would have been
clearer. When a predicate relates something that could vary (like our meal), we call these things
variables, and represent them with an upper-case word or letter.
So, we should have started with something like
meal(X)

cost_of(red_lion,X) = three_pounds,

which reflects the fact that we're talking about some meal at the Red Lion, rather than a particular one.
However, this isn't quite specific enough. We need to tell the reader of our translated sentence
something more about our beliefs concerning the variable X. In this case, we need to tell the reader
that we believe there exists such an X. There is a specific symbol in predicate logic which we use for
this purpose, called the 'exists symbol'. This is written: . If we put it around our pair of predicates,

then we get a fully formed sentence in first-order logic:


X (meal(X)

cost_of(red_lion, X) = three_pounds)

This is read as "there is something called X, where X is a meal and X costs three pounds at the Red
Lion".
But what now if we want to say that all meals at the Red Lion cost three pounds. In this case, we need
to use a different symbol, which we call the 'forall' symbol. This states that the predicates concerning
the variable to which the symbol applies are true for all possible instances of that variable. So, what
would happen if we replaced the exists symbol above by our new forall symbol? We would get this:

X (meal(X)

cost_of(red_lion, X) = three_pounds)

Is this actually what we wanted to say? Aren't we saying something about all meals in the universe?
Well, actually, we're saying something about every object in the Universe: everything is a meal which
you can buy from the Red Lion. For three pounds! What we really wanted to say should have been
expressed more like this:

X (meal(X)

cost_of(red_lion, X) = three_pounds)

This is read as: forall objects X, if X is a meal, then it costs three pounds in the Red Lion. We're still
not there, though. This implies that every meal can be brought at the Red Lion. Perhaps we should
throw in another predicate: serves(Pub, Meal) which states that Pub serves the Meal. We can now
finally write what we wanted to say:

X (meal(X)

serves(red_lion, X)

cost_of(red_lion, X) = three_pounds)

This can be read as: for all objects X, if X is a meal and X is served in the Red Lion, then X costs
three pounds.
The act of making ourselves clear about a variable by introducing an exists or a forall sign is called
quantifying the variable. The exists and forall sign are likewise called quantifiers in first-order logic.
Substituting a ground term for a variable is often called "grounding a variable", "applying a
substitution" or "performing an instantiation". An example of instantiation is: turning the sentence
"All meals are five pounds" into "Spaghetti is five pounds" - we have grounded the value of the
variable meal to the constant spaghetti to give us an instance of the sentence.

Translating from English to First-Order Logic Pitfalls

We have now seen some examples of first order sentences, and you should practice writing down

English sentences in first-order logic, to get used to them.


There are many ways to translate things from English to Predicate Logic incorrectly, and we can
highlight some pitfalls to avoid. Firstly, there is often a mix up between the "and" and "or"
connectives. We saw in a previous lecture that the sentence "Every Monday and Wednesday I go to
John's house for dinner" can be written in first order first-order logic as:

X ((day_of_week(X, monday)
(go_to(me, house_of(john))

day_of_week(X, wednesday))
eat_meal(me, dinner)))

and it's important to note that the "and" in the English sentence has changed to an "or" sign in the
first-order logic translation. Because we have turned this sentence into an implication, we need to
make it clear that if the day of the week is Monday or Wednesday, then we go to John's house for
dinner. Hence the disjunction sign (the "or" sign) is introduced. Note that we call the "and" sign the
conjunction sign.
Another common problem is getting the choice, placement and order of the quantifiers wrong. We
saw this with the Red Lion meals example above. As another example, try translating the sentence:
"Only red things are in the bag". Here are some incorrect answers:
red(X))

X (in_bag(X)

X (red(X)

X(

in_bag(X))

Y (bag(X)

in_bag(Y,X)

red(Y)))

Question: "Why are these incorrect, what are they actually saying, and what is the correct answer?"
Another common problem is using commonsense knowledge to introduce new predicates. While this
may simplify things, the agent you're communicating with is unlikely to know the piece of
commonsense knowledge you are expecting it to. For example, some people translate the sentence:
"Any child of an elephant is an elephant" as:

X(

Y (parent(X,Y)

elephant(X))

elephant(Y))

even though they're told to use the predicate child. What they have done here is use their knowledge
about the world to substitute the predicate 'parent' for 'child'. It's important to never assume this kind
of commonsense knowledge in an agent: unless you've specifically programmed it to, an agent will
not know the relationship between the child predicate and the parent predicate.

Translating from First-Order Logic to English

There are tricks to compress what is written in logic into a succinct, understandable English sentence.

For instance, look at this sentence from earlier:


X (meal(X)

cost_of(red_lion, X) = three_pounds)

This is read as "there is something called X, where X is a meal and X costs three pounds at the Red
Lion". We can abbreviate this to: "there is a meal, X, which costs three pounds at the Red Lion", and
finally, we can ignore the X entirely: "there is a meal at the Red Lion which costs three pounds". In
performing these abbreviations, we have interpreted the sentence.
Interpretation is fraught with danger. Remember that the main reason we will want to translate from
first-order logic is so that we can read the output from a reasoning agent which has deduced
something new for us. Hence it is important that we don't ruin the good work of our agent by misinterpreting the information it provides us with.

6.3 The Prolog Programming Language


Most programming languages are procedural: the programmer specifies exactly the right instructions
(algorithms) required to get an agent to function correclty. It comes as a surprise to many people that
there is another way to write programs. Declarative programming is when the user declares what the
output to a function should look like given some information about the input. The agent then searches
for an answer which fits the declaration, and returns any it finds.
As an example, imagine a parent asking their child to run to the shop and buy some groceries. To do
this in a declarative fashion, the parent simply has to write down a shopping list. The parents have
"programmed" their child to perform their task in the knowledge that the child has underlying search
routines which will enable him or her to get to the shop, find and buy the groceries, and come home.
To instruct their child in a procedural fashion, they would have to tell the child to go out of the front
door, turn left, walk down the street, stop after 70 steps, and so on.
We see that declarative programming languages can have some advantages over procedural ones. In
fact, it is often said that a Java program written to do the same as a Prolog program usually takes
about 10 times the number of lines of code. Many AI researchers try out an idea in Prolog before
implementing it more fully in other languages, because Prolog can be used to perform searches easily
(see later).
A well-known declarative language which is used a lot by AI researchers is Prolog, which is based on
first-order logic. For any declarative programming language, the two most important aspects are: how
information is represented, and the underlying search routines upon which the language is based.
Robert Kowalski put this in a most succinct way:
Algorithm = Logic + Control.

Representation in Prolog - Logic Programs

If we impose some additional constraints on first-order logic, then we get to a representation language
known as logic programs. The main restriction we impose is that all the knowledge we want to
encode is represented as Horn clauses. These are implications which comprise a body and a head,
where the predicates in the body are conjoined and they imply the single predicate in the head. Horn
clauses are universally quantified over all the variables appearing in them. So, an example Horn
clause looks like this:

x, y, z ( b1(x,y)

b2(x)

...

bn(x,y,z)

h(x,y))

We see that the body consists of predicates bi and the head is h(x,y). We can make this look a lot more
like the Prolog programs you are used to writing by making a few syntactic changes: first, we turn the
implication around and write it as :- thus:
x, y, z (h(x,y) :- b1(x,y)

next, we change the

b2(x)

...

bn(x,y,z))

symbols to commas.

x, y, z (h(x,y) :- b1(x,y), b2(x), ..., bn(x,y,z))

Finally, we remove the universal quantification (it is assumed in Prolog), make the variables capital
letters (Prolog requires this), and put a full stop at the end:
h(X,Y) :- b1(X,Y), b2(X), ..., bn(X,Y,Z).

Note that we use the notation h/2 to indicate that predicate h has arity 2. Also, we call a set of Horn
clauses like these a logic program. Representing knowledge with logic programs is less expressive
than full first order logic, but it can still express lots of types of information. In particular, disjunction
can be achieved by having different Horn clauses with the same head. So, this sentence in first-order
logic:
x (a(x)

b(x)

c(x)

d(x))

can be written as the following logic program:


c(x)
c(x)
d(x)
d(x)

::::-

a(x).
b(x).
a(x).
b(x).

We also allow ourselves to represent facts as atomic ground predicates. So, for instance, we can state

that:
parent(georgesenior, georgedubya).
colour(red).

and so on.

Search mechanisms in Prolog

We can use this simple Prolog program to describe how Prolog searches:
president(X) :- first_name(X, georgedubya), second_name(X, bush).
prime_minister(X) :- first_name(X, maggie), second_name(X, thatcher).
prime_minister(X) :- first_name(X, tony), second_name(X, blair).
first_name(tonyblair, tony).
first_name(georgebush, georgedubya).
second_name(tonyblair, blair).
second_name(georgebush, bush).

If we loaded this into a Prolog implementation such as Sicstus, and queried the database:
?- prime_minister(P).

then Sicstus would search in the following manner: it would run through it's database until it came
across a Horn clause (or fact) for which the head was prime_minister and the arity of the predicate
was 1. It would first look at the president clause, and reject this, because the name of the head
doesn't match with the head in the query. However, next it would find that the clause:
prime_minister(X) :- first_name(X, maggie), second_name(X, thatcher).

fits the bill. It would then look at the predicates in the body of the clause and see if it could satisfy
them. In this case, it would try to find a match for first_name(X, maggie). However, it would fail,
because no such information can be found in the database. That means that the whole clause fails, and
Sicstus would backtrack, i.e., it would go back to looking for a clause with the same head as the
query. It would, of course, next find this clause:
prime_minister(X) :- first_name(X, tony), second_name(X, blair).

Then it would look at the body again, and try to find a match for first_name(X, tony). It would
look through the datatabase and find X=tonyblair a good assignment, because the fact

first_name(tonyblair, tony) is found towards the end of the database. Likewise, having assigned
X=tonyblair, it would then look for a match to: second_name(tonyblair, blair), and would
succeed. Hence, the answer tonyblair would make the query succeed, and this would be reported

back to us.
The important thing to remember is that Prolog implementations search from the top to the bottom of
the database, and try each term in the body of a clause in the order in which they appear. We say that
Sicstus has proved the query prime_minister(P) by finding something which satisfied the
declaration of what a prime minister is: Tony Blair. It is also worth remembering that Sicstus assumes
negation as failure. This means that if it cannot prove a predicate, then the predicate is false. Hence
the query:
?- \+ president(tonyblair).

Returns an answer of 'true', because Sicstus cannot prove that Tony Blair is a president.
Note that, as part of its search, Prolog also makes inferences using the generalised Modus-Ponens
rule of inference and unification of clauses. We will look in detail at these processes in the next
lecture.

6.4 Logic-based Expert Systems


Expert systems are agents which are programmed to make decisions about real world situations. They
are put together by using knowledge illicitation techniques to extract information from human
experts. A particularly fruitful area is in diagnosis of diseases, where expert systems are used to
decide (suggest) what disease a patient has, given their symptoms.
Expert systems are one of the major success stories of AI. Russell and Norvig give a very nice
example from medicine:

"A leading expert on lymph-node pathology describes a fiendishly difficult case to the expert system,
and examines the system's diagnosis. He scoffs at the system's response. Only slightly worried, the
creators of the system suggest he ask the computer for an explanation of the diagnosis. The machine
points out the major factors influencing its decision and explains the subtle interaction of several of
the symptoms in this case. The experts admits his error, eventually."

Often, the rules from the expert are encoded as if-then rules in first-order logic and the
implementation of the expert system can be fairly easily achieved in a programming language such as

Prolog.
We can take our card game from the previous lecture as a case study for the implementation of a
logic-based expert system. The rules were: four cards are laid on the table face up. Player 1 takes the
first card, and they take it in turns until they both have two cards each. To see who has won, they each
add up their two card numbers, and the winner is the one with the highest even number. The winner
scores the even number they have. If there's no even number, or both players achieve the same even
number, then the game is drawn
It could be argued that undertaking a minimax search is a little uneccessary for this game, because we
could easily just specify a set of rules for each player, so that they choose cards rationally. To
demonstrate this, we will derive down some Prolog rules which specify how player one should choose
the first card.
For example, suppose the cards dealt were: 4, 5, 6, 10. In this case, the best choice of action for player
one is to choose the 10, followed presumably by the 4, because player two will pick the 6. We need to
abstract from this particular example to the general case: we see that there were three even numbers
and one odd one, so player one is guaranteed another even number to match the one they chose. This
is also true if there are four even numbers. Hence we have our first rule:

If there are three or four even numbered cards, then player one should choose the highest even
numbered card in their first go.

When there are three or four odd cards it's not difficult to see that the most rational action for player
one is to choose the highest odd numbered card:

If there are three or four odd numbered cards, then player one should choose the highest odd
numbered card in their first go.

The only other situation is when there are two even and two odd cards. In this case, I'll leave it as an
exercise to convince yourselves that there are no rules governing the choice of player one's first card:
they can simply choose randomly, because they're not going to win unless player two makes a
mistake.
To write an expert system to decide which card to choose in a game, we will need to translate our
rules into first-order logic, and then into a Prolog implementation. Our first rule states that, in a game,
g:
(number_of_even_at_start(g,3)
number_of_even_at_start(g,4))
highest_even_at_start(g,h)
player_one_chooses(g,h).
The meaning of the predicates is as obvious as it seems. Similarly, our second rule can be written as:

(number_of_odd_at_start(g,3)
number_of_odd_at_start(g,4))
highest_odd_at_start(g,h)
player_one_chooses(g,h).
There are many different ways to encode these rules as a Prolog program. Different implementations
will differ in their execution time, but for our simple program, it doesn't really matter which predicates
we choose to implement. We will make our top level predicate: player_one_chooses/2. This
predicate will take a list of card numbers as the first argument, and it will choose a member of this list
to put as the second argument. In this way, the same predicate can be used in order to make second
choices.
Using our above logical representation, we can start by definining:
player_one_chooses(CardList, CardToChoose) :length(CardList, 4),
number_of_evens(CardList, 3),
biggest_even_in_list(CardList, CardToChoose).
player_one_chooses(CardList, CardToChoose) :length(CardList, 4),
number_of_evens(CardList, 4),
biggest_even_in_list(CardList, CardToChoose).
player_one_chooses(CardList, CardToChoose) :length(CardList, 4),
number_of_odds(CardList, 3),
biggest_odd_in_list(CardList, CardToChoose).
player_one_chooses(CardList, CardToChoose) :length(CardList, 4),
number_of_odds(CardList, 4),
biggest_odd_in_list(CardList, CardToChoose).
player_one_chooses([CardToChoose|_], CardToChoose).

We see that there are four choices depending on the number of odds and evens in the CardList. To
make these predicates work, we need to fill in the details of the other predicates. Assuming that we
have some basic list predicates: length/2 which calculates the size of a list, sort/2 which sorts a list,
and last/2 which returns the last element in a list, then we can write down the required predicates:
iseven(A) :0 is A mod 2.
isodd(A) :1 is A mod 2.
even_cards_in_list(CardList, EvenCards) :findall(EvenCard,(member(EvenCard, CardList), iseven(EvenCard)), EvenCards).

odd_cards_in_list(CardList, OddCardes) :findall(OddCard,(member(OddCard, CardList), isodd(OddCard)), EvenCards).


number_of_evens(CardList, NumberOfEvens) :even_cards_in_list(CardList, EvenCards),
length(EvenCards, NumberOfEvens).
number_of_odds(CardList, NumberOfOdds) :odd_cards_in_list(CardList, OddCards),
length(OddCards, NumberOfOdds).
biggest_odd_in_list(CardList, BiggestOdd) :odd_cards_in_list(CardList, OddCards),
sort(OddCards, SortedOddCards),
last(SortedOddCards, BiggestOdd).
biggest_even_in_list(CardList, BiggestEven) :even_cards_in_list(CardList, EvenCards),
sort(EvenCards, SortedEvenCards),
last(SortedEvenCards, BiggestEven).

It's left as an exercise to write down the rules for player one's next choice, and player two's choices.

chapter-7
Making Deductive Inferences
We have shown how knowledge can be represented in first-order logic, and how
rule-based expert systems expressed in logic can be constructed and used. We now
look at how to take some known facts about a domain and deduce new facts from
them. This will, in turn, enable agents to prove things, i.e., to start with a set of
statements we believe to be true (axioms) and deduce whether another statement
(theorem) is true or not. We will first look at how to tell whether a sentence in
propositional logic is true or false. This will suggest some equivalences between
propositional sentences, which allow us to rewrite sentences to other sentences
which mean the same thing, regardless of the truth or meaning of the individual
propositions they contains. These are reversible inferences, in that deduction can be
applied either way. We then look at propositional and first-order inference rules in
general, which enable us deduce new sentences if we know that certain things are
true, and which may not be reversible.

7.1 Truth Tables


(Material covered in Lecture 6)

In propositional logic where we are restricted to expressing sentences where


propositions are true or false we can check whether a particular statement is true
or false by working out the truth of ever larger sub statements using the truth of the
propositions themselves. To work out the truth of sub statements, we need to know
how to deal with truth assignments in the presence of connectives. For instance, if
we know that is_president(barack_obama) and is_male(barack_obama) are true,
then we know that the sentence:
is_male(barack_obama)

is_president(barack_obama)

is also true, because we know that a sentence of the form P


true and Q is true.

Q is true when P is

The truth values of connectives given the truth values of the propositions they
contain is presented in the following truth table:
P

QP

Q P

Q P

True True

False True True True

True

True False

False False True False

False

False True

True False True True

False

False False

True False False True

True

This table allows us to read the truth of the connectives in the following manner.
Suppose we are looking at row three. This says that, if P is false and Q is true, then
1.
2.
3.
4.
5.

P is true
P Q is false
P Q is true
P
Q is true
Q is false
P

Note that, if P is false, then regardless of whether Q is true or false, the statement P
Q is true. This takes a little getting used to, but can be a very useful tool in

theorem proving: if we know that something is false, it can imply anything we want
it to! So, the following sentence is true: "Barack Obama is female" implies that
"Barack Obama is an alien", because the premise that Barack Obama is female was
false, so the conclusion that Barack Obama is an alien can be deduced in a sound
way.
Each row of a truth table defines the connectives for a particular assignment of true
and false to the individual propositions in a sentence. We call each assignment a
model: it represents a particular possible state of the world. For two propositions P
and Q there are four models.
For propositional sentences in general, a model is also just a particular assignment
of truth values to its individual propositions. A sentence with n propositions will
have 2n possible models, and so 2n rows in its truth table. A sentence S will be true
or false for a given model M when S is true we say 'M is a model of S'.
Sentences which are always true, regardless of the truth of the individual
propositions, are called tautologies (or valid sentences). Tautologies are true for all
models. For instance, if I said that "Tony Blair is prime minister or Tony Blair is not
prime minister", this is largely a content-free sentence, because we could have
replaced the predicate of being Tony Blair with any predicate and the sentence
would still have been correct.
Tautologies are not always as easy to notice as the one above, and we can use truth
tables to be certain that a statement we have written is true, regardless of the truth of
the individual propositions it contains. To do this, the columns of our truth table
will be headed with ever larger sections of the sentence, until the final column
contains the entire sentence. As before, the rows of the truth table will represent all
the possible models for the sentence, i.e. each possible assignment of truth values to
the individual propositions in the sentence. We will use these initial truth values to
assign truth values to the subsentences in the truth table, then use these new truth
values to assign truth values to larger subsentences and so on. If the final column
(the entire sentence) is always assigned true, then this means that, whatever the
truth values of the propositions being discussed, the entire sentence will turn out to
be true.
For example, the following is a tautology:
S: (X

(Y

Z))

((X

Y)

(X

Z))

In English, sentence, S says that X implies Y and Z if and only if X implies Y and X
implies Z. The truth table for this sentence will look like this:

Z X

Y X

Z X

(Y Z) ((X

Y)

(X

Z)) S

true true true true true

true

true

true

true

true true false false true

false

false

false

true

true false true false false

true

false

false

true

true false false false false

false

false

false

true

false true true true true

true

true

true

true

false true false false true

true

true

true

true

false false true false true

true

true

true

true

false false false false true

true

true

true

true

We see that that the seventh and eighth columns the truth values which have
been built up from the previous columns have exactly the same truth values in
each row. Because our sentence is made up of the two sub sentences in these
columns, this means that our overall equivalence must be correct. The truth of this
statement demonstrates that the connectives
and are related by a property
called distributivity, which we come back to later on.
Truth tables give us our first (albeit simple) method for proving a theorem: check
whether it can be written in propositional logic and, if so, if it is a tautology, then it
must be true. So, for instance, if we were asked to prove this theorem from number
theory:

n, m ((sigma(n) = n

tau(n) = m)

(tau(n) = m

sigma(n) =\= n))

then we could prove it straight away, because we know that this is a tautology:
(X

Y)

(Y

X)

As we know this is a tautology, and that our number theory theorem fits into the
tautology (let X represent the proposition sigma(n)=n, and so on), we know that the
theorem must be true, regardless of what tau and sigma mean. (As an exercise,
show that this is indeed a tautology, using a truth table).

7.2 Equivalences & Rewrite Rules


As well as allowing us to prove trivial theorems, tautologies enable us to establish
that certain sentences are saying the same thing. In particular, if we can show that A
B is a tautology then we know A and B are true for exactly the same models, i.e.
they will have identical columns in a truth table. We say that A and B are logically
equivalent, written as the equivalence A B.

and mean the same thing here, so why use two different symbols? It's a
(Clearly
B is a sentence of propositional logic, whereas A B is a
technical difference: A
claim we make outside the logic.)
In natural language, we could replace the phrase "There's only one Tony Blair" by
"Tony Blair is unique", in sentences, because basically the phrases mean the same
thing. We can do exactly the same in logical languages, with an advantage: because
we are being more formal, we will have mathematically proved that two sentences
are equivalent. This means that there is absolutely no situation in which one
sentence would be interpreted in a different way to another, which is certainly
possible with natural language sentences about Tony Blair.
Equivalences allow us to change one sentence into another without affecting the
meaning, because we know that replacing one side of an equivalence with the other
will have no effect whatsoever on the semantics: it will still be true for the same
models. Suppose we have a sentence S with a sub expression A, which we write as
S[A]. If we know A B then we can be sure the semantics of S is unaffected if we
replace A with B, i.e. S[A] S[B].
Moreover, we can also use A B to replace any sub expression of S which is an
instance of A. An instance of a propositional expression A is a 'copy' of A where
some of the propositions of have been consistently replaced by new sub
expressions, e.g. every P has been replaced by Q. We call this replacement a
substitution, a mapping from propositions to expressions. Applying a substitution
U to a sentence S, we get a new sentence S.U which is an instance of S. It is easy to
show that if A B then A.U B.U for any substitution U, i.e. an instance of an
equivalence is also an equivalence. Hence an equivalence A B allows us to change
a sentence S[A'] to a logically equivalent one S[B'] if we have substitution U such
that A' = A.U and B' = B.U.
The power to replace sub expressions allows use to prove theorems with
S[B'] we can use the
equivalences: in the above example, given a theorem S[A']
equivalence A B to rewrite the theorem to the equivalent S[A']
S[A'], which
we know to be true. Given a set of equivalences we can prove (or disprove) a

complex theorem by rewriting it to something logically equivalent that we already


know to be true (or false).
The fact that we can rewrite instances of A to instances of B is expressed in the
rewrite rule A => B. Of course, we can also rewrite Bs to As, so we could use the
rewrite rule B => A instead. However, it's easy to see that having an agent use both
rules is dangerous, as it could get stuck in a loop A => B => A => B => ... and so
on. Hence, we typically use just one of the rewrite rules for a particular equivalence
(we 'orient' the rule in a single direction). If we do use both then we need to make
sure we don't get stuck in a loop.
Apart from proving theorems directly, the other main use for rewrite rules is to
prepare a statement for use before we search for the proof, as described in the next
lecture. This is because some automated deduction techniques require a statement to
be in a particular format, and in these cases, we can use a set of rewrite rules to
convert the sentence we want to prove into a logically equivalent one which is in
the correct format.
Below are some common equivalences which automated theorem proves can use as
rewrite rules. Remember that the rules can be read both ways, but that in practice
either i) only one direction is used or ii) a loop-check is employed. Note also that
these are true of sentences in propositional logic, so they can also be used for
rewriting sentences in first-order logic, which is just an extension of propositional
logic.

Commutativity of Connectives

You will be aware of the fact that some arithmetic operators have a property that it
doesn't matter which way around you give the operator input. We call this property
commutativity. For example, when adding two numbers, it doesn't matter which
one comes first, because a+b = b+a for all a and b. The same is true for
multiplication, but not true for subtraction and division.
The , and
connectives (which operate on two sub sentences), also have the
commutativity property. We can express this with three tautologies:
P Q Q P
P Q Q P
Q Q
P

So, if it helps to do so, whenever we see P Q, we can rewrite it as Q


similarly for the other two commutative connectives.

P, and

Associativity of Connectives

Brackets are useful in order to tell us when to perform calculations in arithmetic and
when to evaluate the truth of sentences in logic. Suppose we want to add 10, 5 and
7. We could do this: (10 + 5) + 7 = 22. Alternatively, we could do this: 10 + (5 + 7)
= 22. In this case, we can alter the bracketing and the answer still comes out the
same. We say that addition is associative because it has this property with respect
to bracketing.
The and connectives are associative. This makes sense, because the order in
which we check truth values doesn't matter when we are working with sentences
only involving or only involving . For instance, suppose we wanted to know
the truth of P (Q R). To do this, we just need to check that every proposition is
true, in which case the whole sentence will be true, otherwise the whole sentence
will be false. So, it doesn't matter how the brackets are arranged, and hence the is
associative.

Similarly, suppose we wanted to work out the truth of:


(P

Q)

(R

(X

Z))

Then all we need to do is check whether one of these propositions is true, and the
bracketing is immaterial. As equivalences, then, the two associativity results are:
(P Q) R P (Q R)
(P Q) R P (Q R)

Distributivity of Connectives

Our last analogy with arithmetic will involve a well-used technique for playing
around with algebraic properties. Suppose we wanted to work out: 10 * (3 + 5). We
could do it like this: 10 * (3 + 5) = 10 * 8 = 80. Or we could do it like this: (10 * 3)
+ (10 * 5) = 30 + 50 = 80. In general, we know that, for any numbers, a, b and c: a *
(b + c) = (a * b) + (a * c). In this case, we say that multiplication is distributive
over addition.
You guessed it, we can distribute some of the connectives too. In particular, is
distributive over and vice versa: is also distributive over . We can present
these as equivalences as follows:
P (Q R) (P Q) (P R)
P (Q R) (P Q) (P R)

Also, we saw earlier that


over . Therefore:

P
P

is distributive over

, and the same is true for

(Q R) (P
Q) (P
R)
(Q R) (P
Q) (P
R)

Double Negation

Parents are always correcting their children for the use of double negatives, but we
have to be very careful with them in natural language: "He didn't tell me not to do
it" doesn't necessarily mean the same as "He did tell me to do it". The same is true
with logical sentences: we cannot, for example, change (P Q) to ( P Q)
without risking the meaning of the sentence changing. However, there are certain
cases when we can alter expressions with negation. Two possibilities are given by
de Morgan's law below, and we can also simplify statements by removing double
negation. These are cases when a proposition has two negation signs in front of it,
like this: P.
You may be wondering why on earth anyone would ever write down a sentence
with such a double negation in the first place. Of course, you're right. As humans,
we wouldn't write a sentence in logic like that. However, remember that our agent
will be doing search using rewrite rules. It may be that as part of the search, they
introduce a double negation, by following a particular rewrite rule to the letter. In
this case, the agent would probably tidy it up by using this equivalence:
P

P
De Morgan's Laws

Continuing with the relationship between and , we can also use De Morgan's
Law to rearrange sentences involving negation in conjunction with these
connectives. In fact, there are two equivalences which, taken as a pair are called De
Morgan's Law:
(P Q) P Q
(P Q) P Q
These are important rules and it is worth spending some time thinking about why
they are true.

Contraposition

The contraposition equivalence is as follows:

This may seem a little strange at first, because it appears that we have said nothing
in the first sentence about Q, so how can we infer anything from it in the second
sentence? However, suppose we know that P implies Q, and we saw that Q was
false. In this case, if we were to imply that P was true, then, because we know that P
implies Q, we also know that Q is true. But Q was false! Hence we cannot possibly
imply that P is true, which means that we must imply that P is false (because we are
in propositional logic, so P must be either true or false). This argument shows that
we can replace the first sentence by the second one, and it is left as an exercise to
construct a similar argument for the vice-versa part of this equivalence.

Other Equivalences

The following miscellaneous equivalence rules are often useful during rewriting
sessions. The first two allow us to completely get rid of implication and equivalence
connectives from our sentences if we want to:

Replace implication: P
Replace equivalence: P

Q
Q

P
(P

Q (this one is very useful)


Q ) (Q
P)

The next two allow truth values to be determined regardless of the truth of the
propositions.

Consistency: P P False
Excluded middle: P P True

Here the "False" symbol stands for the proposition which is always false: no matter
what truth values you give to other propositions in the sentence, this one will
always be false. Similarly, the "True" symbol stands for the proposition which is
always true. In first-order logic we can treat them as special predicates with the
same properties.

An Example using Rewrite Rules


Equivalence rules can be used to show that a complicated looking sentence is
actually just a simple one in disguise. For this example, we shall show that this
sentence:
(A

B)

(A

B)

conveys a meaning which is actually much simpler than you would think on first
inspection.
We can simplify this, using the following chain of rewrite steps based on the
equivalences we've stated above:
1. Using the double negation rewrite: P => P

B) (A B)
(A
2. Using De Morgan's Law: P

Q => (P

B) (A B)
(A
3. Using the commutativity of

:P

Q => Q

Q)

B) ( B A)
(A
4. Using 'replace implication' from right to left: P

Q => P

B) (B
(A
A)
5. Using 'replace equivalence' from left to right: P

Q => (P

A)) (B
B) (B
((A
6. Using the associativity of : (P

A)
Q) R => P

(A
B) ((B
A) (B
A))
7. Using the consistency equivalence above: P
(A
B) False
8. Using the definition of

(Q

Q)

(Q

P)

R)

P => False

False

So, what does this mean? It means that our original sentence was always false: there
are no models which would make this sentence true. Another way to think about
this is that the original sentence was inconsistent with the rules of propositional
logic. In general, proving theorems by proving that they're negation rewrites to
False is an example of proof by contradiction, which we discuss below.
Note that the first step of this simplification routine was to insert a double negation!
Also, at some stages, the rewritten sentence looked more complicated than the
original, so we seemed to be making matters worse, which is quite common. Is
there any other way to simplify the original statement? Of course, you'll still end up
with the answer false, but there might be a quicker way to get there. You may get
the feeling you are solving a search problem, which, of course, is exactly what
you're doing. If you think about this sentence, it may become obvious why it is

false: for (P Q) to be true, P must be false and Q must be true. But then what
about the conjoined equivalence?

7.4 Propositional Inference Rules

7.4 Propositional Inference RulesEquivalence rules are particularly useful


because of the vice-versa aspect, which means that we can search backwards and
forwards in a search space using them. Hence, we can perform bi-directional search,
which is a bonus. However, what if we know that one sentence (or set of sentences)
being true implies that another set of sentences is true. For instance, the following
sentence is used ad nauseum in logic text books:
All men are mortal
Socrates was a man
Hence, Socrates is mortal
This is an example of the application of a rule of deduction known as Modus
Ponens. We see that we have deduced the fact that Socrates is mortal from the two
true facts that all men are mortal and Socrates was a man. So, because we know that
the rule about men being mortal and the classification of Socrates as a man are true,
we can infer with certainty (because we know that modus ponens is sound), that
Socrates is going to die - which, of course, he did. Of course, it doesn't make sense
to go backwards as with equivalences: we would deduce that, Socrates being mortal
implies that he was a man and that all men are mortal!
The general format for the modus ponens rule is as follows: if we have a true
sentence which states that proposition A implies proposition B and we know that
proposition A is true, then we can infer that proposition B is true. The notation we
use for this is as follows:
A

B, A

This is an example of an inference rule. The comma above the line indicates we
know both these things in our knowledge base, and the line stands for the deductive
step. That is, if we know that both the propositions above the line are true, then we
can deduce that the proposition below the line is also true. In general, an inference
rule

A
B
is sound if we can be sure that A entails B, i.e. B is true when A is true. More
formally, A entails B means that if M is a model of A then M is also a model of B.
We write this as A B.

This gives us a way to check the soundness of propositional inference rules: (i)
draw up a logic table for both A and B evaluating them for all models and (ii) check
that whenever A is true, then B is also true. We don't care here about the models for
which A is false.
For instance, the truth table for the modus ponens rule is really the same as the one
for the implication connective. It looks like this:
A

True True True


True False False
False True True
False False True
This is a trivial example, but it highlights how we use truth tables: the first line is
the only one where both above-line propositions (A and A
B) are true. We see
that on this line, the proposition B is also true. This shows us that we have an
entailment: the above-line propositions entail the below-line one.
To see why such inference rules are useful, remember what the main application of
automated deduction is: to prove theorems. Theorems are normally part of a larger
theory, and that theory has axioms. Axioms are special theorems which are taken to
be true without question. Hence whenever we have a theorem statement we want to
prove, we should be able to start from the axioms and deduce the theorem statement
using sound inference rules such as modus ponens.
Below are some more propositional inference rules:

And-Elimination

In English, this says that "if you know that lots of things are all true, then you know
that any one of them is also true". It means that you can simplify a conjunction by
just taking one of the conjuncts (in effect, eliminating the s).

A1

A2

...

An

Ai
Note that 1 i n.

And-Introduction

In English, this says that "if we know that a lot of things are true, then we know that
the conjunction of all of them is true", so we can introduce conjunction ('and')
symbols.
A1, A2, ...,

An

A1

An

A2

...

This may not seem to be saying much. However, imagine that we are working with
a lot of different sentences at different places in our knowledge base, and we know
some of them are true. Then we can make a larger sentence out of them by
conjoining the smaller ones.

Or-Introduction

If we know that one thing is true, then we know that a sentence where that thing is
in a disjunction is true. For example, we know that "Tony Blair is prime minister" is
true. From this, we can infer any disjunction as long as we include this true sentence
as a disjunct. So, we can infer that "Tony Blair is prime minister or the moon is
made of blue cheese", which makes perfect sense.
Ai
A1

A2

...

An

Again, 1 i n.

Unit Resolution

Suppose that we knew the sentence "Tony Blair is prime minister or the moon is
made of blue cheese", is true, and we later found out that the moon isn't in fact
made of cheese. Then, because the first (disjoined) sentence is true, we can infer
that Tony Blair is indeed prime minister. This typifies the essence of the unit
resolution rule:

(A

B),

A
The generalised version of this inference rule is the subject of a whole area of
Artificial Intelligence research known as resolution theorem proving, which we
cover in detail in the next lecture.

7.5 First-Order Models

We proposed first-order logic as a good knowledge representation language rather


than propositional logic because it is more expressive, so we can write more of our
sentences in logic. So the sentences we are going to want to apply rewrites and
inference rules will include quantification. All of the rewrite rules we've seen so far
can be used in propositional logic (and hence first-order logic). We now consider
rules in which rely on information about the quantifiers, so are not available to an
agent working with a propositional logic representation scheme.
Before we look at first-order inference rules we need to pause to consider what it
means for such an inference rule to be sound. Earlier we defined this as meaning the
top entails the bottom: that any model of the former was a model of the latter. But
first-order logic introduces new syntactic elements (constants, functions, variables,
predicates and quantifiers) alongside the propositional connectives. This means we
need to completely revise our definition of model, a notion of a 'possible world'
which defines whether a sentence is true or false in that world.
A propositional model was just an assignement of truth values to propositions. In
contrast, a first-order model is a pair (, ) where

is a domain, a non-empty set of 'objects', i.e. things which our first-order


sentences are refering to.
is an interpretation, a procedure for calculating the truth of sentences relative
to .

This seems very different from propositional logic. Fortunately, everything we have
discussed so far about deduction carries over into first-order logic when we use this new
definition of model.

Terms

First-order logic allows us to talks about properties of objects, so the first job for
our model (, ) is to assign a meaning to the terms which represent objects. A
ground term is any combination of constant and function symbols, and maps
each individual ground term to a specific object in . This means that a ground term
refers to a single specific object. The meaning of subterms is always independent of
the term they appear in.
The particular way that terms are mapped to objects depends on the model.
Different models can define terms as refering to different things. Note that although
father(john) and jack are separate terms, they might both be mapped to the same
object (say Jack) in . That is, the two terms are syntactically different but (in this
model) they are semantically the same, i.e. they both refer to the same thing!
Terms can also contain variables (e.g. father(X)) these are non-ground terms.
They don't refer to any specific object, and so our model can't assign any single
meaning to them directly. We'll come back to what variables mean.
Predicates

Predicates take a number of arguments (which for now we assume are ground
terms) and represent a relationship between those arguments which can be true or
false. The semantics of an n-ary predicate p(t1,...tn) are defined by a model (, )
as follows: we first calculate the n objects that the arguments refer to (t1), ...,
(tn). maps p to a function P: &Delta n{true,false} which defines whether p is
true for those n elements of . Different models can assign different functions P, i.e.
they can provide different meanings for each predicate.
Combining predicates, ground terms and propositional connectives gives us ground
formulae, which don't contain any variables. They are definite statements about
specific objects.
Quantifiers and Variables

So what do sentences containing variables mean? In other words, how does a firstorder model decide whether such a sentence is true or false? The first step is to
ensure that the sentence does not contain any free variables, variables which are
not bound by (associated with) a quantifier. Strictly speaking, a first-order
expression is not a sentence unless all the variables are bound. However, we usually
assume that if a variable is not explicitly bound then really it is implicitly
universally quantified.

Next we look for the outermost quantifier in our sentence. If this is X then we
consider the truth of the sentence for every value X could take. When the outermost
quantifier is X we need to find just a single possible value of X. To make this
more formal we can use a concept of substitution. Here {X\t} is a substitution
which replaces all occurances of variable X with a term representing an object t:

X. A is true if and only if A.{X\t} for all t in


X. A is true if and only if A.{X\t} for at least one t in

Repeating this for all the quantifiers we get a set of ground formulae which we have
to check to see if the original sentence is true or false. Unfortunately, we haven't
specificed that our domain is finite for example, it may contain the natural
numbers so there may be a infinite number of sentences to check for a given
model! There may be also be an infinite number of models..So although we have a
proper definition of model, and hence a proper semantics for first-order logic, so we
can't rely on having a finite number of models as we did when drawing
propositional truth tables.

7.6 First-Order Inference Rules


Now we have a clear definition of a first-order model is, we can define soundness
for first-order inference rules in the same way as we did for propositional inference
rules: the rule is sound if given a model of the sentences above the line, this is
always a model of the sentence below.
To be able to specify these new rules, we must use the notion of substitution. We've
already seen substitutions which replace propositions with propositional
expressions (7.2 above) and other substitutions which replace variables with terms
that represent a given object (7.5 above). In this section we use substitutions which
replace variables with ground terms (terms without variables) so to be clear we
will call these ground substitutions. Another name for a ground substitution is an
instantiation,
For example, if we start with the wonderfully optimistic sentence that everyone
likes everyone else: X, Y (likes(X, Y)), then we can choose particular values for
X and Y. So, we can instantiate this sentence to say: likes(george, tony). Because
we have chosen a particular value, the quantification no longer makes sense, so we
must drop it.

The act of performing an instantiation is a function, as there is only one possible


outcome, so we can write it as a function. The notation

Subst({X/george, Y/tony}, likes(X,Y)) = likes(george, tony)


indicates that we have made a ground substitution.
We also have to recognise that we are working with sentences which form part of a
knowledge base of many such sentences. More to the point, there will be constants
which appear throughout the knowledge base, and some which are local to a
particular sentence.

Universal Elimination

For any sentence, A, containing a universally quantified variable, v, then for any
ground term, g, we can substitute g for v in A. We write the following to represent
this rule:
vA

Subst({v/g}, A)

As an example (from Russell and Norvig), this rule can be used on the following
sentence: X, likes(X, ice_cream) to substitute the variable 'ben' for X, giving us
the sentence likes(ben, ice_cream). In English, this says that, given that everyone
likes ice cream, we can infer that Ben likes ice cream. This is not exactly rocket
science, and it is worth bearing in mind that, beneath all the fancy symbols in logic,
we're really only saying simple things.

Existential Elimination

For a sentence, A, with an existentially quantified variable, v, then, for every


constant symbol k, that does not appear anywhere else in the knowledge base, we
can substitute k for v in A:

Subst({v/k}, A)

For an example, if we know that X (likes(X,ice_cream)), then we can choose a


particular name for X. We could choose ben for this, giving us: likes(ben,
ice_cream), but only if the constant ben does not appear anywhere else in our
knowledge base.
So, why the condition about the existential variable being unique to the new
sentence? Basically, what you are doing here is giving a particular name to a
variable you know must exist. It would be unwise to give this a name which already
exists. For example, suppose we have the predicates brother(john,X), sister(john,
susan) then, when instantiating X, it would be unwise to choose the term susan for
the constant to ground X with, because this would probably be a false inference. Of
course, it's not impossible that John would have a sister named Susan and also a
brother named Susan, but it is not likely. However, if we choose a totally new
constant, then there can be no problems and the inference is guaranteed to be
correct.

Existential Introduction

For any sentence, A, and variable, v, which does not occur in A, then for any
ground term, g, that occurs in A, we can turn A into an existentially quantified
sentence by substituting v for g:
A

v Subst({g/v}, A)

So, for example, if we know that likes(jerry, ice_cream), then we can infer that X
(likes(X, ice_cream)), because the constant jerry does not appear anywhere else in
the original sentence. The conditions that v and g do not occur in A is for similar
reasons as those given for the previous rule. As an exercise, find a situation where
ignoring this condition would mean that the inferred sentence did not follow
logically from the premise sentence.

7.6 Chains of Inference


We look now at how to get an agent to prove a given theorem using various search
strategies. We have noted in previous lectures that, to specify a search problem, we
need to describe the representation language for the artefacts being searched for, the
initial state, the goal state (or some information about what a goal should look like),
and the operators: how to go from one state to another.

We can state the problem of proving a given theorem from some axioms as a search
problem. Three different specifications give rise to three different ways to solve the
problem, namely forward and backward chaining and proof by contradiction. In all
of these specifications, the representation language is predicate logic (not
surprisingly), and the operators are the rules of inference, which allow us to rewrite
a set of sentences as another set. We can think of each state in our search space as a
sentence in first order logic. The operators will traverse this space, finding new
sentences. However, we are really only interested in finding a path from the start
states to the goal state, as this path will constitute a proof. (Note that there are other
ways to prove theorems such as exhausting the search for a counterexample and
finding none - in this case we don't have a deductive proof for the truth of the
theorem, but we know it is true).
Only the initial state of the space and the details of the goal differ in the three
following approaches.

Forward Chaining

Suppose we have a set of axioms which we know are true statements about the
world. If we set these to each be an initial state of the search space, and we set the
goal state to be our theorem statement, then this is a simple approach which can be
used to prove theorems. We call this approach forward chaining, because the agent
employing the search constructs chains of reasoning, from the axioms, hopefully to
the goal. Once a path has been found from the axioms to the theorem, this path
constitutes a proof and the problem has been solved.
However, the problem with forward chaining in general is that it cannot easily use
the goal (theorem statement) to drive the search. Hence it really must just explore
the search space until it comes across the solution. Goal-directed searches are often
more effective than non-goal directed ones like forward chaining.

Backward Chaining

Given that we are only interested in constructing the path, we can set our initial
state to be the theorem statement and search backwards until we find an axiom (or
set of axioms). If we restrict ourselves to just using equivalences as rewrite rules,
then this approach is OK, because we can use equivalences both ways, and any path
from the theorem to axioms which is found will provide a proof. However, if we
use inference rules to traverse from theorem to axioms, then we will have proved
that, if the theorem is true, then the axioms are true. But we already know that the
axioms are true! To get around this, we must invert our inference rules and try to
work backwards. That is, the operators in the search basically answer the question:
what could be true in order to infer the state (logical sentence) we are at right now?

If our agent starts searching from the theorem statement and reaches the axioms, it
has proved the theorem. This is also problematic, because there are numerous
answers to the inversion question, and the search space gets very large.

Proof by Contradiction

So, forward chaining and backward chaining both have drawbacks. Another
approach is to think about proving theorems by contradiction. These are very
common in mathematics: mathematicians specify some axioms, then make an
assumption. After some complicated mathematics, they have shown that an axiom
is false (or something derived from the axioms which did not involve the
assumption is false). As the axioms are irrefutably correct, this means that the
assumption they made must be false. That is, the assumption is inconsistent with the
axioms of the theory. To use this for a particular theorem which they want to prove
is true, they negate the theorem statement and use this as the assumption they are
going to show is false. As the negated theorem must be false, their original theorem
must be true. Bingo!
We can program our reasoning agents to do just the same. To specify this as a
search problem, therefore, we have to say that the axioms of our theory and the
negation of the theorem we want to prove are the initial search states. Remembering
our example in section 7.2, to do this, we need to derive the False statement to show
inconsistency, so the False statement becomes our goal. Hence, if we can deduce
the false statement from our axioms, the theorem we were trying to prove will
indeed have been proven. This means that, not only can we use all our rules of
inference, we also have a goal to aim for.
As an example, below is the input to the Otter theorem prover for the trivial
theorem about Socrates being mortal. Otter searches for contradictions using
resolution, hence we note that the theorem statement - that Socrates is mortal - is
negated using the minus sign. We discuss Otter and resolution theorem proving in
the next two lectures.
Input:
set(auto).
formula_list(usable).
all x (man(x)->mortal(x)). % For all x, if x is a man then x is
mortal
man(socrates).
% Socrates is a man
-mortal(socrates).
% Socrates is immortal (note: negated)
end_of_list.

Otter has no problem whatsoever proving this theorem, and here is the output:

Output:
---------------- PROOF ---------------1
2
3
4
5

[] -man(x)|mortal(x).
[] -mortal(socrates).
[] man(socrates).
[hyper,3,1] mortal(socrates).
[binary,4.1,2.1] $F.

------------ end of proof -------------

Chapter-8
The Resolution Method
A minor miracle occurred in 1965 when Alan Robinson published his resolution
method. This method uses a generalised version of the resolution rule of inference
we saw in the previous lecture. It has been mathematically proven to be refutationcomplete over first order logic. This means that if you write any set of sentences in
first order logic which are unsatisfiable (i.e., taken together they are false, in that
they have no models), then the resolution method will eventually derive the False
symbol, indicating that the sentences somehow contradict each other.
In particular, if the set of first order sentences comprises a set of axioms and the
negation of a theorem you want to prove, the resolution method can be used in a
proof-by-contradiction approach. This means that, if your first order theorem is true
then proof by contradiction using the resolution method is guaranteed to find the
proof to a theorem eventually. The underlining here identifies some drawbacks to
resolution theorem proving:

It only works for true theorems which can be expressed in first order logic:
it cannot check at the same time whether a conjecture is true or false, and it
can't work in higher order logics. (There are related techniques which
address these problems, to varying degrees of success.)
While it is proven that the method will find the solution, in practice the
search space is often too large to find one in a reasonable amount of time,
even for fairly simple theorems.

Notwithstanding these drawbacks, resolution theorem proving is a complete


method: if your theorem does follow from the axioms of a domain, then resolution
can prove it. Moreover, it only uses one rule of deduction (resolution), rather than
the multitude we saw in the last lecture. Hence, it is comparatively easy to

understand how resolution theorem provers work. For these reasons, the
development of the resolution method was a major accomplishment in logic, with
serious implications to Artificial Intelligence research.
Resolution works by taking two sentences and resolving them into one, eventually
resolving two sentences to produce the False statement. The resolution rule is more
complicated than the rules of inference we've seen before, and we need to cover
some preparatory notions before we can understand how it works. In particular, we
need to look at conjunctive normal form and unification before we can state the full
resolution rule at the heart of the resolution method.

8.1 Binary Resolution


We saw unit resolution (a propositional inference rule) in the previous lecture:
A

B, B

We can take this a little further to propositional binary resolution:


A

B, B

Binary resolution gets its name from the fact that each sentence is a disjunction of
exactly two literals. We say the two opposing literals B and B are resolved
they are removed when the disjunctions are merged.
The binary resolution rule can be seen to be sound because if both A and C were
false then then at least one of the sentences on the top line would be false. As this is
an inference rule, we are assuming that the top line is true. Hence we can't have
both A and C being false, which means either A or C must be true. So we can infer
the bottom line.
So far we've only looked at propositional versions of resolution. In first-order logic
we need to also deal with variables and quantifiers. As we'll see below, we don't

need to worry about quantifiers: we are going to be working with sentences that
only contain free variables. Recall that we treat these variables as implicitly
universally quantified, and that they can take any value. This allows us to state a
more general first-order binary resolution inference rule:
A

B, C

Subst(, B) = Subst(&theta, C)
Subst(, A

D)

This rule has the side condition Subst(, B) = Subst(&theta, C), which requires
there to be a substitution which makes B and C the same before we can apply the
rule. (Note can substitute fresh variables whie making B and C equal. It doesn't
have to be a ground substitution!) If we can find such a , then we can make the
resolution step and apply to the outcome. In fact, the first-order binary rule is
simply equivalent to applying the substitution to the original sentences, and then
applying the propositional binary rule.

8.2 Conjunctive Normal Form


For the resolution rule to resolve two sentences, they must both be in a normalised
format known as conjunctive normal form, which is usually abbreviated to CNF.
This is an unfortunate name because the sentences themselves are made up of sets
of disjunctions. It is implicitly assumed that the entire knowledge base is a big
conjunction of the sentences, which is where conjunctive normal form gets its
name. So, CNF is actually a conjunction of disjunctions. The disjunctions are made
up of literals which can either be a predicate or the negation of a predicate (for
propositional read a proposition or the negation of a proposition):
So, CNF sentences are of the form:
(p1

p2

...

pm)

(q1

q2

...

qn)

etc.

where each pi and qj is a literal. Note that we call the disjunction of such literals a
clause. As a concrete example,
likes(george, X)

likes(tony, george)

is in conjunctive normal form, but:

is_mad(maggie)

likes(george, X)

likes(tony, george)

is_mad(maggie)

is_mad(tony)

is not in CNF.
The following eight-stage process converts any sentence into CNF:
1. Eliminate arrow connectives by rewriting with
P
P

=> (P

Q)

Q => P

(Q

P)

2. Move inwards using De Morgan's laws (inc. quantifier versions) and double
negation:
(P

Q) => P

(P

Q) => P

X. P

=>

X. P

X. P

=>

X. P

=> P

3. Rename variables apart: the same variable name may be reused several times for
different variables, within one sentence or between several. To avoid confusion
later rename each distinct variable with a unique name.
4. Move quantifiers outwards: the sentence is now in a form where all the
quantifiers can be moved safely to the outside without affecting the semantics,
provided they are kept in the same order.
5. Skolemise existential variables by replacing them with Skolem constants and
functions. This is similar to the existential elimination rule from the last lecture:
we just substitute a term for each existential variable that represents the
'something' for which it holds. If there are no preceeding universal quantifiers the
'something' is just a fresh constant. However, if there are then we use a function
that takes all these preceeding universal variables as arguments. When we're
done we just drop all the universal quantifiers. This leaves a quantifier-free
sentence. For example:

X. Y. person(X)

has(X, Y)

heart(Y)

is Skolemised as
person(X)
6. Distribute
with:

has(X, f(X))
over

heart(f(X))

to make a conjunction of disjunctions. This involves rewriting

(Q

(P

R) => (P

Q)

(P

R)

R => (P

R)

(Q

R)

Q)

and

7. Flatten binary connectives: replace nested


and disjuncts:

P
(P
P
(P

(Q

R) => P

R => P

R) => P

R => P

Q)
(Q

Q)

with flat lists of conjuncts

8. The sentence is now in CNF. Further simplication can take place by


removing duplicate literals and dropping any clause which contains both A
and A (one will be true, so the clause is always true. In the conjunction of
clauses we want everything to be true, so we can drop it.) There is an
optional final step that takes it to Kowalski normal form, also known as
implicative normal form (INF):
9. Reintroduce implication by gathering up all the negative literals (the negated
ones) and forming their conjunction N, then taking the disjunction P of the
positive literals, and forming the logically equivalent clause N
P.

Example: Converting to CNF


We will work through a simple propositional example:
(B

(A

C))

(B

A)

This first thing to do is remove the implication sign:

(B

(A

C))

(B

A)

Next we use De Morgan's laws to move our negation sign from the outside to the
inside of brackets:
( B

(A

C))

(B

A)

And we can use De Morgan's law again to move a negation sign inwards:
( B

(A

C))

Next we distribute
( B

over

(( A

A))

(B

(B

A)
as follows:

C)

(B

A))

If we flatten our disjunctions, we get our sentence into CNF form. Note the
conjunction of disjunctions:
( B

A)

(A

A)

Finally, the first conjunction has B and B, so the whole conjunction must be true.
Also, we can remove the duplicate A in the second conjunction:
True

( A

B)

The truth of this sentence is only dependent on the second conjunct. If it is false, the
whole thing is false, if it is true, the whole thing is true. Hence, we can remove the
True, giving us a single clause in its final conjunctive normal form:
A

If we want Kowalski normal form we take one more step to get:


(A

C)

8.3 Unification
We have said that the rules of inference for propositional logic detailed in the last
lecture can also be used in first-order logic. However, we need to clarify that a little.
One important difference between propositional and first-order logic is that the
latter has predicates with terms as arguments. So, one clarification we need to make
is that we can apply the inference rules as long as the predicates and arguments

match up. So, not only do we have to check for the correct kinds of sentence before
we can carry out a rule of inference, we also have to check that the arguments do
not forbid the inference.
For instance, suppose in our knowledge base, we have the these two statements:
knows(john,X)
hates(john, X)
knows(john, mary)
and we want to use the Modus Ponens rule to infer something new. In this case,
there is no problem, and we can infer that, because john hates everyone he knows,
and he knows Mary, then he must hate Mary, i.e., we can infer that hates(john,
mary) is true.
However, suppose instead that we had these two sentences:
knows(john,X)
hates(john, X)
knows(jack, mary)
Here, the predicate names have not changed, but the arguments are holding us back
from making any deductive inference. In the first case above, we could allow the
variable X to be instantiated to mary during the deduction, and the constant john
before and after the deduction also matched without problem. However, in the
second case, although we could still instantiate X to mary, we could no longer
match john and jack, because they are two different constants. So we cannot deduce
anything about john (or anybody else) from the latter two statements.
The problem here comes from our inability to make the arguments in knows(john,
X) and the arguments in knows(jack, mary) match up. When we can make two
predicates match up, we say that we have unified them, and we will look at an
algorithm for unifying two predicates (if they can be unified) in this section.
Remember that unification plays a part in the way Prolog searches for matches to
queries.

A Unification Algorithm

To unify two sentences, we must find a substitution which makes the two sentences
the same. Remember that we write V/T to signify that we have substituted term T
for variable V (read the "/" sign as "is substituted by"). The purpose of this
algorithm will be to produce a substitution (a set of pairs V/T) for a given pair of
sentences. So, for example, the output for the pair of sentences:

knows(john,X)
knows(john, mary)
will be: {X/mary}. However, for the two sentences above involving jack, the
function should fail, as there was no way to unify the sentences.
To describe the algorithm, we need to specify some functions it calls internally.

The function isa_variable(x) checks whether x is a variable.


The function isa_compound(x) checks whether x is a compound expression:
either a predicate, a function or a connective which contains subparts. The
subparts of a predicate or function are the arguments. The subparts of a
connective are the things it connects. We write args(x) for the subparts of
compound expression x. Note that args(x) outputs a list: the list of subparts.
Also, we write op(x) to signify the symbol of the compound operator (predicate
name, function name or connective symbol).
The function isa_list(x) checks whether x is a list. We write head(L) for the
first term in a list L and tail(L) for the sublist comprising all the other terms
except the head. Hence the head of [2,3,5,7,11] is 2 and the tail is [3,5,7,11]. This
terminology is common in Prolog.

It's easiest to explain the unification algorithm as a recursive method which is able
to call itself. As this is happening, a set, mu, is passed around the various parts of
the algorithm, collecting substitutions as it goes. The method has two main parts:
unify_internal(x,y,mu)

which returns a substitution which makes sentence x look exactly like sentence y,
given an already existing set of substitutions mu (although mu may be empty). This
function checks various properties of x and y and calls either itself again or the
unify_variable routine, as described below. Note that the order of the ifstatements is important, and if a failure is reported at any stage, the whole function
fails. If none of the cases is true for the input, then the algorithm fails to find a
unifying set of substitutions.
unify_variable(var,x,mu)

which returns a substitution given a variable var, a sentence x and an already


existing set of substitutions mu. This function also contains a set of cases which
cause other routines to run if the case is true of the input. Again, the order of the
cases is important. Here, if none of the cases is true of the input, a substitution is
returned.
The algorithm is as follows:

unify(x,y) = unify_internal(x,y,{})
unify_internal(x,y,mu) ---------------------Cases
1. if (mu=failure) then return failure
2. if (x=y) then return mu.
3. if (isa_variable(x)) then return unify_variable(x,y,mu)
4. if (isa_variable(y)) then return unify_variable(y,x,mu)
5. if (isa_compound(x) and isa_compound(y)) then return
unify_internal(args(x),args(y),unify_internal(op(x),op(y),mu))
6. if (isa_list(x) and isa_list(y)) then return
unify_internal(tail(x),tail(y),unify_internal(head(x),head(y),mu))
7. return failure
unify_variable(var,x,mu) -----------------------Cases
1. if (a substitution var/val is in mu) then return
unify_internal(val,x,mu)
2. if (a substitution x/val is in mu) then return
unify_internal(var,val,mu)
3. if (var occurs anywhere in x) then return failure
4. add var/x to mu and return

Some things to note about this method are:


(i) trying to match a constant to a different constant fails because they are not equal,
neither is a variable and neither is a compound expression or list. Hence none of the
cases in unify_internal is true, so it must return failure.
(ii) Case 1 and 2 in unify_variable(var,x,my) check that neither inputs have
already been substituted. If x already has a substitution, then it tries to unify the

substituted value and var, rather than x and var. It does similarly if var already has
a substitution.
(iii) Case 3 in unify_variable is known as the occurs-check case (or occurcheck). This is important: imagine we got to the stage where, to complete a
unification, we needed to substitute X with, say, f(X,Y). If we did this, we would
write f(X,Y) instead of X. But this still has an X in it! So, we would need to
substitute X by f(X,Y) again, giving us: f(f(X,Y),Y) and it is obvious why we
should never have tried this substitution in the first place, because this process will
never end. The occurs check makes sure this isn't going to happen before case 4
returns a substitution. The rule is: you cannot substitute a compound for a variable
if that variable appears in the compound already, because you will never get rid of
the variable.
(iv) The unify_internal(op(x),op(y),mu)) part of case 5 in unify_internal
checks that the operators of the two compound expressions are the same. This
means that it will return false if, for example, it tries to unify two predicates with
different names, or a with a symbol.
(v) The unification algorithm returns the unique most general unifier (MGU) mu
for two sentences. This means that if there is another unifier U then T.U is always
an instance of T.mu. The MGU substitutes as little as it can get away with while
still being a unifier.

Example: Unifying Two Sentences


Suppose we wanted to unify these two sentences:
1. p(X,tony) q(george, X, Z)
2. p(f(tony),tony) q(B,C,maggie)

We can see by inspection that a way to unify these sentences is to apply the
substitution:
{X/f(tony), B/george, C/f(tony), Z/maggie}.
Therefore, our unification algorithm should find this substitution.
To run our algorithm, we set the inputs to be:

x = p(X,tony) q(george, X, Z)
and
y = p(f(tony),tony) q(B,C,maggie)
and then follow the algorithm steps.
Iteration

one

unify_internal is called with inputs x, y and the empty list {}. This tries case 1,

but as mu is not failure, this is not the case. Next it tries case 2, but this is also not
the case, because x is not equal to y. Cases 3 and 4 similarly fail, because neither x
nor y is a variable. Finally, case 5 kicks in because x and y are compound terms. In
fact, they are both conjunctions connected by the
connective. Using our
definitions
above,
args(x)=[p(X,tony),q(george,X,Z)]
and
args(y)=[p(f(tony),tony),q(B,C,maggie)]. Also, op(x) = p and op(y) = p. So, case 5
means that we call unify_internal again with inputs [p(X,tony),q(george,X,Z)] and
[p(f(tony),tony),q(B,C,maggie)]. Before we do that, the third input to the function
will be unify_internal(op(x),op(y),mu). Because our op(x) and op(y) are the
same (both p), then this will return mu [check this yourselves]. mu is still the empty
list, so this gets passed on.
Iteration
two
So, we're back at the top of unify_internal again, but this time with a pair of lists
as input. None of the cases match until case 6. This states that we have to split our
lists into heads and tails, then unify the heads and use this to unify the tails.
Unifying the heads means that we once again call unify_internal, this time with
predicates p(X,tony) and p(f(tony),tony).
Iteration
three
Now case 5 fires again, because our two inputs are both predicates. This turns the
arguments into a list, checks that the two predicate names match and calls
unify_internal yet again, this time with lists [X,tony] and [f(tony),tony] as input.
Iteration
four
In this iteration, all the algorithm does is split the lists into heads and tails, and first
calls unify_internal with X and f(tony) as inputs, and later with tony and tony
as input. In the latter case we can see that unify_internal will return mu, because
the constant symbols are equal. Hence this will not affect anything.
Iteration
five
When X and f(tony) are given as input, case 3 fires because X is a variable. This
causes unify_variable(X,f(tony),{}) to be called. In this case, it checks that neither
X nor f(tony) has been subject to a substitution already, which they haven't because
the substitution list is still empty. It also makes an occurs-check, and finds that X

does not appear anywhere in f(tony), so case 3 does not fire. Hence it goes all the
way to case 4, and X/f(tony) is added to the substitution list. Finally, we have a
substitution! This returns the substitution list {X/f(tony)} as output, and causes
some other embedded calls to also return this substitution list.
It is left as an exercise to show that the same way in which the algorithm unified
p(X,tony) and p(f(tony),tony) with the substitution {X/f(tony)}, it also unifies
q(george,X,Z) and q(B,C,maggie), adding B/george, C/f(tony) and Z/maggie to
the substitution list. However, in this case, we had already assigned the substitution
X/tony. Hence, when unify_variable was finally called, it fired case 2 (or is it 1?)
to make sure that the already substituted variable was not given another
substitution.
At this stage, all the return statements start to actually return things, and the
substitution gets passed back all the way to the top. Finally, the substitution
{X/f(tony), B/george, C/f(tony), Z/maggie}.
is indeed produced by the unification algorithm. When applied to both sentences,
the result is the same sentence:
p(f(tony),tony)

q(george,f(tony),maggie)

The complexity of this relatively simple example shows why it is a good idea to get
a software agent to do this, rather than doing it ourselves. Of course, if you wanted
to try out the unification algorithm, you can simply run Prolog and type in your
sentences separated by a single = sign. This asks Prolog to try to unify the two
terms. This is what happens in Sicstus Prolog:
?- [p(X,tony),q(george,X,Z)]=[p(f(tony),tony),q(B,C,maggie)].
B = george, C = f(tony), X = f(tony), Z = maggie ?
yes.

We see that Prolog has come up with the same unifying substitution as before.

8.4 The Full Resolution Rule


Now that we know about unification, we can properly describe the full version of
resolution:

p1

...

pj

...

pm,

q1

...

qk

...

qn

Unify(pj, qk) =
Subst(, p1

...

pj-1

pj+1

...

pm

q1

... qk-1

qk+1

...

qn)

This resolves literals pj and qk. Note that we have to add to qk to make it unify
with pj, so it is in fact pj which is the negative literal here. The rule is more general
than first-order binary resolution in that it allows an arbitrary number of literals in
each clause. Moreover, is the most general unifier, rather than an arbitrary
unifying substitution.
To use the rule in practice, we first take a pair of sentences and express them in
CNF using the techniques described above. Then we find two literals, p j and qk for
which can find a substitution mu to unify pj and qk. Then we take a disjunction of
all the literals (in both sentences) except pj and qk. Finally, we apply the substitution
to the new disjunction to determine what we have just inferred using resolution.
In the next lecture, we will look at how resolution theorem proving is put into
action, including some example proofs, some heuristics for improving its
performance and some applications.

Chapter-11
Decision Tree Learning
As discussed in the last lecture, the representation scheme we choose to represent
our learned solutions and the way in which we learn those solutions are the most
important aspects of a learning method. We look in this lecture at decision trees - a
simple but powerful representation scheme, and we look at the ID3 method for
decision tree learning.

11.1 Decision Trees


Imagine you only ever do four things at the weekend: go shopping, watch a movie,
play tennis or just stay in. What you do depends on three things: the weather
(windy, rainy or sunny); how much money you have (rich or poor) and whether
your parents are visiting. You say to your yourself: if my parents are visiting, we'll
go to the cinema. If they're not visiting and it's sunny, then I'll play tennis, but if it's

windy, and I'm rich, then I'll go shopping. If they're not visiting, it's windy and I'm
poor, then I will go to the cinema. If they're not visiting and it's rainy, then I'll stay
in.
To remember all this, you draw a flowchart which will enable you to read off your
decision. We call such diagrams decision trees. A suitable decision tree for the
weekend decision choices would be as follows:

We can see why such diagrams are called trees, because, while they are admittedly
upside down, they start from a root and have branches leading to leaves (the tips of
the graph at the bottom). Note that the leaves are always decisions, and a particular
decision might be at the end of multiple branches (for example, we could choose to
go to the cinema for two different reasons).
Armed with our decision tree, on Saturday morning, when we wake up, all we need
to do is check (a) the weather (b) how much money we have and (c) whether our
parent's car is parked in the drive. The decision tree will then enable us to make our
decision. Suppose, for example, that the parents haven't turned up and the sun is
shining. Then this path through our decision tree will tell us what to do:

and hence we run off to play tennis because our decision tree told us to. Note that
the decision tree covers all eventualities. That is, there are no values that the
weather, the parents turning up or the money situation could take which aren't
catered for in the decision tree. Note that, in this lecture, we will be looking at how
to automatically generate decision trees from examples, not at how to turn thought
processes into decision trees.

Reading Decision Trees

There is a link between decision tree representations and logical representations,


which can be exploited to make it easier to understand (read) learned decision trees.
If we think about it, every decision tree is actually a disjunction of implications (if
... then statements), and the implications are Horn clauses: a conjunction of literals
implying a single literal. In the above tree, we can see this by reading from the root
node to each leaf node:
If the parents are visiting, then go to the cinema
or
If the parents are not visiting and it is sunny, then play tennis
or

If the parents are not visiting and it is windy and you're rich, then go shopping
or
If the parents are not visiting and it is windy and you're poor, then go to cinema
or
If the parents are not visiting and it is rainy, then stay in.
Of course, this is just a re-statement of the original mental decision making process
we described. Remember, however, that we will be programming an agent to learn
decision trees from example, so this kind of situation will not occur as we will start
with only example situations. It will therefore be important for us to be able to read
the decision tree the agent suggests.
Decision trees don't have to be representations of decision making processes, and
they can equally apply to categorisation problems. If we phrase the above question
slightly differently, we can see this: instead of saying that we wish to represent a
decision process for what to do on a weekend, we could ask what kind of weekend
this is: is it a weekend where we play tennis, or one where we go shopping, or one
where we see a film, or one where we stay in? For another example, we can refer
back to the animals example from the last lecture: in that case, we wanted to
categorise what class an animal was (mammal, fish, reptile, bird) using physical
attributes (whether it lays eggs, number of legs, etc.). This could easily be phrased
as a question of learning a decision tree to decide which category a given animal is
in, e.g., if it lays eggs and is homeothermic, then it's a bird, and so on...

11.2 Learning Decision Trees Using ID3

Specifying the Problem

We now need to look at how you mentally constructed your decision tree when
deciding what to do at the weekend. One way would be to use some background
information as axioms and deduce what to do. For example, you might know that
your parents really like going to the cinema, and that your parents are in town, so
therefore (using something like Modus Ponens) you would decide to go to the
cinema.
Another way in which you might have made up your mind was by generalising
from previous experiences. Imagine that you remembered all the times when you
had a really good weekend. A few weeks back, it was sunny and your parents were
not visiting, you played tennis and it was good. A month ago, it was raining and you
were penniless, but a trip to the cinema cheered you up. And so on. This
information could have guided your decision making, and if this was the case, you
would have used an inductive, rather than deductive, method to construct your

decision tree. In reality, it's likely that humans reason to solve decisions using both
inductive and deductive processes.
We can state the problem of learning decision trees as follows:

We have a set of examples correctly categorised into categories (decisions). We


also have a set of attributes describing the examples, and each attribute has a finite
set of values which it can possibly take. We want to use the examples to learn the
structure of a decision tree which can be used to decide the category of an unseen
example.

Assuming that there are no inconsistencies in the data (when two examples have
exactly the same values for the attributes, but are categorised differently), it is
obvious that we can always construct a decision tree to correctly decide for the
training cases with 100% accuracy. All we have to do is make sure every situation
is catered for down some branch of the decision tree. Of course, 100% accuracy
may indicate overfitting.

The basic idea

In the decision tree above, it is significant that the "parents visiting" node came at
the top of the tree. We don't know exactly the reason for this, as we didn't see the
example weekends from which the tree was produced. However, it is likely that the
number of weekends the parents visited was relatively high, and every weekend
they did visit, there was a trip to the cinema. Suppose, for example, the parents have
visited every fortnight for a year, and on each occasion the family visited the
cinema. This means that there is no evidence in favour of doing anything other than
watching a film when the parents visit. Given that we are learning rules from
examples, this means that if the parents visit, the decision is already made. Hence
we can put this at the top of the decision tree, and disregard all the examples where
the parents visited when constructing the rest of the tree. Not having to worry about
a set of examples will make the construction job easier.
This kind of thinking underlies the ID3 algorithm for learning decisions trees,
which we will describe more formally below. However, the reasoning is a little
more subtle, as (in our example) it would also take into account the examples when
the parents did not visit.

Entropy

Putting together a decision tree is all a matter of choosing which attribute to test at
each node in the tree. We shall define a measure called information gain which will
be used to decide which attribute to test at each node. Information gain is itself
calculated using a measure called entropy, which we first define for the case of a
binary decision problem and then define for the general case.
Given a binary categorisation, C, and a set of examples, S, for which the proportion
of examples categorised as positive by C is p+ and the proportion of examples
categorised as negative by C is p-, then the entropy of S is:

The reason we defined entropy first for a binary decision problem is because it is
easier to get an impression of what it is trying to calculate. Tom Mitchell puts this
quite well:
"In order to define information gain precisely, we begin by defining a measure
commonly used in information theory, called entropy that characterizes the
(im)purity of an arbitrary collection of examples."
Imagine having a set of boxes with some balls in. If all the balls were in a single
box, then this would be nicely ordered, and it would be extremely easy to find a
particular ball. If, however, the balls were distributed amongst the boxes, this would
not be so nicely ordered, and it might take quite a while to find a particular ball. If
we were going to define a measure based on this notion of purity, we would want to
be able to calculate a value for each box based on the number of balls in it, then
take the sum of these as the overall measure. We would want to reward two
situations: nearly empty boxes (very neat), and boxes with nearly all the balls in
(also very neat). This is the basis for the general entropy measure, which is defined
as follows:
Given an arbitrary categorisation, C into categories c 1, ..., cn, and a set of examples,
S, for which the proportion of examples in c i is pi, then the entropy of S is:

This measure satisfies our criteria, because of the -p*log2(p) construction: when p
gets close to zero (i.e., the category has only a few examples in it), then the log(p)
becomes a big negative number, but the p part dominates the calculation, so the
entropy works out to be nearly zero. Remembering that entropy calculates the
disorder in the data, this low score is good, as it reflects our desire to reward

categories with few examples in. Similarly, if p gets close to 1 (i.e., the category has
most of the examples in), then the log(p) part gets very close to zero, and it is this
which dominates the calculation, so the overall value gets close to zero. Hence we
see that both when the category is nearly - or completely - empty, or when the
category nearly contains - or completely contains - all the examples, the score for
the category gets close to zero, which models what we wanted it to. Note that
0*ln(0) is taken to be zero by convention.

Information Gain

We now return to the problem of trying to determine the best attribute to choose for
a particular node in a tree. The following measure calculates a numerical value for a
given attribute, A, with respect to a set of examples, S. Note that the values of
attribute A will range over a set of possibilities which we call Values(A), and that,
for a particular value from that set, v, we write Sv for the set of examples which
have value v for attribute A.
The information gain of attribute A, relative to a collection of examples, S, is
calculated as:

The information gain of an attribute can be seen as the expected reduction in


entropy caused by knowing the value of attribute A.

An Example Calculation

As an example, suppose we are working with a set of examples, S = {s 1,s2,s3,s4}


categorised into a binary categorisation of positives and negatives, such that s1 is
positive and the rest are negative. Suppose further that we want to calculate the
information gain of an attribute, A, and that A can take the values {v1,v2,v3}.
Finally, suppose that:
s1
takes
s2
takes
s3
takes
s4 takes value v1 for A

value
value
value

v2
v2
v3

for
for
for

A
A
A

To work out the information gain for A relative to S, we first need to calculate the
entropy of S. To use our formula for binary categorisations, we need to know the

proportion of positives in S and the proportion of negatives. These are given as: p +
= 1/4 and p- = 3/4. So, we can calculate:
Entropy(S) = -(1/4)log2(1/4) -(3/4)log2(3/4) = -(1/4)(-2) -(3/4)(-0.415) = 0.5 + 0.311
= 0.811
Note that, to do this calculation with your calculator, you may need to remember
that: log2(x) = ln(x)/ln(2), where ln(2) is the natural log of 2. Next, we need to
calculate the weighted Entropy(Sv) for each value v = v1, v2, v3, v4, noting that the
weighting involves multiplying by (|Svi|/|S|). Remember also that Sv is the set of
examples from S which have value v for attribute A. This means that:
Sv1 = {s4}, sv2={s1, s2}, sv3 = {s3}.
We now have need to carry out these calculations:
(|Sv1|/|S|) * Entropy(Sv1) = (1/4) * (-(0/1)log2(0/1) - (1/1)log2(1/1)) = (1/4)(-0 (1)log2(1)) = (1/4)(-0 -0) = 0
(|Sv2|/|S|) * Entropy(Sv2) = (2/4) * (-(1/2)log2(1/2) - (1/2)log2(1/2))
= (1/2) * (-(1/2)*(-1) - (1/2)*(-1)) = (1/2) * (1) = 1/2
(|Sv3|/|S|) * Entropy(Sv3) = (1/4) * (-(0/1)log2(0/1) - (1/1)log2(1/1)) = (1/4)(-0 (1)log2(1)) = (1/4)(-0 -0) = 0
Note that we have taken 0 log2(0) to be zero, which is standard. In our calculation,
we only required log2(1) = 0 and log2(1/2) = -1. We now have to add these three
values together and take the result from our calculation for Entropy(S) to give us
the final result:
Gain(S,A) = 0.811 - (0 + 1/2 + 0) = 0.311
We now look at how information gain can be used in practice in an algorithm to
construct decision trees.

The ID3 algorithm

The calculation for information gain is the most difficult part of this algorithm. ID3
performs a search whereby the search states are decision trees and the operator
involves adding a node to an existing tree. It uses information gain to measure the
attribute to put in each node, and performs a greedy search using this measure of
worth. The algorithm goes as follows:

Given a set of examples, S, categorised in categories ci, then:


1. Choose the root node to be the attribute, A, which scores the highest for
information gain relative to S.
2. For each value v that A can possibly take, draw a branch from the node.
3. For each branch from A corresponding to value v, calculate S v. Then:

If Sv is empty, choose the category cdefault which contains the most examples from
S, and put this as the leaf node category which ends that branch.
If Sv contains only examples from a category c, then put c as the leaf node
category which ends that branch.
Otherwise, remove A from the set of attributes which can be put into nodes. Then
put a new node in the decision tree, where the new attribute being tested in the
node is the one which scores highest for information gain relative to Sv (note: not
relative to S). This new node starts the cycle again (from 2), with S replaced by Sv
in the calculations and the tree gets built iteratively like this.

The algorithm terminates either when all the attributes have been exhausted, or the
decision tree perfectly classifies the examples.
The following diagram should explain the ID3 algorithm further:

Artificial Intelligence-CS1351

11.3 A worked example


We will stick with our weekend example. Suppose we want to train a decision tree
using the following instances:
Weekend (Example) Weather Parents Money Decision (Category)
W1

Sunny

Yes

Rich

Cinema

W2

Sunny

No

Rich

Tennis

EINSTEIN COLLEGE OF ENGINEERING

Page 101

Artificial Intelligence-CS1351

W3

Windy

Yes

Rich

Cinema

W4

Rainy

Yes

Poor

Cinema

W5

Rainy

No

Rich

Stay in

W6

Rainy

Yes

Poor

Cinema

W7

Windy

No

Poor

Cinema

W8

Windy

No

Rich

Shopping

W9

Windy

Yes

Rich

Cinema

W10

Sunny

No

Rich

Tennis

The first thing we need to do is work out which attribute will be put into the node at
the top of our tree: either weather, parents or money. To do this, we need to
calculate:
Entropy(S) = -pcinema log2(pcinema) -ptennis log2(ptennis) -pshopping log2(pshopping) pstay_in log2(pstay_in)
= -(6/10) * log2(6/10) -(2/10) * log2(2/10) -(1/10) * log2(1/10) -(1/10) *
log2(1/10)
= -(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -(1/10) * -3.322
= 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571
and we need to determine the best of:
Gain(S, weather) = 1.571 - (|Ssun|/10)*Entropy(Ssun) - (|Swind|/10)*Entropy(Swind) (|Srain|/10)*Entropy(Srain)
= 1.571 - (0.3)*Entropy(Ssun) - (0.4)*Entropy(Swind) (0.3)*Entropy(Srain)
= 1.571 - (0.3)*(0.918) - (0.4)*(0.81125) - (0.3)*(0.918) = 0.70
Gain(S, parents) = 1.571 - (|Syes|/10)*Entropy(Syes) - (|Sno|/10)*Entropy(Sno)
= 1.571 - (0.5) * 0 - (0.5) * 1.922 = 1.571 - 0.961 = 0.61
Gain(S, money) = 1.571 - (|Srich|/10)*Entropy(Srich) - (|Spoor|/10)*Entropy(Spoor)
= 1.571 - (0.7) * (1.842) - (0.3) * 0 = 1.571 - 1.2894 = 0.2816
EINSTEIN COLLEGE OF ENGINEERING

Page 102

Artificial Intelligence-CS1351

This means that the first node in the decision tree will be the weather attribute. As
an exercise, convince yourself why this scored (slightly) higher than the parents
attribute - remember what entropy means and look at the way information gain is
calculated.
From the weather node, we draw a branch for the values that weather can take:
sunny, windy and rainy:

Now we look at the first branch. S sunny = {W1, W2, W10}. This is not empty, so we
do not put a default categorisation leaf node here. The categorisations of W1, W2
and W10 are Cinema, Tennis and Tennis respectively. As these are not all the same,
we cannot put a categorisation leaf node here. Hence we put an attribute node here,
which we will leave blank for the time being.
Looking at the second branch, Swindy = {W3, W7, W8, W9}. Again, this is not
empty, and they do not all belong to the same class, so we put an attribute node
here, left blank for now. The same situation happens with the third branch, hence
our amended tree looks like this:

Now we have to fill in the choice of attribute A, which we know cannot be weather,
because we've already removed that from the list of attributes to use. So, we need to
calculate the values for Gain(S sunny, parents) and Gain(S sunny, money). Firstly,
Entropy(Ssunny) = 0.918. Next, we set S to be Ssunny = {W1,W2,W10} (and, for this
part of the branch, we will ignore all the other examples). In effect, we are
interested only in this part of the table:
Weekend (Example) Weather Parents Money Decision (Category)

EINSTEIN COLLEGE OF ENGINEERING

Page 103

Artificial Intelligence-CS1351

W1

Sunny

Yes

Rich

Cinema

W2

Sunny

No

Rich

Tennis

W10

Sunny

No

Rich

Tennis

Hence we can calculate:


Gain(Ssunny, parents) = 0.918 - (|Syes|/|S|)*Entropy(Syes) - (|Sno|/|S|)*Entropy(Sno)
= 0.918 - (1/3)*0 - (2/3)*0 = 0.918
Gain(Ssunny, money) = 0.918 - (|Srich|/|S|)*Entropy(Srich) - (|Spoor|/|S|)*Entropy(Spoor)
= 0.918 - (3/3)*0.918 - (0/3)*0 = 0.918 - 0.918 = 0
Notice that Entropy(Syes) and Entropy(Sno) were both zero, because Syes contains
examples which are all in the same category (cinema), and S no similarly contains
examples which are all in the same category (tennis). This should make it more
obvious why we use information gain to choose attributes to put in nodes.
Given our calculations, attribute A should be taken as parents. The two values from
parents are yes and no, and we will draw a branch from the node for each of these.
Remembering that we replaced the set S by the set S Sunny, looking at Syes, we see
that the only example of this is W1. Hence, the branch for yes stops at a
categorisation leaf, with the category being Cinema. Also, S no contains W2 and
W10, but these are in the same category (Tennis). Hence the branch for no ends
here at a categorisation leaf. Hence our upgraded tree looks like this:

Finishing this tree off is left as a tutorial exercise.

EINSTEIN COLLEGE OF ENGINEERING

Page 104

Artificial Intelligence-CS1351

11.4 Avoiding Overfitting


As we discussed in the previous lecture, overfitting is a common problem in
machine learning. Decision trees suffer from this, because they are trained to stop
when they have perfectly classified all the training data, i.e., each branch is
extended just far enough to correctly categorise the examples relevant to that
branch. Many approaches to overcoming overfitting in decision trees have been
attempted. As summarised by Tom Mitchell, these attempts fit into two types:

Stop growing the tree before it reaches perfection.


Allow the tree to fully grow, and then post-prune some of the branches from it.

The second approach has been found to be more successful in practice. Both
approaches boil down to the question of determining the correct tree size. See
Chapter 3 of Tom Mitchell's book for a more detailed description of overfitting
avoidance in decision tree learning.

11.5 Appropriate Problems for Decision Tree Learning


It is a skilled job in AI to choose exactly the right learning representation/method
for a particular learning task. As elaborated by Tom Mitchell, decision tree learning
is best suited to problems with these characteristics:

The background concepts describe the examples in terms of attribute-value pairs,


and the values for each attribute range over finitely many fixed possibilities.
The concept to be learned (Mitchell calls it the target function) has discrete
values.
Disjunctive descriptions might be required in the answer.

In addition to this, decision tree learning is robust to errors in the data. In particular,
it will function well in the light of (i) errors in the classification instances provided
(ii) errors in the attribute-value pairs provided and (iii) missing values for certain
attributes for certain examples.

Lecture
Two
Layer
Networks
EINSTEIN COLLEGE OF ENGINEERING

Artificial

12
Neural

Page 105

Artificial Intelligence-CS1351

Decision trees, while powerful, are a simple representation scheme. While graphical
on the surface, they can be seen as disjunctions of conjunctions, and hence are a
logical representation, and we call such schemes symbolic representations. In this
lecture, we look at a non-symbolic representation scheme known as Artificial
Neural Networks. This term is often shortened to Neural Networks, but this annoys
neuro-biologists who deal with real neural networks (inside our human heads).
As the name suggests, ANNs have a biological motivation, and we briefly look at
that first. Following this, we look in detail at how information is represented in
ANNs, then we look at the simplest type of network, two layer networks. We look
at perceptrons and linear units, and discuss the limitations that such simple
networks have. In the next lecture, we discuss multi-layer networks and the backpropagation algorithm for learning such networks.

12.1 Biological Motivation


In our discussion in the first lecture about how people have answered the question:
"How are we going to get an agent to act intelligently", one of the answers was to
realise that intelligence in individual humans is effected by our brains. Neuroscientists have told us that the brain is made up of architectures of networks of
neurons. At the most basic level, neurons can be seen as functions which, when
given some input, will either fire or not fire, depending on the nature of the input.
The input to certain neurons comes from the senses, but in general, the input to a
neuron is a set of outputs from other neurons. If the input to a neuron goes over a
certain threshold, then the neuron will fire. In this way, one neuron firing will affect
the firing of many other neurons, and information can be stored in terms of the
thresholds set and the weight assigned by each neuron to each of its inputs.
Artificial Neural Networks (ANNs) are designed to mimic the behaviour of the
brain. Some ANNs are built into hardware, but the vast majority are simulated in
software, and we concentrate on these. It's important not to take the analogy too far,
because there really isn't much similarity between artificial and animal neural
networks. In particular, while the human brain is estimated to contain around
100,000,000,000 neurons, ANNs usually contain less than 1000 equivalent units.
Moreover, the interconnection of neurons is much bigger in natural systems. Also,
the way in which ANNs store and manipulate information is a gross simplification
of the way in which networks of neurons work in natural systems.

12.2 ANN Representation


ANNs are taught on AI courses because of their motivation from brain studies and
the fact that they are used in an AI task, namely machine learning. However, I
EINSTEIN COLLEGE OF ENGINEERING

Page 106

Artificial Intelligence-CS1351

would argue that their real home is in statistics, because, as a representation


scheme, they are just fancy mathematical functions.
Imagine being asked to come up with a function to take the following inputs and
produce their associated outputs:
Input Output
1

16

Presumably, the function you would learn would be f(x) = x2. Imagine now that you
had a set of values, rather than a single instance as input to your function:
Input Output
[1,2,3] 1
[2,3,4] 5
[3,4,5] 11
[4,5,6] 19

Here, it is still possible to learn a function: for example, multiply the first and last
element and take the middle one from the product. Note that the functions we are
learning are getting more complicated, but they are still mathematical. ANNs just
take this further: the functions they learn are generally so complicated that it's
difficult to understand them on a global level. But they are still just functions which
play around with numbers.
Imagine, now, for example, that the inputs to our function were arrays of pixels,
actually taken from photographs of vehicles, and that the output of the function is
either 1, 2 or 3, where 1 stands for a car, 2 stands for a bus and 3 stands for a tank:
EINSTEIN COLLEGE OF ENGINEERING

Page 107

Artificial Intelligence-CS1351

Input

Output Input

Output

In this case, the function which takes an array of integers representing pixel data
and outputs either 1, 2 or 3 will be fairly complicated, but it's just doing the same
kind of thing as the two simpler functions.
Because the functions learned to, for example, categorise photos of vehicles into a
category of car, bus or tank, are so complicated, we say the ANN approach is a
black box approach because, while the function performs well at its job, we cannot
look inside it to gain a knowledge of how it works. This is a little unfair, as there
are some projects which have addressed the problem of translating learned neural
networks into human readable forms. However, in general, ANNs are used in cases
where the predictive accuracy is of greater importance than understanding the
learned concept.
Artificial Neural Networks consist of a number of units which are mini calculation
devices. They take in real-valued input from multiple other nodes and they produce
a single real valued output. By real-valued input and output we mean real numbers
which are able to take any decimal value. The architecture of ANNs is as follows:
1. A set of input units which take in information about the example to be
propagated through the network. By propagation, we mean that the
information from the input will be passed through the network and an output
produced. The set of input units forms what is known as the input layer.
2. A set of hidden units which take input from the input layer. The hidden
units collectively form the hidden layer. For simplicity, we assume that
each unit in the input layer is connected to each unit of the hidden layer, but
this isn't necessarily the case. A weighted sum of the output from the input
EINSTEIN COLLEGE OF ENGINEERING

Page 108

Artificial Intelligence-CS1351

units forms the input to every hidden unit. Note that the number of hidden
units is usually smaller than the number of input units.
3. A set of output units which, in learning tasks, dictate the category assigned
to an example propagated through the network. The output units form the
output layer. Again, for simplicity, we assume that each unit in the hidden
layer is connected to each unit in the output layer. A weighted sum of the
output from the hidden units forms the input to every output unit.
Hence ANNs look like this in the general case:

Note that the w, x, y and z represent real valued weights and that all the edges in
this graph have weights associated with them (but it was difficult to draw them all
on). Note also that more complicated ANNs are certainly possible. In particular,
many ANNs have multiple hidden layers, with the output from one hidden layer
forming the input to another hidden layer. Also, ANNs with no hidden layer - where
the input units are connected directly to the output units - are possible. These tend
to be too simple to use for real world learning problems, but they are useful to study
for illustrative purposes, and we look at the simplest kind of neural networks,
perceptrons, in the next section.
In our vehicle example, it is likely that the images will all be normalised to having
the same number of pixels. Then there may be an input unit for each red, green and
blue intensity for each pixel. Alternatively, greyscale images may be used, in which
case there needs only to be an input node for each pixel, which takes in the
brightness of the pixel. The hidden layer is likely to contain far fewer units
(probably between 3 and 10) than the number of input units. The output layer will
contain three units, one for each of the categories possible (car, bus, tank). Then,
EINSTEIN COLLEGE OF ENGINEERING

Page 109

Artificial Intelligence-CS1351

when the pixel data for an image is given as the initial values for the input units,
this information will propagate through the network and the three output units will
each produce a real value. The output unit which produces the highest value is taken
as the categorisation for the input image.
So, for instance, when this image is used as input:

then, if output unit 1 [car] produces value 0.5, output unit 2 [bus] produces value
0.05 and output unit 3 [tank] produces value 0.1, then this image has been
(correctly) classified as a car, because the output from the corresponding car output
unit is higher than for the other two. Exactly how the function embedded within a
neural network computes the outputs given the inputs is best explained using
example networks. In the next section, we look at the simplest networks of all,
perceptrons, which consist of a set of input units connected to a single output unit.

12.3 Perceptrons
The weights in any ANN are always just real numbers and the learning problem
boils down to choosing the best value for each weight in the network. This means
there are two important decisions to make before we train a artificial neural
network: (i) the overall architecture of the system (how input nodes represent given
examples, how many hidden units/hidden layers to have and how the output
information will give us an answer) and (ii) how the units calculate their real value
output from the weighted sum of real valued inputs.
The answer to (i) is usually found by experimentation with respect to the learning
problem at hand: different architectures are tried and evaluated on the learning
problem until the best one emerges. In perceptrons, given that we have no hidden
layer, the architecture problem boils down to just specifying how the input units
represent the examples given to the network. The answer to (ii) is discussed in the
next subsection.

Units

EINSTEIN COLLEGE OF ENGINEERING

Page 110

Artificial Intelligence-CS1351

The input units simply output the value which was input to them from the example
to be propagated. Every other unit in a network normally has the same internal
calculation function, which takes the weighted sum of inputs to it and calculates an
output. There are different possibilities for the unit function and this dictates to
some extent how learning over networks of that type is performed. Firstly, there is a
simple linear unit which does no calculation, it just outputs the weighted sum
which was input to it.
Secondly, there are other unit functions which are called threshold functions,
because they are set up to produce low values up until the weighted sum reaches a
particular threshold, then they produce high values after this threshold. The simplest
type of threshold function produces a 1 if the weighted sum of the inputs is over a
threshold value T, and produces a -1 otherwise. We call such functions step
functions, due to the fact that, when drawn as a graph, it looks like a step. Another
type of threshold function is called a sigma function, which has similarities with the
step function, but advantages over it. We will look at sigma functions in the next
lecture.

Example

As an example, consider a ANN which has been trained to learn the following rule
categorising the brightness of 2x2 black and white pixel images: if it contains 3 or 4
black pixels, it is dark; if it contains 2, 3 or 4 white pixels, it is bright. We can
model this with a perceptron by saying that there are 4 input units, one for each
pixel, and they output +1 if the pixel is white and -1 if the pixel is black. Also, the
output unit produces a 1 if the input example is to be categorised as bright and -1 if
the example is dark. If we choose the weights as in the following diagram, the
perceptron will perfectly categorise any image of four pixels into dark or light
according to our rule:

EINSTEIN COLLEGE OF ENGINEERING

Page 111

Artificial Intelligence-CS1351

We see that, in this case, the output unit has a step function, with the threshold set
to -0.1. Note that the weights in this network are all the same, which is not true in
the general case. Also, it is convenient to make the weights going in to a node add
up to 1, so that it is possible to compare them easily. The reason this network
perfectly captures our notion of darkness and lightness is because, if three white
pixels are input, then three of the input units produce +1 and one input unit
produces -1. This goes into the weighted sum, giving a value of S = 0.25*1 +
0.25*1 + 0.25*1 + 0.25*(-1) = 0.5. As this is greater than the threshold of -0.1, the
output node produces +1, which relates to our notion of a bright image. Similarly,
four white pixels will produce a weighted sum of 1, which is greater than the
threshold, and two white pixels will produce a sum of 0, also greater than the
threshold. However, if there are three black pixels, S will be -0.5, which is below
the threshold, hence the output node will output -1, and the image will be
categorised as dark. Similarly, an image with four black pixels will be categorised
as dark. As an exercise: keeping the weights the same, how low would the threshold
have to be in order to misclassify an example with three or four black pixels?

Learning Weights in Perceptrons

We will look in detail at the learning method for weights in multi-layer networks
next lecture. The following description of learning in perceptrons will help clarify
what is going on in the multi-layer case. We are in a machine learning setting, so
we can expect the task to be to learn a target function which categorises examples
into categories, given (at least) a set of training examples supplied with their correct
categorisations. A little thought will be needed in order to choose the correct way of
thinking about the examples as input to a set of input units, but, due to the simple
nature of a perceptron, there isn't much choice for the rest of the architecture.
In order to produce a perceptron able to perform our categorisation task, we need to
use the examples to train the weights between the input units and the output unit,
and to train the threshold. To simplify the routine, we think of the threshold as a
special weight, which comes from a special input node that always outputs a 1. So,
we think of our perceptron like this:

EINSTEIN COLLEGE OF ENGINEERING

Page 112

Artificial Intelligence-CS1351

Then, we say that the output from the perceptron is +1 if the weighted sum from all
the input units (including the special one) is greater than zero, and it outputs -1
otherwise. We see that weight w0 is simply the threshold value. However, thinking
of the network like this means we can train w0 in the same way as we train all the
other weights.
The weights are initially assigned randomly and training examples are used one
after another to tweak the weights in the network. All the examples in the training
set are used and the whole process (using all the examples again) is iterated until all
examples are correctly categorised by the network. The tweaking is known as the
perceptron training rule, and is as follows: If the training example, E, is correctly
categorised by the network, then no tweaking is carried out. If E is mis-classified,
then each weight is tweaked by adding on a small value, . Suppose we are trying
to calculate weight wi, which is between the i-th input unit, xi and the output unit.
Then, given that the network should have calculated the target value t(E) for
example E, but actually calculated the observed value o(E), then is calculated as:
= (t(E)- o(E))xi
Note that is a fixed positive constant called the learning rate. Ignoring briefly,
we see that the value that we add on to our weight w i is calculated by multiplying
the input value xi by t(E) - o(E). t(E) - o(E) will either be +2 or -2, because
perceptrons output only +1 or -1, and t(E) cannot be equal to o(E), otherwise we
wouldn't be doing any tweaking. So, we can think of t(E) - o(E) as a movement in a
particular numerical direction, i.e., positive or negative. This direction will be such
that, if the overall sum, S, was too low to get over the threshold and produce the
correct categorisation, then the contribution to S from w i * xi will be increased.
Conversely, if S is too high, the contribution from wi * xi is reduced. Because t(E) o(E) is multiplied by xi, then if xi is a big value (positive or negative), the change to
the weight will be greater. To get a better feel for why this direction correction
works, it's a good idea to do some simple calculations by hand.
EINSTEIN COLLEGE OF ENGINEERING

Page 113

Artificial Intelligence-CS1351

simply controls how far the correction should go at one time, and is usually set to
be a fairly low value, e.g., 0.1. The weight learning problem can be seen as finding
the global minimum error, calculated as the proportion of mis-categorised training
examples, over a space where all the input values can vary. Therefore, it is possible
to move too far in a direction and improve one particular weight to the detriment of
the overall sum: while the sum may work for the training example being looked at,
it may no longer be a good value for categorising all the examples correctly. For
this reason, restricts the amount of movement possible. If a large movement is
actually required for a weight, then this will happen over a series of iterations
through the example set. Sometimes, is set to decay as the number of such
iterations through the whole set of training examples increases, so that it can move
more slowly towards the global minimum in order not to overshoot in one direction.
This kind of gradient descent is at the heart of the learning algorithm for multilayered networks, as discussed in the next lecture.
Perceptrons with step functions have limited abilities when it comes to the range of
concepts that can be learned, as discussed in a later section. One way to improve
matters is to replace the threshold function with a linear unit, so that the network
outputs a real value, rather than a 1 or -1. This enables us to use another rule, called
the delta rule, which is also based on gradient descent. We don't look at this rule
here, because the backpropagation learning method for multi-layer networks is
similar.

12.4 Worked Example


Suppose we are trying to learn a perceptron to represent the brightness rules above,
in such a way that if it outputs a 1, the image is categorised as bright, and if it
outputs a -1, the image is categorised as dark. Remember that we said a 2x2 black
and white pixel image is categorised as bright if it has two or more white pixels in
it. We shall call the pixels p1 to p4, with the numbers going from left to right, top to
bottom in the 2x2 image. A black pixel will produce an input of -1 to the network,
and a white pixel will give an input of +1.
Given our new way of thinking about the threshold as a weight from a special input
node, our network will have five input nodes and five weights. Suppose also that we
have assigned the weights randomly to values between -1 and 1, namely -0.5, 0.7, 0.2, 0.1 and 0.9. Then our perceptron will initially look like this:

EINSTEIN COLLEGE OF ENGINEERING

Page 114

Artificial Intelligence-CS1351

We will now train the network with the first training example, using a learning rate
of = 0.1. Suppose the first example image, E, is this:

With two white squares, this is categorised as bright. Hence, the target output for E
is: t(E) = +1. Also, p1 (top left) is black, so the input x1 is -1. Similarly, x2 is +1, x3
is +1 and x4 is -1. Hence, when we propagate this through the network, we get the
value:
S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1) = -2.2
As this value is less than zero, the network outputs o(E) = -1, which is not the
correct value. This means that we should now tweak the weights in light of the
incorrectly categorised example. Using the perception training rule, we need to
calculate the value of to add on to each weight in the network. Plugging values
into the formula for each weight gives us:
0 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2
1 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (-1) = 0.1 * (-2) = -0.2
2 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2
3 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2

EINSTEIN COLLEGE OF ENGINEERING

Page 115

Artificial Intelligence-CS1351

4 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (-1) = 0.1 * (-2) = -0.2


When we add these values on to our existing weights, we get the new weights for
the network as follows:
w'0 = -0.5 + 0 = -0.5 + 0.2 = -0.3
w'1 = 0.7 + 1 = 0.7 + -0.2 = 0.5
w'2 = -0.2 + 2 = -0.2 + 0.2 = 0
w'3 = 0.1 + 3 = 0.1 + 0.2 = 0.3
w'4 = 0.9 + 4 = 0.9 - 0.2 = 0.7
Our newly trained network will now look like this:

To see how this has improved the situation with respect to the training example, we
can propagate it through the network again. This time, we get the weighted sum to
be:
S = (-0.3 * 1) + (0.5 * -1) + (0 * +1) + (0.3 * +1) + (0.7 * -1) = -1.2
This is still negative, and hence the network categorises the example as dark, when
it should be light. However, it is less negative. We can see that, by repeatedly
training using this example, the training rule would eventually bring the network to
a state where it would correctly categorise this example.

12.5 The Learning Abilities of Perceptrons


Computational learning theory is the study of what concepts particular learning
schemes (representation and method) can and can't learn. We don't look at this in
EINSTEIN COLLEGE OF ENGINEERING

Page 116

Artificial Intelligence-CS1351

detail, but a famous example, first highlighted in a very influential book by Minsky
and Papert involves perceptrons. It has been mathematically proven that the above
method for learning perceptron weights will converge to a perfect classifier for
learning tasks where the target concept is linearly separable.
To understand what is and what isn't a linearly separable target function, we look at
the simplest functions of all, boolean functions. These take two inputs, which are
either 1 or -1 and output either a 1 or a -1. Note that, in other contexts, the values 0
and 1 are used instead of -1 and 1. As an example function, the AND boolean
function outputs a 1 only if both inputs are 1, whereas the OR function only outputs
a 1 if either inputs are 1. Obviously, these relate to the connectives we studied in
first order logic. The following two perceptrons can represent the AND and OR
boolean functions respectively:

One of the major impacts of Minsky and Papert's book was to highlight the fact that
perceptions cannot learn a particular boolean function called XOR. This function
outputs a 1 if the two inputs are not the same. To see why XOR cannot be learned,
try and write down a perception to do the job. The following diagram highlights the
notion of linear reparability in Boolean functions, which explains why they can't be
learned by perceptions:

EINSTEIN COLLEGE OF ENGINEERING

Page 117

Artificial Intelligence-CS1351

In each case, we've plotted the values taken by the Boolean function when the
inputs are particular values: (-1,-1);(1,-1);(-1,1) and (1,1). For the AND function,
there is only one place where a 1 is plotted, namely when both inputs are 1. This
meant that we could draw the dotted line to separate the output -1s from the 1s. We
were able to draw a similar line in the OR case. Because we can draw these lines,
we say that these functions are linearly separable. Note that it is not possible to
draw such as line for the XOR plot: wherever you try, you never get a clean split
into 1s and -1s.
The dotted lines can be seen as the threshold in perceptrons: if the weighted sum, S,
falls below it, then the perceptron outputs one value, and if S falls above it, the
alternative output is produced. It doesn't matter how the weights are organized, the
threshold will still be a line on the graph. Therefore, functions which are not
linearly separable cannot be represented by perceptrons.
Note that this result extends to functions over any number of variables, which can
take in any input, but which produce a Boolean output (and hence could, in
principle be learned by a perceptron). For instance, in the following two graphs, the
function takes in two inputs (like Boolean functions), but the input can be over a
range of values. The concept on the left can be learned by a perceptron, whereas the
concept on the right cannot:

EINSTEIN COLLEGE OF ENGINEERING

Page 118

Artificial Intelligence-CS1351

As an exercise, in the left hand plot, draw in the separating (threshold) line.
Unfortunately, the disclosure in Minsky and Papert's book that perceptrons cannot
learn even such a simple function was taken the wrong way: people believed it
represented a fundamental flaw in the use of ANNs to perform learning tasks. This
led to a winter of ANN research within AI, which lasted over a decade. In reality,
perceptrons were being studied in order to gain insights into more complicated
architectures with hidden layers, which do not have the limitations that perceptrons
have. No one ever suggested that perceptrons would be eventually used to solve real
world learning problems. Fortunately, people studying ANNs within other sciences
(notably neuro-science) revived interest in the study of ANNs. For more details of
computational learning theory, see chapter 7 of Tom Mitchell's machine learning
book.

Chapter-13
Multi-Layer Artificial Neural Networks
We can now look at more sophisticated ANNs, which are known as multi-layer artificial neural
networks because they have hidden layers. These will naturally be used to undertake more
complicated tasks than perceptrons. We first look at the network structure for multi-layer ANNs,
and then in detail at the way in which the weights in such structures can be determined to solve
EINSTEIN COLLEGE OF ENGINEERING

Page 119

Artificial Intelligence-CS1351

machine learning problems. There are many considerations involved with learning such ANNs,
and we consider some of them here. First and foremost, the algorithm can get stuck in local
minima, and there are some ways to try to get around this. As with any learning technique, we
will also consider the problem of overfitting, and discuss which types of problems an ANN
approach is suitable for.

13.1 Multi-Layer Network Architectures


We saw in the previous lecture that perceptrons have limited scope in the type of concepts they
can learn - they can only learn linearly separable functions. However, we can think of
constructing larger networks by building them out of perceptrons. In such larger networks, we
call the step function units the perceptron units in multi-layer networks.
As with individual perceptrons, multi-layer networks can be used for learning tasks. However,
the learning algorithm that we look at (the backpropagation routine) is derived mathematically,
using differential calculus. The derivation relies on having a differentiable threshold function,
which effectively rules out using perceptron units if we want to be sure that backpropagation
works correctly. The step function in perceptrons is not continuous, hence non-differentiable. An
alternative unit was therefore chosen which had similar properties to the step function in
perceptron units, but which was differentiable. There are many possibilities, one of which is
sigmoid units, as described below.

Sigmoid units

Remember that the function inside units take as input the weighted sum, S, of the values coming
from the units connected to it. The function inside sigmoid units calculates the following value,
given a real-valued input S:

Where e is the base of natural logarithms, e = 2.718...


When we plot the output from sigmoid units given various weighted sums as input, it looks
remarkably like a step function:

EINSTEIN COLLEGE OF ENGINEERING

Page 120

Artificial Intelligence-CS1351

Of course, getting a differentiable function which looks like the step function was the whole
point of the exercise. In fact, not only is this function differentiable, but the derivative is fairly
simply expressed in terms of the function itself:

Note that the output values for the function range between but never make it to 0 and 1. This is
because e-S is never negative, and the denominator of the fraction tends to 0 as S gets very big in
the negative direction, and tends to 1 as it gets very big in the positive direction. This tendency
happens fairly quickly: the middle ground between 0 and 1 is rarely seen because of the sharp
(near) step in the function. Because of it looking like a step function, we can think of it firing
and not-firing as in a perceptron: if a positive real is input, the output will generally be close to
+1 and if a negative real is input the output will generally be close to -1.

Example Multi-layer ANN with Sigmoid Units

We will concern ourselves here with ANNs containing only one hidden layer, as this makes
describing the backpropagation routine easier. Note that networks where you can feed in the
input on the left and propagate it forward to get an output are called feed forward networks.
Below is such an ANN, with two sigmoid units in the hidden layer. The weights have been set
arbitrarily between all the units.

EINSTEIN COLLEGE OF ENGINEERING

Page 121

Artificial Intelligence-CS1351

Note that the sigma units have been identified with sigma signs in the node on the graph. As we
did with perceptrons, we can give this network an input and determine the output. We can also
look to see which units "fired", i.e., had a value closer to 1 than to 0.
Suppose we input the values 10, 30, 20 into the three input units, from top to bottom. Then the
weighted sum coming into H1 will be:
SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20) = 2 -3 + 8 = 7.
Then the function is applied to SH1 to give:
(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999
[Don't forget to negate S]. Similarly, the weighted sum coming into H2 will be:
SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20) = 7 - 36 + 24 = -5
and applied to SH2 gives:
(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067
From this, we can see that H1 has fired, but H2 has not. We can now calculate that the weighted
sum going in to output unit O1 will be:
SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996
and the weighted sum going in to output unit O2 will be:
SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047
The output sigmoid unit in O1 will now calculate the output values from the network for O1:

EINSTEIN COLLEGE OF ENGINEERING

Page 122

Artificial Intelligence-CS1351

(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750


and the output from the network for O2:
(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957
Therefore, if this network represented the learned rules for a categorisation problem, the input
triple (10,30,20) would be categorised into the category associated with O2, because this has the
larger output.

13.2 The Backpropagation Learning Routine


As with perceptrons, the information in the network is stored in the weights, so the learning
problem comes down to the question: how do we train the weights to best categorise the training
examples. We then hope that this representation provides a good way to categorise unseen
examples.
In outline, the backpropagation method is the same as for perceptrons:
1. We choose and fix our architecture for the network, which will contain input, hiddedn and
output units, all of which will contain sigmoid functions.

2. We randomly assign the weights between all the nodes. The assignments should be to
small numbers, usually between -0.5 and 0.5.
3. Each training example is used, one after another, to re-train the weights in the network.
The way this is done is given in detail below.
4. After each epoch (run through all the training examples), a termination condition is
checked (also detailed below). Note that, for this method, we are not guaranteed to find
weights which give the network the global minimum error, i.e., perfectly correct
categorisation of the training examples. Hence the termination condition may have to be
in terms of a (possibly small) number of mis-categorisations. We see later that this might
not be such a good idea, though.

Weight Training Calculations

Because we have more weights in our network than in perceptrons, we firstly need to introduce
the notation: wij to specify the weight between unit i and unit j. As with perceptrons, we will
calculate a value ij to add on to each weight in the network after an example has been tried. To
calculate the weight changes for a particular example, E, we first start with the information
about how the network should perform for E. That is, we write down the target values ti(E) that
each output unit Oi should produce for E. Note that, for categorisation problems, t i(E) will be
zero for all the output units except one, which is the unit associated with the correct
categorisation for E. For that unit, t i(E) will be 1.
EINSTEIN COLLEGE OF ENGINEERING

Page 123

Artificial Intelligence-CS1351

Next, example E is propagated through the network so that we can record all the observed values
oi(E) for the output nodes Oi. At the same time, we record all the observed values h i(E) for the
hidden nodes. Then, for each output unit Ok, we calculate its error term as follows:

The error terms from the output units are used to calculate error terms for the hidden units. In
fact, this method gets its name because we propagate this information backwards through the
network. For each hidden unit Hk, we calculate the error term as follows:

In English, this means that we take the error term for every output unit and multiply it by the
weight from hidden unit Hk to the output unit. We then add all these together and multiply the
sum by hk(E)*(1 - hk(E)).
Having calculated all the error values associated with each unit (hidden and output), we can now
transfer this information into the weight changes ij between units i and j. The calculation is as
follows: for weights wij between input unit Ii and hidden unit Hj, we add on:

[Remembering that xi is the input to the i-th input node for example E; that is a small value
known as the learning rate and that Hj is the error value we calculated for hidden node Hj using
the formula above].
For weights wij between hidden unit Hi and output unit Oj, we add on:

[Remembering that hi(E) is the output from hidden node Hi when example E is propagated
through the network, and that Oj is the error value we calculated for output node Oj using the
formula above].
Each alteration is added to the weights and this concludes the calculation for example E. The
next example is then used to tweak the weights further. As with perceptrons, the learning rate is
used to ensure that the weights are only moved a short distance for each example, so that the
training for previous examples is not lost. Note that the mathematical derivation for the above
EINSTEIN COLLEGE OF ENGINEERING

Page 124

Artificial Intelligence-CS1351

calculations is based on derivative of that we saw above. For a full description of this, see
chapter 4 of Tom Mitchell's book "Machine Learning".

13.3 A Worked Example


We will re-use the example from section 13.1, where our network originally looked like this:

and we propagated the values (10,30,20) through the network. When we did so, we observed the
following values:
Input units

Hidden units

Output units

Unit Output Unit Weighted Sum Input Output Unit Weighted Sum Input Output
I1

10

H1 7

0.999

O1 1.0996

0.750

I2

30

H2 -5

0.0067 O2 3.1047

0.957

I3

20

Suppose now that the target categorisation for the example was the one associated with O1. This
means that the network mis-categorised the example and gives us an opportunity to demonstrate
the backpropagation algorithm: we will update the weights in the network according to the
weight training calculations provided above, using a learning rate of = 0.1.
If the target categorisation was associated with O1, this means that the target output for O1 was
1, and the target output for O2 was 0. Hence, using the above notation,
t1(E) = 1;

t2(E) = 0;

o1(E) = 0.750;

EINSTEIN COLLEGE OF ENGINEERING

o2(E) = 0.957
Page 125

Artificial Intelligence-CS1351

That means we can calculate the error values for the output units O1 and O2 as follows:
O1 = o1(E)(1 - o1(E))(t1(E) - o1(E)) = 0.750(1-0.750)(1-0.750) = 0.0469
O2 = o2(E)(1 - o2(E))(t2(E) - o2(E)) = 0.957(1-0.957)(0-0.957) = -0.0394
We can now propagate this information backwards to calculate the error terms for the hidden
nodes H1 and H2. To do this for H1, we multiply the error term for O1 by the weight from H1 to
O1, then add this to the multiplication of the error term for O2 and the weight between H1 and
O2. This gives us: (1.1*0.0469) + (3.1*-0.0394) = -0.0706. To turn this into the error value for
H1, we multiply by h1(E)*(1-h1(E)), where h1(E) is the output from H1 for example E, as
recorded in the table above. This gives us:
H1 = -0.0706*(0.999 * (1-0.999)) = -0.0000705
A similar calculation for H2 gives the first part to be: (0.1*0.0469)+(1.17*-0.0394) = -0.0414,
and the overall error value to be:
H2 -0.0414 * (0.067 * (1-0.067)) = -0.00259
We now have all the information required to calculate the weight changes for the network. We
will deal with the 6 weights between the input units and the hidden units first:
Input unit Hidden unit

xi = *H*xi Old weight New weight

I1

H1

0.1 -0.0000705 10 -0.0000705 0.2

0.1999295

I1

H2

0.1 -0.00259

0.69741

I2

H1

0.1 -0.0000705 30 -0.0002115 -0.1

-0.1002115

I2

H2

0.1 -0.00259

-1.20777

I3

H1

0.1 -0.0000705 20 -0.000141 0.4

0.39999

I3

H2

0.1 -0.00259

1.1948

10 -0.00259

30 -0.00777

20 -0.00518

0.7

-1.2

1.2

We now turn to the problem of altering the weights between the hidden layer and the output
layer. The calculations are similar, but instead of relying on the input values from E, they use the
values calculated by the sigmoid functions in the hidden nodes: hi(E). The following table
EINSTEIN COLLEGE OF ENGINEERING

Page 126

Artificial Intelligence-CS1351

calculates the relevant values:


Hidden Output

unit
unit

hi(E)

= *O*hi(E) Old weight New weight

H1

O1

0.1 0.0469 0.999 0.000469

1.1

1.100469

H1

O2

0.1 -0.0394 0.999 -0.00394

3.1

3.0961

H2

O1

0.1 0.0469 0.0067 0.00314

0.1

0.10314

H2

O2

0.1 -0.0394 0.0067 -0.0000264

1.17

1.16998

We note that the weights haven't altered all that much, so it might be a good idea in this situation
to use a bigger learning rate. However, remember that, with sigmoid units, small changes in the
weighted sum can produce big changes in the output from the unit.
As an exercise, check whether the re-trained network performs better with respect to the
example than the original network.

13.4 Avoiding Local Minima


The error rate of multi-layered networks over a training set could be calculated as the number of
mis-classified examples. Remembering, however, that there are many output nodes, all of which
could potentially misfire (e.g., giving a value close to 1 when it should have output 0, and viceversa), we can be more sophisticated in our error evaluation. In practice the overall network
error is calculated as:

This is not as complicated as it first appears. The calculation simply involves working out the
difference between the observed output for each output unit and the target output and squaring
this to make sure it is positive, then adding up all these squared differences for each output unit
and for each example.
Backpropagation can be seen as using searching a space of network configurations (weights) in
order to find a configuration with the least error, measured in the above fashion. The more
EINSTEIN COLLEGE OF ENGINEERING

Page 127

Artificial Intelligence-CS1351

complicated network structure means that the error surface which is searched can have local
minima, and this is a problem for multi-layer networks, and we look at ways around it below.
Having said that, even if a learned network is in a local minima, it may still perform adequately,
and multi-layer networks have been used to great effect in real world situations (see Tom
Mitchell's book for a description of an ANN which can drive a car!)
One way around the problem of local minima is to use random re-start as described in the lecture
on search techniques. Different initial random weightings for the network may mean that it
converges to different local minima, and the best of these can be taken for the learned ANN.
Alternatively, as described in Mitchell's book, a "committee" of networks could be learned, with
the (possibly weighted) average of their decisions taken as an overall decision for a given test
example. Another alternative is to try and skip over some of the smaller local minima, as
described below.

Adding Momentum

Imagine a ball rolling down a hill. As it does so, it gains momentum, so that its speed increases
and it becomes more difficult to stop. As it rolls down the hill towards the valley floor (the
global minimum), it might occasionally wander into local hollows. However, it may be that the
momentum it has obtained keeps it rolling up and out of the hollow and back on track to the
valley floor.
The crude analogy describes one heuristic technique for avoiding local minima, called adding
momentum, funnily enough. The method is simple: for each weight remember the previous
value of which was added on to the weight in the last epoch. Then, when updating that weight
for the current epoch, add on a little of the previous . How small to make the additional extra is
controlled by a parameter called the momentum, which is set to a value between 0 and 1.
To see why this might help bypass local minima, note that if the weight change carries on in the
direction it was going in the previous epoch, then the movement will be a little more pronounced
in the current epoch. This effect will be compounded as the search continues in the same
direction. When the trend finally reverses, then the search may be at the global minimum, in
which case it is hoped that the momentum won't be enough to take it anywhere other than where
it is. Alternatively, the search may be at a fairly narrow local minimum. In this case, even
though the back propagation algorithm dictates that will change direction, it may be that the
additional extra from the previous epoch (the momentum) may be enough to counteract this
effect for a few steps. These few steps may be all that is needed to bypass the local minimum.
In addition to getting over some local minima, when the gradient is constant in one direction,
adding momentum will increase the size of the weight change after each epoch, and the network
may converge quicker. Note that it is possible to have cases where (a) the momentum is not
enough to carry the search out of a local minima or (b) the momentum carries the search out of
the global minima into a local minima. This is why this technique is a heuristic method and
EINSTEIN COLLEGE OF ENGINEERING

Page 128

Artificial Intelligence-CS1351

should be used somewhat carefully (it is used in practice a great deal).

13.5 Over fitting Considerations


Left unchecked, back propagation in multi-layer networks can be highly susceptible to
overfitting itself to the training examples. The following graph plots the error on the training and
test set as the number of weight updates increases. It is typical of networks left to train
unchecked.

Alarmingly, even though the error on the training set continues to gradually decrease, the error
on the test set actually begins to increase towards the end. This is clearly overfitting, and it
relates to the network beginning to find and fine-tune to ideosyncrasies in the data, rather than to
general properties. Given this phenomena, it would be unwise to use some kind of threshold for
the error as the termination condition for backpropagation.
In cases where the number of training examples is high, one antidote to overfitting is to split the
training examples into a set to use to train the weight and a set to hold back as an internal
validation set. This is a mini-test set, which can be used to keep the network in check: if the
error on the validation set reaches a minima and then begins to increase, then it could be that
overfitting is beginning to occur.
Note that (time permitting) it is worth giving the training algorithm the benefit of the doubt as
much as possible. That is, the error in the validation set can also go through local minima, and it
is not wise to stop training as soon as the validation set error starts to increase, as a better
minima may be achieved later on. Of course, if the minima is never bettered, then the network
which is finally presented by the learning algorithm should be re-wound to be the one which
produced the minimum on the validation set.
EINSTEIN COLLEGE OF ENGINEERING

Page 129

Artificial Intelligence-CS1351

Another way around overfitting is to decrease each weight by a small weight decay factor
during each epoch. Learned networks with large (positive or negative) weights tend to have
overfitted the data, because larger weights are needed to accommodate outliers in the data.
Hence, keeping the weights low with a weight decay factor may help to steer the network from
overfitting.

13.6 Appropriate Problems for ANN learning


As we did for decision trees, it's important to know when ANNs are the right representation
scheme for the job. The following are some characteristics of learning tasks for which artificial
neural networks are an appropriate representation:
1. The concept (target function) to be learned can be characterised in terms of a real-valued
function. That is, there is some translation from the training examples to a set of real
numbers, and the output from the function is either real-valued or (if a categorisation)
can be mapped to a set of real values. It's important to remember that ANNs are just
giant mathematical functions, so the data they play around with are numbers, rather than
logical expressions, etc. This may sound restrictive, but many learning problems can be
expressed in a way that ANNs can tackle them, especially as real numbers contain
booleans (true and false mapped to +1 and -1), integers, and vectors of these data types
can also be used.
2. Long training times are acceptable. Neural networks generally take a longer time to train
than, for example, decision trees. Many factors, including the number of training
examples, the value chosen for the learning rate and the architecture of the network, have
an affect on the time required to train a network. Training times can vary from a few
minutes to many hours.
3. It is not vitally important that humans be able to understand exactly how the learned
network carries out categorizations. As we discussed above, ANNs are black boxes and it
is difficult for us to get a handle on what its calculations are doing.
4. When in use for the actual purpose it was learned for, the evaluation of the target
function needs to be quick. While it may take a long time to learn a network to, for
instance, decide whether a vehicle is a tank, bus or car, once the ANN has been learned,
using it for the categorization task is typically very fast. This may be very important: if
the network was to be used in a battle situation, then a quick decision about whether the
object moving hurriedly towards it is a tank, bus, car or old lady could be vital.
In addition, neural network learning is quite robust to errors in the training data, because it is not trying
to learn exact rules for the task, but rather to minimize an error function.

EINSTEIN COLLEGE OF ENGINEERING

Page 130

Artificial Intelligence-CS1351

Chapter-14
Inductive Logic Programming
Having studied a non-symbolic approach to machine learning (Artificial Neural
Networks), we return to a logical approach, namely Inductive Logic Programming
(ILP). As the name suggests, the representation scheme used in this approach is
logic programs, which we covered in lecture 6. As a quick overview, one search
strategy for ILP systems is to invert rules of deduction and therefore induce
hypotheses which may solve the learning problem.
In order to understand ILP, we will define a context for ILP, and use this to state the
machine learning problem being addressed. Following this, we will look at the
search operators in ILP, in particular the notion of inverting resolution in order to
generate hypotheses. We will consider how the search is undertaken and run
through a session with the Progol ILP system. We end by looking at some of the
applications of Inductive Logic Programming.

14.1 Problem Context and Specification


The development of Inductive Logic Programming has been heavily formal
(mathematical) in nature, because the major people in the field believe that this is
the only way to progress and to show progress. It means that we have to (re)introduce some notation, and we will use this to formally specify the machine
learning problem faced by ILP programs. To do this, we first need to refresh and rerepresent our knowledge about logic programs, and define background, example
and hypothesis logic programs. Following this, we will specify some prior
conditions on the knowledge base that must be met before an agent attempts a
learning task. We will also specify some posterior conditions on the learned
hypothesis, in such a way that, given a problem satisfying the prior conditions, if
our learning agent finds a hypothesis which satisfies the posterior conditions, it will
have solved the learning task.

Logic Programs

Logic programs are a subset of first order logic. A logic program contains a set of
Horn clauses, which are implication conjectures where there is a conjunction of
literals which imply a single literal. Hence a logic program consists of implications
which look like this example:
EINSTEIN COLLEGE OF ENGINEERING

Page 131

Artificial Intelligence-CS1351

X, Y, Z ( b1(X,Y)

b2(X)

...

bn(X,Y,Z)

h(X,Y))

Remember also that, in Prolog, we turn the implication around so that the
implication sign points from left to right, and the head of the clause comes first. We
also assume universal quantification over all our literals, so that can be removed.
Hence we can write Horn clauses like this:
h(x,y)

b1(X,Y)

b2(X)

...

bn(X,Y,Z)

and everybody understands what we are saying. We will also adopt the convention
of writing a conjunction of literals with a capital letter, and a single literal with a
lower case letter. Hence, if we were interested in the first literal in the body of the
above Horn clause, but not interested in the others, then we could write:
h(X,Y)

b1, B

We see that the conjunction of literals b2(X) ...


B and we have used a comma instead of a sign.

bn(X,Y,Z) has been replaced by

Also, we need to specify when one logic program can be deduced from another. We
use the entails sign to denote this. If logic program L1 can be proved to be true
using logic program L2, we write: L2 L1. We use the symbol to denote that one
logic program does not entail another. It is important to understand that if L2 L1,
this does not mean that L2 entails that L1 is false, only that L2 cannot be used to
prove that L1 is true.
Note also that, because we have restricted our representation language to logic
programs, we can use a Prolog interpreter to prove the entailment of one logic
program from another. As a final notation, it is important to remember that a logic
program can contain just one Horn clause, and that the Horn clause could have no
body, in which case the head of the clause is a known fact about the domain.

Background, Examples and Hypothesis

We will start off with three logic programs. Firstly, we will have the logic program
representing a set of positive examples for the concept we wish to be learned, and
we denote the set of examples E +. Secondly, we will have a set of negative
examples for the concept we wish to be learned, labelled E -. Thirdly, we will have
set of Horn clauses which provide background concepts, and we denote this logic
program B. We will denote the logic program representing the learned hypothesis
H.

EINSTEIN COLLEGE OF ENGINEERING

Page 132

Artificial Intelligence-CS1351

Normally, E+ and E- will be ground facts, i.e., Horn clauses with no body. In this
case, we can prove that an example of E follows from the hypothesis, as they are all
still logic programs. When an example (positive or negative) is proved to be true
using a hypothesis H, we say that H (taken along with B) explains the example.

Prior Conditions

Firstly, we must make sure that our problem has a solution. If one of the negative
examples can be proved to be true from the background information alone, then
clearly any hypothesis we find will not be able to compensate for this, and the
problem is not satisfiable. Hence, we need to check the prior satisfiability of the
problem:
e in E- (B

e).

Any learning problem which breaks the prior satisfiability condition has
inconsistent data, so the user should be made aware of this. Note that this condition
does not mean that B entails that any negative example is false, so it is certainly
possible to find a hypothesis which, along with B entails a negative example.
In addition to checking whether we will be able to find a solution to the problem,
we also have to check that the problem isn't solved already by the background
information. That is, if the problem satisfies the prior satisfiability condition, and
each positive example is entailed by the background information, then the
background logic program B would itself perfectly solve the problem. Hence, we
need to check that at least one positive example cannot be explained by the
background information B. We call this condition the prior necessity condition:
e in E+ (B

e).

Posterior Conditions

Given a problem which satisfies the prior conditions, we define here two properties
that the hypothesis learned by our agent, H, will satisfy if it solves the concept
learning problem. Firstly, H should satisfy the posterior satisfiability condition
that, taken together with the background logic program, it does not entail any
negative example:
e in E- ((B

H)

e).

Also, we must check that all the positive examples are entailed if we take the
background program in conjunction with the hypothesis. This is called the
posterior sufficiency condition:
EINSTEIN COLLEGE OF ENGINEERING

Page 133

Artificial Intelligence-CS1351

e in E+ ((B

H)

e).

It should be obvious that any hypothesis satisfying the two posterior conditions will
be a perfect solution to the learning problem.

Problem Specification

Given the above context for ILP, we can state the learning problem as follows: we
are given a set of positive and a set of negative examples represented as logic
programs E+ and E- respectively, and some background clauses making up a logic
program B. These logic programs satisfy the two prior conditions. Then the learning
problem is to find a logic program, H, such that H, B, E + and E- satisfy the posterior
conditions.

Pruning and Sorting

Because we can test whether each hypothesis explains (entails) a particular


example, we can associate to a hypothesis a set of positive elements that it explains
and a similar set of negative elements. There is also a similar analogy with general
and specific hypotheses as described above: if a hypothesis G is more general than
hypothesis S, then the examples explained by S will be a subset of those explained
by G.
We will assume the following generic search strategy for an ILP system: (i) a set of
current hypotheses is maintained, QH (ii) at each step in the search, a hypothesis H
is taken from QH and some inference rules applied to it in order to generate some
new hypotheses which are then added to the set (we say that H has been expanded)
(iii) this continues until a termination criteria is met.
This leaves many questions unanswered. Looking first at the question of which
hypothesis to expand at a particular stage, ILP systems associate a label with each
hypothesis generated which expresses a probability of the hypothesis holding given
that the background knowledge and examples are true. Then, hypotheses with a
higher probability are expanded before those with a lower probability, and
hypotheses with zero probability are pruned from the set QH entirely. This
probability calculation is derived using Bayesian mathematics and we do not go
into the derivation here. However, we hint at two aspects of the calculation in the
paragraphs below.
In specific to general ILP systems, the inference rules are inductive, so each
operator takes a hypothesis and generalizes it. As mentioned above, this means that
the hypothesis generated will explain more examples than the original hypothesis.
As the search gradually makes hypotheses more general, there will come a stage
EINSTEIN COLLEGE OF ENGINEERING

Page 134

Artificial Intelligence-CS1351

when a newly formed hypothesis H is general enough to explain a negative


example, e-. This should therefore score zero for the probability calculation because
it cannot possibly hold given the background and examples being true. Furthermore,
because the operators only generalize, there is no way by which H can be fixed to
not explain e-, so pruning it from QH because of the zero probability score is a good
decision.
A similar situation occurs in general to specific ILP systems, where the inference
rules are deductive, hence they specialize. At some stage, a hypothesis will become
so specialized that it fails to explain all the positive examples. In this case, a similar
pruning operation can be imposed because further specialization will not rectify the
situation. Note that in practice, to compensate for noisy data, there is more
flexibility built into the systems. In particular, the posterior conditions which
specify the problem can be relaxed, and the pruning of hypotheses which explain
small numbers of negative examples may not be immediately dropped.
We can see how the examples could be used to choose between two non-pruned
hypotheses: if performing a specific to general search, then the number of positive
examples explained by a hypothesis can be taken as a value to sort the hypotheses
with (more positive examples explained being better). Similarly, if performing a
general to specific search, then the number of negatives still explained by a
hypothesis can be taken as a value to sort the hypotheses with (fewer negatives
being better).
This may, however, be a very crude measure, because many hypotheses might score
the same, especially if there is a small number of examples. When all things are
equal, an ILP system may employ a sophisticated version of Occam's razor, and
choose between two equal scoring hypotheses according to some function derived
from Algorithmic Complexity theory or some similar theory.

Chapter-16
Constraint Satisfaction Problems
I was perhaps most proud of AI on a Sunday. On this particular Sunday, a friend of
mine found an article in the Observer about the High-IQ society, a rather brash and
even more elitist version of Mensa. Their founder said that their entrance test was
so difficult that some of the problems had never been solved. The problem given
below was in the Observer as such an unsolved problem. After looking at it for a
few minutes, I confidently told my friend that I would have the answer in half an
hour.
EINSTEIN COLLEGE OF ENGINEERING

Page 135

Artificial Intelligence-CS1351

After just over 45 minutes, I did indeed have an answer, and my friend was suitably
impressed. See the end of these notes for the details. Of course, I didn't spend my
time trying to figure it out (if you want to split the atom, you don't sharpen a knife).
Instead, I used the time to describe the problem to a constraint solver, which is
infinitely better at these things than me. The constraint solver is part of good old
Sicstus Prolog, so specifying the problem was a matter of writing it as a logic
program - it's worth pointing out that I didn't specify how to find the solution, just
EINSTEIN COLLEGE OF ENGINEERING

Page 136

Artificial Intelligence-CS1351

what the problem was. With AI programming languages such as Prolog, every now
and then the intelligence behind the scenes comes in very handy. Once I had
specified the problem to the solver (a mere 80 lines of Prolog), it took only one
hundredth of a second to solve the problem. So not only can the computer solve a
problem which had beaten many high IQ people, it could solve 100 of these
"difficult" problems every second. A great success for AI. In this lecture, we will
look at how constraint solving works in general. Much of the material here is taken
from Barbara Smith's excellent tutorial on Constraint Solving which is available
here:

16.1 Specifying Constraint Problems


As with most successful AI techniques, constraint solving is all about solving
problems: somehow phrase the intelligent task you are interested in as a problem,
then massage it into the format of a constraint satisfaction problem (CSP), put it
into a constraint solver and see if you get a solution. CSPs consist of the following
parts:

A set of variables X = {x1, x2, ..., xn}


A finite set of values that each variable can take. This is called the domain of the
variable. The domain of variable xi is written Di
A set of constraints that specifies which values the variables can take
simultaneously

In the high-IQ problem above, there are 25 variables: one for each of the 24 smaller
square lengths, and one for the length of the big square. If we say that the smallest
square is of length 1, then the big square is perhaps of length at most 1000. Hence
the variables can each take values in the range 1 to 1000. There are many
constraints in this problem, including the fact that each length is different, and that
certain ones add up to give other lengths, for example the lengths of the three
squares along the top must add up to the length of the big square.
Depending on what solver you are using, constraints are often expressed as
relationships between variables, e.g., x1 + x2 < x3. However, to be able to discuss
constraints more formally, we use the following notation:
A constraint Cijk specifies which tuples of values variables xi, xj and xk ARE
allowed to take simultaneously. In plain English, a constraint normally talks about
things which can't happen, but in our formalism, we are looking at tuples (vi, vj, vk)
which xi, xj and xk can take simultaneously. As a simple example, suppose we have
a CSP with two variables x and y, and that x can take values {1,2,3}, whereas y can
take values {2,3}. Then the constraint that x=y would be written as:
EINSTEIN COLLEGE OF ENGINEERING

Page 137

Artificial Intelligence-CS1351

Cxy={(2,2), (3,3)},
and the constraint that x<y would be written as
Cxy = {(1,2),(1,3),(2,3)}
A solution to a CSP is an assignment of values, one to each variable in such a way
that no constraint is broken. It depends on the problem at hand, but the user might
want to know that there is a solution, i.e., they will take the first answer given.
Alternatively, they may require all the solutions to the problem, or they might want
to know that no solutions exists. Sometimes, the point of the exercise is to find the
optimum solution based on some measure of worth. Sometimes, it's possible to do
this without enumerating all the solutions, but other times, it will be necessary to
find all solutions, then work out which is the optimum. In the high-IQ problem, a
solution is simply a set of lengths, one per square. The shaded one is the 17th
biggest, which answers the IQ question.

16.2 Binary Constraints


Unary constraints specify that a particular variable can take certain values, which
basically restricts the domain for that variable, and hence should be taken care of
when specifying the CSP. Binary constraints relate two variables, and binary
constraint problems are special CSPs which involve only binary constraints.
Binary CSPs have a special place in the theory because all CSPs can be written as
binary CSPs (we don't go into the details of this here, and while it is possible in
theory to do so, in practice, the translation is rarely used). Also, binary CSPs can be
represented both graphically and using matrices, which can make them easier to
understand.
Binary constraint graphs such as the one below afford a nice representation of
constraint problems, where the nodes are the variables and the edges represent the
constraints on the variables between the two variables joined by the edge
(remember that the constraints state which values can be taken at the same time).

EINSTEIN COLLEGE OF ENGINEERING

Page 138

Artificial Intelligence-CS1351

Binary constraints can also be represented as matrices, with a single matrix for each
constraint. For example, in the above constraint graph, the constraint between
variables x4 and x5 is {(1,3),(2,4),(7,6)}. This can be represented as the following
matrix.
C

EINSTEIN COLLEGE OF ENGINEERING

Page 139

Artificial Intelligence-CS1351

We see that the asterixes mark the entry (i,j) in the table such that variable x4 can
take value i at the same time that variable x5 takes value j. As all CSPs can be
written as binary CSPs, the artificial generation of random binary CSPs as a set of
matrices is often used to assess the relative abilities of constraint solvers. However,
it should be noted that in real world constraint problems, there is often much more
structure to the problems than you get from such random constructions.
A very commonly used example CSP, which we will use in the next section, is the
"n-queens" problem, which is the problem of placing n queens on a chess board in
such a way that no one threatens another along the vertical, horizontal or diagonal.
We've seen this in previous lectures. There are many possibilities for representing
this as a CSP (in fact, finding the best specification of a problem so that a solver
gets the answer as quickly as possible is a highly skilled art). One possibility is to
have the variables representing the rows and the values they can take representing
the columns on the row that a queen was situated on. If we look at the following
solution to the 4-queens problem below:

Then, counting rows from the top downwards and columns from the left, the
solution would be represented as: X1=2, X2=4, X3=1, X4=3. This is because the
queen on row 1 is in column 2, the queen in row 2 is in column 4, the queen in row
3 is in column 1 and the queen in row 4 is in column 3. The constraint between
variable X1 and X2 would be:
C1,2 = {(1,3),(1,4),(2,4),(3,1),(4,1),(4,2)}
As an exercise, work out exactly what the above constraint is saying.

EINSTEIN COLLEGE OF ENGINEERING

Page 140

Artificial Intelligence-CS1351

16.3 Arc Consistency


There have been many advances in how constraint solvers search for solutions
(remember this means an assignment of a value to each variable in such a way that
no constraint is broken). We look first at a pre-processing step which can greatly
improve efficiency by pruning the search space, namely arc-consistency. Following
this, we'll look at two search methods, backtracking and forward checking which
keep assigning values to variables until a solution is found. Finally, we'll look at
some heuristics for improving the efficiency of the solver, namely how to order the
choosing of the variables, and how to order the assigning of the values to variables.
The pre-processing routine for bianry constraints known as arc-consistency
involves calling a pair (xi, xj) an arc and noting that this is an ordered pair, i.e., it is
not the same as (xj, xi). Each arc is associated with a single constraint C ij, which
constrains variables xi and xj. We say that the arc (xi, xj) is consistent if, for all
values a in Di, there is a value b in D j such that the assignment xi=a and xj=b
satisfies constraint Cij. Note that (xi, xj) being consistent doesn't necessarily mean
that (xj,xi) is also consistent. To use this in a pre-processing way, we take every pair
of variables and make it arc-consistent. That is, we take each pair (xi,xj) and remove
variables from Di which make it inconsistent, until it becomes consistent. This
effectively removes values from the domain of variables, hence prunes the search
space and makes it likely that the solver will succeed (or fail to find a solution)
more quickly.
To demonstrate the worth of performing an arc-consistency check before starting a
serarch for a solution, we'll use an example from Barbara Smith's tutorial. Suppose
that we have four tasks to complete, A, B, C and D, and we're trying to schedule
them. They are subject to the constraints that:

Task A lasts 3 hours and precedes tasks B and C


Task B lasts 2 hours and precedes task D
Task C lasts 4 hours and precedes task D
Task D lasts 2 hours

We will model this problem with a variable for each of the task start times, namely
startA, startB, startC and startD. We'll also have a variable for the overall start
time: start, and a variable for the overall finishing time: finish. We will say that the
domain for variable start is {0}, but the domains for all the other variables is
{0,1,...,11}, because the summation of the duration of the tasks is 3 + 2 + 4 + 2 =
11. We can now translate our English specification of the constraints into our
formal model. We start with an intermediate translation thus:

start startA

EINSTEIN COLLEGE OF ENGINEERING

Page 141

Artificial Intelligence-CS1351

startA + 3 startB
startA + 3 startC
startB + 2 startD
startC + 2 startD
startD + 2 finish

Then, by thinking about the values that each pair of variables can take
simultatneously, we can write the constraints as follows:

Cstart,startA = {(0,0), (0,1), (0,2), ..., (0,11)}


CstartA,start = {(0,0), (1,0), (2,0), ..., (11,0)}
CstartA,startB = {(0,3), (0,4), ..., (0,11), (1,4), (1,5), ..., (8,11)}
etc.

Now, we will check whether each arc is arc-consistent, and if not, we will remove
values from the domains of variables until we get consistency. We look first at the
arc (start, startA) which is associated with the constraint {(0,0), (0,1), (0,2), ...,
(0,11)} above. We need to check whether there is any value, P, in D start that does not
have a corresponding value, Q, such that (P,Q) satisfies the constraint, i.e., appears
in the set of assignable pairs. As D start is just {0}, we are fine. We next look at the
arc (startA, start), and check whether there is any value in D startA, P, which doesn't
have a corresponding Q such that (P,Q) is in C startA, start. Again, we are OK, because
all the values in DstartA appear in CstartA, start.
If we now look at the arc (startA, startB), then the constraint in question is: {(0,3),
(0,4), ..., (0,11), (1,4), (1,5), ..., (8,11)}. We see that their is no pair of the form
(9,Q) in the constraint, similarly no pair of the form (10,Q) or (11,Q). Hence, this
arc is not arc-consistent, and we have to remove the values 9, 10 and 11 from the
domain of startA in order to make the arc consistent. This makes sense, because we
know that, if task B is going to start after task A, which has duration 3 hours, and
they are all going to have started by the eleventh hour, then task A cannot start after
the eighth hour. Hence, we can - and do - remove the values 9, 10 and 11 from the
domain of startA.
This method of removing values from domains is highly effective. As reported in
Barbara Smith's tutorial, the domains become quite small, as reflected in the
following scheduling network:

EINSTEIN COLLEGE OF ENGINEERING

Page 142

Artificial Intelligence-CS1351

We see that the largest domain size has only 5 values in it, which means that quite a
lot of the search space has been pruned. In practice, to remove as many variables as
possible in a CSP which is dependent on precedence constraints, we have to work
backwards, i.e., look at the start time of the task, T, which must occur last, then
make each arc of the form (startT, Y) consistent for every variable Y. Following
this, move on to the task which must occur second to last, etc. In CSPs which only
involve precedence constraints, arc-consistency is guaranteed to remove all values
which cannot appear in a solution to the CSP. In general, however, we cannot make
such a guarantee, but arc-consistency usually has some effect on the initial
specification of a problem.

16.4 Search Methods and Heuristics


We now come to the question of how constraint solvers search for solutions constraint preserving assignments of values to variables - to the CSPs they are
given. The most obvious approach is to use a depth first search: assign a value to
the first variable and check that this assignment doesn't break any constraints. Then,
move on to the next variable, assign it a value and check that this doesn't break any
constraints, then move on to the next variable and so on. When an assignment does
break a constraint, then choose a different value for the assignment until one is
found which satisfies the constraints. If one cannot be found, then this is when the
search must backtrack. In such a situation, the previous variable is looked at again,
and the next value for it is tried. In this way, all possible sets of assignments will be
tried, and a solution will be found. The following search diagram - taken from
Smith's tutorial paper - shows how the search for a solution to the 4-queens problem
progresses until it finds a solution:

EINSTEIN COLLEGE OF ENGINEERING

Page 143

Artificial Intelligence-CS1351

We see that the first time it backtracks is after the failure to put a queen in row three
given queens in positions (1,1) and (2,3). In this case, it backtracked and move the
queen in (2,3) to (2,4). Eventually, this didn't work out either, so it had to backtrack
further and moved the queen in (1,1) to (1,2). This led fairly quickly to a solution.
To add some sophistication to the search method, constraint solvers use a technique
known as forward checking. The general idea is to work the same as a
backtracking search, but, when checking compliance with constraints after
assigning a value to a variable, the agent also checks whether this assignment is
going to break constraints with future variable assignments. That is, supposing that
Vc has been assigned to the current variable c, then for each unassigned variable x i,
(temporarily) remove all values from Di which, along with Vc break a constraint. It
may be that in doing so, Di becomes empty. This means that the choice of Vc for the
current variable is bad - it will not find its way into a solution to the problem,
because there's no way to assign a value to xi without breaking a constraint. In such
a scenario, even though the assignment of V c may not break any constraints with
already assigned variables, a new value is chosen (or backtracking occurs if there
are no values left), because we know that Vc is a bad assignment.
EINSTEIN COLLEGE OF ENGINEERING

Page 144

Artificial Intelligence-CS1351

The following diagram (again, taken from Smith's tutorial) shows how forward
checking improves the search for a solution to the 4-queens problem.

In addition to forward checking to improve the intelligence of the constraint solving


agent, there are some possibilities for a heuristic search. Firstly, our agent can
worry about the order in which it looks at the variables, e.g., in the 4-queens
problem, it might try to put a queen in row 2, then one in row 3, one in row 1 and
finally one in row 4. A solver taking such care is said to be using a variableordering heuristic. The ordering of variables can be done before a search is started
and rigidly adhered to during the search. This might be a good idea if there is extra
knowledge about the problem, e.g., that a particular variable should be assigned a
value sooner rather than later. Alternatively, the ordering of the variables can be
done dynamically, in response to some information gathered about how the search
is progressing during the search procedure.
One such dynamic ordering procedure is called "fail-first forward checking". The
idea is to take advantage of information gathered while forward checking during
search. In cases where forward checking highlights the fact that a future domain is
effectively emptied, then this signals that it's time to change the current assignment.
However, in the general case, the domain of the variable will be reduced but not
necessarily emptied. Suppose that of all the future variables, x f has the most values
EINSTEIN COLLEGE OF ENGINEERING

Page 145

Artificial Intelligence-CS1351

removed from Df. The fail-first approach specifies that our agent should choose to
assign values to xf next. The thinking behind this is that, with fewer possible
assignments for xf than the other future variables, we will find out most quickly
whether we are heading down a dead-end. Hence, a better name for this approach
would be "find out if its a dead end quickest". However, this isn't as catchy a phrase
as "fail-first".
An alternative/addition to variable ordering is value ordering. Again, we could
specify in advance the order in which values should be assigned to variables, and
this kind of tweaking of the problem specification can dramatically improve search
time. We can also perform value ordering dynamically: suppose that it's possible to
assign values Vc, Vd and Ve to the current variable. Further suppose that, when
looking at all the future variables, the total number of values in their domains
reduces to 300, 20 and 50 for Vc, Vd and Ve respectively. We could then specify
that our agent assigns V c at this stage in the search, because it has retained the most
number of values in the future domains. This is different from variable ordering in
two important ways:

If this is a dead end then we will end up visiting all the values for this variable
anyway, so fail-first does not make sense for values. Rather, we try and keep our
options open as much as possible, as this will help if there is a solution ahead of
us.
Unlike the variable ordering heuristics, this heuristic carries an extra cost on top of
forward checking, because the reduction in domain sizes of future variables for
every assignment of the current variable needs to be checked. Hence, it is possible
that this kind of value ordering will slow things down. In practice, this is what
happens for randomly constructed binary CSPs. On occasions, however, it can
sometimes be a very good idea to employ dynamic value ordering.

Chapter-17
Genetic Algorithms
The evolutionary approach to Artificial Intelligence is one of the neatest ideas of
all. We have tried to mimic the functioning of the brain through neural networks,
because - even though we don't know exactly how it works - we know that the brain
does work. Similarly, we know that mother nature, through the process of
evolution, has solved many problems, for instance the problem getting animals to
walk around on two feet (try getting a robot to do that - it's very difficult). So, it
seems like a good idea to mimic the processes of reproduction and survival of the
fittest to try to evolve answers to problems, and maybe in the long run reach the
EINSTEIN COLLEGE OF ENGINEERING

Page 146

Artificial Intelligence-CS1351

holy grail of computers which program themselves by evolving programs.


Evolutionary approaches are simple in conception:

generate a population of possible answers to the problem at hand


choose the best individuals from the population (using methods inspired by
survival of the fittest)
produce a new generation by combining these best ones (using techniques
inspired by reproduction)
stop when the best individual of a generation is good enough (or you run out of
time)

Perhaps the first landmark in the history of the evolutionary approach to computing
was John Holland's book "Adaptation in Natural and Artificial Systems", where he
developed the idea of the genetic algorithm as searching via sampling hyperplane
partitions of the space. It's important to rememeber that genetic algorithms (GAs),
which we look at in this lecture, and genetic programming (GP), which we look at
in the next lecture, are just fancy search mechanisms which are inspired by
evolution. In fact, using Tom Mitchell's definition of a machine learning system
being one which improves its performance through experience, we can see that
evolutionary approaches can be classed as machine learning efforts. Historically,
however, it has been more common to categorise evolutionary approaches together
because of their inspiration rather than their applications (to learning and discovery
problems).
As we will see, evolutionary approaches boil down to (i) specifying how to
represent possible problem solutions and (ii) determining how to choose which
partial solutions are doing the best with respect to solving the problem. The main
difference between genetic algorithms and genetic programming is the choice of
representation for problem solutions. In particular, with genetic algorithms, the
format of the solution is fixed, e.g., a fixed set of parameters to find, and the
evolution occurs in order to find good values for those parameters. With genetic
programming, however, the individuals in the population of possible solutions are
actually individual programs which can increase in complexity, so are not as
constrained as in the genetic algorithm approach.

17.1 The Canonical Genetic Algorithm


As with all search techniques, one of the first questions to ask with GAs is how to
define a search space which potentially contains good solutions to the problem at
hand. This means answering the question of how to represent possible solutions to
the problem. The classical approach to GAs is to represent the solutions as strings
of ones and zeros, i.e., bit strings . This is not such a bad idea, given that computers
store everything as bit strings, so any solution would eventually boil down to a
EINSTEIN COLLEGE OF ENGINEERING

Page 147

Artificial Intelligence-CS1351

string of ones and zeros. However, there have been many modifications to the
original approach to genetic algorithms, and GA approaches now come in many
different shapes and sizes, with higher level representations. Indeed, it's possible to
see genetic programming, where the individuals in the population are programs, as
just a GA approach with a more complicated representation scheme.
Returning to the classical approach, as an example, if solving a particular problem
involved finding a set of five integers between 1 and 100, then the search space for
a GA would be bits strings where the first eight bits are decoded as the first integer,
the next eight bits become the second integer and so on. Representing the solutions
is one of the tricky parts to using genetic algorithms, a problem we come back to
later. However, suppose that the solutions are represented as strings of length L.
Then, in the standard approach to GAs, known as the canonical genetic algorithm,
the first stage is to generate an initial random population of bit strings of length L.
By random, we mean that the ones and zeros in the strings are chosen at random.
Sometimes, rarely, the initialisation procedure is done with a little more
intelligence, e.g., using some additional knowledge about the domain to choose the
initial population.
After the initialisation step, the canonical genetic algorithm proceeds iteratively
using selection, mating, and recombination processes, then checking for
termination. This is portrayed in the following diagram:

EINSTEIN COLLEGE OF ENGINEERING

Page 148

Artificial Intelligence-CS1351

In the next section, we look in detail at how individuals are selected, mated,
recombined (and mutated for good measure). Termination of the algorithm may
occur if one or more of the best individuals in the current generation performs well
enough with respect to the problem, with this performance specified by the user.
Note that this termination check may be related, or the same as the evaluation
function - discussed later - but it may be something entirely different to this.
There may not be a definitive answer to the problem you're looking at, and it may
only be possible to evolve solutions which are as good as possible. In this case, it
may not be obvious when to stop, and moreover, it may be a good idea to produce
as many populations as possible given the computing/time resources you have
available. In this case, the termination function may be a specific time limit or a
specific number of generations. It is very important to note that the best individual
in your final population may not be as good as the best individual in a previous
generation (GAs do not perform hill-climbing searches, so it is perfectly possible
for generations to degrade). Hence GAs should record the best individuals from
every generation, and, as a final solution presented to the user, they should output
the best solution found over all the generations.

17.2 Selection, Mating, Recombination and Mutation


So, the point of GAs is to generate population after population of individuals which
represent possible solutions to the problem at hand in the hope that one individual
in one generation will be a good solution. We look here at how to produce the next
generation from the current generation. Note that there are various models for
whether to kill off the previous generation, or allow some of the fittest individuals
to stay alive for a while - we'll assume a culling of the old generation once the new
one has been generated.

Selection

The first step is to choose the individuals which will have a shot at becoming the
parents of the next generation. This is called the selection procedure, and its
purpose it to choose those individuals from the current population which will go
into an intermediate population (IP). Only individuals in this intermediate
population will be chosen to mate with each other (and there's still no guarantee that
they'll be chosen to mate, or that if they do mate, they will be successful - see later).
To perform the selection, the GA agent will require a fitness function. This will
assign a real number to each individual in the current generation. From this value,
the GA calculates the number of copies of the individual which are guaranteed to go
into the intermediate population and a probability which will be used to determine
whether an additional copy goes into the IP. To be more specific, if the value
EINSTEIN COLLEGE OF ENGINEERING

Page 149

Artificial Intelligence-CS1351

calculated by the fitness function is an integer part followed by a fractional part,


then the integer part dictates the number of copies of the individual which are
guaranteed to go into the IP, and the fractional part is used as a probability: another
copy of the individual is added to the IP with this probability, e.g., if it was 1/6,
then a random number between 1 and 6 would be generated and only if it was a six
would another copy be added.
The fitness function will use an evaluation function to calculate a value of worth
for the individual so that they can be compared against each other. Often the
evaluation function is written g(c) for a particular individual c. Correctly specifying
such evaluation functions is a tricky job, which we look at later. The fitness of an
individual is calculated by dividing the value it gets for g by the average value for g
over the entire population:
fitness(c) = g(c)/(average of g over the entire population)
We see that every individual has at least a chance of going into the intermediate
population unless they score zero for the evaluation function.
As an example of a fitness function using an evaluation function, suppose our GA
agent has calculated the evaluation function for every member of the population,
and the average is 17. Then, for a particular individual c 0, the value of the
evaluation function is 25. The fitness function for c0 would be calculated as 25/17 =
1.47. This means that one copy of c0 will definitely be added to the IP, and another
copy will be added with a probability of 0.47 (e.g., a 100 side dice is thrown and
only if it returns 47 or less, is another copy of c0 added to the IP).

Mating

Once our GA agent has chosen the individuals lucky enough (actually, fit enough)
to produce offspring, we next determine how they are going to mate with each
other. To do this, pairs are simply chosen randomly from the set of potential
parents. That is, one individual is chosen randomly, then another - which may be the
same as the first - is chosen, and that pair is lined up for the reproduction of one or
more offspring (dependent on the recombination techniques used). Then whether or
not they actually reproduce is probabilistic, and occurs with a probability p c. If they
do reproduce, then their offspring are generated using a recombination and mutation
procedure as described below, and these offspring are added to the next generation.
This continues until the number of offspring which is produced is the required
number. Often this required number is the same as the current population size, to
keep the population size constant. Note that there are repeated individuals in the IP,
so some individuals may become the proud parent of multiple children.
EINSTEIN COLLEGE OF ENGINEERING

Page 150

Artificial Intelligence-CS1351

This mating process has some anology with natural evolution, because sometimes
the fittest organisms may not have the opportunity to find a mate, and even if they
do find a mate, it's not guaranteed that they will be able to reproduce. However, the
analogy with natural evolution also breaks down here, because individuals can mate
with themselves and there is no notion of sexes.

Recombination

During the selection and mating process, the GA repeatedly lines up pairs of
individuals for reproduction. The next question is how to generate offspring from
these parent individuals. This is called the recombination process and how this is
done is largely dependent on the representation scheme being used. We will look at
three possibilities for recombination of individuals represented as bit strings.
The population will only evolve to be better if the best parts of the best individuals
are combined, hence recombination procedures usually take parts from both parents
and place them into the offspring. In the One-Point Crossover recombination
process, a point is chosen at random on the first individual, and the same point is
chosen on the second individual. This splits both individuals into a left hand and a
right hand side. Two offspring individuals are then produced by (i) taking the LHS
of the first and adding it to the RHS of the second and (ii) by taking the LHS of the
second and adding it to the RHS of the first. In the following example, the crossover
point is after the fifth letter in the bit string:

Note that all the a's, b's, X's and Y's are actually ones or zeros. We see that the
length of the two children is the same as that of the parents because GAs use a fixed
representation (remember that the bit strings only make sense as solutions if they
are of a particular length).
In Two-point Crossover, as you would expect, two points are chosen in exactly the
same place in both individuals. Then the bits falling in-between the two points are

EINSTEIN COLLEGE OF ENGINEERING

Page 151

Artificial Intelligence-CS1351

swapped to give two new offspring. For example, in the following diagram, the two
points are after the 5th and 11th letters:

Again, the a's, b's, X's and Y's are ones or zeros, and we see that this recombintion
technique doesn't alter the string length either. As a third recombination operator,
the inversion process simply takes a segment of a single individual and produces a
single offspring by reversing the letters in-between two chosen points. For example:

Mutation

It may appear that the above recombinations are a little arbitrary, especially as
points defining where crossover and inversion occur are chosen randomly.
However, it is important to note that large parts of the string are kept in tact, which
means that if the string contained a region which scored very well with the
evaluation function, these operators have a good chance of passing that region on to
the offspring (especially if the regions are fairly small, and, like in most GA
problems, the overall string length is quite high).
The recombination process produces a large range of possible solutions. However,
it is still possible for it to guide the search into a local rather than the global maxima
with respect to the evaluation function. For this reason, GAs usually perform
random mutations. In this process, the offspring are taken and each bit in their bit
EINSTEIN COLLEGE OF ENGINEERING

Page 152

Artificial Intelligence-CS1351

string is flipped from a one to a zero or vice versa with a given probability. This
probability is usually taken to be very small, say less than 0.01, so that only one in a
hundred letters is flipped on average.
In natural evolution, random mutations are often highly deleterious (harmful) to the
organism, because the change in the DNA leads to big changes to way the body
works. It may seem sensible to protect the children of the fittest individuals in the
population from the mutation process, using special alterations to the flipping
probability distribution. However, it may be that it is actually the fittest individuals
that are causing the population to stay in the local maxima. After all, they get to
reproduce with higher frequency. Hence, protecting their offspring is not a good
idea, especially as the GA will record the best from each generation, so we won't
lose their good abilities totally. Random mutation has been shown to be effective at
getting GA searches out of local maxima effectively, which is why it is an
important part of the process.
To summarize the production of one generation from the previous: firstly, an
intermediate population is produced by selecting copies of the fittest individuals
using probability so that every individual has at least a chance of going into the
intermediate population. Secondly, pairs from this intermediate population are
chosen at random for reproduction (a pair might consist of the same individual
twice), and the pair reproduce with a given fixed probability. Thirdly, offspring are
generated through recombination procedures such as 1-point crossover, 2-point
crossover and inversion. Finally, the offspring are randomly mutated to produce the
next generation of individuals. Individuals from the old generation may be entirely
killed off, but some may be allowed into the next generation (alternatively, the
recombination procedure might be tuned to leave some individuals unchanged). The
following schematic gives an indication of how the new generation is produced:

EINSTEIN COLLEGE OF ENGINEERING

Page 153

Artificial Intelligence-CS1351

17.3 Two Difficulties


The first big problem we face when designing an AI agent to perform a GA-style
search is how to represent the solutions. If the solutions are textual by nature, then
ASCII strings require eight bits per letter, so the size of individuals can get very
large. This will mean that evolution may take thousands of generations to converge
onto a solution. Also, there will be much redundancy in the bit string
representations: in general many bit strings produced by the recombination process
will not represent solutions at all, e.g., they may represent ASCII characters which
shouldn't appear in the solution. In the case of individuals which don't represent
solutions, how do we measure these with the evaluation function? It doesn't
necessarily follow that they are entirely unfit, because the tweaking of a single zero
to a one might make them good solutions. The situation is better when the solution
space is continuous, or the solutions represent real valued numbers or integers. The
situation is worse when there are only a finite number of solutions.
The second big problem we face is how to specify the evaluation function. This is
crucial to the success of the GA experiment. The evaluation function should, if
possible:

Return a real-valued number scoring higher for individuals which perform better
with respect to the problem
Be quick to calculate, as this calculation will be done many thousands of times
Distinguish well between different individuals, i.e., give a good range of values

Even with a well specified evaluation function, when populations have evolved to a
certain stage, it is possible that the individuals will all score highly with respect to
the evalation function, so all have equal chances of reproducing. In this case,
evolution will effectively have stopped, and it may be necessary to take some action
to spread them out (make the evaluation function more sophisticated dynamically,
possibly).

16.4 An Example Application


There are many fantastic applications of genetic algorithms. Perhaps my favourite is
their usage in evaluating Jazz melodies done as part of a PhD project in Edinburgh.
The one we look at here is chosen because it demonstrates how a fairly lightweight
effort using GAs can often be highly effective. In their paper "The Application of
Artificial Intelligence to Transportation System Design", Ricardo Hoar and Joanne
Penner describe their undergraduate project, which involved representing vehicles
on a road system as autonomous agents, and using a GA approach to evolve
solutions to the timing of traffic lights to increase the traffic flow in the system. The
EINSTEIN COLLEGE OF ENGINEERING

Page 154

Artificial Intelligence-CS1351

optimum settings for when lights come on and go off is known only for very simple
situations, so an AI-style search can be used to try and find good solutions. Hoar
and Penner chose to do this in an evolutionary fashion. They don't give details of
the representation scheme they used, but traffic light times are real-valued numbers,
so they could have used a bit-string representation.
The evaluation function they used involved the total waiting time and total driving
time for each car in the system as follows:

The results they produced were good (worthy of writing a paper). The two graphs
below describe the decrease in overall waiting time for a simple road and for a more
complicated road (albeit not amazingly complicated).

We see that in both cases, the waiting time has roughly halved, which is a good
result. In the first case, for the simple road system, the GA evolved a solution very
EINSTEIN COLLEGE OF ENGINEERING

Page 155

Artificial Intelligence-CS1351

similar to the ones worked out to be optimal by humans. We see that GAs can be
used to find good near-optimal solutions to problems where a more cognitive
approach might have failed (i.e., humans still can't work out how best to tune traffic
light times, but a computer can evolve a good solution).

EINSTEIN COLLEGE OF ENGINEERING

Page 156

You might also like