Professional Documents
Culture Documents
http://rajakishor.co.cc Page 96
P robably L earning an A pproximately C orrect H ypothesis
Here, we consider a particular setting for the learning problem, called the
probably approximately correct (PAC) learning model. We begin by specifying the
problem setting that defines the PAC learning model, then consider the questions of
how many training examples and how much computation are required in order to learn
various classes of target functions within this PAC model. For the sake of simplicity, we
restrict the discussion to the case of learning Boolean-valued concepts from noise-free
training data.
http://rajakishor.co.cc Page 97
Error of a Hypothesis
Because we are interested in how closely the learner's output hypothesis h
approximates the actual target concept c, let us begin by defining the true error of a
hypothesis h with respect to target concept c and instance distribution D.
Definition
The true error (denoted errorD(h)) of hypothesis h with respect to target
concept c and distribution D is the probability that h will misclassify an instance drawn
at random according to D.
Here, the notation Pr indicates that the probability is taken over the instance
x∈D
distribution D.
Informally, the true error of h is just the error rate we expect when applying h to
future instances drawn according to the probability distribution D.
http://rajakishor.co.cc Page 98
To accommodate these two difficulties, we weaken our demands on the learner
in two ways.
1. We will not require that the learner output a zero error hypothesis - we will
require only that its error be bounded by some constant, є, that can be made
arbitrarily small.
2. We will not require that the learner succeed for every sequence of randomly
drawn training examples - we will require only that its probability of failure
be bounded by some constant, δ, that can be made arbitrarily small.
In short, we require only that the learner probably learn a hypothesis that is
approximately correct - hence the term probably approximately correct learning, or
PAC learning.
Consider some class C of possible target concepts and a learner L using
hypothesis space H. We can say that the concept class C is PAC-learnable by L using H if,
for any target concept c in C, L will with probability (1 - δ) output a hypothesis h with
errorD(h) < ε, after observing a reasonable number of training examples and performing
a reasonable amount of computation.
More precisely,
Consider a concept class C defined over a set of instances X of length n and a
learner L using hypothesis space H. C is PAC-learnable by L using H if for all
c є C, distributions D over X, ε such that 0 < ε < ½, and δ such that 0 < δ < ½,
learner L will with probability at least (1 - δ) output a hypothesis h є H such
that errorD(h) ≤ ε, in time that is polynomial in 1/ε , 1/δ, n, and size(c).
http://rajakishor.co.cc Page 99
VS H , D = {h ∈ H | (∀ < x, c ( x) >∈ D )( h( x) = c ( x ))}
The significance of the version space here is that every consistent learner
outputs a hypothesis belonging to the version space, regardless of the instance space X,
hypothesis space H, or training data D. The reason is simply that by definition the
version space VSH,D contains every consistent hypothesis in H. Therefore, to bound the
number of examples needed by any consistent learner, we need only bound the number
of examples needed to assure that the version space contains no unacceptable
hypotheses.
The following definition states this condition precisely.
Consider a hypothesis space H, target concept c, instance distribution D,
and set of training examples D of c. The version space VSH,D, is said to be
ε-exhausted with respect to c and D, if every hypothesis h in VSH,D, has
error less than ε with respect to c and D.
Find-S
Initialize h to the most specific hypothesis l1 ^ ¬l1 ^ l2 ^ ¬l2 … ln ^ ¬ln
For each positive training instance x
o Remove from h any literal that is not satisfied by x
Output hypothesis h.
FIND-S converges in the limit to a hypothesis that makes no errors, provided C ⊆
H and provided the training data is noise-free. FIND-S begins with the most specific
hypothesis (which classifies every instance a negative example), then incrementally
generalizes this hypothesis as needed to cover observed positive training examples.
Instance-
stance-Based Learning
Training algorithm:
For each training example (x, f(x)), add the example to the list training-examples
Classification algorithm:
Given a query instance xq to be classified,
• Let x1, …, xk denote the k instances from training-examples that are
nearest to xq.
• Return
k
f ( x ) ← arg max δ (v, f ( x ))
q ∑v∈V i =1
i
where
1
wi ≡
d ( xq , xi )2
To accommodate the case where the query point x, exactly matches one of the
training instances xi and the denominator d(xq, xi)2 is therefore zero, we assign f ( xq ) to
be f (xi) in this case. If there are several such training examples, we assign the majority
classification among them.
We can distance-weight the instances for real-valued target functions in a similar
fashion, replacing the final line of the algorithm in this case by
k
f ( x ) ← ∑ i =1 wi f ( xi )
q k
∑ i =1 wi
where wi is the inverse square of its distance from xq.
1
E≡ ∑ ( f ( x) − f ( x)) 2
2 x∈D
which led us to the gradient descent training rule
∆w j = η ∑ ( f ( x) − f ( x))a j ( x)
x∈D
1
E1 ( xq ) ≡ ∑ (f(x)- f (x)) 2
2 x∈k nearest nbrs of x q
1
E2 ( xq ) ≡ ∑ (f(x)- f (x)) 2 K ( d ( xq , x))
2 x∈D
3. Combine 1 and 2:
1
E3 ( xq ) ≡ ∑ (f(x)- f (x)) 2 K (d ( xq , x))
2 x∈k nearest nbrs of x q
Genetic Algorithms
http://rajakishor.co.cc Page 109
Genetic Algorithms are nondeterministic stochastic search or optimization
methods that utilize the theories of evolution and natural selection to solve a problem
within a complex solution space. They are computer-based problem solving systems,
which use computational models of some of the known mechanisms in evolution as key
elements in their design and implementation. Genetic algorithms are loosely based on
natural evolution and use a “survival of the fittest” technique, wherein the best solutions
survive and are varied until we get a good result.
In a genetic algorithm, the performance of a set of candidate solutions to a
problem called ‘chromosomes’ are evaluated and ordered, then new candidate solutions
are produced by selecting candidates as ‘parents’ and applying mutation or crossover
operators which combine bits of two parents to produce one or more children. The new
set of candidates is evaluated, and this cycle continues until an adequate solution is
found.
P reliminaries
Genetic Algorithms are nondeterministic stochastic search or optimization
methods that utilize the theories of evolution and natural selection to solve a problem
within a complex solution space. They are computer-based problem-solving systems,
which use computational models of some of the known mechanisms in evolution as key
elements in their design and implementation. They are a member of a wider population
of algorithms, “Evolutionary Algorithms”.
Genetic Algorithms perform a multi-point search in the problem space. On one
hand it ensures the robustness, searching in a local minimum does not mean that the
whole algorithm fails, while on the other it may give not just one, but more nearly
optimal solutions for the problem from which the user can select.
Due to robustness, flexibility and efficiency of Genetic Algorithms, costly
redesigns of artificial systems that are based on them can be avoided. Genetic
Algorithms are theoretically and empirically proven to provide robust search in
complex problem spaces.
Biological Background
Genetic algorithms are inspired by Darwin's theory. Solutions to a problem can
be obtained through evolutions. All living organisms consist of cells. In each cell there is
the same set of chromosomes. Chromosomes are strings of DNA and serves as a model
for the whole organism. The genes determine a chromosome’s characteristic. Each gene
has several forms or alternatives, which are called alleles, producing differences in the
E ncoding of C hromosome
The first step in genetic algorithm is to “translate” the real problem into
“biological terms”. Format of chromosome is called encoding. There are four commonly
used encoding methods: binary encoding, permutation encoding, direct value encoding
and tree encoding.
i. Binary Encoding:
Binary encoding is the most common and simplest one. In binary encoding, every
chromosome is a string of bits, 0 or 1. For example:
Chromosome A: 0101101100010011
Chromosome B: 1011010110110101
i. Initialization
There are many ways to initialize and encode the initial population. It can be
binary or non-binary, fixed or variable length strings and so on. This operator is not of
much significance if the system random generates valid chromosomes and evaluates
each one.
ii. Reproduction
Between successive generations, the process by which chromosomes of the
previous generations are retained in the next generations is reproduction. The two
types of reproduction are Generational Reproduction and steady-state reproduction.
Generational Reproduction
In Generational Reproduction, the whole population is potentially replaced at
each generation. The most often used method is to randomly generate a
population of chromosomes. The next step is to decode chromosomes into
iii. Selection
According to Darwin's evolution theory the best ones should survive and create
new offspring. There are many methods how to select the best chromosomes, for
example roulette wheel selection, Boltzman selection, tournament selection, rank
selection, spatially oriented selection.
Roulette Wheel Selection
Parents are selected according to their fitness. The better the chromosomes are,
the more chances to be selected they have. Imagine a roulette wheel (pie chart)
where all chromosomes in the population are placed in according to their
normalized fitness. Then a random number is generated which decides the
chromosome to be selected. Chromosomes with bigger fitness values will be
selected more times since they occupy more space on the pie.
Rank Selection
The previous selection will have problems when the fitnesses differ very much.
For example, if the best chromosome fitness is 90% of the entire roulette wheel
then the other chromosomes will have very few chances to be selected. Rank
selection first ranks the population and then every chromosome receives fitness
from this ranking. The worst will have fitness 1, second worst 2 etc. and the best
will have fitness N (number of chromosomes in population). After this, all the
chromosomes have a chance to be selected. However, this method can lead to
slower convergence, because the best chromosomes do not differ so much from
other ones.
Elitism
When creating new population by crossover and mutation, we have a big chance
that we will loose the best chromosome. Elitism is a method, which first copies
the best chromosome (or a few best chromosomes) to new population. The rest
iv. Crossover
The crossover operator is the most important operation in genetic algorithms.
Crossover is a process of yielding recombination of bit strings via an exchange of
segments between pairs of chromosomes. There are many kinds of crossovers. Certain
crossover operators are applicable for binary chromosomes and some other for
permutation chromosomes.
One-point crossover
A randomly chosen point is taken within the length of the chromosomes .The
chromosomes are cut at that point. The first child consists of sub chromosome of
Parent1 up to the cut point concatenated with the sub chromosome of parent2
after the cut point. The second child is constructed in a similar way. For example:
P1 = 1010101| 010
P2 = 1110001|110
The crossover point is between 7th and 8th bits. Then the offspring will be
O1 = 1010101|110
O2 = 1110001| 010
v. Mutation
Mutation has the effect of ensuring that all possible chromosomes are reachable.
For example, the position of a chromosome can be any value from one to twenty, if in
the initial population there is no chromosome having a value of 6 in any of its gene
positions. Then with only crossover and reproduction operators then the value 6 will
never occur in any future chromosomes. The mutation operator can overcome this way
randomly selecting a bit position and changing its value.
Mutation is useful in escaping local minima ass it helps explore new regions of
the multidimensional solution space. If mutation rate is too high, it can cause well-bred
chromosomes to be lost and thus decrease the exploitation of high fitness regions of the
solution space.
Some systems that use random populations (noisy) created at initialization
phase do not use mutation operators at all. Some mutation operators are:
Bit Inversion
This operator is discussed on binary chromosomes. In this the bits in the
chromosome are inverted (0’s are made as 1’s and 1’s are made as 0’s)
depending on the probability of mutation. For example, 1000000001 =
1010000000 where the third and 10th bits have been (randomly) mutated.
Order Changing
This operator can be used on both binary and non-binary gene representations.
In this a portion of the chromosome is selected and the genes in that region are
randomly permuted. For example, (5 6 3 4 7 3) = (5 3 4 6 7 3), where the second,
third and fourth values have been randomly scrambled.
vi. Inversion
In Holland’s founding work on Genetic Algorithms he made mention of another
operator, besides selection, crossover and mutation which takes place in biological
reproduction. This is known as the inversion operator. An inversion is where a portion
of a chromosome detaches from the rest of the chromosome, then changes direction and
recombines with the chromosome. The process of inversion is decidedly more complex
to implement than the other operators involved in genetic algorithms. Inversion has
also attracted a substantial amount of research. For example, consider the Chromosome
S earch M ethods
There are three main types of traditional or conventional search method:
calculus-based, enumerative, and random.
i. Calculus-based methods
Calculus-based methods are also referred to as gradient methods. These
methods use the information about the gradient of the function to guide the direction of
search. If the derivative of the function cannot be computed, because it is discontinuous,
for example, these methods often fail.
Hill Climbing is one method that works using gradient information to find the
local best by moving in the direction of steepest permissible direction.
Calculus-based methods also depend upon the existence of derivatives or well-
defined slope values. But, the real world of search is fraught with discontinuities, vast
multimodal noisy search spaces.
ii. Enumerative methods
Enumerative methods work within a finite search space, or at least a discretized
infinite search space. The algorithm then starts looking at objective function values at
every point in the space, one at a time.
Enumerative methods search every point in the space. So, if the search space is
increased exponentially or if the problem is NP-Hard like the Traveling Salesman
Problem, this method becomes inefficient.
iii. Random search methods
Random search methods are strictly random walks through the search space
while saving the best.
iv. Differences between Genetic Algorithms and conventional search
procedures
Genetic Algorithms and conventional optimization/search procedures in that:
They work with a coding of the parameter set, not the parameters themselves.
They search from a population of points in the problem domain, not a singular
point.
They use payoff information as the objective function rather than derivatives of
the problem or auxiliary knowledge.
Fitness Function
The fitness function is a non-negative figure of merit of the chromosome. The
objective function is the basis for the computation of the fitness function, which
provides the Genetic Algorithms with feedback from the environment, feedback used to
direct the population towards areas of the search space characterized by better
solutions.
Generally, the only requirement of a fitness function is to return a value
indicating the quality of the individual solution under evaluation. This gives the modeler
almost unlimited freedom in building the model, therefore a diverse range of modeling
structures can be incorporated into the Genetic Algorithms.
At every evolutionary step, known as a generation, the individuals in the current
population are decoded and evaluated according to some predefined quality criterion,
Dimension-1
Sequential covering algorithms learn one rule at a time, removing the covered
examples and repeating the process on the remaining examples.
In contrast, decision tree algorithms such as ID3 learn the entire set of disjuncts
simultaneously as part of the single search for an acceptable decision tree. We might,
therefore, call algorithms such as ID3 simultaneous covering algorithms, in contrast to
sequential covering algorithms such as CN2.
Which should we prefer? The key difference occurs in the choice made at the
most primitive step in the search. At each search step ID3 chooses among alternative
Dimension-2
Sequential covering algorithms learn one rule at a time, removing the covered
examples and repeating the process on the remaining examples.
In the LEARN-ONE-RULE algorithm described above, the search is from general-
to-specific hypotheses. Other algorithms we have discussed (e.g., FIND-S) search from
specific-to-general.
One advantage of general to specific search here is that there is a single
maximally general hypothesis from which to begin the search, whereas there are very
many specific hypotheses in most hypothesis spaces (i.e., one for each possible
instance).
Dimension-4
A fourth dimension is whether and how rules are post-pruned. As in decision
tree learning, it is possible for LEARN-ONE-RULE to formulate rules that perform very
well on the training data, but less well on subsequent data. As in decision tree learning,
one way to address this issue is to post-prune each rule after it is learned from the
training data. In particular, preconditions can be removed from the rule whenever this
leads to improved performance over a set of pruning examples distinct from the
training examples.
1. Relative Frequency
Let n denote the number of examples the rule matches and let nc denote the
number of these that it classifies correctly. The relative frequency estimate of rule
performance is n/nc.
2. m-estimate of accuracy
This accuracy estimate is biased toward the default accuracy expected of the
rule. It is often preferred when data is scarce and the rule must be evaluated based on
few examples.
Let n and nc denote the number of examples matched and correctly predicted by
the rule. Let p be the prior probability that a randomly drawn example from the entire
data set will have the classification assigned by the rule (e.g., if 12 out of 100 examples
have the value predicted by the rule, then p = .12). Finally, let m be the weight, or
equivalent number of examples for weighting this prior p. The m-estimate of rule
accuracy is (nc+mp)/(n+m).
Note if m is set to zero, then the m-estimate becomes the above relative
frequency estimate. As m is increased, a larger number of examples is needed to
override the prior assumed accuracy p.
3. Entropy
This is the measure used by the PERFORMANCE subroutine in the generate-
and-test algorithm. Let S be the set of examples that match the rule preconditions.
Entropy measures the uniformity of the target function values for this set of examples.
We take the negative of the entropy so that better rules will have higher scores.
c
− Entropy ( S ) = ∑ pi log 2 pi
i =1
where c is the number of distinct values the target function may take on, and where pi is
the proportion of examples from S for which the target function takes on the ith value.