Pattern Recognition in Stock Market3947

pattern recognition
in stock market
Introduction
motivation
Our time is limited, better not to waste it
working
Life style costs money
Create someone else to do the job for you
metatrader
Online broker
Lets you trade foreign currency, stocks
and indexes
MetaQuotes Language (MQL) similar to C,
allows you to buy and sell
Can be linked with dynamic linked libraries
(dll)
Pattern recognition
Pattern recognition aims to classify data
(patterns) based either on a priori knowledge or
on statistical information extracted from the
patterns. The patterns to be classified are usually
groups of measurements or observations,
defining points in an appropriate
multidimensional space.
To understand is to perceive patterns
SVM
Number of art books purchased
Linear Support Vector Machines
A direct marketing company wants to sell a

new book:
The Art History of Florence
Nissan Levin and Jacob Zahavi in Lattin,
Carroll and Green (2003).
buyers
non-buyers
Problem: How to identify buyers and nonbuyers using the two variables:
Months since last purchase

Linear SVM: Separable Case

Main idea of SVM:
separate groups by a line.
However: There are infinitely many lines that

have zero training error
buyers
non-buyers
which line shall we choose?
buyers
non-buyers
SVM use the idea of a margin around the

separating line.
margi
n
The thinner the margin,
the more complex the model,
The best line is the one with the

largest margin.
x2
The line having the largest margin is:
w 1x 1
w 2x 2
=
+b
w 2x
+
x1
+b
2
w1x1 + w2x2 + b = 0
Where
w1
x1 = months since last purchase
margin
x2 = number of art books purchased
w 1x 1
2
wx
b=
x1
-1
Note:
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b 1
for i
for j

The width of the margin is given by:
x2
w 1x 1
2
w 2x
w 2x
+
x1
w1
margin
=
+b
w 1x 1
+b
2
x2
2
w
+
b=
-1
Note:
2 w
maximize
the margin
margin
x1
1 ( 1)
w12 w 22
w 2
minimize
2
|| w ||
minimize
wx i b 1 for yi 1
wx i b 1 for yi 1
yi (wx i b) 1 0 for all i

2 w
x2
maximize
the margin
w 2
minimize
minimize
The optimization problem for SVM is:
minimize L( w ) w
margin
subject to:
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b 1
x1
for i
for j

Support vectors
x2
Support vectors are those points that lie

on the boundaries of the margin
The decision surface (line) is determined

only by the support vectors. All other
points are irrelevant
x1
Linear SVM: Nonseparable Case

Non-separable case: there is no line
Training set: 1000 targeted customers separating errorlessly the two groups
x2
buyers
non-buyers
w 1x 1
x2
L( w,C ) w
C i
+w
=
+b
Here, SVM minimize L(w,C) :
maximize
the margin
minimize the
training errors
L(w,C) = Complexity +
Errors
subject to:
x1
w1xi 1 + w2xi 2 + b +1 i
w1xj 1 + w2xj 2 + b 1 + i
I,j 0
for i
for j
vectors Xi
labels yi = 1
y sign (w X b)
min :
w ,b
1
2
w C 1 yi (w X i b)
2
margin and error vectors
yi (w X i b) 1, i S
y sign ( i yi X i X b)
iS
w i yi X i
iS
Linear SVM: The Role of C

x2
x2
C=5
x1
Bigger C
increased complexity
( thinner margin )
Smaller C
C=1
x1
decreased complexity
( wider margin )
smaller number errors
bigger number errors
( better fit on the data )
( worse fit on the data )
Vary both complexity and empirical error via C by affecting the optimal w and optimal
number of training errors
Non-linear SVMs
Transform x (x)
The linear algorithm depends only on xxi, hence
transformed algorithm depends only on (x)(xi)
Use kernel function K(xi,xj) such that K(xi,xj)= (x)(xi)
Nonlinear SVM: Nonseparable Case
Mapping into a higher-dimensional space
x2
x11
x
21
xl 1
x112
2 x11 x12
x122
2
x 21

2 x 21 x 22
2
x 22
xl 2
2
l1
x12
x 22
2 xl1xl 2
2
l2
Optimization task: minimize L(w,C)
L(w ,C ) w
subject to:
C i
i
w 1x i21 w 2 2 x i 1x i 2 w 3 x i22 b 1 i
w1x 2j 1 w 2 2 x j 1x j 2 w 3 x 2j 2 b 1 j
x1
Map the data into higher-dimensional space:
x1
x
2
(-1,1)
(-1,-1)
1, 2 , 1
x22
(1,1)
x1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
2 x1 x2
x22
1, 1
x2
2
1
x12
(1,-1)
2 x1 x2

Find the optimal hyperplane in the transformed space
x1
x
2
2 x1 x2
x22
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
2
1
x2
(-1,1)
(-1,-1)
x22
(1,1)
x1
1, 2 , 1
x12
(1,-1)
2 x1 x2

Observe the decision surface in the original space (optional)
x1
x
2
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
2 x1 x2
x22
2
1
x2
x22
x1
1, 2 , 1
x12
2 x1 x2

Dual formulation of the (primal) SVM minimization problem
Primal
min
w
2
Dual
C i
yi w x i b 1 i
yi 1
Subject to
i 0
max
Subject to
0 i C
yi 0
yi 1
1
2
yi yj xi xj

Dual formulation of the (primal) SVM minimization problem
x1
x
2
Dual
x12
2 x1 x2
x22
max
x ,
(x
i1
2 xi1 xi 2 , x
2
j1
, xi 2 ) ( x j1 , x j 2 )
x x
i
2
i2
, 2 x j1 x j 2 , x
2
j2
max
1
2
yi yj xi xj
2 i j yi yj ( xi ) ( xj )
i
K ( x i , x j ) ( x i ) ( x j )
(kernel function)
( x i ) ( x j )
2
i1
max
i 12 i j yi yj xi xj
i
Subject to
0 i C
yi 0
yi 1
Solving
Construct & minimise the Lagrangian
N
1
2
L(w , b, ) || w || i [ yi (wx i b) 1]
2
i 1
wrt. constraint i 0, i 1,...N
Take derivatives wrt. w and b, equate them to 0

N
L(w , b, )
w i yi x i 0
w
i 1
L(w , b, ) N
i yi 0
b
i 1
KKT cond : i [ yi (wx i b) 1] 0
parameters are expressed as a linear

combination of training points
only SVs will have non-zero i
The Lagrange multipliers i are called dual variables

Each training point has an associated dual variable.
Applications
Handwritten digits recognition
Of interest to the US Postal services
4% error was obtained
about 4% of the training data were SVs only
Text categorisation
Face detection
DNA analysis
Architecture of SVMs
Nonlinear Classifier(using kernel)
Decision function
l
f ( x) sgn( vi ( ( x) ( xi )) b)
i 1
l
sgn( vi k ( x, xi ) b)
i 1
( xi ) substitute for each

train example xi
vi i yi
vi are computed as the
solution of quadratic program
Artificial Neural Networks
Neural Network
Taxonomy of Neural Network Architecture
The architecture of the neural network refers to the arrangement

of the connection between neurons, processing element, number
of layers, and the flow of signal in the neural network. There are
mainly two category of neural network architecture: feedforward and feedback (recurrent) neural networks
Neural Network
Feed-forward network, Multilayer Perceptron
Neural Network
Recurrent network
Multilayer Perceptron (MLP)

Input Layer
Neuron processing element
x1
Hidden Layer
h1
x2
Input
Vector
x3
x4
.
.
.
x1
Output Layer
O1
x2
h2
xn
w1
w2
y F(y)
wn
xn
F(y)
MLP Structure
Backpropagation Learning
Architecture:
Feedforward network of at least one layer of non-linear
hidden nodes, e.g., # of layers L 2 (not counting the input
layer)
Node function is differentiable
most common: sigmoid function
Learning: supervised, error driven,

generalized delta rule
Call this type of nets BP nets
The weight update rule
(gradient descent approach)
Practical considerations
Variations of BP nets
Applications
Notations:
Weights: two weight matrices:
w(1,0) from input layer (0) to hidden layer (1)

w( 2,1) from hidden layer (1) to output layer (2)
w2(1,1,0) weight from node 1 at layer 0 to node 2 in layer 1
Training samples: pair of
so it is supervised learning
Input pattern:
Output pattern:
{( x p , d p ) p 1,..., P}
x p ( x p ,1 ,..., x p ,n )
Desired output:
o p (o p ,1 ,..., o p ,k )
d p (d p ,1 ,..., d p ,k )
Error:
error for output node j when xp is
l p, j d p,k o p,k
applied
sum square error
P K
(l p, j ) 2
p 1 (change
j 1
This error drives learning
and
(1, 0 )
( 2),1)
Sigmoid function again:
Differentiable:
1
1 e x
1
x
S ' ( x)
(
1
e
)'
x 2
(1 e )
1
x
e
)
x 2
(1 e )
1
e x
x
1 e 1 ex
S ( x)
Saturation
region
Saturation
region
S ( x)(1 S ( x))
When |net| is sufficiently large, it moves into one of the

two saturation regions, behaving like a threshold or ramp
function.
Chain rule of differentiation dz
dz dy dx
if z f ( y ), y g ( x), x h(t ) then f ' ( y ) g ' ( x)h' (t )
dt dy dx dt
Forward computing:
Apply an input vector x to input nodes
Computing output vector x(1) on hidden layer
x (j1) S ( net (j1) ) S ( w(j1,i,0) xi )
i
Computing the output vector o on output layer

ok S (netk( 2) ) S ( wk( 2, ,j1) x (j1) )
j
The network is said to be a map from input x to output o
Objective of learning:
Modify the 2 weight matrices to reduce sum square error
P
K
p 1 k 1(l p,k )2 for the given P training samples as much
as possible (to zero if possible)
Idea of BP learning:
Update of weights in w(2, 1) (from hidden layer to output
layer):
delta rule as in a single layer net using sum square error
Delta rule is not applicable to updating weights in w(1, 0)
(from input and hidden layer) because we dont know the
desired values for hidden nodes
Solution: Propagating errors at output nodes down to
hidden nodes, these computed errors on hidden nodes
drives the update of weights in w(1, 0) (again by delta rule),
thus called error Back Propagation (BP) learning
How to compute errors on hidden nodes is the key
Error backpropagation can be continued downward if the
net has more than one hidden layer
Proposed first by Werbos (1974), current formulation by
Rumelhart, Hinton, and Williams (1986)
Generalized delta rule:
Consider sequential learning mode: for a given sample (xp, dp)
E k (l p , k ) 2 k (d p , k o p , k ) 2
Update weights by gradient descent
For weight in w(2, 1):
For weight in w
(1, 0)
wk( 2, ,j1) (E / wk( 2, ,j1) )
w(j1,i,0) (E / w(j1,i,0) )
Derivation of update rule for w(2, 1):

since E is a function of lk = dk ok, ok is a function of
netk( 2 )
wk( 2, ,j1)
is a function of
, by chain rule
netk( 2)
, and
Derivation of update rule for
ok
w(j1,i,0)
wk( 2, ,j1)
consider hidden node j:

(1)
(1, 0 )
net
w
weight j ,i influences
j
(1)
S
(
net
it sends
j ) to all output nodes
w(j1,i,0 )
(1, 0 )
all K terms in E are functions of w j ,i
E k (d k ok ) 2 , ok S (net k( 2) ), net k( 2) j x (j1) wk( 2, ,j1) ,

x (j1) S (net (j1) ), net (j1) i xi w(j1,i,0)
by chain
rule
E
ok
S (net k( 2) )
netk( 2)
netk( 2)
x (j1)
x (j1)
net (j1)
net (j1)
w(j1,i)
Update rules:
for outer layer weights w(2, 1) :
( 2)
(
d
o
)
S
'
(
net
where k
k
k
k )
for inner layer weights w(1, 0) :
where
j (k k wk( 2, ,j1) ) S ' (net (j1) )

Weighted sum of errors
from output layer
Note: if S is a logistic function,

then S(x) = S(x)(1 S(x))
Pattern classification: an example
Classification of myoelectric signals
Input pattern: 2 features, normalized to real values
between -1 and 1
Output patterns: 3 classes
Network structure: 2-5-3

2 input nodes, 3 output nodes,
1 hidden layer of 5 nodes
= 0.95, = 0.4 (momentum)
Error bound e = 0.05

332 training samples
Maximum iteration = 20,000
When stopped, 38 patterns remain misclassified
38 patterns misclassified
Strengths of BP Learning
Great representation power
Any L2 function can be represented by a BP net
Many such functions can be approximated by BP learning
(gradient descent approach)
Easy to apply
Only requires that a good set of training samples is
available
Does not require substantial prior knowledge or deep
understanding of the domain itself (ill structured problems)
Tolerates noise and missing data in training samples
(graceful degrading)
Easy to implement the core of the learning algorithm

Good generalization power
Often produce accurate results for inputs outside the
training set
Deficiencies of BP Learning
Learning often takes a long time to converge
Complex functions often need hundreds or thousands of
epochs
The net is essentially a black box

It may provide a desired mapping between input and output
vectors (x, o) but does not have the information of why a
particular x is mapped to a particular o.
It thus cannot provide an intuitive (e.g., causal) explanation
for the computed result.
This is because the hidden nodes and the learned weights do
not have clear semantics.
What can be learned are operational parameters, not general,
abstract knowledge of a domain
Unlike many statistical methods, there is no theoretically wellfounded way to assess the quality of BP learning
What is the confidence level of o computed from input x using
such net?
What is the confidence level for a trained BP net, with the final
E (which may or may not be close to zero)?
Problem with gradient descent approach

only guarantees to reduce the total error to a local
minimum. (E may not be reduced to zero)
Cannot escape from the local minimum error state
Not every function that is representable can be
learned
How bad: depends on the shape of the error surface.

Too many valleys/wells will make it easy to be trapped
in local minima
Possible remedies:
Try nets with different # of hidden layers and hidden
nodes (they may lead to different error surfaces, some
might be better than others)
Try different initial weights (different starting points on the
surface)
Forced escape from local minima by random perturbation
(e.g., simulated annealing)
Generalization is not guaranteed even if the error

is reduced to 0
Over-fitting/over-training problem: trained net fits the training
samples perfectly (E reduced to 0) but it does not give accurate
outputs for inputs not in the training set
Possible remedies:
More and better samples
Using smaller net if possible
Using larger error bound
(forced early termination)
Introducing noise into samples
modify (x1,, xn) to (x1+1,
, xn+n) where i are small
random displacements
Cross-Validation
leave some (~10%) samples as test data (not used for weight update)
periodically check error on test data
learning stops when error on test data starts to increase
Network paralysis with sigmoid activation function

Saturation regions:
S ( x) 1 /(1 e x ), its derivative S ' ( x) S ( x)(1 S ( x)) 0

when x .
When x falls in a saturation region, S ( x) hardly changes its value
regardless how fast the magnitude of x increases
Input to an node may fall into a saturation region

when some of its incoming weights become very
large during learning. Consequently, weights stop to
change no matter how hard you try.
Possible remedies:
Use non-saturating activation functions
Periodically normalize all weights
wk , j : wk , j / w.k
The learning (accuracy, speed, and generalization)

is highly dependent of a set of learning
parameters
Initial weights, learning rate, # of hidden layers and #
of nodes...
Most of them can only be determined empirically
(via experiments)
Practical Considerations
A good BP net requires more than the core of the learning
algorithms. Many parameters must be carefully selected
to ensure a good performance.
Although the deficiencies of BP nets cannot be
completely cured, some of them can be eased by some
practical means.
Initial weights (and biases)
Random, [-0.05, 0.05], [-0.1, 0.1], [-1, 1]
Avoid bias in weight initialization
Normalize weights for hidden layer (w(1, 0)) (Nguyen-Widrow)
Random assign initial weights for all hidden nodes

For each hidden node j, normalize its weight by
w(j1,i,0) w(j1,i,0) / w(j1,0)
where 0.7 n m
m # of hiddent nodes, n # of input nodes
w(j1,0)
after normalization
Training samples:
Quality and quantity of training samples often determines the
quality of learning results
Samples must collectively represent well the problem space
Random sampling
Proportional sampling (with prior knowledge of the problem
space)
# of training patterns needed: There is no theoretically idea
number.
Baum and Haussler (1989): P = W/e, where
W: total # of weights to be trained (depends on net structure)
e: acceptable classification error rate
If the net can be trained to correctly classify (1 e/2)P of the
P training samples, then classification accuracy of this net is
1 e for input patterns drawn from the same sample space
Example: W = 27, e = 0.05, P = 540. If we can successfully
train the network to correctly classify (1 0.05/2)*540 = 526
of the samples, the net will work correctly 95% of time with
other input.
How many hidden layers and hidden nodes

per layer:
Theoretically, one hidden layer (possibly with many
hidden nodes) is sufficient for any L2 functions
There is no theoretical results on minimum necessary
# of hidden nodes
Practical rule of thumb:
n = # of input nodes; m = # of hidden nodes
For binary/bipolar data: m = 2n
For real data: m >> 2n
Multiple hidden layers with fewer nodes may be trained
faster for similar quality in some applications
Example: compressing character bitmaps.

Each character is represented by a 7 by 9 pixel
bitmap, or a binary vector of dimension 63
10 characters (A J) are used in experiment
Error range:
tight: 0.1 (off: 0 0.1; on: 0.9 1.0)
loose: 0.2 (off: 0 0.2; on: 0.8 1.0)
Relationship between # hidden nodes, error

range, and convergence rate
relaxing error range may speed up
increasing # hidden nodes (to a point) may
speed up
error range: 0.1 hidden nodes: 10 # epochs: 400+
no noticeable speed up when # hidden nodes increases
to beyond 22
Other applications.
Medical diagnosis
Input: manifestation (symptoms, lab tests, etc.)
Output: possible disease(s)
Problems:
no causal relations can be established
hard to determine what should be included as
inputs
Currently focus on more restricted diagnostic tasks
e.g., predict prostate cancer or hepatitis B based
on standard blood test
Process control
Input: environmental parameters
Output: control parameters
Learn ill-structured control functions
Stock market forecasting

Input: financial factors (CPI, interest rate, etc.) and
stock quotes of previous days (weeks)
Output: forecast of stock prices or stock indices
(e.g., S&P 500)
Training samples: stock market data of past few
years
Consumer credit evaluation
Input: personal financial information (income, debt,
payment history, etc.)
Output: credit rating
And many more
Key for successful application
Careful design of input vector (including all
important features): some domain knowledge
Obtain good training samples: time and other cost
Summary of BP Nets
Architecture
Multi-layer, feed-forward (full connection between
nodes in adjacent layers, no connection within a layer)
One or more hidden layers with non-linear activation
function (most commonly used are sigmoid functions)
BP learning algorithm
Supervised learning (samples (xp, dp))
Approach: gradient descent to reduce the total error
w E / w
(why it is also called generalized delta rule)

Error terms at output nodes
error terms at hidden nodes (why it is called error BP)
Ways to speed up the learning process
Adding momentum terms
Adaptive learning rate (delta-bar-delta)
Quickprop
Generalization (cross-validation test)
Strengths of BP learning
Great representation power
Wide practical applicability
Easy to implement
Good generalization power
Problems of BP learning
Learning often takes a long time to converge
The net is essentially a black box
Gradient descent approach only guarantees a local minimum error
Not every function that is representable can be learned
Generalization is not guaranteed even if the error is reduced to zero
No well-founded way to assess the quality of BP learning
Network paralysis may occur (learning is stopped)
Selection of learning parameters can only be done by trial-and-error
BP learning is non-incremental (to include new training samples, the
network must be re-trained with all old and new samples)
Experiments
Stock Prediction
Stock prediction is a difficult task due to the nature of the stock data
which is very noisy and time varying.
The efficient market hypothesis claim that future price of the stock is
not predictable based on publicly available information.
However theory has been challenged by many studies and a few
researchers have successfully applied machine learning approach
such as neural network to perform stock prediction
?Is the Market Predictable
Efficient Market Hypothesis (EMH) (Fama, 1965)

Stock market is efficient in that the current market prices reflect all information
available to traders, so that future changes cannot be predicted relying on past prices
or publicly available information.
Murphy's law : Anything that can go wrong will go wrong.

Fama et al. (1988) showed that 25% to 40% of the variance in
the stock returns over the period of three to five years is
predictable from past return
Pesaran and Timmerman (1999) conclude that the UK stock market is
predictable for the past 25 years.
Saad (1998) has successfully employed different neural network models
to predict the trend of various stocks on a short-term range
Optimistic report
Implementation
In this paper we propose to investigate SVM, MLP and RBF network

for the task of predicting the future trend of the 3 major stock indices
a) Kuala Lumpur Composite Index (KLCI)
b) Hongkong Hangseng index
c) Nikkei 225 stock index
using input based on technical indicators.
This paper approach the problem based on 2 class pattern
classification formulated specifically to assist investor in making
trading decisions
The classifier is asked to recognise investment opportunities that
can give a return of r% or more within the next h days. r=3% h=10
days
System Block Diagram
The classifier is to predict if the trend of the stock index increment of

more than 3% within the next 10 days period can be achieved.
Data from
daily
historical
data
converted
into
technical
analysis
indicator
Increment Achievable ??
Classifier
Yes / No
Data Used
Kuala Lumpur Stock Index (KLCI) for the period of 1992-1997
Data Used
Hangseng index (20/4/1992-1/9/1997)
Data Used
Nikkei 225 stock index (20/4/1982-1/9/1987)
TABLE 1: DESCRIPTION OF INPUT TO CLASSIFIER
xi i=1,2,3 .12 n=15
Input to Classifier
DLN (t) = sign[q(t)-q(t-N)] * ln (q(t)/q(t-N) +1) (1)

q(t) is the index level at day t and DLN (t) is the actual input to the classifier.
Prediction Formulation
Consider ymax(t) as the maximum upward movement of the stock

index value within the period t and t + . y(t) represents the stock
index level at day t
Classification
The prediction of stock trend is formulated as a two class
classification problem.
yr(t) > r% >> Class 2
yr(t) r% >> Class 1
Classification
Let (xi , yi ) 1<i<N be a set of N training examples, each input example

xi
Rn n=15 being the dimension of the input space, belongs to a class
labelled by yi +1,-1.
Yi =-1
Yi =+1
Performance Measure
True Positive (TP) is the number of positive
predicted correctly as positive class.
False Positive (FP) is the number of negative
predicted wrongly as positive class.
False Negative (FN) is the number of positive
predicted wrongly as negative class.
True Negative (TN) is the number of negative
predicted correctly as negative class.
class
class
class
class
Performance Measure
Accuracy = TP+TN / (TP+FP+TN+FN)

Precision = TP/(TP+FP)
Recall rate (sensitivity) = TP/(TP+FN)
F1 = 2 * Precision * Recall/(Precision + Recall)
Testing Method
Rolling Window Method is Used to Capture Training and
Test Data
Train
Test
Train =600 data Test= 400 data
Experiment and Result

Experiments are conducted to predict the stock trend of
three major stock indexes, KLCI, Hangseng and Nikkei.
SVM, MLP and RBF network is used in making trend
prediction based on classification and regression
approach.
A hypothetical trading system is simulated to find out the
annualized profit generated based on the given
prediction.
Trading Performance
A hypothetical trading system is used
When a positive prediction is made, one unit of money
was invested in a portfolio reflecting the stock index. If
the stock index increased by more than r% (r=3%) within
the next h days (h=10) at day t, then the investment is
sold at the index price of day t. If not, the investment is
sold on day t+1 regardless of the price. A transaction fee
of 1% is charged for every transaction made.
Use annualised rate of return .
Trading Performance
Classifier Evaluation Using Hypothetical Trading
System
Trading Performance

Classification Result

The result shows better performance of neural network
techniques when compared to K nearest neighbour
classifier. SVM shows the overall better performance on
average than MLP and RBF network in most of the
performance metric used

Comparison of Receiver Operating Curve (ROC)

Area under Curve (ROC)
Conclusion
We have investigated the SVM, MLP and RBF network as a

classifier and regressor to assess it's potential in the stock trend
prediction task
Support vector machine (SVM) has shown better performance

when compared to MLP and RBF .
SVM classifier with probabilistic output outperform MLP and RBF

network in terms of error-reject tradeoff
Both the classification and regression model can be used for a

profitable trend prediction system. The classification model has the
advantage in which pattern rejection scheme can be incorporated.
This report
Implementation
OnlineSVR by Francesco Parrella
http://onlinesvr.altervista.org/
BPN by Karsten Kutza
http://www.neural-networks-at-your-fingertips.com/
Results
Basically zero correlation between prediction and the actual

outcome
Suffer from many technical failures
Still have faith that these methods (when applied correctly) can
predict the future better then a random guess
Tried many sorts of topologies of the BPN and the input values to
SVM, looks like the secret does not lie there
Future investigation, use wavelets/noiselets coefficients as inputs
References
http://www.cs.unimaas.nl/datamining/slides2009/svm_presentation.ppt
http://merlot.stat.uconn.edu/~lynn/svm.ppt
http://www.cs.bham.ac.uk/~axk/ML_SVM05.ppt
http://www.stanford.edu/class/msande211/KKTgeometry.ppt
http://www.csee.umbc.edu/~ypeng/F09NN/lecture-notes/NN-Ch3.ppt
http://fit.mmu.edu.my/caiic/reports/report04/mmc/haris.ppt
http://www.youtube.com/watch?v=oQ1sZSCz47w
Google, Wikipedia and others

Pattern Recognition in Stock Market3947

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pattern Recognition in Stock Market3947

Uploaded by

Copyright:

Available Formats

pattern recognition

To understand is to perceive patterns

Number of art books purchased

Linear Support Vector Machines

A direct marketing company wants to sell a

Months since last purchase

Months since last purchase

Linear SVM: Separable Case

Number of art books purchased

separate groups by a line.

However: There are infinitely many lines that

which line shall we choose?

Months since last purchase

Number of art books purchased

Linear SVM: Separable Case

SVM use the idea of a margin around the

The thinner the margin,

the more complex the model,

Months since last purchase

The best line is the one with the

Linear SVM: Separable Case

Number of art books purchased

The line having the largest margin is:

x1 = months since last purchase

x2 = number of art books purchased

Months since last purchase

Linear SVM: Separable Case

Number of art books purchased

Months since last purchase

yi (wx i b) 1 0 for all i

Linear SVM: Separable Case

The optimization problem for SVM is:

Linear SVM: Separable Case

Support vectors are those points that lie

The decision surface (line) is determined

Linear SVM: Nonseparable Case

Here, SVM minimize L(w,C) :

margin and error vectors

Linear SVM: The Role of C

smaller number errors

bigger number errors

( better fit on the data )

( worse fit on the data )

Nonlinear SVM: Nonseparable Case

Mapping into a higher-dimensional space

Optimization task: minimize L(w,C)

Nonlinear SVM: Nonseparable Case

Map the data into higher-dimensional space:

Nonlinear SVM: Nonseparable Case

Nonlinear SVM: Nonseparable Case

Nonlinear SVM: Nonseparable Case

Nonlinear SVM: Nonseparable Case

Take derivatives wrt. w and b, equate them to 0

parameters are expressed as a linear

The Lagrange multipliers i are called dual variables

( xi ) substitute for each

Artificial Neural Networks

The architecture of the neural network refers to the arrangement

Multilayer Perceptron (MLP)

Neuron processing element

Learning: supervised, error driven,

w(1,0) from input layer (0) to hidden layer (1)

When |net| is sufficiently large, it moves into one of the

Chain rule of differentiation dz