Professional Documents
Culture Documents
in stock market
Introduction
motivation
Our time is limited, better not to waste it
working
Life style costs money
Create someone else to do the job for you
metatrader
Online broker
Lets you trade foreign currency, stocks
and indexes
MetaQuotes Language (MQL) similar to C,
allows you to buy and sell
Can be linked with dynamic linked libraries
(dll)
Pattern recognition
Pattern recognition aims to classify data
(patterns) based either on a priori knowledge or
on statistical information extracted from the
patterns. The patterns to be classified are usually
groups of measurements or observations,
defining points in an appropriate
multidimensional space.
SVM
buyers
non-buyers
Problem: How to identify buyers and nonbuyers using the two variables:
buyers
non-buyers
buyers
non-buyers
margi
n
x2
w 1x 1
w 2x 2
=
+b
w 2x
+
x1
+b
2
w1x1 + w2x2 + b = 0
Where
w1
margin
w 1x 1
2
wx
b=
x1
-1
Note:
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b 1
for i
for j
x2
w 1x 1
2
w 2x
w 2x
+
x1
w1
margin
=
+b
w 1x 1
+b
2
x2
2
w
+
b=
-1
Note:
2 w
maximize
the margin
margin
x1
1 ( 1)
w12 w 22
w 2
minimize
2
|| w ||
minimize
wx i b 1 for yi 1
wx i b 1 for yi 1
maximize
the margin
w 2
minimize
minimize
minimize L( w ) w
margin
subject to:
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b 1
x1
for i
for j
x2
x1
buyers
non-buyers
w 1x 1
x2
L( w,C ) w
C i
+w
=
+b
maximize
the margin
minimize the
training errors
L(w,C) = Complexity +
Errors
subject to:
x1
w1xi 1 + w2xi 2 + b +1 i
w1xj 1 + w2xj 2 + b 1 + i
I,j 0
for i
for j
vectors Xi
labels yi = 1
y sign (w X b)
min :
w ,b
1
2
w C 1 yi (w X i b)
2
yi (w X i b) 1, i S
y sign ( i yi X i X b)
iS
w i yi X i
iS
x2
C=5
x1
Bigger C
increased complexity
( thinner margin )
Smaller C
C=1
x1
decreased complexity
( wider margin )
Vary both complexity and empirical error via C by affecting the optimal w and optimal
number of training errors
Non-linear SVMs
Transform x (x)
The linear algorithm depends only on xxi, hence
transformed algorithm depends only on (x)(xi)
Use kernel function K(xi,xj) such that K(xi,xj)= (x)(xi)
x2
x11
x
21
xl 1
x112
2 x11 x12
x122
2
x 21
2 x 21 x 22
2
x 22
xl 2
2
l1
x12
x 22
2 xl1xl 2
2
l2
L(w ,C ) w
subject to:
C i
i
w 1x i21 w 2 2 x i 1x i 2 w 3 x i22 b 1 i
w1x 2j 1 w 2 2 x j 1x j 2 w 3 x 2j 2 b 1 j
x1
x1
x
2
(-1,1)
(-1,-1)
1, 2 , 1
x22
(1,1)
x1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
2 x1 x2
x22
1, 1
x2
2
1
x12
(1,-1)
2 x1 x2
x1
x
2
2 x1 x2
x22
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
2
1
x2
(-1,1)
(-1,-1)
x22
(1,1)
x1
1, 2 , 1
x12
(1,-1)
2 x1 x2
x1
x
2
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1
2 x1 x2
x22
2
1
x2
x22
x1
1, 2 , 1
x12
2 x1 x2
Primal
min
w
2
Dual
C i
yi w x i b 1 i
yi 1
Subject to
i 0
max
Subject to
0 i C
yi 0
yi 1
1
2
yi yj xi xj
x1
x
2
Dual
x12
2 x1 x2
x22
max
x ,
(x
i1
2 xi1 xi 2 , x
2
j1
, xi 2 ) ( x j1 , x j 2 )
x x
i
2
i2
, 2 x j1 x j 2 , x
2
j2
max
1
2
yi yj xi xj
2 i j yi yj ( xi ) ( xj )
i
K ( x i , x j ) ( x i ) ( x j )
(kernel function)
( x i ) ( x j )
2
i1
max
i 12 i j yi yj xi xj
i
Subject to
0 i C
yi 0
yi 1
Solving
Construct & minimise the Lagrangian
N
1
2
L(w , b, ) || w || i [ yi (wx i b) 1]
2
i 1
wrt. constraint i 0, i 1,...N
L(w , b, ) N
i yi 0
b
i 1
KKT cond : i [ yi (wx i b) 1] 0
Applications
Handwritten digits recognition
Of interest to the US Postal services
4% error was obtained
about 4% of the training data were SVs only
Text categorisation
Face detection
DNA analysis
Architecture of SVMs
Nonlinear Classifier(using kernel)
Decision function
l
f ( x) sgn( vi ( ( x) ( xi )) b)
i 1
l
sgn( vi k ( x, xi ) b)
i 1
Neural Network
Taxonomy of Neural Network Architecture
Neural Network
Feed-forward network, Multilayer Perceptron
Neural Network
Recurrent network
x1
Hidden Layer
h1
x2
Input
Vector
x3
x4
.
.
.
x1
Output Layer
O1
x2
h2
xn
w1
w2
y F(y)
wn
xn
F(y)
MLP Structure
Backpropagation Learning
Architecture:
Feedforward network of at least one layer of non-linear
hidden nodes, e.g., # of layers L 2 (not counting the input
layer)
Node function is differentiable
most common: sigmoid function
Backpropagation Learning
Notations:
Weights: two weight matrices:
{( x p , d p ) p 1,..., P}
x p ( x p ,1 ,..., x p ,n )
Desired output:
o p (o p ,1 ,..., o p ,k )
d p (d p ,1 ,..., d p ,k )
Error:
error for output node j when xp is
l p, j d p,k o p,k
applied
sum square error
P K
(l p, j ) 2
p 1 (change
j 1
This error drives learning
and
(1, 0 )
( 2),1)
Backpropagation Learning
Sigmoid function again:
Differentiable:
1
1 e x
1
x
S ' ( x)
(
1
e
)'
x 2
(1 e )
1
x
e
)
x 2
(1 e )
1
e x
x
1 e 1 ex
S ( x)
Saturation
region
Saturation
region
S ( x)(1 S ( x))
dz dy dx
if z f ( y ), y g ( x), x h(t ) then f ' ( y ) g ' ( x)h' (t )
dt dy dx dt
Backpropagation Learning
Forward computing:
Apply an input vector x to input nodes
Computing output vector x(1) on hidden layer
x (j1) S ( net (j1) ) S ( w(j1,i,0) xi )
i
Objective of learning:
Modify the 2 weight matrices to reduce sum square error
P
K
p 1 k 1(l p,k )2 for the given P training samples as much
as possible (to zero if possible)
Backpropagation Learning
Idea of BP learning:
Update of weights in w(2, 1) (from hidden layer to output
layer):
delta rule as in a single layer net using sum square error
Delta rule is not applicable to updating weights in w(1, 0)
(from input and hidden layer) because we dont know the
desired values for hidden nodes
Solution: Propagating errors at output nodes down to
hidden nodes, these computed errors on hidden nodes
drives the update of weights in w(1, 0) (again by delta rule),
thus called error Back Propagation (BP) learning
How to compute errors on hidden nodes is the key
Error backpropagation can be continued downward if the
net has more than one hidden layer
Proposed first by Werbos (1974), current formulation by
Rumelhart, Hinton, and Williams (1986)
Backpropagation Learning
Generalized delta rule:
Consider sequential learning mode: for a given sample (xp, dp)
E k (l p , k ) 2 k (d p , k o p , k ) 2
Update weights by gradient descent
For weight in w(2, 1):
For weight in w
(1, 0)
w(j1,i,0) (E / w(j1,i,0) )
netk( 2)
, and
Backpropagation Learning
Derivation of update rule for
ok
w(j1,i,0)
wk( 2, ,j1)
(1)
S
(
net
it sends
j ) to all output nodes
w(j1,i,0 )
(1, 0 )
all K terms in E are functions of w j ,i
E
ok
S (net k( 2) )
netk( 2)
netk( 2)
x (j1)
x (j1)
net (j1)
net (j1)
w(j1,i)
Backpropagation Learning
Update rules:
for outer layer weights w(2, 1) :
( 2)
(
d
o
)
S
'
(
net
where k
k
k
k )
where
Backpropagation Learning
Pattern classification: an example
Classification of myoelectric signals
Input pattern: 2 features, normalized to real values
between -1 and 1
Output patterns: 3 classes
38 patterns misclassified
Strengths of BP Learning
Great representation power
Any L2 function can be represented by a BP net
Many such functions can be approximated by BP learning
(gradient descent approach)
Easy to apply
Only requires that a good set of training samples is
available
Does not require substantial prior knowledge or deep
understanding of the domain itself (ill structured problems)
Tolerates noise and missing data in training samples
(graceful degrading)
Deficiencies of BP Learning
Learning often takes a long time to converge
Complex functions often need hundreds or thousands of
epochs
Unlike many statistical methods, there is no theoretically wellfounded way to assess the quality of BP learning
What is the confidence level of o computed from input x using
such net?
What is the confidence level for a trained BP net, with the final
E (which may or may not be close to zero)?
wk , j : wk , j / w.k
Practical Considerations
A good BP net requires more than the core of the learning
algorithms. Many parameters must be carefully selected
to ensure a good performance.
Although the deficiencies of BP nets cannot be
completely cured, some of them can be eased by some
practical means.
Initial weights (and biases)
Random, [-0.05, 0.05], [-0.1, 0.1], [-1, 1]
Avoid bias in weight initialization
Normalize weights for hidden layer (w(1, 0)) (Nguyen-Widrow)
where 0.7 n m
w(j1,0)
after normalization
Training samples:
Quality and quantity of training samples often determines the
quality of learning results
Samples must collectively represent well the problem space
Random sampling
Proportional sampling (with prior knowledge of the problem
space)
# of training patterns needed: There is no theoretically idea
number.
Baum and Haussler (1989): P = W/e, where
W: total # of weights to be trained (depends on net structure)
e: acceptable classification error rate
If the net can be trained to correctly classify (1 e/2)P of the
P training samples, then classification accuracy of this net is
1 e for input patterns drawn from the same sample space
Example: W = 27, e = 0.05, P = 540. If we can successfully
train the network to correctly classify (1 0.05/2)*540 = 526
of the samples, the net will work correctly 95% of time with
other input.
Other applications.
Medical diagnosis
Input: manifestation (symptoms, lab tests, etc.)
Output: possible disease(s)
Problems:
no causal relations can be established
hard to determine what should be included as
inputs
Currently focus on more restricted diagnostic tasks
e.g., predict prostate cancer or hepatitis B based
on standard blood test
Process control
Input: environmental parameters
Output: control parameters
Learn ill-structured control functions
Summary of BP Nets
Architecture
Multi-layer, feed-forward (full connection between
nodes in adjacent layers, no connection within a layer)
One or more hidden layers with non-linear activation
function (most commonly used are sigmoid functions)
BP learning algorithm
Supervised learning (samples (xp, dp))
Approach: gradient descent to reduce the total error
w E / w
Strengths of BP learning
Great representation power
Wide practical applicability
Easy to implement
Good generalization power
Problems of BP learning
Learning often takes a long time to converge
The net is essentially a black box
Gradient descent approach only guarantees a local minimum error
Not every function that is representable can be learned
Generalization is not guaranteed even if the error is reduced to zero
No well-founded way to assess the quality of BP learning
Network paralysis may occur (learning is stopped)
Selection of learning parameters can only be done by trial-and-error
BP learning is non-incremental (to include new training samples, the
network must be re-trained with all old and new samples)
Experiments
Stock Prediction
Stock prediction is a difficult task due to the nature of the stock data
which is very noisy and time varying.
The efficient market hypothesis claim that future price of the stock is
not predictable based on publicly available information.
However theory has been challenged by many studies and a few
researchers have successfully applied machine learning approach
such as neural network to perform stock prediction
Optimistic report
Implementation
Data from
daily
historical
data
converted
into
technical
analysis
indicator
Increment Achievable ??
Classifier
Yes / No
Data Used
Data Used
Data Used
Nikkei 225 stock index (20/4/1982-1/9/1987)
Input to Classifier
Prediction Formulation
Prediction Formulation
Classification
The prediction of stock trend is formulated as a two class
classification problem.
yr(t) > r% >> Class 2
yr(t) r% >> Class 1
Prediction Formulation
Classification
Yi =-1
Yi =+1
Performance Measure
True Positive (TP) is the number of positive
predicted correctly as positive class.
False Positive (FP) is the number of negative
predicted wrongly as positive class.
False Negative (FN) is the number of positive
predicted wrongly as negative class.
True Negative (TN) is the number of negative
predicted correctly as negative class.
class
class
class
class
Performance Measure
Testing Method
Rolling Window Method is Used to Capture Training and
Test Data
Train
Test
Trading Performance
A hypothetical trading system is used
When a positive prediction is made, one unit of money
was invested in a portfolio reflecting the stock index. If
the stock index increased by more than r% (r=3%) within
the next h days (h=10) at day t, then the investment is
sold at the index price of day t. If not, the investment is
sold on day t+1 regardless of the price. A transaction fee
of 1% is charged for every transaction made.
Use annualised rate of return .
Trading Performance
Classifier Evaluation Using Hypothetical Trading
System
Trading Performance
Conclusion
This report
Implementation
OnlineSVR by Francesco Parrella
http://onlinesvr.altervista.org/
BPN by Karsten Kutza
http://www.neural-networks-at-your-fingertips.com/
Results
Still have faith that these methods (when applied correctly) can
predict the future better then a random guess
Tried many sorts of topologies of the BPN and the input values to
SVM, looks like the secret does not lie there
References
http://www.cs.unimaas.nl/datamining/slides2009/svm_presentation.ppt
http://merlot.stat.uconn.edu/~lynn/svm.ppt
http://www.cs.bham.ac.uk/~axk/ML_SVM05.ppt
http://www.stanford.edu/class/msande211/KKTgeometry.ppt
http://www.csee.umbc.edu/~ypeng/F09NN/lecture-notes/NN-Ch3.ppt
http://fit.mmu.edu.my/caiic/reports/report04/mmc/haris.ppt
http://www.youtube.com/watch?v=oQ1sZSCz47w
Google, Wikipedia and others