Professional Documents
Culture Documents
Stock Markets
Elaborated by:
Oussema GAZZEH
Third year engineering student - EGES
within :
Steve JOBS
Abstract
Today, agents in financial markets use sophisticated trading systems with a wide range
of algorithms to support their trading. A critical component of this process is to model
markets and make accurate predictions.
The problem is a multi-class classification task. One main part of this project is the fea-
tures engineering which are based on trades and quotes available from Onetick database.
The modelling phase is achieved throughout Logistic regression algorithms. During this
phase, a first model is fitted using all available features. This classifier has poor accuracy
and high complexity. Then, features selection techniques are implemented and just 40 of
85 variables are maintained. A second model is fitted, but, no much improvements have
been observed. Finally, we have gone through a two levels model. This latter validates a
good part of our initial objectives.
keywords: Quantitative Trading, liquidity modeling, trades and quotes, machine learn-
ing, Multi-class classification algorithms, Logistic Regression.
Acknowledgements
I would like to thank my teacher, Mr.Adel BENZINA for his collaboration, sugges-
tions and for the help he has given me.
I would also like to thank Mrs. Leila DRIDI for correcting this report.
Finally, I am extremely grateful to my entire Family for their assistance and support
throughout the development of my project.
iii
Contents
Abstract ii
Acknowledgements iii
List of Figures vi
General Introduction 1
2 Mathematical Background 10
2.1 Introduction ............................................................................................... 10
2.2 Multiclass Classification Models .................................................................. 10
2.2.1 Performance Evaluation ................................................................ 10
2.2.2 Multi-Class Techniques......................................................................... 14
2.3 Multi-Class Logistic Regression ................................................................... 16
2.3.1 Binary Logistic Regression ................................................................ 16
2.3.2 Multi-Class Techniques......................................................................... 19
2.3.3 Features Importance....................................................................... 19
2.4 Data Pre-processing ....................................................................................... 19
2.4.1 Centering and Scaling ................................................................... 19
2.4.2 Resolve Skewness ................................................................................... 20
iv
Contents v
Bibliography 51
vi
List of Figures vii
viii
General Introduction
Trade is a daily activity in our life and its history extends to the era of Egyptian
rulers. These were the times when people traded food, items and other tools for
daily use.
The basic act of trading was nearly the same until the Telephone invention. This
latter stimulated the born of a whole new era of trading. With computerized
systems this era continued to progress to the point as we know it nowadays.
Despite the fact that trad- ing by itself maintained its principles, new types of trade
mechanisms were introduced. Progress in the area of information and
communication technologies allowed people to trade from different parts of the
world and reduced the trading process time significantly.
Today, agents in financial markets use sophisticated trading systems with a wide
range of algorithms to support their trading. A critical component of this process
is to model markets and make accurate predictions.
More specifically, the aim of our work is to develop a classifier that identifies the
most liquid venue based on short term market information.
This report defines the work we have achieved since last Mars, when we started
our internship at Quant-Dev.
• Conclusion and perspectives: This is the synthesis to all our efforts with
some perspectives.
Chapter 1
1.1 Introduction
In this first chapter, we start by presenting the host company Quant-Dev and
the general context of this internship. Then, we give an overview of trading and
markets by introducing the most known US Stock Exchanges and the market
micro-structure. Later on, we set the problem statement. Finally we detail the
objectives.
This project was carried out with one of its clients, a multi-strategy hedge fund
based in New York City that globally trades equity, futures, fixed income and a
variety of derivatives.
Trade is a basic economic activity. It includes the purchase and the sale of goods
and services, with indemnification paid by a buyer to a seller.
3
General Framework of the Project 4
1.3.1.1 History
The earliest forms of trade came through prehistoric peoples exchanging anything
of value for food, shelter and clothing. As the idea of exchange for sustenance
became ingrained in cultures worldwide, a physical space known as the
marketplace came into existence.[3]
The development of the first digital stock quote delivery system during the early
1960s marked the beginning of the transition towards fully automated markets.
Through the use of streaming real-time digital stock quotes, brokers could receive
specific market data on demand without having to wait for it to be delivered on
the ticker tape. The late 1980s and early 90s saw the desire to adopt automated
trading practices gravitate from institutional investors to individual retail traders.
As the personal computer became more and more powerful, and internet
connectivity technology evolved, the overwhelming push towards fully automated
markets soon overtook the holdovers from the open outcry system of trade.[5]
The strict rules built into the model attempt to determine the optimal time for an
order to be placed that will cause the least amount of impact on a stock’s price.
Large blocks of shares are usually purchased by dividing the large share block
into smaller lots and allowing the complex algorithms to decide when the smaller
blocks are to be purchased.
In this type of a system, the need for a human trader’s intervention is minimized
and thus the decision making is very quick. This enables the system to take
advantage of any profit making opportunities arising in the market much before a
human trader can even spot them.
General Framework of the Project 5
As the large institutional investors deal in a large amount of shares, they are the
ones who make a large use of algorithmic trading.
HFT is Usually used by institutional investors, hedge funds and large investment
banks, which utilizes powerful computers.
The major benefit of HFT is it has improved market liquidity and removed bid-ask
spreads that previously would have been too small. This was tested by adding fees on
HFT, and as a result, bid-ask spreads increased.
This recent practice became dominant around 2005 in the United States and then,
structured in the international trading system in a few years, posing new
regulatory and ethical problems.[5]
Big data is a term for data sets that are so large or complex that traditional data
processing application software is inadequate to deal with them. Big data
challenges in- clude capturing data, data storage, data analysis, search, sharing,
transfer, visualization, querying, updating and information privacy.
The 3-Vs rule that define Big Data is, in our case, confirmed. Volume, Variety and
Velocity are the 3 axes to which ’big’ must a central attribute of the data.[16]
• Volume: The quantity of generated and stored data. The size of the data
deter- mines the value and potential insight- and whether it can actually be
considered big data or not.
• Variety: The type and nature of the data. This helps people who analyze it
to effectively use the resulting insight.
• Velocity: In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of growth
and development.
General Framework of the Project 6
As part of our study, it is very important to highlight the techniques that today
make profiling activities increasingly sophisticated, notably by improving data
analysis and statistical approaches. These various recent techniques in artificial
intelligence have allowed the emergence of data mining. Data mining is defined
as ”the application of statistical techniques for data analysis and artificial
intelligence to the exploration, in order to extract new and useful information to
the holder of such data ”.In other words, the interest of data mining rely on the fact
that it is a computer tool capable of ”making speak” the data collected.[1]
A Stock Exchange is a facility where people can buy or sell stocks, bonds, and
securities through brokers and traders. Most often the traditional Exchange floor is
where the sell- ing and buying take place. However, modern trading is now also
done through electronic networks for its speed and lesser cost. This means dark
pools, electronic communica- tion networks, and alternative trading systems are
also being utilized as trading locales. Buyers and sellers are called stock investors
who may profit or lose capital depending on whether there is a bull or bear
market, respectively. Stock Exchanges are usually preferred by investors for
transparency.[3] The worlds two largest stock exchanges are:
• New York Stock Exchange: The New York Stock Exchange abbreviated
as NYSE and nicknamed The Big Board, is an American stock exchange. It
is by far the worlds largest stock exchange by market capitalization of its
listed com- panies at US $21.3 trillion as of June 2017. The average daily
trading value was approximately US$169 billion.
• Nasdaq Stock Market: The Nasdaq Stock Market is an American stock ex-
change. It is the second-largest exchange in the world by market capitalization,
behind only the New York Stock Exchange. The exchange platform is owned by
Nasdaq, Inc.
General Framework of the Project 7
• Limit orders: These have an inbuilt price limit that must not be breached,
a maximum price for buys and a minimum price for sells orders. Hence, limit
orders can help provide liquidity but risk failing to execute.
After-hours trading is the period of time after the market closes when an investor
can buy and sell securities outside of regular trading hours. Both the New York
Stock Exchange and the Nasdaq Stock Market normally operate from 9:30 a.m. to
4:00 p.m. Eastern Time. However, trades in the after-hours session can be
completed through electronic exchanges anytime between 4:00 p.m. and 8:00
p.m. Eastern Time. These electronic communication networks (ECNs) match
potential buyers and sellers without using a traditional stock exchange.[3] The
development of after-hours trading offers investors the possibility of great gains
with some risks and dangers.
• Less liquidity: There are far more buyers and sellers during regular hours.
During after-hours trading there may be less trading volume for your stock,
and it may be harder to convert shares to cash.
Market data includes information about current prices and recently completed
trades. It is usually referred to as Level I and Level II market data.
• Bid Size: The number of shares an investor is willing to purchase at the bid price.
Level II provides more information than Level I data. Mainly, it doesn’t just show the
highest bid and offer, but also shows bids and offers at other prices.
Quantitative trading have some drawbacks. In fact, financial markets are some of
the most dynamic entities that exist. Thus, quantitative trading models must be as
dy- namic to be consistently successful. Therefore, many developed models are
temporarily lucrative, but they eventually fail when market conditions change.
• Depth: Depth determines the total quantity of buy and sell orders that are
avail- able for the asset around the equilibrium price .
• Resiliency: Resiliency indicates how quickly the market recovers from a shock.
It is within this context that this graduation project is performed. The main
objective of this internship is to develop a predictive model that will allow
forecasting the most liquid venue(i.e stock exchange) in the upcoming short-time
horizon.
1.6 Conclusion
In this first chapter, we introduced the general context of the work, after having
starting with a presentation of the host company Quant-Dev. Then, after being well
familiarized with the the different financial agents and market micro-structure, we
have described the problem statement. Finally, we detailed our objectives.
Chapter 2
Mathematical Background
2.1 Introduction
When the objective is to classify an instance into one specific category, then, we talk
about a classification problem.
And when we have the possibility to choice from more than two categories, then, it is
a multi-class Classification Problem.
We should not confuse Multi-class classification with multi-label classification, in
which multiple labels are to be predicted for each instance.[20]
At this level, we deal with several ways used to evaluate the performance of different
classification models. In fact, comparing classifiers is not at all an easy task and
there is no single classifier that works best on all given problems.
10
Mathematical Background 11
A clean and simple way to present the prediction results of a classifier is to use a
confu- sion matrix. This is a simple cross-tabulation of the observed and predicted
classes for the data.[10]
Diagonal cells denote cases where the classes are correctly predicted while the off-
diagonals illustrate the number of errors for each possible case.
Figure 2.1 shows the confusion matrix that allows to distinguish between correctly
clas- sified and incorrectly classified instances . In fact, the green grids show the
correctly classified points. In an ideal scenario, all other grids should have zero
points.
In addition to this unambiguous visualization, confusion matrix is used to
calculate fur- ther evaluation performance metrics. These latter are based,
essentially, on the following terms.
• True positive (TP): Items that actually belong to the positive class and are
correctly included in the positive class.
• False positive (FP): Items that actually belong to the negative class and are
incorrectly included in the positive class.
• True negative (TN): Items that actually belong to the negative class and are
correctly included in the negative class.
• False negative (FN): Items that actually belong to the positive class and are
incorrectly included in the negative class.
In Multi-class Classification, where there are more than two classes, the above items
are specified, separately, for each class.
Mathematical Background 12
N
Σ
FN i : Sum (Rowi) = Confusion Matrix [i, k] (2.3)
k=1
kƒ=i
Σ
TN i : Sum (Other Elements) = Confusion Matrix [l, k] (2.4)
lƒ=
i
kƒ
=i
For example, in Figure 2.1, the gray boxes represent the FP 3 (False Positive for the
Class 3), while FN 3 is the sum of the boxes colored in yellow.
Classification accuracy Rate or the overall accuracy is simplest metric. For each
Class, it is the number of correct predictions divided by the total number of
predictions, mul- tiplied by 100 to turn it into a percentage.[10]
Those latter are calculated throughout the following formulas:
TP i + TN i
Classification Accuracy Ratei = (2.5)
FP i + FN i
TP i + TN i
Classification Accuracy Percentage i = 100× (2.6)
FP i + FN i
2.2.1.3 Precision
PPVi = TP i
(2.7)
TPi + FPi
2.2.1.4 Recall
2.2.1.5 F1 score
The F1 score is a weighted average of the precision and recall. The formula for
the F1-score is:
F 1i = 2 × PPV i × TPR i (2.9)
PPV i + TPR i
F1 score reaches its best value at 1 (in that case we talk about a perfect precision and
recall) and its worst at 0. Finally for the multi-class case, we consider the weighted
average of the F1 score of each class.
In Multi Class Classification accuracy, precision and recall do not provide enough
infor- mation of the performance of our classifier. Usually, we resort to the Cohen’s
Kappa Statistic.
Cohens’s kappa is defined as :
O−E 1−
Kappa = =1− (2.10)
O
1−E 1 −E
where O is the observed accuracy and E is the expected accuracy. In other words, it
tells us how much better our classifier is performing over the performance of a
classifier that simply guesses at random according to the frequency of each class.
In order to understand the calculation of Cohen’s kappa Statistic, let’s consider a 3-
class problem which confusion matrix is shown in Following Table.
Mathematical Background 14
Predicted Class
C1 C2 C3 Total
C1 a b c a+b+c=C
True Class C2 d e f d + e + f =C
g+h+i=C
C3 g h i
Total a + d + g = C1P red b + e + h = C2P red c + f + i = C3P red Nobs
Table 2.1: Confusion matrix for a 3-class problem
Being Nobs the number of samples and C1, C2 and C3 the label of the Class 1, 2 and
3. The Cohen’s Kappa is given by:
Nobs × (a + e + i) − (C1T rue × C1P red + C2T rue × C2P red + C3T rue × C3P red)
Kappa = 2
Nobs − (C1T rue × C1 Pred + C2T rue × C2 Pred + C3T rue × C3 Pred )
(2.11)
Σ
N Σ
N
Nobs × Confusion Matrix [i, i] − (CiT rue × CiPred)
i=1 i=1
Kappa = N (2.12
2 Σ )
Nobs − (CiT rue × CiPred)
i=1
The range of Kappa values extends from positive to negative one and according to its
value, we can judge our classifier. In fact, a value of 0 indicates a useless classifier
and a value of 1 means means a perfect model.
The existing multi-class classification techniques can be grouped into three main
cate- gories which are transformation to binary, extension from binary and
hierarchical clas- sification.
• One-versus-Rest (OVR):
If we consider a problem of classifying among N different categories, this approach
Mathematical Background 15
is based on the idea of dividing this problem into N binary classifiers, where
each task compare a specific class with the other (N - 1) classes.
The next step, is that each binary classifier return a real-valued score for its
deci- sion.
Finally, to predict a class for a new instance, the classifier returning the
high- est score is considered the winner and its class is assigned to the new
instance.
Despite the fact that this approach is simple, it has some disadvantages.
First, the scale of the returned score value can differ between the classifiers.
Second, this strategy can lead to imbalanced classes. In fact, by
discriminating one class from the others, it is obvious that negative
observations will be much larger than the positive observation.
• One-versus-One (OVO):
In this approach , we compare categories between each other. That means that
N (N −1)
2 binary classifier are built to discriminate between each pair of categories.
Just like any other approach, OVO suffers from some problems. The most
am- biguous is that for some instances, several classes may get the same
number of votes.
Some Classification algorithms are still applicable for more than two classes. Those
are called extended algorithms and they include neural networks, decision trees, k-
Nearest Neighbor, Naive Bayes and Support Vector Machines.
Yet, another approach for tackling the multi-class classification problem utilizes a
hier- archical division of the output space i.e. the classes are arranged into a tree.
The tree is created such that the classes at each parent node are divided into a
number of clusters, one for each child node. The process continues until the leaf
nodes contain only a single class. At each node of the tree, a simple classifier,
usually a binary classifier, makes the
Mathematical Background 16
discrimination between the different child class clusters. Following a path from the
root node to a leaf node leads to a classification of a new pattern.
Since the result is a probability value, the output p(X) should be between 0 and 1
for all values of X.
Logistic Regression, uses the following function called logistic function.
exp(β0 + β1 X)
p(X) = (2.13)
1 + exp(β0 + β 1 X)
where β0 is the intercept and β1 the coefficient of X.
The left-hand side is called the odds. It can take any real and positive value. A
value of 0 indicates very low probability of Y = 1 and ∞ means that the probability
of Y = 1 is
Mathematical Background 17
too high.
The quantity log( p(X) ) is called the log-odds or logit. An interesting result is that the
1−p(X)
logistic regression has a logit that is linear in X. This means that increasing X by one
unit changes the logit by β1.
Based on the available training data, the coefficients β0 and β1, used in the above
equations, can be estimated. We can use the maximum likelihood approach to
estimate the unknown coefficients.
The basic behind using maximum likelihood to fit a logistic regression model can
be formalized using the following mathematical equation:
Y Y
l(β0, β1) = p(xi) (1 − p(xj)) (2.16)
i:yi =1 j:yj =0
make the desired prediction. This suggest using the sigmoid function. The latter is
an S-shaped curve that can take any real-valued number and map it into a value
between 0 and 1. These values will then be transformed into either 0 or 1 using a
threshold classifier.
Therefore, we use the maximum likelihood method to estimate β0, β1, ..., βp.
Mathematical Background 19
(2.20
An un-skewed distribution is one that is pretty much symmetric. This means that
the probability of falling on either side of the distribution’s mean is
approximately equal. In other words, skewness represents an imbalance from the
mean of a data distribution. The formula to find sample skewness statistic is the
following:
3 × (Mean − Median)
Skewness = (2.21)
Standard deviantion
where :
• Median is the number that falls directly in the middle of the data distribution.
Replacing the data with the log, square root or inverse may help to remove the skew.
Data-sets come sometimes with predictors that take an unique value across
samples. Such uninformative predictor is more common than you might think.
This kind of predictor is not only non-informative, it can break some models.
Even more common is the presence of predictors that are almost constant across
samples. Constant and almost constant predictors across samples happens quite
often. One reason is because we usually break a categorical variable with many
categories into several
Mathematical Background 21
dummy variables. Hence, when one of the categories have zero observations, it
becomes a dummy variable full of zeroes.
One quick and dirty solution is to remove all predictors that satisfy some
threshold criterion related to their variance.
Among the multi class classification models sensitive to the near zero variance, we
found the Logistic Regression, K-Nnearest Neighbors and Neural Networks.[2]
Class imbalance is a very common machine learning problem. It arises when the
number of instances belonging to one class is much greater than the total number
of another class. This problem, clearly, affects the model reliability and accuracy
since most ma- chine learning algorithms works best when the number of
instances of each classes are approximately equal. To town down this problem, we
should use some approaches which are classified into two major categories: cost
function based approaches and sampling based approaches.
The basic idea behind cost function based approaches is to generate a new
formula to optimize that penalize one or many given type of error. In case we
think one false negative is worse than one false positive, we will count that one
false negative as, e.g., 100 false negatives instead. Then the machine learning
algorithm will try to make fewer false negatives compared to false positives.
Here, the green line is the ideal decision boundary we would like to have.The
blue line is the actual result. As shown in Figure 2.3, after under-sampling
the upper class (the negative class), the new boundary decision is slanted so
some negative class elements will be, wrongly, classified as positive.
The thick positive signs indicate there are multiple duplicated copies of that
data instance. The machine learning algorithm then sees these cases many
times and therefore designs to overfit to these examples specifically,
resulting the blue line boundary shown in Figure 2.4 on the right side.
Mathematical Background 23
Cross validation is one of the most famous technique used by a data analyst. In
fact, it represents a very powerful method for estimating the reliability and the
robustness of a given model based on sampling.
Cross validation can be used in many senarios and for several objectives. In fact,
it should be used when we are dealing with small data sets and also when we are
building a statistical model with one or more unknown parameter. So, the idea is
to optimize those parameters in order to get a model that matches the data as soon
as possible.[10] There are several cross validation techniques :
• Test-Set Validation: The main idea of this basic method is to divide the
given data into two completely independent data-sets. The first one is the
training set. It is called the In Sample and it contains more than 60% of the
original data. The second subset is the testing Set. It is also called the Out Of
Sample.
The Model is build on the training set and then it will be validated on the
testing set. This means that the performance of the model is evaluated based
on the testing set.
• k-Fold Cross Validation: This approach divide the original sample into k
sam- ples, later on we select one subset as the validation set and the remaining
k-1 subsets constitutes the training set.
As in the above approach, the performance score is calculated using the first se-
lected subset.
Then the hole operation is repeated by selecting another validation sample
from the other k-1 sample that have not been used yet.
This operation is repeated k times.
Finally, the global performance score is defined as the average of the k
performance scores calculated at each iteration.
An advantage of the last two methods is that we are not wasting any data. In other
words, it is a clever way to use all data.
Mathematical Background 24
Feature selection is an automatic picking of variables that are most relevant to the
predictive model.
Ii is in this perspective, that Robert Neuhaus said: Feature selection is itself useful,
but it mostly acts as a filter, muting out features that arent useful in addition to your
existing features.
There is a bunch of reasons for why we should care about features selection, but
really, they can be classified into two main categories. One is about human beings
and the other is about machines and machine learning algorithms.[11]
So, we can say that the problem of feature selection helps, hopefully, to go from a
whole bunch of features to just few features. Thus, making the learning problem
much easier.
In the following subsections we will consider data with T the number of independent
attributes.
In best subset selection, we need to fit models with each possible combinations of T
features. The total number of models that we will be having is 2 T .
This method requires massive computational power. In fact, if we consider a model
with 100 attributes, that means that we need to fit 2100 models. To reduce this
computational power, this approach is divided into two separate steps.
• Step 1 : For k = 1, 2, ..., T , fit models with all combinations that have exactly k
predictors. Select the best among these models and call it Mk
• Step 2 : Select the best model from among M1, M2, ..., MT .
Mathematical Background 25
For computational reasons, best subset selection cannot be applied with a high T .
For this reason, stepwise selection can be the best alternative to best subset
selection.
Stepwise selection methods include forward and backward stepwise selection.
• Forward Stepwise Selection: The procedure starts with the null model
(mean(Y)) and then, for each operation k, Only one predictor is added and a
model Mk is picked. If a variable is retained it never drops from the model and the
total number
of fitted models is 1 + T ×( T +1) .
2
2.7 Conclusion
Different models have different sensitivities to the type of predictors in the model.
Trans- formations of the data to reduce the impact of data skewness or outliers
can lead to significant improvements in performance. Feature Selection can also
be effective.
Chapter 3
3.1 Introduction
We start this chapter by presenting the general scope of the project. Then, we turn
our focus on exploratory data analysis and the feature engineering. Finally, we
tackle the modelling phase by including the performance evaluation of the fitted
models.
The main objective of this internship is to analyze the liquidity on North America’s
stock markets by developing a predictive model that will allow forecasting which
venue is showing the highest volume in the upcoming short time horizon.
More specifically, we want to develop a classification model that allows us to
predict which venue will show the highest volume with at least Y% of the total
volume. The forecast will be made if in the current minute there is a venue showing
more than X% of the total volume. The two parameters X and Y are given as input
based on specification provided by the host company.
In what follows, X will be set to 30% and Y will be set to 40%.
26
Experimental Results and Modelling 27
In this Project, we will use data for 5 days from 2018-03-05 to 2018-03-09. The
dataset comes from Onetick database. It contains Data for Trades and Quotes.
From Trades:
• Bid Price: (Number) The price at which a buyer is prepared to buy share.
• Ask Price: (Number) The price a seller is willing to accept per share.
• Ask Size: (Integer(9)) Number of round lots to be sold at the ask price (100
share units).
The ’Exchange’ column in both trades and quotes data contains a character
identifying the native exchange. It takes one of the following characters:
• NYSE: If NYSE is showing the highest volume with at least 40% of the
total volume.
• BATS Z: If BATS Z is showing the highest volume with at least 40% of the
total volume.
• BATS Y: If BATS Y is showing the highest volume with at least 40% of the
total volume.
• ARCA: If ARCA is showing the highest volume with at least 40% of the
total volume.
• EDGX: If EDGX is showing the highest volume with at least 40% of the
total volume.
Experimental Results and Modelling 29
• EDGA: If EDGA is showing the highest volume with at least 40% of the
total volume.
The choice of Python over R is because the latter has the disadvantage of not
handling well large data sets. It is true that when it comes to dealing with
statistical aspects of the data, R stands out as the most used language in data
science. In fact, R has a big community around the world providing this
language with various packages for statistical analysis.
Python is well adapted to deal with big data. With an ecosystem propitious for
statis- tical analysis, Python has become one of the widely used languages by data
scientists.
In data processing, handling large scale data sets and having fast access to data
can be the most crucial criteria when choosing a programming language. Python
may not be the best, but it is surely better than R in dealing with meduim scale
data sets (millions of rows). It is open source, easy to use and very suitable for
treating data of type string and sequence.[17]
Thanks to its open source libraries adapted the field of data science, Python has
become one of the most used languages by data analysts in recent years.[17]
Here is a list of packages that we used in this project:
• Pandas: Pandas is the Python package that we used for handling data. In
fact, it is designed for fast loading, easy cleaning and quick exploration of
data.
• Numpy: Numpy is one of the most basic package created by the scientific
stack. We used it for (n-dimensional) array manipulation.
The exploratory data analysis (EDA) is a way to analyze data and represents an
essential phase in any analysis or modeling project. It helps us to understand the
data we are facing.
In statistics, EDA refers to a set of procedures for analyzing data sets and
summarizing their main characteristics, often with visual methods. For further
information about EDA, we may refer the reader to[15].
As seen in the project description, every trade occurs in the market is recorded in
a trade message. It indicates for each trade on which exchange it occurred, the
price of execution and the number of shares traded.
Figure 3.1 shows a visualization of the total volume of trades of a single symbol
per exchange per half hour. The symbol used is ’AAPL’ and we visualize data for
the four most liquid exchanges.
These results are expected. In fact, they conform perfectly the regular trading
session schedule seen in Chapter 1. We recall that orders and trades are only
available from 04:00 to 20:00. During this period, trades and orders occur on
three different sessions:
Another important point showed in Figure 3.1, is that during the Regular
Market, trading volume are much higher at the begin and the end of the session,
indicating a U shaped type of profile.
Similar to Trades, Quote messages are generated for every quote event. Data used
is known as level I quote. It indicates the best bid/ask at a given point in time.
Below we visualize the total count of quotes of one symbol per half hour. The symbol
used is ’AAPL’ on Fri 2018-03-22.
Figure 3.2 shows that the number quotes is much larger in the regular market
session than in the pre-market and after market sessions.
Bid-Ask Spread is one of the most adequate information that can be extracted
from Level I Quote data. It is the amount by which the ask price exceeds the bid
price for a given stock in the market. It is calculated as being:
In the following box-plot visualization, we try to display the bid-ask spread over
the entire day for ’AAPL’ stock.
As marked in Figure 3.3, Bid Ask Spread is a much smaller in regular market session.
This fact explains the high count of trades and quotes occurring on this session.
The size of the bid-ask spread is one of the most relevant attributes used in
modeling liquidity.
We recall that our response is a categorical variable and it represents the venue
showing the highest volume with at least 40% of the total volume in the next
minute.
The following pie plot is generated using 5 days data for 110 symbols with various
price and volume characteristics.
There are 117 457 observations.
Experimental Results and Modelling 33
Figure 3.4: Venue showing the highest volume with at least 40% of the total volume
We recall that predictions are only made if in current minute there is a venue
showing at least 30% of the total volume. This fact requires visualizing the
response according to the venue verifying the above assumption.
Figure 3.5: Response per current venue showing more than 30% Volume
Liquidity, usually, differs from one symbol to another. This fact suggests
characterizing symbols by different groups according to their price, volume and
market-cap. The latter refers to market Capitalization. It determines the company’s
size in terms of its wealth. It is calculated by multiplying all the outstanding shares
with the current market price of one share.
Experimental Results and Modelling 34
As shown in Figure 3.6, Symbols fall into three categories according to their price,
volume and market-cap. So, each venue is considered with low or mid or high price
, low or mid or high volume and low or mid or high market-cap.
Features engineering is just as important if not more important than the choice of
algo- rithm. In fact, good features allow a simple model to beat a complex model.
Basic Features are features based only on time. The idea of using such variables came
from the U shape form illustrated in Figure 3.1 that displays the trade volume
overall the day.
Since this U shaped curve has a parabolic form, just two features are needed:
• F2: F1 squared.
Fitting a linear model based on this two basic features in order to predict volume
on the next minute, produces the following results:
Figure 3.8: Linear model to predict trade volume using basic features
The blue scatter points showed in Figure 3.8, represent the actual trade volume in
Bats Y stock exchange for the correspondent minute. The black line indicates the
predicted volume using the fitted linear model. Moreover, the correlation between
actual volume and predicted volume is equal to 0.67.
Experimental Results and Modelling 36
These attributes are extracted from trade data and they include:
• Trade Sum: One feature is generated for each venue and it indicates the
total volume of trades occurred on this venue in the last 5 minutes.
• Trade Count: One feature is generated for each venue and it indicates the
total number of trades occurred on this venue in the last 5 minutes.
They include all variables extracted from Level one quote data which are
essentially based on:
B en = I B B × qB − I B B × qB (3.3)
{Pn >Pn−1} n {Pn <Pn−1} n−1
• Bid Flow : It describes the flow on the left side of the order book and it is
estimated as following:
Nb Σ
events
Bid Flow = B en (3.4)
i=1
A
A en = I{P A>P A } × q n−1 − I{PnA<Pn−1
A } × qn
A
(3.5)
n n−1
• Ask Flow: It describes the flow on the right side of the order book and it is
calculated as being:
Experimental Results and Modelling 37
Nb Σ
events
Ask Flow = A en (3.6)
i=1
Features generated from Quotes are 56 (8 × 7) and they calculated separately for each
venue. They include:
• Bid Size: Represents the total size of shares from the bid side that appear
on level one quote data in the last 5 minutes.
• Ask Size: Represents the total size of shares appeared in the ask side on the
level one quote data in the last 5 minutes.
• Bid Count: Indicates the total number of bid messages in the last 5 minutes.
• Ask Count: Indicates the total number of ask messages in the last 5 minutes.
• Mean Bid-Ask Spread: Mean of the bid ask spread (bps) in the last minute.
Dummy variables are boolean indicators. They take the value 0 or 1 to indicate the
presence or not of some categorical effect. In this context, dummy variables are
involved to specify both which venue is showing highest volume and in which group
the symbol is classified. Finally, we get the following 12 features:
• PriceGroup Low: Takes 1 only if the symbol is classified in Low price Group.
• PriceGroup High: Takes 1 only if the symbol is classified in High price Group.
At this phase in the project, we use data for 5 days from 2018-03-05 to 2018-03-
09. To measure the model’s reliability and robustness, we consider first four days data as
Training Set and the last day is maintained as Testing Set.
Features used in fitting models are 85 and they are extracted from quotes and
trades data available.
The classification goal is to predict which venue will be showing the highest volume
with at least 40% of the total volume in the next minute.
Experimental Results and Modelling 39
Figure 3.10:
Figure 3.9: Initial Train Resample
d Train
After resampling, centering and scaling the data we turn our focus to fitting
logistic regression classifiers. In this section we fit our first model, this latter is
based on all
Experimental Results and Modelling 40
the 85 features. All models are fitted using OVR multi-class technique detailed in
the second chapter.
The following figures display the performance of this model on the Testing Set.
• The accuracy is 30%, it means that in 70% of predictions, the fitted classifier
gives wrong answers.
• The recall for the No Venue class is equal to 0.4. So, in 60% of cases, the
classifier predict one particular venue among the venues while the truth is that
there is no venue showing at least 40% of the total volume.
Experimental Results and Modelling 41
In what follows, our main objective is to fit other models with higher performances.
More specifically, we aim to improve both the ’No Venue’ Recall and the
accuracy of the model.
Feature selection allows to go from a whole bunch of features to just few features.
This helps, hopefully, to make the learning problem much easier.
To evaluate the efficiency of the above features selection operations, we measure the
performance of new logistic regression model having only the best 40 features
maintained. Below are the results:
Experimental Results and Modelling 42
Based on the performance exposed in Figure 3.14 and Figure 3.15 of the new fitted
model, we can evaluate the best features model on the following axes:
• ’No Venue’ Recall: The recall for the ’No Venue’ class goes from 0.4 to 0.58.
It seems that resorting to features selection techniques has, roughly, improve the
per- formance of our developed model.
Experimental Results and Modelling 43
The urge for this two levels Classification is both the fact that the False Negative
errors of the ’No Venue’ class have to be penalized and the fact that predicting
similar Classes seems more logical.
Thus, the classification will take place on the two following steps.
At this Level, the main objective is to predict if there is a venue (among all the seven
venues) which will be showing at least 40% of the total volume.
The problem, now, is a binary classification task. Therefore, the Target value will take:
• Yes: If there is a venue which will be showing at least 40% of the total volume
in the next minute.
• No: Otherwise.
Figure 3.16 shows the performance of this First Level Classifier on the training and
testing sets.
We recall that one of our objectives is to penalize the false negative errors of the ’No
Venue’ Class.
Logistic regression classifier works by affecting the new instance to the class for
which the probability is greatest. By default, an instance is classified in class YES
if:
Thus, the logistic regression works with a threshold of 0.5. However if we are
concerned about incorrectly predicting yes for the actual No elements, then we
should consider rising this threshold. Then Equation 3.10 became:
Several performance scores, resulting from taking this approach, are illustrated in
the following figure.
Figure 3.17: First Level Classifier Performance as function of the threshold value
Figure 3.17 indicates that rising the threshold has a vast improvement over the
False Negative Rate(No Venue Recall). However, this improvement comes at a
cost.
Just in case the predicted class, by the first classifier, is YES, thus, the second level
classification arises. So, the aim of this classifier is to predict which exactly venue
from all the seven venues concerned is the one divined to be showing the highest
volume.
The model is trained only with the best 40 features maintained from the above
section. Based on the performance exposed in Figure 3.18 and Figure 3.19, it seems
that the developed solution validates a good part of our initial objectives. Namely
improving the accuracy rate and the recall for the ’No Venue’ class. In fact:
To quantify the importance of any created feature, we use the magnitude of its
coefficient times its standard deviation in each binary classification model ( this is
how OVR method works).
• F2: F1 squared.
Figure 3.20 quantifies the influence of these attributes in the classification models.
F1 and F2 identify the historical behavior of each venue. Therefore, they are
important when dealing with low volume venues and approximately null when it
comes to dynamics venues such as NYSE and NASDAQ.
This result characterizes a continuity on the trading flows. In fact, they indicate
that when a venue is showing the highest volume and at least more than 30% in
the current minute, this same venue is more probably to be showing showing the
highest volume and at least more than 40% in the next minute.
These attributes are generated from trade messages extracted from Level I market
data. Figure 3.22 quantifies the influence of these attributes in the classification
models.
• Features indicating the trade count are more important locally. More
specifically, these features are majorly present on the local venue. This
result is illustrated by the red coefficients presented in Figure 3.22
• It seems that features identifying the total volume in the last five minutes are
less important. This can be justified by the fact that these variables are
considered as redundant variables. Consequently, they are removed after
applying RFE features selection.
Liquidity, usually, differs from one symbol to another. This fact suggests
characterizing symbols by different groups.
Experimental Results and Modelling 48
Therefore, symbols fall into three categories according to their price, volume and
market- cap. So, each venue is considered with low or mid or high price , low or mid
or high volume and low or mid or high market-cap.
From Figure 3.23, we can notice that the importance of dummy variables
characterizing symbols differs from one venue to another. This result suggests
that:
• Symbols in the Low Volume Group are likely to have EDGA as the venue
showing highest volume and at least 40% of the total volume in the next
minute.
• The blue colored coefficients indicates that the probability of predicting the cor-
respondent venue is inversely affected by this symbol’s characterization. For
example, when making prediction for high volume symbols, EDGX and ARCA
are unlikely to be predicted.
3.7 Conclusion
In this chapter, we have gone through the organization of the venues classification
project. On the basis of the latter, we have shown the different stages of the
implemen- tation of the classification model. Starting from exploratory data
analysis and features engineering. Finally, we have detailed the different modeling
results upon which we have decided on the best model to choose.
Conclusion and Perspectives
We have presented, throughout this report, our analysis, research and development
work deployed to quantify liquidity dynamics on north America’s stock markets.
In the first part of this report, we introduced the general framework of the project
that consists in implementing the working environment, as well as bringing a
global view on the world’s financial markets. Then, we described the problem
statement and the project objectives. In the second part, we presented the
theoretical background needed in multi class classification tasks. In the last part,
we showed the experimental results of our fitted models.
In this context, our work comes as logistic regression models based on OVR
techniques for classifying Us venues based on their short-term liquidity. After
creating features from trades and quotes data, a first model is fitted using all
available features. This classifier has poor accuracy and high complexity.
Finally, we have gone through a two levels model. This latter validates a good part
of our initial objectives. Namely improving the accuracy rate as well as penalizing
the False Negative errors for the ’No Venue’ Class.
On a personal level, the period of this internship was extremely enriching since I
had the opportunity to integrate a dynamic team and to discover closely the
lifecycle of a large trading project. This internship also has allowed me to deepen
my technical knowl- edge, especially in Python programming while applying our
academic and theoretical knowledge on data mining.
49
Conclusion and future perspectives 50
As far as the future prospects are concerned, this project could be developed on the
following axes :
• Create other features that may better reflect the short-term dependence
between different venues.
[2] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of
statis- tical learning. Vol. 1. New York: Springer series in statistics, 2001.
[3] Johnson, Barry. Algorithmic Trading & DMA: An introduction to direct access
trading strategies. Vol. 200. London: 4Myeloma Press, 2010.
[4] Cartea, Alvaro, and Sebastian Jaimungal. Modelling asset prices for algorithmic
and high-frequency trading. Applied Mathematical Finance 20.6 (2013).
[6] Glantz, Morton, and Robert Kissell. Multi-asset risk modeling: techniques for a
global economy in an electronic and algorithmic trading era. Academic Press,
2013.
[7] Gomber, Peter, and Martin Haferkorn. High frequency trading. Encyclopedia of
Information Science and Technology, Third Edition. IGI Global, 2015.
[8] Avellaneda, Marco, Josh Reed, and Sasha Stoikov. Forecasting Prices in the
Pres- ence of Hidden Liquidity. Preprint (2010).
[9] Zivot, Eric. Analysis of High Frequency Financial Data: Models, Methods and
Software. Part I: Descriptive Analysis of High Frequency Financial Data with S-
PLUS. (2005).
[10] James, Gareth, et al. An introduction to statistical learning. Vol. 112. New
York: springer, 2013.
[11] Guyon, Isabelle, and Andr Elisseeff. An introduction to variable and feature
selec- tion. Journal of machine learning research 3.Mar (2003).
[12] Liu, Huan, and Hiroshi Motoda. Feature selection for knowledge discovery and
data mining. Vol. 454. Springer Science & Business Media, 2012.
51
Bibliography 52
[13] Lisa Smith , The Auction Method: How NYSE Stock Prices are Set.
Investopedia, LLC, 2018.
[14] Donoho, David L. ”High-dimensional data analysis: The curses and blessings
of dimensionality.” AMS Math Challenges Lecture 1 (2000).
[15] Young, Forrest W., Pedro M. Valero-Mora, and Michael Friendly. Visual
statistics: seeing data with dynamic interactive graphics. Vol. 914. John Wiley &
Sons, 2011.
[16] Cukier, Kenneth, and Viktor Mayer-Schoenberger. The rise of big data: How
it’s changing the way we think about the world. Foreign Aff. 92 (2013).
[17] Layton, Robert. Learning data mining with python. Packt Publishing Ltd, 2017.
[18] Cont, Rama, Arseniy Kukanov, and Sasha Stoikov. The price impact of order
book events. Journal of financial econometrics 12.1 (2014).
[19] Derman, Emanuel. My life as a quant: reflections on physics and finance. John
Wiley & Sons, 2004.
[20] Kolo, Brian. Binary and Multiclass Classification. Lulu. com, 2011.
Appendix A: Features Importance
53
Figure 27: Bid Features
56
Figure 34: Response Portrayal per Volume Group