You are on page 1of 66

Ministère de l’Enseignement Supérieur

et de la Recherche Scientifique ‫ةـعماج يملعلا ثحبلا و يلاعلا مـيلعتلا ةرازو‬


Université de Carthage
‫جاطرق‬
‫تاينقتلل ةيسنوتلا ةـسردملا‬
Ecole Polytechnique de Tunisie

Graduation Project Report


to obtain :

The National Engineering Diploma from Tunisia Polytechnic School

Modelling Liquidity Dynamics in North America's

Stock Markets

Elaborated by:
Oussema GAZZEH
Third year engineering student - EGES

within :

Supported 25th June 2018 before the examination board:

Mr. Rabii JHINAOUI CEO & Founder (SPX) President

Mr. Wajih GACEM Manager SGSS Tunisia (UIB) Member

Mrs. Yousra GATI Associated Professor (TPS - LIM ) Member

Mrs. Christine VAI Associated Professor (TPS) Member

Mr. Wajdi TEKAYA Quantitative Researcher (Quant-Dev) Supervisor

Mr. Adel BENZINA Associated Professor (TPS) Supervisor

Academic Year 2017/2018


Rue Elkhawarezmi BP 743 La Marsa 2078 ‫ص يمزراوخلا جهن‬.‫ ب‬743 ‫ ىسرملا‬2078
Tel: 71 774 611 -- 71 774 699 Fax: 71 748 843 71 748 843 :‫ الفاكس‬71 774 699 -- 71 774 611 :‫الهاتف‬
Site Web: www.ept.rnu.tn www.ept.rnu.tn :‫موقع الواب‬
“Everyone should learn how to code..
It teaches you how to think.”

Steve JOBS
Abstract

Today, agents in financial markets use sophisticated trading systems with a wide range
of algorithms to support their trading. A critical component of this process is to model
markets and make accurate predictions.

It is in this perspective, that we implemented, during this graduation project a model


to quantify the liquidity on North America stock markets by developing a predictive clas-
sifier that will allow gauging short-term liquidity.

The problem is a multi-class classification task. One main part of this project is the fea-
tures engineering which are based on trades and quotes available from Onetick database.

The modelling phase is achieved throughout Logistic regression algorithms. During this
phase, a first model is fitted using all available features. This classifier has poor accuracy
and high complexity. Then, features selection techniques are implemented and just 40 of
85 variables are maintained. A second model is fitted, but, no much improvements have
been observed. Finally, we have gone through a two levels model. This latter validates a
good part of our initial objectives.

keywords: Quantitative Trading, liquidity modeling, trades and quotes, machine learn-
ing, Multi-class classification algorithms, Logistic Regression.
Acknowledgements

I would like to express my heartfelt thanks to my mentor Mr.Wajdi TEKAYA for


having welcomed me at Quant-Dev and also for the permanent encouragement and the
valuable advice he provided me.

I would like to thank my teacher, Mr.Adel BENZINA for his collaboration, sugges-
tions and for the help he has given me.

I would also like to thank Mrs. Leila DRIDI for correcting this report.

Finally, I am extremely grateful to my entire Family for their assistance and support
throughout the development of my project.

iii
Contents

Abstract ii

Acknowledgements iii

List of Figures vi

List of Tables viii

General Introduction 1

1 General Framework of the Project 3


1.1 Introduction ................................................................................................. 3
1.2 Presentation of the host organization: Quant-Dev .................................... 3
1.3 General Context .......................................................................................... 3
1.3.1 Evolution of trading ........................................................................ 3
1.3.2 Big Data ............................................................................................... 5
1.3.3 Data Mining ......................................................................................... 6
1.4 Overview of Trading and Markets ............................................................. 6
1.4.1 US Stock Exchanges ............................................................................ 6
1.4.2 Market Micro-Structure................................................................... 7
1.4.3 Market data ...................................................................................... 8
1.5 Problem Statement and Project goals ........................................................ 8
1.6 Conclusion ....................................................................................................... 9

2 Mathematical Background 10
2.1 Introduction ............................................................................................... 10
2.2 Multiclass Classification Models .................................................................. 10
2.2.1 Performance Evaluation ................................................................ 10
2.2.2 Multi-Class Techniques......................................................................... 14
2.3 Multi-Class Logistic Regression ................................................................... 16
2.3.1 Binary Logistic Regression ................................................................ 16
2.3.2 Multi-Class Techniques......................................................................... 19
2.3.3 Features Importance....................................................................... 19
2.4 Data Pre-processing ....................................................................................... 19
2.4.1 Centering and Scaling ................................................................... 19
2.4.2 Resolve Skewness ................................................................................... 20

iv
Contents v

2.4.3 Near Zero Variance ............................................................................... 20


2.4.4 Class Imbalance ................................................................................. 21
2.5 Cross Validation ............................................................................................ 23
2.6 Features Selection .......................................................................................... 24
2.6.1 Best Subset Selection ......................................................................... 24
2.6.2 Stepwise Selection ............................................................................. 25
2.7 Conclusion ..................................................................................................... 25

3 Experimental Results and Modelling 26


3.1 Introduction ............................................................................................... 26
3.2 General Description ....................................................................................... 26
3.2.1 Problem Statement Recall ............................................................. 26
3.2.2 Data Description ............................................................................ 27
3.2.3 Software Environment................................................................... 29
3.3 Exploratory Data Analysis ....................................................................... 30
3.3.1 EDA: Data from Trades ...................................................................... 30
3.3.2 EDA: Data from Quotes ................................................................ 31
3.3.3 EDA: Response .................................................................................. 32
3.3.4 Symbols Characterization ............................................................. 33
3.4 Features Engineering ...................................................................................... 35
3.4.1 Basic Features ........................................................................................ 35
3.4.2 Features From Trades ........................................................................... 36
3.4.3 Features From Quotes .................................................................... 36
3.4.4 Dummy Variables .................................................................................. 37
3.5 Modeling: Results and Validation ........................................................... 38
3.5.1 Data Pre-processing ........................................................................... 39
3.5.2 All Features Model ............................................................................. 39
3.5.3 Best Features Model ...................................................................... 41
3.5.4 Two Levels Model ............................................................................. 43
3.6 Features Importance Interpretation........................................................... 46
3.6.1 Basic Features ........................................................................................ 46
3.6.2 Features Indicating Current Highest Venue .......................................46
3.6.3 Trade Features.................................................................................... 47
3.6.4 Features Indicating Symbols Characterization ............................. 47
3.7 Conclusion ..................................................................................................... 48

Conclusion and future perspectives 49

Bibliography 51

Appendix A: Features Importance 53

Appendix B: Response Portrayal Group 56


List of Figures

2.1 Confusion matrix for a multi-class classification problem ..........................11


2.2 Logistic Regression ....................................................................................... 18
2.3 Under-sampling.............................................................................................. 22
2.4 Over-sampling.................................................................................................22

3.1 Trades volume per venue per half hour ........................................................30


3.2 Quotes Count per half hour ....................................................................... 31
3.3 Bid-Ask Spread per half hour .......................................................................32
3.4 Venue showing the highest volume with at least 40% of the total volume . 33
3.5 Response per current venue showing more than 30% Volume .................... 33
3.6 Symbols Characterization .......................................................................... 34
3.7 Response Portrayal per Market-Cap Group .............................................. 34
3.8 Linear model to predict trade volume using basic features ...................... 35
3.9 Initial Train ........................................................................................................39
3.10 Resampled Train ................................................................................................... 39
3.11 Confusion Matrix of the All Features Model ...............................................40
3.12 Scores of the All Features Model ..................................................................40
3.13 Recursive Features Elimination ................................................................. 41
3.14 Confusion Matrix of the Best Features Model .............................................42
3.15 Scores of the Best Features Model ................................................................42
3.16 Performance of the First Level Classifier .....................................................43
3.17 First Level Classifier Performance as function of the threshold value ......44
3.18 Confusion Matrix of the Two Levels Model.................................................45
3.19 Scores of the Two Levels Model ...................................................................45
3.20 Basic Features Importance......................................................................... 46
3.21 Features Indicating Current Highest Venue .................................................... 46
3.22 Trade Features................................................................................................47
3.23 Group Characterization Features ................................................................ 48

24 Basic Features ....................................................................................................53


25 Highest Venue.............................................................................................................. 53
26 Ask Features ...................................................................................................53
27 Bid Features ...................................................................................................54
28 Bid-Ask Spread Features ..................................................................................54
29 Trade Features................................................................................................54
30 Group Characterization Features .................................................................54
31 40 - Selected Features........................................................................................55

vi
List of Figures vii

32 Response Portrayal per Price Group ............................................................. 56


33 Response Portrayal per Market-Cap Group ............................................. 56
34 Response Portrayal per Volume Group ......................................................... 57
List of Tables

2.1 Confusion matrix for a 3-class problem ....................................................... 14

viii
General Introduction

Trade is a daily activity in our life and its history extends to the era of Egyptian
rulers. These were the times when people traded food, items and other tools for
daily use.

The basic act of trading was nearly the same until the Telephone invention. This
latter stimulated the born of a whole new era of trading. With computerized
systems this era continued to progress to the point as we know it nowadays.
Despite the fact that trad- ing by itself maintained its principles, new types of trade
mechanisms were introduced. Progress in the area of information and
communication technologies allowed people to trade from different parts of the
world and reduced the trading process time significantly.

Today, agents in financial markets use sophisticated trading systems with a wide
range of algorithms to support their trading. A critical component of this process
is to model markets and make accurate predictions.

It is in this perspective, that we want to implement, during this graduation


project a model to quantify the liquidity on North America stock markets by
developing a predictive classifier that will allow gauging short-term liquidity.

More specifically, the aim of our work is to develop a classifier that identifies the
most liquid venue based on short term market information.

This report defines the work we have achieved since last Mars, when we started
our internship at Quant-Dev.

The outline of this report as follows:

• General Framework of the Project: This stage consists in implementing the


working environment as well as bringing a global view on the world’s financial
markets. Then, we describe the problem statement and the project objectives.

• Mathematical Background: We start this by introducing the theoretical


back- ground needed in Multi-class classification tasks. Then we turn our
focus to some
1
General Introduction 2

useful Data Mining techniques used in the development of our project.


Finally, we describe the Logistic Regression Classifiers.

• Experimental Results and Modelling: We start this chapter by


presenting the general scope of the project. Then, we turn our focus on
exploratory data analysis and the feature engineering. Finally, we tackle the
modelling phase by including the performance evaluation of the fitted models.

• Conclusion and perspectives: This is the synthesis to all our efforts with
some perspectives.
Chapter 1

General Framework of the Project

1.1 Introduction

In this first chapter, we start by presenting the host company Quant-Dev and
the general context of this internship. Then, we give an overview of trading and
markets by introducing the most known US Stock Exchanges and the market
micro-structure. Later on, we set the problem statement. Finally we detail the
objectives.

1.2 Presentation of the host organization: Quant-Dev

Quant-Dev is a consulting company founded in 2014 headquartered in Tunis,


Tunisia. The company’s expertise leverages more than 10 years of experience in
mathematical modelling, optimization and delivering data driven quantitative
tools to leading inter- national clients.

This project was carried out with one of its clients, a multi-strategy hedge fund
based in New York City that globally trades equity, futures, fixed income and a
variety of derivatives.

1.3 General Context

1.3.1 Evolution of trading

Trade is a basic economic activity. It includes the purchase and the sale of goods
and services, with indemnification paid by a buyer to a seller.

3
General Framework of the Project 4

1.3.1.1 History

The evolution of trading is one of the most important components in the


humanity’ growth. Mankind has developed throughout the centuries and that would
not have been possible if they were restricted to geographical frontiers.

The earliest forms of trade came through prehistoric peoples exchanging anything
of value for food, shelter and clothing. As the idea of exchange for sustenance
became ingrained in cultures worldwide, a physical space known as the
marketplace came into existence.[3]

1.3.1.2 Electronic Trading

The development of the first digital stock quote delivery system during the early
1960s marked the beginning of the transition towards fully automated markets.
Through the use of streaming real-time digital stock quotes, brokers could receive
specific market data on demand without having to wait for it to be delivered on
the ticker tape. The late 1980s and early 90s saw the desire to adopt automated
trading practices gravitate from institutional investors to individual retail traders.
As the personal computer became more and more powerful, and internet
connectivity technology evolved, the overwhelming push towards fully automated
markets soon overtook the holdovers from the open outcry system of trade.[5]

1.3.1.3 Algorithmic Trading

Algorithm trading is a system of trading which facilitates transaction decision


making in the financial markets using advanced mathematical tools. A trading
system that utilizes very advanced mathematical models for making transaction
decisions in the financial markets.

The strict rules built into the model attempt to determine the optimal time for an
order to be placed that will cause the least amount of impact on a stock’s price.
Large blocks of shares are usually purchased by dividing the large share block
into smaller lots and allowing the complex algorithms to decide when the smaller
blocks are to be purchased.

In this type of a system, the need for a human trader’s intervention is minimized
and thus the decision making is very quick. This enables the system to take
advantage of any profit making opportunities arising in the market much before a
human trader can even spot them.
General Framework of the Project 5

As the large institutional investors deal in a large amount of shares, they are the
ones who make a large use of algorithmic trading.

1.3.1.4 High frequency trading

High frequency trading (HFT) is an automated trading, ie trading using electronic


plat- forms that rely on an algorithm to decide trading orders. Thus, it is an
automated way to pass financial transactions through virtual operators at very
high speed and without human intervention.

HFT is Usually used by institutional investors, hedge funds and large investment
banks, which utilizes powerful computers.

The major benefit of HFT is it has improved market liquidity and removed bid-ask
spreads that previously would have been too small. This was tested by adding fees on
HFT, and as a result, bid-ask spreads increased.

This recent practice became dominant around 2005 in the United States and then,
structured in the international trading system in a few years, posing new
regulatory and ethical problems.[5]

1.3.2 Big Data

Big data is a term for data sets that are so large or complex that traditional data
processing application software is inadequate to deal with them. Big data
challenges in- clude capturing data, data storage, data analysis, search, sharing,
transfer, visualization, querying, updating and information privacy.

The 3-Vs rule that define Big Data is, in our case, confirmed. Volume, Variety and
Velocity are the 3 axes to which ’big’ must a central attribute of the data.[16]

• Volume: The quantity of generated and stored data. The size of the data
deter- mines the value and potential insight- and whether it can actually be
considered big data or not.

• Variety: The type and nature of the data. This helps people who analyze it
to effectively use the resulting insight.

• Velocity: In this context, the speed at which the data is generated and
processed to meet the demands and challenges that lie in the path of growth
and development.
General Framework of the Project 6

1.3.3 Data Mining

As part of our study, it is very important to highlight the techniques that today
make profiling activities increasingly sophisticated, notably by improving data
analysis and statistical approaches. These various recent techniques in artificial
intelligence have allowed the emergence of data mining. Data mining is defined
as ”the application of statistical techniques for data analysis and artificial
intelligence to the exploration, in order to extract new and useful information to
the holder of such data ”.In other words, the interest of data mining rely on the fact
that it is a computer tool capable of ”making speak” the data collected.[1]

1.4 Overview of Trading and Markets

1.4.1 US Stock Exchanges

A Stock Exchange is a facility where people can buy or sell stocks, bonds, and
securities through brokers and traders. Most often the traditional Exchange floor is
where the sell- ing and buying take place. However, modern trading is now also
done through electronic networks for its speed and lesser cost. This means dark
pools, electronic communica- tion networks, and alternative trading systems are
also being utilized as trading locales. Buyers and sellers are called stock investors
who may profit or lose capital depending on whether there is a bull or bear
market, respectively. Stock Exchanges are usually preferred by investors for
transparency.[3] The worlds two largest stock exchanges are:

• New York Stock Exchange: The New York Stock Exchange abbreviated
as NYSE and nicknamed The Big Board, is an American stock exchange. It
is by far the worlds largest stock exchange by market capitalization of its
listed com- panies at US $21.3 trillion as of June 2017. The average daily
trading value was approximately US$169 billion.

• Nasdaq Stock Market: The Nasdaq Stock Market is an American stock ex-
change. It is the second-largest exchange in the world by market capitalization,
behind only the New York Stock Exchange. The exchange platform is owned by
Nasdaq, Inc.
General Framework of the Project 7

1.4.2 Market Micro-Structure

1.4.2.1 Order types

Orders also play an important role in market structure . An order is simply an


instruction to buy or sell a specific quantity of a given asset . The two main types
are:

• Market orders: These are directions to trade immediately at the best


price available. Thus, they demand liquidity and risk execution price
uncertainty.

• Limit orders: These have an inbuilt price limit that must not be breached,
a maximum price for buys and a minimum price for sells orders. Hence, limit
orders can help provide liquidity but risk failing to execute.

1.4.2.2 Off market trading

After-hours trading is the period of time after the market closes when an investor
can buy and sell securities outside of regular trading hours. Both the New York
Stock Exchange and the Nasdaq Stock Market normally operate from 9:30 a.m. to
4:00 p.m. Eastern Time. However, trades in the after-hours session can be
completed through electronic exchanges anytime between 4:00 p.m. and 8:00
p.m. Eastern Time. These electronic communication networks (ECNs) match
potential buyers and sellers without using a traditional stock exchange.[3] The
development of after-hours trading offers investors the possibility of great gains
with some risks and dangers.

• Less liquidity: There are far more buyers and sellers during regular hours.
During after-hours trading there may be less trading volume for your stock,
and it may be harder to convert shares to cash.

• Wide spreads: A lower volume in trading may result in a wide spread


between bid and ask prices. Therefore, it may be hard for an individual to
have his or her order executed at a favorable price.

• Tough competition for individual investors: While individual investors


now have the opportunity to trade in an after-hours market, the reality is that
they must compete against large institutional investors that have access to
more resources than the average individual investor

• Volatility: The after-hours market is thinly traded in comparison to regular-


hours trading. Therefore, you are more likely to experience severe price
fluctuations in after-hours trading than trading during regular hours.
General Framework of the Project 8

1.4.3 Market data

Market data includes information about current prices and recently completed
trades. It is usually referred to as Level I and Level II market data.

1.4.3.1 Level I Market Data

Basic market data is known as Level I. It includes the following information:

• Bid Price: The price at which a buyer is prepared to buy share.

• Bid Size: The number of shares an investor is willing to purchase at the bid price.

• Ask Price: The price a seller is willing to accept per share.

• Ask Size: The number of shares to be sold at the ask price.

• Last Price: The price at which the last transaction occurred

• Last Size: The number of shares to be sold at the ask price.

1.4.3.2 Level II Market Data

Level II provides more information than Level I data. Mainly, it doesn’t just show the
highest bid and offer, but also shows bids and offers at other prices.

1.5 Problem Statement and Project goals

Quantitative trading includes execution strategies based on quantitative analysis.


The latter relies on mathematical algorithms to identify trading opportunities.
This trading mechanism is typically used by hedge funds and financial
institutions. Consequently, its transactions are usually large in size and may
involve buy and sell of hundreds of thousands of shares. [19]

Quantitative traders, usually, create their models based on several trading


techniques that include high-frequency trading, algorithmic trading and statistical
arbitrage. They develop computer programs that use models fitted to historical
market data. The model is then backtested. If suitable results are achieved, the
system is implemented in real- time markets.
General Framework of the Project 9

Quantitative trading have some drawbacks. In fact, financial markets are some of
the most dynamic entities that exist. Thus, quantitative trading models must be as
dy- namic to be consistently successful. Therefore, many developed models are
temporarily lucrative, but they eventually fail when market conditions change.

Liquidity is an important factor in trading. If an asset or a market lacks liquidity,


the market order cannot be executed with a reasonably stable price level. This
situation implies higher costs and higher returns volatility.

Liquidity can be expressed in terms of immediacy, which reflects the ability to


trade immediately by executing at the best available price. An assets price is
closely linked to its liquidity.[3]
Overall, we can characterize a market liquidity in terms of three main features :

• Depth: Depth determines the total quantity of buy and sell orders that are
avail- able for the asset around the equilibrium price .

• Tightness: Tightness refers to the bid offer spread.

• Resiliency: Resiliency indicates how quickly the market recovers from a shock.

However, liquidity can be challenging to quantify.

It is within this context that this graduation project is performed. The main
objective of this internship is to develop a predictive model that will allow
forecasting the most liquid venue(i.e stock exchange) in the upcoming short-time
horizon.

1.6 Conclusion

In this first chapter, we introduced the general context of the work, after having
starting with a presentation of the host company Quant-Dev. Then, after being well
familiarized with the the different financial agents and market micro-structure, we
have described the problem statement. Finally, we detailed our objectives.
Chapter 2

Mathematical Background

2.1 Introduction

In this Chapter, we concentrate on the mathematical background needed to


achieve our specified objectives. First, we extend Multi-class Classification
Models. Then, we focus on the basic data mining techniques unavoidable in any
data analytic projects: start- ing with Data Pre-processing, going through Cross
validation and finishing by features selection.

2.2 Multiclass Classification Models

When the objective is to classify an instance into one specific category, then, we talk
about a classification problem.
And when we have the possibility to choice from more than two categories, then, it is
a multi-class Classification Problem.
We should not confuse Multi-class classification with multi-label classification, in
which multiple labels are to be predicted for each instance.[20]

2.2.1 Performance Evaluation

At this level, we deal with several ways used to evaluate the performance of different
classification models. In fact, comparing classifiers is not at all an easy task and
there is no single classifier that works best on all given problems.

10
Mathematical Background 11

2.2.1.1 Confusion Matrix

A clean and simple way to present the prediction results of a classifier is to use a
confu- sion matrix. This is a simple cross-tabulation of the observed and predicted
classes for the data.[10]
Diagonal cells denote cases where the classes are correctly predicted while the off-
diagonals illustrate the number of errors for each possible case.

Figure 2.1: Confusion matrix for a multi-class classification problem

Figure 2.1 shows the confusion matrix that allows to distinguish between correctly
clas- sified and incorrectly classified instances . In fact, the green grids show the
correctly classified points. In an ideal scenario, all other grids should have zero
points.
In addition to this unambiguous visualization, confusion matrix is used to
calculate fur- ther evaluation performance metrics. These latter are based,
essentially, on the following terms.

• True positive (TP): Items that actually belong to the positive class and are
correctly included in the positive class.

• False positive (FP): Items that actually belong to the negative class and are
incorrectly included in the positive class.

• True negative (TN): Items that actually belong to the negative class and are
correctly included in the negative class.

• False negative (FN): Items that actually belong to the positive class and are
incorrectly included in the negative class.

In Multi-class Classification, where there are more than two classes, the above items
are specified, separately, for each class.
Mathematical Background 12

Therefore, for a given Class i, they are calculating as following:

TP i : Diagonal Element(i) = Confusion Matrix [i, i] (2.1)


N
Σ
FP i : Sum (Columni) = Confusion Matrix [k, i] (2.2)
k=1
kƒ=i

N
Σ
FN i : Sum (Rowi) = Confusion Matrix [i, k] (2.3)
k=1
kƒ=i
Σ
TN i : Sum (Other Elements) = Confusion Matrix [l, k] (2.4)
lƒ=
i

=i

For example, in Figure 2.1, the gray boxes represent the FP 3 (False Positive for the
Class 3), while FN 3 is the sum of the boxes colored in yellow.

2.2.1.2 Classiftcation Accuracy Rate

Classification accuracy Rate or the overall accuracy is simplest metric. For each
Class, it is the number of correct predictions divided by the total number of
predictions, mul- tiplied by 100 to turn it into a percentage.[10]
Those latter are calculated throughout the following formulas:

TP i + TN i
Classification Accuracy Ratei = (2.5)
FP i + FN i
TP i + TN i
Classification Accuracy Percentage i = 100× (2.6)
FP i + FN i

The Classification Accuracy Rate can be directly interpreted. In fact a value of 1


indicates a perfect match between the predicted and the actual classes.
Yet, this rate can not indicate the type of errors being made and do not consider
the natural frequencies of each class.

2.2.1.3 Precision

Precision is the number of positive predictions divided by the total number of


positive class values predicted for a given class. It is also called the Positive
Predictive Value (PPV).

Precision can be thought of as a measure of a classifiers exactness. The precision


is intuitively the ability of the classifier not to label as positive a sample that is
negative
Mathematical Background 13

and it is calculated by this formula:

PPVi = TP i
(2.7)
TPi + FPi

2.2.1.4 Recall

Recall is the number of positive predictions divided by the number of positive


class values in the data. It is also called Sensitivity or the True Positive
Rate(TPR). The recall is intuitively the ability of the classifier to find all the
positive samples and it is calculated throughout the following formula:
TP i
TPRi = (2.8)
TP i + FN i

2.2.1.5 F1 score

The F1 score is a weighted average of the precision and recall. The formula for
the F1-score is:
F 1i = 2 × PPV i × TPR i (2.9)
PPV i + TPR i
F1 score reaches its best value at 1 (in that case we talk about a perfect precision and
recall) and its worst at 0. Finally for the multi-class case, we consider the weighted
average of the F1 score of each class.

2.2.1.6 Cohen’S Kappa Statistic

In Multi Class Classification accuracy, precision and recall do not provide enough
infor- mation of the performance of our classifier. Usually, we resort to the Cohen’s
Kappa Statistic.
Cohens’s kappa is defined as :
O−E 1−
Kappa = =1− (2.10)
O
1−E 1 −E

where O is the observed accuracy and E is the expected accuracy. In other words, it
tells us how much better our classifier is performing over the performance of a
classifier that simply guesses at random according to the frequency of each class.
In order to understand the calculation of Cohen’s kappa Statistic, let’s consider a 3-
class problem which confusion matrix is shown in Following Table.
Mathematical Background 14

Predicted Class
C1 C2 C3 Total
C1 a b c a+b+c=C
True Class C2 d e f d + e + f =C
g+h+i=C
C3 g h i
Total a + d + g = C1P red b + e + h = C2P red c + f + i = C3P red Nobs
Table 2.1: Confusion matrix for a 3-class problem

Being Nobs the number of samples and C1, C2 and C3 the label of the Class 1, 2 and
3. The Cohen’s Kappa is given by:

Nobs × (a + e + i) − (C1T rue × C1P red + C2T rue × C2P red + C3T rue × C3P red)
Kappa = 2
Nobs − (C1T rue × C1 Pred + C2T rue × C2 Pred + C3T rue × C3 Pred )
(2.11)

Equation 2.11 can be generalized to the N Classes:

Σ
N Σ
N
Nobs × Confusion Matrix [i, i] − (CiT rue × CiPred)
i=1 i=1
Kappa = N (2.12
2 Σ )
Nobs − (CiT rue × CiPred)
i=1

The range of Kappa values extends from positive to negative one and according to its
value, we can judge our classifier. In fact, a value of 0 indicates a useless classifier
and a value of 1 means means a perfect model.

2.2.2 Multi-Class Techniques

The existing multi-class classification techniques can be grouped into three main
cate- gories which are transformation to binary, extension from binary and
hierarchical clas- sification.

2.2.2.1 Transformation to Binary

A multiclass classification problem can be decomposed into several binary


classification tasks. Those latter can be solved, separately, using binary classifiers.
This technique can be applied throughout One vs Rest or One vs One approaches.

• One-versus-Rest (OVR):
If we consider a problem of classifying among N different categories, this approach
Mathematical Background 15

is based on the idea of dividing this problem into N binary classifiers, where
each task compare a specific class with the other (N - 1) classes.
The next step, is that each binary classifier return a real-valued score for its
deci- sion.

Finally, to predict a class for a new instance, the classifier returning the
high- est score is considered the winner and its class is assigned to the new
instance.

Despite the fact that this approach is simple, it has some disadvantages.
First, the scale of the returned score value can differ between the classifiers.
Second, this strategy can lead to imbalanced classes. In fact, by
discriminating one class from the others, it is obvious that negative
observations will be much larger than the positive observation.

• One-versus-One (OVO):
In this approach , we compare categories between each other. That means that
N (N −1)
2 binary classifier are built to discriminate between each pair of categories.

Finally, when classifying a new instance, a vote occurs by the binary


classifiers and the class with the highest count of votes wins.

Just like any other approach, OVO suffers from some problems. The most
am- biguous is that for some instances, several classes may get the same
number of votes.

2.2.2.2 Extension from Binary

Some Classification algorithms are still applicable for more than two classes. Those
are called extended algorithms and they include neural networks, decision trees, k-
Nearest Neighbor, Naive Bayes and Support Vector Machines.

2.2.2.3 Hierarchical Classiftcation

Yet, another approach for tackling the multi-class classification problem utilizes a
hier- archical division of the output space i.e. the classes are arranged into a tree.
The tree is created such that the classes at each parent node are divided into a
number of clusters, one for each child node. The process continues until the leaf
nodes contain only a single class. At each node of the tree, a simple classifier,
usually a binary classifier, makes the
Mathematical Background 16

discrimination between the different child class clusters. Following a path from the
root node to a leaf node leads to a classification of a new pattern.

2.3 Multi-Class Logistic Regression

Multi-Class Logistic Regression is more general version of Binary Logistic


Regression. In fact the difference lies on the fact that the predicted variable is not
restricted to two categories.

To better understand how this algorithm works concretely, we should first


introduce the binary logistic regression.
So, let us consider a simple case where a data is composed from one feature X
and a response Y, where Y = 1 or 0.

2.3.1 Binary Logistic Regression

Logistic Regression is a classification model. So, instead of directly predicting the


re- sponse Y of the data set, the classifier models the probability that Y belongs to a
specific class given a particular X and returns p(Y = 1 | X) = p(X). Those
probabilities can be converted, later on, into class predictions.

2.3.1.1 Logistic Model

Since the result is a probability value, the output p(X) should be between 0 and 1
for all values of X.
Logistic Regression, uses the following function called logistic function.
exp(β0 + β1 X)
p(X) = (2.13)
1 + exp(β0 + β 1 X)
where β0 is the intercept and β1 the coefficient of X.

Starting from the Equation 2.13, we can write the following:


p(X)
= exp(β0 + β 1 X) (2.14)
1 − p(X)

The left-hand side is called the odds. It can take any real and positive value. A
value of 0 indicates very low probability of Y = 1 and ∞ means that the probability
of Y = 1 is
Mathematical Background 17

too high.

We can deduct from Equation 2.14 :


p(X)
log( ) = β0 + β1X (2.15)
1 − p(X)

The quantity log( p(X) ) is called the log-odds or logit. An interesting result is that the
1−p(X)
logistic regression has a logit that is linear in X. This means that increasing X by one
unit changes the logit by β1.

2.3.1.2 Estimating Coefficients

Based on the available training data, the coefficients β0 and β1, used in the above
equations, can be estimated. We can use the maximum likelihood approach to
estimate the unknown coefficients.
The basic behind using maximum likelihood to fit a logistic regression model can
be formalized using the following mathematical equation:
Y Y
l(β0, β1) = p(xi) (1 − p(xj)) (2.16)
i:yi =1 j:yj =0

The above function is called a likelihood function.


There are many other approaches that can be used in estimating parameters in such
statistical model. They include Newton’s methods and Gradient Descent. The
estimated βˆ0 and βˆ1 are chosen throughout an algorithm which aim to maximize
this likelihood
function.
The algorithm stops when the convergence criterion is met or the maximum
number of iterations are reached.

2.3.1.3 Making Predictions

Making predictions in logistic regression is a very simple matter. Indeed, the


probability p(X) is computing by plugging the X-Variable into the following
logistic regression equation.
exp(βˆ0 + βˆ1 X)
p(X) = (2.17)
1+ 0 + βˆ1 X)
exp(βˆ
Where βˆ0 and βˆ1 are the estimated coefficients.
Therefore, the probability should be transformed into binary values in order to actually
Mathematical Background 18

make the desired prediction. This suggest using the sigmoid function. The latter is
an S-shaped curve that can take any real-valued number and map it into a value
between 0 and 1. These values will then be transformed into either 0 or 1 using a
threshold classifier.

The picture below illustrates an example of a logistic regression model.

Figure 2.2: Logistic Regression

2.3.1.4 Logistic Regression with Multiple predictors

In this section, we consider a task of forecasting a binary response using multiple


features. Equation 2.15 can be generalized as follows:
p(X)
log( ) = β0 + β1X1 + ... + βpXp (2.18)
1 − p(X)

where X = (X1, ..., X2) are p predictors.


Equation 2.17 can be rewritten as:
exp(β0 + β1X1 + ... + βpXp)
p(X) = (2.19)
1 + exp(β0 + β1X1 + ... + βpXp)

Therefore, we use the maximum likelihood method to estimate β0, β1, ..., βp.
Mathematical Background 19

2.3.2 Multi-Class Techniques

When dealing with a problem of classifying among N different categories, logistic


regres- sion can be generalized. One attractive technique is OVR. The latter
approach is based on the idea of dividing this problem into N binary classifiers,
where each task compare a specific class with the other (N - 1) classes.
Therefore, each binary classifier return the probability of new instance belonging
to the corespondent class. Finally, the predicted class is the one with the highest
probability.

2.3.3 Features Importance

Estimating the importance of each feature is an ultimate step in any modelling


project. It apply a statistical measure to assign a scoring to each feature. The
calculated scores help in choosing features that allow more performing models
requiring less data. One of the simplest ways to compute the importance of a
given attribute in logistic regression models is to calculate the magnitude of its
coefficient times its standard deviation.
More clearly, we compute features importance as follows:

Importance Xi = β̂i × Standard Deviation(Xi )

(2.20

) where β̂i is the estimated coefficient of the attribute Xi .

2.4 Data Pre-processing

Data pre-processing techniques generally refer to the addition, deletion or


transformation of training set data. The need for data pre-processing is determined
by the type of model used.[10]

2.4.1 Centering and Scaling

The most straightforward data transformation is to center-scale the predictor


variables. To center a variable, the average value is subtracted from all the values.
As a result of centering, the predictor has a zero mean.
Similarly, to scale the data, each value of the predictor variable is divided by its
standard deviation. Scaling the data coerce the values to have a common standard
deviation of one. These manipulations are generally used to improve the numerical
stability of some calculations.
Mathematical Background 20

Some Multi-class Classification models, such as tree-based models, are notably


insen- sitive to this characteristic of the predictor data. Others, like Logistic
Regression, K- Nearest Neighbors and Support Vector Machines, are not.
The only real downside to these transformations is the loss of interpretability of
the individual values since the data are no longer in the original units.

2.4.2 Resolve Skewness

Another common reason for transformations is to remove distributional skewness.


In fact one of the most fundamental assumptions in many predictive models is
that the predictors have normal distributions. Those latter are un-skewed.

An un-skewed distribution is one that is pretty much symmetric. This means that
the probability of falling on either side of the distribution’s mean is
approximately equal. In other words, skewness represents an imbalance from the
mean of a data distribution. The formula to find sample skewness statistic is the
following:
3 × (Mean − Median)
Skewness = (2.21)
Standard deviantion

where :

• Mean is the average of the numbers in the data distribution.

• Median is the number that falls directly in the middle of the data distribution.

• standard deviation is a measure used to quantify the amount of variation of


the data distribution.

Replacing the data with the log, square root or inverse may help to remove the skew.

2.4.3 Near Zero Variance

Data-sets come sometimes with predictors that take an unique value across
samples. Such uninformative predictor is more common than you might think.
This kind of predictor is not only non-informative, it can break some models.
Even more common is the presence of predictors that are almost constant across
samples. Constant and almost constant predictors across samples happens quite
often. One reason is because we usually break a categorical variable with many
categories into several
Mathematical Background 21

dummy variables. Hence, when one of the categories have zero observations, it
becomes a dummy variable full of zeroes.
One quick and dirty solution is to remove all predictors that satisfy some
threshold criterion related to their variance.
Among the multi class classification models sensitive to the near zero variance, we
found the Logistic Regression, K-Nnearest Neighbors and Neural Networks.[2]

2.4.4 Class Imbalance

Class imbalance is a very common machine learning problem. It arises when the
number of instances belonging to one class is much greater than the total number
of another class. This problem, clearly, affects the model reliability and accuracy
since most ma- chine learning algorithms works best when the number of
instances of each classes are approximately equal. To town down this problem, we
should use some approaches which are classified into two major categories: cost
function based approaches and sampling based approaches.

2.4.4.1 Cost Function Based Approaches

The basic idea behind cost function based approaches is to generate a new
formula to optimize that penalize one or many given type of error. In case we
think one false negative is worse than one false positive, we will count that one
false negative as, e.g., 100 false negatives instead. Then the machine learning
algorithm will try to make fewer false negatives compared to false positives.

2.4.4.2 Sampling Based Approaches

Sampling based approaches can be summarized into two main subsets:

• Under-sampling It is the most basic and intuitive method. Simply all we


need is to remove randomly some instances from the upper class so it has less
effect on the statistical algorithm.
One frequent downside to this approach is a risk of removing some of the
most relevant elements, and therefore, it can affects the reliability of the
developed model. This case can be more illustrated below:
Mathematical Background 22

Figure 2.3: Under-sampling

Here, the green line is the ideal decision boundary we would like to have.The
blue line is the actual result. As shown in Figure 2.3, after under-sampling
the upper class (the negative class), the new boundary decision is slanted so
some negative class elements will be, wrongly, classified as positive.

• Over-sampling As its name indicates, this method is achieved by


duplicating randomly some of the minority class so it has more effect on the
statistical algo- rithm.
Most real disadvantage of this approach is that it leads probably to over-
fitting. This drawback is explained as follows:

Figure 2.4: Over-sampling

The thick positive signs indicate there are multiple duplicated copies of that
data instance. The machine learning algorithm then sees these cases many
times and therefore designs to overfit to these examples specifically,
resulting the blue line boundary shown in Figure 2.4 on the right side.
Mathematical Background 23

2.5 Cross Validation

Cross validation is one of the most famous technique used by a data analyst. In
fact, it represents a very powerful method for estimating the reliability and the
robustness of a given model based on sampling.

Cross validation can be used in many senarios and for several objectives. In fact,
it should be used when we are dealing with small data sets and also when we are
building a statistical model with one or more unknown parameter. So, the idea is
to optimize those parameters in order to get a model that matches the data as soon
as possible.[10] There are several cross validation techniques :

• Test-Set Validation: The main idea of this basic method is to divide the
given data into two completely independent data-sets. The first one is the
training set. It is called the In Sample and it contains more than 60% of the
original data. The second subset is the testing Set. It is also called the Out Of
Sample.
The Model is build on the training set and then it will be validated on the
testing set. This means that the performance of the model is evaluated based
on the testing set.

• k-Fold Cross Validation: This approach divide the original sample into k
sam- ples, later on we select one subset as the validation set and the remaining
k-1 subsets constitutes the training set.
As in the above approach, the performance score is calculated using the first se-
lected subset.
Then the hole operation is repeated by selecting another validation sample
from the other k-1 sample that have not been used yet.
This operation is repeated k times.
Finally, the global performance score is defined as the average of the k
performance scores calculated at each iteration.

• Leave-One-Out Cross Validation: It is simply a use case from the


second approach with k = n. This means that we train a model on n-1
instances. Then we validate the model on the remaining instance and we repeat
this operation n times.

An advantage of the last two methods is that we are not wasting any data. In other
words, it is a clever way to use all data.
Mathematical Background 24

2.6 Features Selection

Feature selection is an automatic picking of variables that are most relevant to the
predictive model.
Ii is in this perspective, that Robert Neuhaus said: Feature selection is itself useful,
but it mostly acts as a filter, muting out features that arent useful in addition to your
existing features.

There is a bunch of reasons for why we should care about features selection, but
really, they can be classified into two main categories. One is about human beings
and the other is about machines and machine learning algorithms.[11]

• knowledge Discovery: It means the ability to interpret the features.

• Curse of Dimensionality: It says that the amount of data needed grows


expo- nentially in the number of features used. So, it will be really nice when
having fewer features.[14]

So, we can say that the problem of feature selection helps, hopefully, to go from a
whole bunch of features to just few features. Thus, making the learning problem
much easier.

In this section we focus on some wrapper approaches for selecting subsets of


predictors. These include best subset and stepwise selection procedures.

In the following subsections we will consider data with T the number of independent
attributes.

2.6.1 Best Subset Selection

In best subset selection, we need to fit models with each possible combinations of T
features. The total number of models that we will be having is 2 T .
This method requires massive computational power. In fact, if we consider a model
with 100 attributes, that means that we need to fit 2100 models. To reduce this
computational power, this approach is divided into two separate steps.

• Step 1 : For k = 1, 2, ..., T , fit models with all combinations that have exactly k
predictors. Select the best among these models and call it Mk

• Step 2 : Select the best model from among M1, M2, ..., MT .
Mathematical Background 25

2.6.2 Stepwise Selection

For computational reasons, best subset selection cannot be applied with a high T .
For this reason, stepwise selection can be the best alternative to best subset
selection.
Stepwise selection methods include forward and backward stepwise selection.

• Forward Stepwise Selection: The procedure starts with the null model
(mean(Y)) and then, for each operation k, Only one predictor is added and a
model Mk is picked. If a variable is retained it never drops from the model and the
total number
of fitted models is 1 + T ×( T +1) .
2

• Backward Stepwise Selection: It represents the reverse of the Forward


Stepwise Selection. It starts with all T predictors and then, for each operation
k, Only one predictor is dropped and a model Mk is picked. The total number of
fitted models is 1 + T ×(T +1) .
2

2.7 Conclusion

Different models have different sensitivities to the type of predictors in the model.
Trans- formations of the data to reduce the impact of data skewness or outliers
can lead to significant improvements in performance. Feature Selection can also
be effective.
Chapter 3

Experimental Results and


Modelling

3.1 Introduction

We start this chapter by presenting the general scope of the project. Then, we turn
our focus on exploratory data analysis and the feature engineering. Finally, we
tackle the modelling phase by including the performance evaluation of the fitted
models.

3.2 General Description

3.2.1 Problem Statement Recall

The main objective of this internship is to analyze the liquidity on North America’s
stock markets by developing a predictive model that will allow forecasting which
venue is showing the highest volume in the upcoming short time horizon.
More specifically, we want to develop a classification model that allows us to
predict which venue will show the highest volume with at least Y% of the total
volume. The forecast will be made if in the current minute there is a venue showing
more than X% of the total volume. The two parameters X and Y are given as input
based on specification provided by the host company.
In what follows, X will be set to 30% and Y will be set to 40%.

26
Experimental Results and Modelling 27

3.2.2 Data Description

In this Project, we will use data for 5 days from 2018-03-05 to 2018-03-09. The
dataset comes from Onetick database. It contains Data for Trades and Quotes.

3.2.2.1 Input Variables

From Trades:

• Time: (yy:mm:dd:HH:MM:SS.000000000) The date and time at which the


trade occurred.

• Symbol: (Character(4)) The code of the symbol Traded.

• Price: (Number) Trade price per share.

• Size: (Integer(9)) Number of shares traded.

• Exchange: (Character) Exchange on which the trade occurred.

From Quotes : It is called Level I Quote Data and it contains:

• Time: (yy:mm:dd:HH:MM:SS.000000000), The date and time at which the


quote is entered.

• Symbol: (Character(4)) The code of the symbol quoted.

• Bid Price: (Number) The price at which a buyer is prepared to buy share.

• Bid Size: (Integer(9)) Number of round lots an investor is willing to


purchase at the bid price (100 share units).

• Ask Price: (Number) The price a seller is willing to accept per share.

• Ask Size: (Integer(9)) Number of round lots to be sold at the ask price (100
share units).

• Exchange: (Character) The exchange that issued the quote.


Experimental Results and Modelling 28

3.2.2.2 Native Venues

The ’Exchange’ column in both trades and quotes data contains a character
identifying the native exchange. It takes one of the following characters:

• N: Refers to New York Stock Exchange (NYSE).

• T/Q: Refers to National Association of Securities Dealers Automated


Quotations (NASDAQ).

• Z: Refers to BATS Z exchange (BATS Z).

• Y: Refers to BATS Y exchange (BATS Y).

• P: Refers to Archipelago Exchange (Arca).

• K: Refers to Direct Edge 1st platform (EDGX).

• J: Refers to Direct Edge 2nd platform (EDGA).

3.2.2.3 Predicted Variable

The predicted variable, or the target, is a categorical variable. It presents the


venue which will be showing the highest volume with at least 40% of the total
volume.
Each instance should be classified in one of the eight following classes:

• NYSE: If NYSE is showing the highest volume with at least 40% of the
total volume.

• NASDAQ: If NASDAQ is showing the highest volume with at least 40% of


the total volume.

• BATS Z: If BATS Z is showing the highest volume with at least 40% of the
total volume.

• BATS Y: If BATS Y is showing the highest volume with at least 40% of the
total volume.

• ARCA: If ARCA is showing the highest volume with at least 40% of the
total volume.

• EDGX: If EDGX is showing the highest volume with at least 40% of the
total volume.
Experimental Results and Modelling 29

• EDGA: If EDGA is showing the highest volume with at least 40% of the
total volume.

• No Venue: If there is no Venue showing at least 40% of the total volume.

3.2.3 Software Environment

3.2.3.1 Programming Language Python

The choice of Python over R is because the latter has the disadvantage of not
handling well large data sets. It is true that when it comes to dealing with
statistical aspects of the data, R stands out as the most used language in data
science. In fact, R has a big community around the world providing this
language with various packages for statistical analysis.

Python is well adapted to deal with big data. With an ecosystem propitious for
statis- tical analysis, Python has become one of the widely used languages by data
scientists.

In data processing, handling large scale data sets and having fast access to data
can be the most crucial criteria when choosing a programming language. Python
may not be the best, but it is surely better than R in dealing with meduim scale
data sets (millions of rows). It is open source, easy to use and very suitable for
treating data of type string and sequence.[17]

3.2.3.2 Libraries and Packages

Thanks to its open source libraries adapted the field of data science, Python has
become one of the most used languages by data analysts in recent years.[17]
Here is a list of packages that we used in this project:

• Pandas: Pandas is the Python package that we used for handling data. In
fact, it is designed for fast loading, easy cleaning and quick exploration of
data.

• Numpy: Numpy is one of the most basic package created by the scientific
stack. We used it for (n-dimensional) array manipulation.

• Matplotlib: Matplotlib is a Python library generated for powerful visualizations.


Experimental Results and Modelling 30

• Seaborn: Seaborn focuses on the visualization of statistical models. Such


visual- izations include heat maps.
• Scikit-learn: Scikit-learn provides a simple and consistent interface to
common machine learning (ML) algorithms, making it easy to bring ML
into production systems. So, it is greatly helpful for statistical modelling.

3.3 Exploratory Data Analysis

The exploratory data analysis (EDA) is a way to analyze data and represents an
essential phase in any analysis or modeling project. It helps us to understand the
data we are facing.
In statistics, EDA refers to a set of procedures for analyzing data sets and
summarizing their main characteristics, often with visual methods. For further
information about EDA, we may refer the reader to[15].

In the following subsections, we will present some EDA plots.

3.3.1 EDA: Data from Trades

As seen in the project description, every trade occurs in the market is recorded in
a trade message. It indicates for each trade on which exchange it occurred, the
price of execution and the number of shares traded.
Figure 3.1 shows a visualization of the total volume of trades of a single symbol
per exchange per half hour. The symbol used is ’AAPL’ and we visualize data for
the four most liquid exchanges.

Figure 3.1: Trades volume per venue per half hour


Experimental Results and Modelling 31

Figure 3.1 shows the following:

• 00:00 – 04:00: No Trades.

• 04:00 – 09:30: Low volume of Trades.

• 09:30 – 16:00: high volume of Trades.

• 16:00 – 20:00: Low volume of Trades.

• 20:00 – 00:00: No Trades.

These results are expected. In fact, they conform perfectly the regular trading
session schedule seen in Chapter 1. We recall that orders and trades are only
available from 04:00 to 20:00. During this period, trades and orders occur on
three different sessions:

• Pre-Market: From 04:00 to 09:30.

• Regular Market: From 09:30 to 16:00.

• After Market: From 16:00 to 20:00.

Another important point showed in Figure 3.1, is that during the Regular
Market, trading volume are much higher at the begin and the end of the session,
indicating a U shaped type of profile.

3.3.2 EDA: Data from Quotes

Similar to Trades, Quote messages are generated for every quote event. Data used
is known as level I quote. It indicates the best bid/ask at a given point in time.
Below we visualize the total count of quotes of one symbol per half hour. The symbol
used is ’AAPL’ on Fri 2018-03-22.

Figure 3.2: Quotes Count per half hour


Experimental Results and Modelling 32

Figure 3.2 shows that the number quotes is much larger in the regular market
session than in the pre-market and after market sessions.

Bid-Ask Spread is one of the most adequate information that can be extracted
from Level I Quote data. It is the amount by which the ask price exceeds the bid
price for a given stock in the market. It is calculated as being:

Bid Ask Spread = Ask price − Bid price (3.1)

In the following box-plot visualization, we try to display the bid-ask spread over
the entire day for ’AAPL’ stock.

Figure 3.3: Bid-Ask Spread per half hour

As marked in Figure 3.3, Bid Ask Spread is a much smaller in regular market session.
This fact explains the high count of trades and quotes occurring on this session.
The size of the bid-ask spread is one of the most relevant attributes used in
modeling liquidity.

3.3.3 EDA: Response

We recall that our response is a categorical variable and it represents the venue
showing the highest volume with at least 40% of the total volume in the next
minute.

The following pie plot is generated using 5 days data for 110 symbols with various
price and volume characteristics.
There are 117 457 observations.
Experimental Results and Modelling 33

Figure 3.4: Venue showing the highest volume with at least 40% of the total volume

We recall that predictions are only made if in current minute there is a venue
showing at least 30% of the total volume. This fact requires visualizing the
response according to the venue verifying the above assumption.

Figure 3.5: Response per current venue showing more than 30% Volume

Figure 3.5 identifies the short-term dependence between venues.

3.3.4 Symbols Characterization

Liquidity, usually, differs from one symbol to another. This fact suggests
characterizing symbols by different groups according to their price, volume and
market-cap. The latter refers to market Capitalization. It determines the company’s
size in terms of its wealth. It is calculated by multiplying all the outstanding shares
with the current market price of one share.
Experimental Results and Modelling 34

Figure 3.6: Symbols Characterization

As shown in Figure 3.6, Symbols fall into three categories according to their price,
volume and market-cap. So, each venue is considered with low or mid or high price
, low or mid or high volume and low or mid or high market-cap.

The following plots illustrate the influence of the market-cap characterization on


the response.

Figure 3.7: Response Portrayal per Market-Cap Group


Experimental Results and Modelling 35

3.4 Features Engineering

Features engineering is just as important if not more important than the choice of
algo- rithm. In fact, good features allow a simple model to beat a complex model.

3.4.1 Basic Features

Basic Features are features based only on time. The idea of using such variables came
from the U shape form illustrated in Figure 3.1 that displays the trade volume
overall the day.
Since this U shaped curve has a parabolic form, just two features are needed:

• F1: Minutes since 09:30 am.

• F2: F1 squared.

Fitting a linear model based on this two basic features in order to predict volume
on the next minute, produces the following results:

Figure 3.8: Linear model to predict trade volume using basic features

The blue scatter points showed in Figure 3.8, represent the actual trade volume in
Bats Y stock exchange for the correspondent minute. The black line indicates the
predicted volume using the fitted linear model. Moreover, the correlation between
actual volume and predicted volume is equal to 0.67.
Experimental Results and Modelling 36

3.4.2 Features From Trades

These attributes are extracted from trade data and they include:

• Percent: It represents the volume percentage of the venue showing highest


volume in the current minute.

• Trade Sum: One feature is generated for each venue and it indicates the
total volume of trades occurred on this venue in the last 5 minutes.

• Trade Count: One feature is generated for each venue and it indicates the
total number of trades occurred on this venue in the last 5 minutes.

3.4.3 Features From Quotes

They include all variables extracted from Level one quote data which are
essentially based on:

• Bid-Ask Spread in basis points: It is the bid-ask spread versus the


midpoint price and it is calculated throughout the following formula:
Bid Ask Spread
Bid Ask Spread (bps) × 10000 (3.2)
= midpoint price

• Bid event contribution(B en): It measures the contribution of the nth


event to the size of bid queues:[18]

B en = I B B × qB − I B B × qB (3.3)
{Pn >Pn−1} n {Pn <Pn−1} n−1

• Bid Flow : It describes the flow on the left side of the order book and it is
estimated as following:
Nb Σ
events
Bid Flow = B en (3.4)
i=1

• Ask event contribution(A en): It measures the contribution of the nth


event to the size of ask queues:[18]

A
A en = I{P A>P A } × q n−1 − I{PnA<Pn−1
A } × qn
A
(3.5)
n n−1

• Ask Flow: It describes the flow on the right side of the order book and it is
calculated as being:
Experimental Results and Modelling 37

Nb Σ
events
Ask Flow = A en (3.6)
i=1

Features generated from Quotes are 56 (8 × 7) and they calculated separately for each
venue. They include:

• Bid Size: Represents the total size of shares from the bid side that appear
on level one quote data in the last 5 minutes.

• Ask Size: Represents the total size of shares appeared in the ask side on the
level one quote data in the last 5 minutes.

• Bid Count: Indicates the total number of bid messages in the last 5 minutes.

• Ask Count: Indicates the total number of ask messages in the last 5 minutes.

• Mean Bid-Ask Spread: Mean of the bid ask spread (bps) in the last minute.

• Last Bid-Ask Spread: The last bid ask spread (bps).

• Ask Flow: The ask flow during the last 5 minutes.

• Bid Flow: The bid flow during the last 5 minutes.

3.4.4 Dummy Variables

Dummy variables are boolean indicators. They take the value 0 or 1 to indicate the
presence or not of some categorical effect. In this context, dummy variables are
involved to specify both which venue is showing highest volume and in which group
the symbol is classified. Finally, we get the following 12 features:

• HVenue ARCA: Takes 1 only if Arca is showing highest volume in current


minute.

• HVenue BatsZ: Takes 1 only if Bats Z is showing highest volume in current


minute.

• HVenue BatsY: Takes 1 only if Bats Y is showing highest volume in current


minute.

• HVenue EDGA: Takes 1 only if EDGA is showing highest volume in current


minute.
Experimental Results and Modelling 38

• HVenue EDGX: Takes 1 only if ADGX is showing highest volume in current


minute.

• HVenue NASDAQ: Takes 1 only if NASDAQ is showing highest volume in


cur- rent minute.

• VolumeGroup Low: Takes 1 only if the symbol is classified in Low volume


Group.

• VolumeGroup High: Takes 1 only if the symbol is classified in High


volume Group.

• PriceGroup Low: Takes 1 only if the symbol is classified in Low price Group.

• PriceGroup High: Takes 1 only if the symbol is classified in High price Group.

• MarketCapGroup Low: Takes 1 only if the symbol is classified in Low


market- Cap Group.

• MarketCapGroup High: Takes 1 only if the symbol is classified in High mar-


ketCap Group.

We have to mention that to transform a categorical attribute with n possible value,


we need exactly n − 1 dummy variables. For example Price Group can get three
different values (Low, Mid , High), thus, this attribute will be extended into two
dummy vari- ables: PriceGroup High and PriceGroup Low. If PriceGroup High
and priceGroup Low contain exactly 0, the symbol then is classified in Mid Price
Group.

3.5 Modeling: Results and Validation

At this phase in the project, we use data for 5 days from 2018-03-05 to 2018-03-
09. To measure the model’s reliability and robustness, we consider first four days data as
Training Set and the last day is maintained as Testing Set.
Features used in fitting models are 85 and they are extracted from quotes and
trades data available.
The classification goal is to predict which venue will be showing the highest volume
with at least 40% of the total volume in the next minute.
Experimental Results and Modelling 39

3.5.1 Data Pre-processing

3.5.1.1 Class Imbalance

Figure 3.10:
Figure 3.9: Initial Train Resample
d Train

As illustrated in Figure 3.9 , the number of instances belonging to No Venue is


much greater than the total number of elements belonging to any class of all other
classes. To remedy the problem, we resort to oversampling methods. The choice is
taken to avoid any risk of removing some of the most relevant elements.
After Resampling the train data, we get a decent number of instances for each
particular class. The Resampled train illustrated in Figure 3.10 is used in fitting
logistic regression models detailed in the following sections.

3.5.1.2 Center and Scale

Logistic regression is one of the Multi-class Classification models that require


Centering and Scaling during the pre-processing step. In fact, it is sensitive to this
characteristic of the predictors.
In order to prepare the data we apply the following formulas to each column i:
Train Xi − Mean(Train Xi)
New Train Xi = (3.7)
Standard deviantion(Train
Xi)
Test Xi − Mean(Train Xi)
New Test Xi = (3.8)
Standard deviantion(Train
Xi)

3.5.2 All Features Model

After resampling, centering and scaling the data we turn our focus to fitting
logistic regression classifiers. In this section we fit our first model, this latter is
based on all
Experimental Results and Modelling 40

the 85 features. All models are fitted using OVR multi-class technique detailed in
the second chapter.

The following figures display the performance of this model on the Testing Set.

Figure 3.11: Confusion Matrix of the All Features Model

Figure 3.12: Scores of the All Features Model

As exposed on Figure 3.11 and Figure 3.12:

• The accuracy is 30%, it means that in 70% of predictions, the fitted classifier
gives wrong answers.

• The recall for the No Venue class is equal to 0.4. So, in 60% of cases, the
classifier predict one particular venue among the venues while the truth is that
there is no venue showing at least 40% of the total volume.
Experimental Results and Modelling 41

In what follows, our main objective is to fit other models with higher performances.
More specifically, we aim to improve both the ’No Venue’ Recall and the
accuracy of the model.

3.5.3 Best Features Model

Feature selection allows to go from a whole bunch of features to just few features.
This helps, hopefully, to make the learning problem much easier.

In this section we use Recursive Features Elimination (RFE) approaches for


selecting a subset of predictors.
RFE is a Backward Stepwise Selection approach. It is one of the most used
methods thanks to its simplicity and efficiency.
To be sure that we are eliminating redundant and insignificant features, we apply RFE
with Cross-Validation on the Training Data.

The Following curve represents the cross-validated accuracy as function of the


number of features selected.

Figure 3.13: Recursive Features Elimination

As Figure 3.13 illustrates, using 40 features is enough to fit a model having,


approxi- mately, the highest accuracy.

To evaluate the efficiency of the above features selection operations, we measure the
performance of new logistic regression model having only the best 40 features
maintained. Below are the results:
Experimental Results and Modelling 42

Figure 3.14: Confusion Matrix of the Best Features Model

Figure 3.15: Scores of the Best Features Model

Based on the performance exposed in Figure 3.14 and Figure 3.15 of the new fitted
model, we can evaluate the best features model on the following axes:

• Accuracy: There is no much improvement at the accuracy.

• ’No Venue’ Recall: The recall for the ’No Venue’ class goes from 0.4 to 0.58.

• Complexity: It is intuitive that fitting and interpreting a model with 40


features is much easier than another model with 85 features.

It seems that resorting to features selection techniques has, roughly, improve the
per- formance of our developed model.
Experimental Results and Modelling 43

3.5.4 Two Levels Model

3.5.4.1 Basic Idea

The urge for this two levels Classification is both the fact that the False Negative
errors of the ’No Venue’ class have to be penalized and the fact that predicting
similar Classes seems more logical.
Thus, the classification will take place on the two following steps.

3.5.4.2 First Level Classifter

At this Level, the main objective is to predict if there is a venue (among all the seven
venues) which will be showing at least 40% of the total volume.
The problem, now, is a binary classification task. Therefore, the Target value will take:

• Yes: If there is a venue which will be showing at least 40% of the total volume
in the next minute.
• No: Otherwise.

Figure 3.16: Performance of the First Level Classifier

Figure 3.16 shows the performance of this First Level Classifier on the training and
testing sets.
We recall that one of our objectives is to penalize the false negative errors of the ’No
Venue’ Class.

Logistic regression classifier works by affecting the new instance to the class for
which the probability is greatest. By default, an instance is classified in class YES
if:

P (Y = Y ES|X = x) > 0.5 (3.9)


Experimental Results and Modelling 44

Thus, the logistic regression works with a threshold of 0.5. However if we are
concerned about incorrectly predicting yes for the actual No elements, then we
should consider rising this threshold. Then Equation 3.10 became:

P (Y = Y ES|X = x) > Threshold (3.10)

Several performance scores, resulting from taking this approach, are illustrated in
the following figure.

Figure 3.17: First Level Classifier Performance as function of the threshold value

Figure 3.17 indicates that rising the threshold has a vast improvement over the
False Negative Rate(No Venue Recall). However, this improvement comes at a
cost.

3.5.4.3 Second Level Classifter

Just in case the predicted class, by the first classifier, is YES, thus, the second level
classification arises. So, the aim of this classifier is to predict which exactly venue
from all the seven venues concerned is the one divined to be showing the highest
volume.

3.5.4.4 Model Performance

To evaluate the improvement of the two levels classification approach, we measure


the performance of the new model on the testing set.
Below are the results:
Experimental Results and Modelling 45

Figure 3.18: Confusion Matrix of the Two Levels Model

Figure 3.19: Scores of the Two Levels Model

The model is trained only with the best 40 features maintained from the above
section. Based on the performance exposed in Figure 3.18 and Figure 3.19, it seems
that the developed solution validates a good part of our initial objectives. Namely
improving the accuracy rate and the recall for the ’No Venue’ class. In fact:

• Accuracy: is boosted, it goes from 30% to 40%.

• ’No Venue’ Recall: is raised from 0.4 to 0.76.


Experimental Results and Modelling 46

3.6 Features Importance Interpretation

At this level, we go through features importance. Estimating the importance of


each feature is an ultimate step in any modelling project. It helps detecting the
main driving ones.

To quantify the importance of any created feature, we use the magnitude of its
coefficient times its standard deviation in each binary classification model ( this is
how OVR method works).

3.6.1 Basic Features

Basic features are based only on time. They are:

• F1: Minutes since 09:30 am.

• F2: F1 squared.

Figure 3.20: Basic Features Importance

Figure 3.20 quantifies the influence of these attributes in the classification models.

F1 and F2 identify the historical behavior of each venue. Therefore, they are
important when dealing with low volume venues and approximately null when it
comes to dynamics venues such as NYSE and NASDAQ.

3.6.2 Features Indicating Current Highest Venue

Figure 3.21 demonstrates the importance quantification of the six dummy


variables indicating current highest venue.

Figure 3.21: Features Indicating Current Highest Venue


Experimental Results and Modelling 47

This result characterizes a continuity on the trading flows. In fact, they indicate
that when a venue is showing the highest volume and at least more than 30% in
the current minute, this same venue is more probably to be showing showing the
highest volume and at least more than 40% in the next minute.

3.6.3 Trade Features

These attributes are generated from trade messages extracted from Level I market

data. Figure 3.22 quantifies the influence of these attributes in the classification

models.

Figure 3.22: Trade Features

Figure 3.22 shows two main results:

• Features indicating the trade count are more important locally. More
specifically, these features are majorly present on the local venue. This
result is illustrated by the red coefficients presented in Figure 3.22

• It seems that features identifying the total volume in the last five minutes are
less important. This can be justified by the fact that these variables are
considered as redundant variables. Consequently, they are removed after
applying RFE features selection.

3.6.4 Features Indicating Symbols Characterization

Liquidity, usually, differs from one symbol to another. This fact suggests
characterizing symbols by different groups.
Experimental Results and Modelling 48

Therefore, symbols fall into three categories according to their price, volume and
market- cap. So, each venue is considered with low or mid or high price , low or mid
or high volume and low or mid or high market-cap.

Figure 3.23: Group Characterization Features

From Figure 3.23, we can notice that the importance of dummy variables
characterizing symbols differs from one venue to another. This result suggests
that:

• Symbols in the Low Volume Group are likely to have EDGA as the venue
showing highest volume and at least 40% of the total volume in the next
minute.

• The blue colored coefficients indicates that the probability of predicting the cor-
respondent venue is inversely affected by this symbol’s characterization. For
example, when making prediction for high volume symbols, EDGX and ARCA
are unlikely to be predicted.

3.7 Conclusion

In this chapter, we have gone through the organization of the venues classification
project. On the basis of the latter, we have shown the different stages of the
implemen- tation of the classification model. Starting from exploratory data
analysis and features engineering. Finally, we have detailed the different modeling
results upon which we have decided on the best model to choose.
Conclusion and Perspectives

Liquidity is an important factor in trading. If an asset or a market lacks liquidity,


the market order cannot be executed with a reasonably stable price level. This
situation implies higher costs and higher returns volatility.

We have presented, throughout this report, our analysis, research and development
work deployed to quantify liquidity dynamics on north America’s stock markets.

In the first part of this report, we introduced the general framework of the project
that consists in implementing the working environment, as well as bringing a
global view on the world’s financial markets. Then, we described the problem
statement and the project objectives. In the second part, we presented the
theoretical background needed in multi class classification tasks. In the last part,
we showed the experimental results of our fitted models.

In this context, our work comes as logistic regression models based on OVR
techniques for classifying Us venues based on their short-term liquidity. After
creating features from trades and quotes data, a first model is fitted using all
available features. This classifier has poor accuracy and high complexity.

Features selection techniques are implemented and 40 of 85 variables are


maintained. A second model is fitted, but, no much improvements have been
observed.

Finally, we have gone through a two levels model. This latter validates a good part
of our initial objectives. Namely improving the accuracy rate as well as penalizing
the False Negative errors for the ’No Venue’ Class.

On a personal level, the period of this internship was extremely enriching since I
had the opportunity to integrate a dynamic team and to discover closely the
lifecycle of a large trading project. This internship also has allowed me to deepen
my technical knowl- edge, especially in Python programming while applying our
academic and theoretical knowledge on data mining.

49
Conclusion and future perspectives 50

As far as the future prospects are concerned, this project could be developed on the
following axes :

• Create other features that may better reflect the short-term dependence
between different venues.

• Think about some methods to quantify hidden liquidity.

• Explore other advanced models.


Bibliography

[1] Course materials ’Data Mining’ of Mr.Wajdi TEKAYA, TPS, 2017.

[2] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of
statis- tical learning. Vol. 1. New York: Springer series in statistics, 2001.

[3] Johnson, Barry. Algorithmic Trading & DMA: An introduction to direct access
trading strategies. Vol. 200. London: 4Myeloma Press, 2010.

[4] Cartea, Alvaro, and Sebastian Jaimungal. Modelling asset prices for algorithmic
and high-frequency trading. Applied Mathematical Finance 20.6 (2013).

[5] Avellaneda, Marco. Algorithmic and High-frequency trading: an overview. Re-


trieved May 11 (2011): 2013.

[6] Glantz, Morton, and Robert Kissell. Multi-asset risk modeling: techniques for a
global economy in an electronic and algorithmic trading era. Academic Press,
2013.

[7] Gomber, Peter, and Martin Haferkorn. High frequency trading. Encyclopedia of
Information Science and Technology, Third Edition. IGI Global, 2015.

[8] Avellaneda, Marco, Josh Reed, and Sasha Stoikov. Forecasting Prices in the
Pres- ence of Hidden Liquidity. Preprint (2010).

[9] Zivot, Eric. Analysis of High Frequency Financial Data: Models, Methods and
Software. Part I: Descriptive Analysis of High Frequency Financial Data with S-
PLUS. (2005).

[10] James, Gareth, et al. An introduction to statistical learning. Vol. 112. New
York: springer, 2013.

[11] Guyon, Isabelle, and Andr Elisseeff. An introduction to variable and feature
selec- tion. Journal of machine learning research 3.Mar (2003).

[12] Liu, Huan, and Hiroshi Motoda. Feature selection for knowledge discovery and
data mining. Vol. 454. Springer Science & Business Media, 2012.

51
Bibliography 52

[13] Lisa Smith , The Auction Method: How NYSE Stock Prices are Set.
Investopedia, LLC, 2018.

[14] Donoho, David L. ”High-dimensional data analysis: The curses and blessings
of dimensionality.” AMS Math Challenges Lecture 1 (2000).

[15] Young, Forrest W., Pedro M. Valero-Mora, and Michael Friendly. Visual
statistics: seeing data with dynamic interactive graphics. Vol. 914. John Wiley &
Sons, 2011.

[16] Cukier, Kenneth, and Viktor Mayer-Schoenberger. The rise of big data: How
it’s changing the way we think about the world. Foreign Aff. 92 (2013).

[17] Layton, Robert. Learning data mining with python. Packt Publishing Ltd, 2017.

[18] Cont, Rama, Arseniy Kukanov, and Sasha Stoikov. The price impact of order
book events. Journal of financial econometrics 12.1 (2014).

[19] Derman, Emanuel. My life as a quant: reflections on physics and finance. John
Wiley & Sons, 2004.

[20] Kolo, Brian. Binary and Multiclass Classification. Lulu. com, 2011.
Appendix A: Features Importance

Figure 24: Basic Features

Figure 25: Highest Venue

Figure 26: Ask Features

53
Figure 27: Bid Features

Figure 28: Bid-Ask Spread Features

Figure 29: Trade Features

Figure 30: Group Characterization Features


Figure 31: 40 - Selected Features
Appendix B: Response Portrayal
Group

Figure 32: Response Portrayal per Price Group

Figure 33: Response Portrayal per Market-Cap Group

56
Figure 34: Response Portrayal per Volume Group

You might also like