SlideShare a Scribd company logo
CFM
www.cfm.fr www.cfm.fr
Presented by
Applied machine learning in finance:
Estimating missing bid-ask spreads
© CFM 2018
Marine Michaut, Elise Tellier
Thursday, November 29, 2018
Proprietary and confidential - not for redistribution Women in Machine Learning & Data Science (WiMLDS) Meetup
CFM
www.cfm.fr
Trading
US$5.7 BILLION
US$4.4 BILLION
US$10.1
BILLION
AUM
Investment process
Overview
© CFM 2018 2
Our approach History
Global reach Firm
233EMPLOYEES
30+
FOUNDED
1991
QUANTITATIVE
SYSTEMATIC
ASSET
MANAGER
WE TRADE LIQUID
INSTRUMENTS
ACROSS GLOBAL
MARKETS INCLUDING
BASED IN PARIS
WITH OFFICES IN LONDON | NEW YORK | TOKYO | SYDNEY
NATIONALITIES
SCIENTIFIC
APPROACH
TO FINANCE
27
YEARS
FUTURES
EQUITIES
BONDS
OPTIONS
SPOT & FORWARD FX
CREDIT
RESEARCH AND
TECHNOLOGY
ENABLE OUR
Notional AUM are calculated across all CFM mandates adjusted for leverage. Including the AUM of the UCITS versions of certain CFM’s Alternative Beta programs, managed by a
partner. Total Notional AUM for the Alpha program is US$5.7bn or US$4.4bn in equity; for Alternative Beta Notional AUM of US$4.4bn corresponds to US$3.8bn in equity.
69PhDs
WHO SHARE A CULTURE OF
INNOVATION
COLLABORATION
HUMILITY
CFM
www.cfm.fr© CFM 2018 3
Financial context
CFM
www.cfm.fr
 Bid-ask spread is computed in real time from order book informations.
 Order book = electronic list of buy and sell orders organized by price level.
Definition of bid-ask spread
© CFM 2018 4
BUYERS SELLERS
decreasing bid prices
decreasing ask prices
Bestask
Bestbid
CFM
www.cfm.fr
 Bid-ask spread is computed in real time from order book informations.
 Order book = electronic list of buy and sell orders organized by price level.
Definition of bid-ask spread
© CFM 2018 5
BUYERS SELLERS
decreasing bid prices
Bestask
Bestbid
Bid-ask spread = $34.44 - $34.32 = $0.12
decreasing ask prices
CFM
www.cfm.fr
 Bid-ask spread is computed in real time from order book informations.
 Order book = electronic list of buy and sell orders organized by price level.
Definition of bid-ask spread
© CFM 2018 6
BUYERS SELLERS
decreasing bid prices
BestaskBid-ask spread = $34.38 - $34.36 = $0.02
decreasing ask prices
CFM
www.cfm.fr
 Bid-ask spread is computed in real time from order book informations.
 Order book = electronic list of buy and sell orders organized by price level.
Definition of bid-ask spread
© CFM 2018 7
BUYERS SELLERS
decreasing bid prices
Bestask
decreasing ask prices
Bid-ask spread = $34.38 - $34.36 = $0.02
Minimum price increment = $0.01
CFM
www.cfm.fr
 Bid-ask spread is computed in real time from order book informations.
 Order book = electronic list of buy and sell orders organized by price level.
Definition of bid-ask spread
© CFM 2018 8
BUYERS SELLERS
Bestask
34.37Bestbid
decreasing bid prices
decreasing ask prices
Bid-ask spread = $34.38 - $34.37 = $0.01 = minimal spread
CFM
www.cfm.fr
Why is bid-ask spread important?
© CFM 2018 9
 The bid-ask spread measures the cost of completing a round trip (buy and sell).
 Example with APPLE order book :
cost = (ask - bid) = spread = $0.02
 We need bid-ask spreads to simulate execution costs:
𝒄𝒐𝒔𝒕 = 𝒇 𝒃𝒊𝒅 − 𝒂𝒔𝒌 𝒔𝒑𝒓𝒆𝒂𝒅 ∗ 𝑻𝒓𝒂𝒅𝒆𝒅 𝑸𝒖𝒂𝒏𝒕𝒊𝒕𝒚 + 𝒐𝒕𝒉𝒆𝒓 𝒄𝒐𝒔𝒕𝒔
Buy at
$173.82
Sell at
$173.80
CFM
www.cfm.fr
From real time to daily data
© CFM 2018 10
 ~10,000 trades per day for 1 stock
 ~50,000 stocks in our universe
 Our strategy needs bid-ask spreads on a daily basis
Real Time Data
Daily Spread =
𝑞𝑢𝑜𝑡𝑒𝑠/𝑗𝑜𝑢𝑟 𝑆𝑝𝑟𝑒𝑎𝑑 × 𝑇𝑟𝑎𝑑𝑒𝑑 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦
𝑞𝑢𝑜𝑡𝑒𝑠/𝑗𝑜𝑢𝑟 𝑇𝑟𝑎𝑑𝑒𝑑 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦
CFM
www.cfm.fr© CFM 2018 11
Bid-ask spread estimation
CFM
www.cfm.fr
Spreads estimation
Scope
© CFM 2018 12
We need a long history to be able to add execution costs to our simulations, and
data are missing.
 Long history: data from 2000
 Coverage: world
CFM
www.cfm.fr
Spreads estimation
Scope
© CFM 2018 13
We need a long history to be able to add execution costs to our simulations, and
data are missing.
 Long history: data from 2000
 Coverage: world
 ~50,000 stocks and lots of financial markets
CFM
www.cfm.fr
Spreads estimation
Scope
© CFM 2018 14
We need a long history to be able to add execution costs to our simulations, and
data are missing.
 Long history: data from 2000
 Coverage: world
 ~50,000 stocks and lots of financial markets
Different Price
Increments
Different Traded
Quantities
Different spread
behaviours
CFM
www.cfm.fr
What we have in our real data
© CFM 2018 15
Coverage
(%) Real Bid-Ask Spread Coverage
date
coverage(%)
CFM
www.cfm.fr
What we have in our real data
© CFM 2018 16
Coverage
(%) Real Bid-Ask Spread Coverage
date
coverage(%)
no missing spread
missing spreads
CFM
www.cfm.fr
Example of missing spreads
© CFM 2018 17
Daily spreads for one stock
date
CFM
www.cfm.fr
Example of missing spreads
© CFM 2018 18
Daily spreads for one stock
date
We need an estimator to fill missing
bid-ask spreads
?
CFM
www.cfm.fr
Machine learning workflow
© CFM 2018 19
Raw
data
Feature
extraction
Features Spread
𝑋 𝑦Stocks with
spreads
Train and evaluate
the model
Training
Final
model
Deploy model
CFM
www.cfm.fr
Machine learning workflow
© CFM 2018 20
Raw
data
Feature
extraction
Features Spread
𝑋 𝑦
Features Spread
𝑋 𝑦?
Spread
𝑦
Stocks with
spreads
Stocks
without
spreads
Train and evaluate
the model
Final
model
Estimate missing spread
Deploy model
Training
Prediction
CFM
www.cfm.fr
Look for determinants of bid-ask spread
© CFM 2018 21
$10
$10.01
$9.98
$9.97
$9.96
$10.02
$10.03
$10.04
Spread = $0.01
Buyers
High trading volume:
- Lots of buyers and sellers
- Stocks easy to trade
- Narrow spread
Here the spread is equal to the minimum price increment
Sellers
CFM
www.cfm.fr
Look for determinants of bid-ask spread
© CFM 2018 22
$10
$9.95
$9.80
$9.90 $11
$11.2
$11.5
Spread = $1
Low trading volume:
- Few buyers and sellers
- Stocks more difficult to trade
- Larger spread
Buyers Sellers
CFM
www.cfm.fr
Input features
Trading volume
© CFM 2018 23
R2 = 0.60
The higher the trading volume, the lower the bid-ask spread
(Log scale)
CFM
www.cfm.fr
Input features
Trading volume
© CFM 2018 24
R2 = 0.60
The higher the trading volume, the lower the bid-ask spread
Do we have the same
correlation all around
the world?
(Log scale)
CFM
www.cfm.fr
Input features
Stock exchange’s zone
© CFM 2018 25
Different correlations depending on the stock exchange’s zone
(Log scale)
CFM
www.cfm.fr
Data set draft
© CFM 2018 26
Stock ID Trading Volume Country … 𝒙 𝒑 𝒚 = spread
1 1 079 259 US 0.026
2 41 700 Japan 0.04
3 186 400 Japan 0.014
4 4 195 US 0.052
… …
N 2 195 298 France 0.005
TRAINING
SET
𝑦
TargetInput features
𝑋
CFM
www.cfm.fr
Data set draft
© CFM 2018 27
Stock ID Trading Volume Country … 𝒙 𝒑 𝒚 = spread
1 1 079 259 US 0.026
2 41 700 Japan 0.04
3 186 400 Japan 0.014
4 4 195 US 0.052
… …
N 2 195 298 France 0.005
TRAINING
SET
𝑦
TargetInput features
𝑋
What about
dates?
CFM
www.cfm.fr
How to deal with time?
A possible approach
© CFM 2018 28
Date P features 𝒚 = spread
2 Jan 2000
2 Jan 2000
…
31 Dec 2000
31 Dec 2000 Date P features 𝒚 = spread
2 Jan 2001
𝑋 𝑦?
2 Jan 2001
…
31 Jan 2001
31 Jan 2001
Train a model on 1 year
Estimate spreads for
the following month
A possible approach
𝑦𝑋
CFM
www.cfm.fr
How to deal with time?
Non-stationarity
© CFM 2018 29
Daily median spread of US stocks
Challenge: non-stationary target
date
CFM
www.cfm.fr
How to deal with time?
Date’s influence on correlations
© CFM 2018 30
Different behaviour depending on the day
CFM
www.cfm.fr
One model per year and zone
 Benefit: bigger training set for small countries
 Limit: estimation can be wrong because of high volatility days
Spreads estimation
What we could do
© CFM 2018 31
Estimated and real spread for one stock Real spread
Estimated spread
Median of US Spreads
spread
date
CFM
www.cfm.fr
One model per year and zone
 Benefit: bigger training set for small countries
 Limit: Estimation is impossible because of high volatility days
Spreads estimation
What we could do
© CFM 2018 32
Estimated and real spread for one stock
spread
date
Real spread
Estimated spread
Median of US Spreads
CFM
www.cfm.fr
One model per year and zone
 Benefit: bigger training set for small countries
 Limit: Estimationis impossible because of high volatility days
Spreads estimation
What we could do
© CFM 2018 33
Estimated and real spread for one stock
spread
date
Real spread
Estimated spread
Median of US Spreads
CFM
www.cfm.fr
Final data set
© CFM 2018 34
Stock
ID
Date
Traded
volume
𝒙 𝟐 … 𝒙 𝒑 Spread
1 2018-01-02 256 225 0.02
2 2018-01-02 3 781 1.25
… …
N 2018-01-02 217 608 0.046
Stock
ID
Date
Traded
Volume
𝒙 𝟐 … 𝒙 𝒑 Spread
100 2018-01-02 131 935 ?
200 2018-01-02 66 101 ?
TRAINING
SET
PREDICTION
SET
We choose to train a daily model per zone
𝑦𝑋
Train on stocks with spreads
Estimate missing spreads
?
Each day:
CFM
www.cfm.fr
Spreads estimation
Regressions
© CFM 2018 35
Let’s train models with the following regressors
Support Vector Regression
CFM
www.cfm.fr
How to choose the final estimation?
© CFM 2018 36
XGBoost
SVR
Linear Regression
Error
Time series of median error per regressor
CFM
www.cfm.fr
How to choose the final estimation?
© CFM 2018 37
XGBoost
SVR
Linear Regression
Error
LinReg or SVR?
CFM
www.cfm.fr
How to choose the final estimation?
© CFM 2018 38
XGBoost
SVR
Linear Regression
Error
Choose SVR?
CFM
www.cfm.fr
How to choose the final estimation?
© CFM 2018 39
Error
XGBoost
SVR
Linear Regression
XGBoost and SVR?
CFM
www.cfm.fr
How to choose the final estimation?
© CFM 2018 40
Error
Final spreads estimation: median of the three regressors outputs
CFM
www.cfm.fr
Back to our original problem
Fill missing spreads
© CFM 2018 41
Daily spreads for one stock
date
CFM
www.cfm.fr
Back to our original problem
Fill missing spreads
© CFM 2018 42
Daily spreads and estimations for one stock Real spread
Estimated spread
date
CFM
www.cfm.fr
Myth: Few input sample implies bad model performance
How to deal with a ML problem with few data?
- Stick to simple models
- Spend time to design your dataset
- Dig deeper to understand the model
- Investigate where the model does not perform well
- Try different estimators and combine them
- Try, try, try again… and succeed!
What we have learnt
© CFM 2018 43
CFM
www.cfm.fr
Myth: Few input sample implies bad model performance
How to deal with a ML problem with few data?
- Stick to simple models
- Spend time to design your dataset
- Dig deeper to understand the model
- Investigate where the model does not perform well
- Try different estimators and combine them
- Try, try, try again… and succeed!
What we have learnt
© CFM 2018 44
Questions?
CFM
www.cfm.fr
Disclaimer
© CFM 2018 45
Any description or information involving investment process or allocations is provided for illustrations
purposes only.
Any statements regarding correlations or modes or other similar statements constitute only subjective
views, are based upon expectations or beliefs, should not be relied on, are subject to change due to a
variety of factors, including fluctuating market conditions, and involve inherent risks and uncertainties,
both general and specific, many of which cannot be predicted or quantified and are beyond Capital
Fund Management's control. Future evidence and actual results could differ materially from those set
forth, contemplated by or underlying these statements. There can be no assurance that these
statements are or will prove to be accurate or complete in any way. All figures are unaudited.

More Related Content

PDF
Algoth presentation regarding trading automation
PPTX
COPA-1-0.pptx
PDF
UK Conference 2018_How BT delivered 21% cost savings through Converged Transf...
PDF
Bizible Essentials for Marketo Users
PPTX
Real -time data visualization using business intelligence techniques. and mak...
PPTX
Management consultancy-chapter-26-and-35
PDF
How to compete with Banks in 32 countries
PDF
Hypatia investor overview_jan2015
Algoth presentation regarding trading automation
COPA-1-0.pptx
UK Conference 2018_How BT delivered 21% cost savings through Converged Transf...
Bizible Essentials for Marketo Users
Real -time data visualization using business intelligence techniques. and mak...
Management consultancy-chapter-26-and-35
How to compete with Banks in 32 countries
Hypatia investor overview_jan2015

Similar to Applied Machine Learning in Finance: Estimating Missing Bid-Ask Spreads and Detecting Anomalies by Elise Tellier, Data Scientist & Marine Michaut, Software Engineer at Capital Fund Management (20)

PDF
Aligning Profit to Execution
PDF
I-Bytes Technology Industry
PPTX
How Analytic Solutions Drive Real-world Change (Interesting Use Cases)
PDF
2018_sow_cmd_webcast_duffaut.pdf
PDF
Sales Strategy Plan Powerpoint Presentation Slides
PPTX
Lessons learned in developing an agile architecture to reward our customers.
PDF
Oct 4 2017 ir marketing finalv2
PPTX
Itg investor ppt 11 sept18 final
PPT
Enterprise wide information systems - configuring sap
PPTX
Five Steps to Recession Proof your FinOps Tech Stack.pptx
PPTX
[DSC Europe 22] Using AI to solve complex business problems: optimizing markd...
PPSX
Cxo Advisor customer value proposition 2013 update
PPT
Nolen bootcamp
PDF
Our added value in Software & financial services sector 052025.pdf
PDF
Edison Partners 2018 Growth Index
PDF
Master the New Lightning Report Builder
PDF
GrowUP - A Modern way of Trading
PDF
WE and Belgium ICT buying power
PDF
An Advanced Analytics Approach to Resource Allocation Optimization & MCM Anal...
PPTX
financial modeling chapter four -(4) pro forma
Aligning Profit to Execution
I-Bytes Technology Industry
How Analytic Solutions Drive Real-world Change (Interesting Use Cases)
2018_sow_cmd_webcast_duffaut.pdf
Sales Strategy Plan Powerpoint Presentation Slides
Lessons learned in developing an agile architecture to reward our customers.
Oct 4 2017 ir marketing finalv2
Itg investor ppt 11 sept18 final
Enterprise wide information systems - configuring sap
Five Steps to Recession Proof your FinOps Tech Stack.pptx
[DSC Europe 22] Using AI to solve complex business problems: optimizing markd...
Cxo Advisor customer value proposition 2013 update
Nolen bootcamp
Our added value in Software & financial services sector 052025.pdf
Edison Partners 2018 Growth Index
Master the New Lightning Report Builder
GrowUP - A Modern way of Trading
WE and Belgium ICT buying power
An Advanced Analytics Approach to Resource Allocation Optimization & MCM Anal...
financial modeling chapter four -(4) pro forma
Ad

More from Paris Women in Machine Learning and Data Science (20)

PDF
Survival Models: Proper Scoring Rule and Stochastic Optimization with Competi...
PDF
No Capes Needed: The Real Superpowers of Women in Tech, by Aurélie Giard-Jacquet
PPTX
OpenCV Essentials: From Basics to Small Projects, by Irina Nikulina
PDF
AI Revolution: How Artificial Intelligence is Reshaping Business, by Alina Kr...
PDF
(Online) Convex Reinforcement Learning and applications to Energy Management ...
PDF
Welcome to the techno broligarchy by Mathilde Saliou
PDF
Low Rank Optimisation, by Irène Waldspurger
PDF
From Golem to Code: AI and Male Fantasies of Self-Engendering by Isabelle Collet
PDF
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
PDF
CI CD in the age of machine learning by Sofia Calcagno
PDF
Sequential and reinforcement learning for demand side management by Margaux B...
PDF
How and why AI should fight cybersexism, by Chloe Daudier
PDF
Anomaly detection and data imputation within time series
PPTX
Managing international tech teams, by Natasha Dimban
PDF
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
PPTX
PDF
Evaluation strategies for dealing with partially labelled or unlabelled data
PDF
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
PDF
An age-old question, by Caroline Jean-Pierre
PDF
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Survival Models: Proper Scoring Rule and Stochastic Optimization with Competi...
No Capes Needed: The Real Superpowers of Women in Tech, by Aurélie Giard-Jacquet
OpenCV Essentials: From Basics to Small Projects, by Irina Nikulina
AI Revolution: How Artificial Intelligence is Reshaping Business, by Alina Kr...
(Online) Convex Reinforcement Learning and applications to Energy Management ...
Welcome to the techno broligarchy by Mathilde Saliou
Low Rank Optimisation, by Irène Waldspurger
From Golem to Code: AI and Male Fantasies of Self-Engendering by Isabelle Collet
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
CI CD in the age of machine learning by Sofia Calcagno
Sequential and reinforcement learning for demand side management by Margaux B...
How and why AI should fight cybersexism, by Chloe Daudier
Anomaly detection and data imputation within time series
Managing international tech teams, by Natasha Dimban
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
Evaluation strategies for dealing with partially labelled or unlabelled data
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
An age-old question, by Caroline Jean-Pierre
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Ad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”

Applied Machine Learning in Finance: Estimating Missing Bid-Ask Spreads and Detecting Anomalies by Elise Tellier, Data Scientist & Marine Michaut, Software Engineer at Capital Fund Management

  • 1. CFM www.cfm.fr www.cfm.fr Presented by Applied machine learning in finance: Estimating missing bid-ask spreads © CFM 2018 Marine Michaut, Elise Tellier Thursday, November 29, 2018 Proprietary and confidential - not for redistribution Women in Machine Learning & Data Science (WiMLDS) Meetup
  • 2. CFM www.cfm.fr Trading US$5.7 BILLION US$4.4 BILLION US$10.1 BILLION AUM Investment process Overview © CFM 2018 2 Our approach History Global reach Firm 233EMPLOYEES 30+ FOUNDED 1991 QUANTITATIVE SYSTEMATIC ASSET MANAGER WE TRADE LIQUID INSTRUMENTS ACROSS GLOBAL MARKETS INCLUDING BASED IN PARIS WITH OFFICES IN LONDON | NEW YORK | TOKYO | SYDNEY NATIONALITIES SCIENTIFIC APPROACH TO FINANCE 27 YEARS FUTURES EQUITIES BONDS OPTIONS SPOT & FORWARD FX CREDIT RESEARCH AND TECHNOLOGY ENABLE OUR Notional AUM are calculated across all CFM mandates adjusted for leverage. Including the AUM of the UCITS versions of certain CFM’s Alternative Beta programs, managed by a partner. Total Notional AUM for the Alpha program is US$5.7bn or US$4.4bn in equity; for Alternative Beta Notional AUM of US$4.4bn corresponds to US$3.8bn in equity. 69PhDs WHO SHARE A CULTURE OF INNOVATION COLLABORATION HUMILITY
  • 3. CFM www.cfm.fr© CFM 2018 3 Financial context
  • 4. CFM www.cfm.fr  Bid-ask spread is computed in real time from order book informations.  Order book = electronic list of buy and sell orders organized by price level. Definition of bid-ask spread © CFM 2018 4 BUYERS SELLERS decreasing bid prices decreasing ask prices Bestask Bestbid
  • 5. CFM www.cfm.fr  Bid-ask spread is computed in real time from order book informations.  Order book = electronic list of buy and sell orders organized by price level. Definition of bid-ask spread © CFM 2018 5 BUYERS SELLERS decreasing bid prices Bestask Bestbid Bid-ask spread = $34.44 - $34.32 = $0.12 decreasing ask prices
  • 6. CFM www.cfm.fr  Bid-ask spread is computed in real time from order book informations.  Order book = electronic list of buy and sell orders organized by price level. Definition of bid-ask spread © CFM 2018 6 BUYERS SELLERS decreasing bid prices BestaskBid-ask spread = $34.38 - $34.36 = $0.02 decreasing ask prices
  • 7. CFM www.cfm.fr  Bid-ask spread is computed in real time from order book informations.  Order book = electronic list of buy and sell orders organized by price level. Definition of bid-ask spread © CFM 2018 7 BUYERS SELLERS decreasing bid prices Bestask decreasing ask prices Bid-ask spread = $34.38 - $34.36 = $0.02 Minimum price increment = $0.01
  • 8. CFM www.cfm.fr  Bid-ask spread is computed in real time from order book informations.  Order book = electronic list of buy and sell orders organized by price level. Definition of bid-ask spread © CFM 2018 8 BUYERS SELLERS Bestask 34.37Bestbid decreasing bid prices decreasing ask prices Bid-ask spread = $34.38 - $34.37 = $0.01 = minimal spread
  • 9. CFM www.cfm.fr Why is bid-ask spread important? © CFM 2018 9  The bid-ask spread measures the cost of completing a round trip (buy and sell).  Example with APPLE order book : cost = (ask - bid) = spread = $0.02  We need bid-ask spreads to simulate execution costs: 𝒄𝒐𝒔𝒕 = 𝒇 𝒃𝒊𝒅 − 𝒂𝒔𝒌 𝒔𝒑𝒓𝒆𝒂𝒅 ∗ 𝑻𝒓𝒂𝒅𝒆𝒅 𝑸𝒖𝒂𝒏𝒕𝒊𝒕𝒚 + 𝒐𝒕𝒉𝒆𝒓 𝒄𝒐𝒔𝒕𝒔 Buy at $173.82 Sell at $173.80
  • 10. CFM www.cfm.fr From real time to daily data © CFM 2018 10  ~10,000 trades per day for 1 stock  ~50,000 stocks in our universe  Our strategy needs bid-ask spreads on a daily basis Real Time Data Daily Spread = 𝑞𝑢𝑜𝑡𝑒𝑠/𝑗𝑜𝑢𝑟 𝑆𝑝𝑟𝑒𝑎𝑑 × 𝑇𝑟𝑎𝑑𝑒𝑑 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑞𝑢𝑜𝑡𝑒𝑠/𝑗𝑜𝑢𝑟 𝑇𝑟𝑎𝑑𝑒𝑑 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦
  • 11. CFM www.cfm.fr© CFM 2018 11 Bid-ask spread estimation
  • 12. CFM www.cfm.fr Spreads estimation Scope © CFM 2018 12 We need a long history to be able to add execution costs to our simulations, and data are missing.  Long history: data from 2000  Coverage: world
  • 13. CFM www.cfm.fr Spreads estimation Scope © CFM 2018 13 We need a long history to be able to add execution costs to our simulations, and data are missing.  Long history: data from 2000  Coverage: world  ~50,000 stocks and lots of financial markets
  • 14. CFM www.cfm.fr Spreads estimation Scope © CFM 2018 14 We need a long history to be able to add execution costs to our simulations, and data are missing.  Long history: data from 2000  Coverage: world  ~50,000 stocks and lots of financial markets Different Price Increments Different Traded Quantities Different spread behaviours
  • 15. CFM www.cfm.fr What we have in our real data © CFM 2018 15 Coverage (%) Real Bid-Ask Spread Coverage date coverage(%)
  • 16. CFM www.cfm.fr What we have in our real data © CFM 2018 16 Coverage (%) Real Bid-Ask Spread Coverage date coverage(%) no missing spread missing spreads
  • 17. CFM www.cfm.fr Example of missing spreads © CFM 2018 17 Daily spreads for one stock date
  • 18. CFM www.cfm.fr Example of missing spreads © CFM 2018 18 Daily spreads for one stock date We need an estimator to fill missing bid-ask spreads ?
  • 19. CFM www.cfm.fr Machine learning workflow © CFM 2018 19 Raw data Feature extraction Features Spread 𝑋 𝑦Stocks with spreads Train and evaluate the model Training Final model Deploy model
  • 20. CFM www.cfm.fr Machine learning workflow © CFM 2018 20 Raw data Feature extraction Features Spread 𝑋 𝑦 Features Spread 𝑋 𝑦? Spread 𝑦 Stocks with spreads Stocks without spreads Train and evaluate the model Final model Estimate missing spread Deploy model Training Prediction
  • 21. CFM www.cfm.fr Look for determinants of bid-ask spread © CFM 2018 21 $10 $10.01 $9.98 $9.97 $9.96 $10.02 $10.03 $10.04 Spread = $0.01 Buyers High trading volume: - Lots of buyers and sellers - Stocks easy to trade - Narrow spread Here the spread is equal to the minimum price increment Sellers
  • 22. CFM www.cfm.fr Look for determinants of bid-ask spread © CFM 2018 22 $10 $9.95 $9.80 $9.90 $11 $11.2 $11.5 Spread = $1 Low trading volume: - Few buyers and sellers - Stocks more difficult to trade - Larger spread Buyers Sellers
  • 23. CFM www.cfm.fr Input features Trading volume © CFM 2018 23 R2 = 0.60 The higher the trading volume, the lower the bid-ask spread (Log scale)
  • 24. CFM www.cfm.fr Input features Trading volume © CFM 2018 24 R2 = 0.60 The higher the trading volume, the lower the bid-ask spread Do we have the same correlation all around the world? (Log scale)
  • 25. CFM www.cfm.fr Input features Stock exchange’s zone © CFM 2018 25 Different correlations depending on the stock exchange’s zone (Log scale)
  • 26. CFM www.cfm.fr Data set draft © CFM 2018 26 Stock ID Trading Volume Country … 𝒙 𝒑 𝒚 = spread 1 1 079 259 US 0.026 2 41 700 Japan 0.04 3 186 400 Japan 0.014 4 4 195 US 0.052 … … N 2 195 298 France 0.005 TRAINING SET 𝑦 TargetInput features 𝑋
  • 27. CFM www.cfm.fr Data set draft © CFM 2018 27 Stock ID Trading Volume Country … 𝒙 𝒑 𝒚 = spread 1 1 079 259 US 0.026 2 41 700 Japan 0.04 3 186 400 Japan 0.014 4 4 195 US 0.052 … … N 2 195 298 France 0.005 TRAINING SET 𝑦 TargetInput features 𝑋 What about dates?
  • 28. CFM www.cfm.fr How to deal with time? A possible approach © CFM 2018 28 Date P features 𝒚 = spread 2 Jan 2000 2 Jan 2000 … 31 Dec 2000 31 Dec 2000 Date P features 𝒚 = spread 2 Jan 2001 𝑋 𝑦? 2 Jan 2001 … 31 Jan 2001 31 Jan 2001 Train a model on 1 year Estimate spreads for the following month A possible approach 𝑦𝑋
  • 29. CFM www.cfm.fr How to deal with time? Non-stationarity © CFM 2018 29 Daily median spread of US stocks Challenge: non-stationary target date
  • 30. CFM www.cfm.fr How to deal with time? Date’s influence on correlations © CFM 2018 30 Different behaviour depending on the day
  • 31. CFM www.cfm.fr One model per year and zone  Benefit: bigger training set for small countries  Limit: estimation can be wrong because of high volatility days Spreads estimation What we could do © CFM 2018 31 Estimated and real spread for one stock Real spread Estimated spread Median of US Spreads spread date
  • 32. CFM www.cfm.fr One model per year and zone  Benefit: bigger training set for small countries  Limit: Estimation is impossible because of high volatility days Spreads estimation What we could do © CFM 2018 32 Estimated and real spread for one stock spread date Real spread Estimated spread Median of US Spreads
  • 33. CFM www.cfm.fr One model per year and zone  Benefit: bigger training set for small countries  Limit: Estimationis impossible because of high volatility days Spreads estimation What we could do © CFM 2018 33 Estimated and real spread for one stock spread date Real spread Estimated spread Median of US Spreads
  • 34. CFM www.cfm.fr Final data set © CFM 2018 34 Stock ID Date Traded volume 𝒙 𝟐 … 𝒙 𝒑 Spread 1 2018-01-02 256 225 0.02 2 2018-01-02 3 781 1.25 … … N 2018-01-02 217 608 0.046 Stock ID Date Traded Volume 𝒙 𝟐 … 𝒙 𝒑 Spread 100 2018-01-02 131 935 ? 200 2018-01-02 66 101 ? TRAINING SET PREDICTION SET We choose to train a daily model per zone 𝑦𝑋 Train on stocks with spreads Estimate missing spreads ? Each day:
  • 35. CFM www.cfm.fr Spreads estimation Regressions © CFM 2018 35 Let’s train models with the following regressors Support Vector Regression
  • 36. CFM www.cfm.fr How to choose the final estimation? © CFM 2018 36 XGBoost SVR Linear Regression Error Time series of median error per regressor
  • 37. CFM www.cfm.fr How to choose the final estimation? © CFM 2018 37 XGBoost SVR Linear Regression Error LinReg or SVR?
  • 38. CFM www.cfm.fr How to choose the final estimation? © CFM 2018 38 XGBoost SVR Linear Regression Error Choose SVR?
  • 39. CFM www.cfm.fr How to choose the final estimation? © CFM 2018 39 Error XGBoost SVR Linear Regression XGBoost and SVR?
  • 40. CFM www.cfm.fr How to choose the final estimation? © CFM 2018 40 Error Final spreads estimation: median of the three regressors outputs
  • 41. CFM www.cfm.fr Back to our original problem Fill missing spreads © CFM 2018 41 Daily spreads for one stock date
  • 42. CFM www.cfm.fr Back to our original problem Fill missing spreads © CFM 2018 42 Daily spreads and estimations for one stock Real spread Estimated spread date
  • 43. CFM www.cfm.fr Myth: Few input sample implies bad model performance How to deal with a ML problem with few data? - Stick to simple models - Spend time to design your dataset - Dig deeper to understand the model - Investigate where the model does not perform well - Try different estimators and combine them - Try, try, try again… and succeed! What we have learnt © CFM 2018 43
  • 44. CFM www.cfm.fr Myth: Few input sample implies bad model performance How to deal with a ML problem with few data? - Stick to simple models - Spend time to design your dataset - Dig deeper to understand the model - Investigate where the model does not perform well - Try different estimators and combine them - Try, try, try again… and succeed! What we have learnt © CFM 2018 44 Questions?
  • 45. CFM www.cfm.fr Disclaimer © CFM 2018 45 Any description or information involving investment process or allocations is provided for illustrations purposes only. Any statements regarding correlations or modes or other similar statements constitute only subjective views, are based upon expectations or beliefs, should not be relied on, are subject to change due to a variety of factors, including fluctuating market conditions, and involve inherent risks and uncertainties, both general and specific, many of which cannot be predicted or quantified and are beyond Capital Fund Management's control. Future evidence and actual results could differ materially from those set forth, contemplated by or underlying these statements. There can be no assurance that these statements are or will prove to be accurate or complete in any way. All figures are unaudited.