SlideShare a Scribd company logo
QuantCon
“Light Up Your Dark Data”
April 2016
What is dark data?
2
SQL
CSV
REST
JSON
SQL
CSV
REST
JSON
SQL
CSV
SQL
CSV
Example Datasets
3
Trade	
History
Signal	
History
Clearing	
Data
Log	Files
Ref
Data
Corp	
Actions
Market	
Data
Models
Firm	Generated Vendor	Generated
Compounding Challenges
Accumulates	
Quickly
Disparate	Storage
Different	
Vendors
Format	Changes
Ad-hoc	Usage
Urgent!
4
Workflow
Find	Data
Ad-Hoc	
ETL
Store	/		
Copy
Analysis
Report
5
Sample Environment
6
Oracle MySQL MSSQL KDB ZIPCSV
SQL
Python
DSL
R Matlab
C++ Java
Storage
ETL
Analysis
REST
Independent First Class Citizens
7
Expression
ComputeData
Datashape
8
Structured	data	description	language
http://datashape.pydata.org
Datashape Example
9
daily_bars: var * {
date: string,
symbol: string,
open: float64,
high: float64,
low: float64,
close: float64,
volume: int64,
}
Language,	compute,	and	storage	independent
Blaze
10
Write	expressions	independent	of	storage	system
Push	computations	to	the	data
Lazy	evaluation
Pandas-like	API
Blaze
11
http://blaze.pydata.org/
Blaze Expressions
12
Flat File Repositories
13
Many directories	and	files
Dictated	structure
Naming	convention	part	of	dataset
Requires	one	off	ad-hoc	scripts
Vendor - directory structure
/daily/us/nasdaq stocks/
/daily/us/nasdaq stocks/1/
/daily/us/nasdaq stocks/2/
osn.us.txt
ostk.us.txt
…
zyne.us.txt
/daily/us/nyse etfs/
/daily/us/nyse stocks/1/
/daily/us/nyse stocks/2/
Contains ~8400 individual files
14
Vendor – file contents
15
Date,Open,High,Low,Close,Volume,OpenInt
20151111,18.5,25.9,18,24.5,1584600,0
20151112,24.25,27.12,22.5,25,83000,0
20151113,25.47,26.2,24.55,25.26,67300,0
20151116,25.01,26.19,24.13,25.02,16900,0
20151117,24.46,25.51,24.38,24.62,25900,0
20151118,24.62,26.31,24.06,25,111100,0
20151119,24.85,26,24.71,25.9,113100,0
…
Symbol	is	not	contained	within	the	individual	 data	files
/daily/us/nasdaq stocks/1/aaap.us.txt
Lux
16
source: "lux://global-equities/data/daily/us/nasdaq stocks"
extractor: "{}/{Symbol}.{Region}.txt"
Date,Open,High,Low,Close,Volume,OpenInt,Symbol,Region
20151111,18.5,25.9,18,24.5,1584600,0,aaap,us
20151112,24.25,27.12,22.5,25,83000,0,aaap,us
20151113,25.47,26.2,24.55,25.26,67300,0,aaap,us
…
20160322,11.56,11.98,10.8894,11.09,517604,0,zyne,us
20160323,11.3,11.72,9.5,9.75,489743,0,zyne,us
20160324,9.5,10.24,9.22,9.64,188512,0,zyne,us
One	dataset	with	~5.5	million	rows
Lux Benefits
17
Combines	individual	files
No	separate	ETL	or	storage
Names	become	part	of	data
Optimized	compute
Anaconda Mosaic
18
Interactive	exploration
Intuitive	interface
Advanced	visualizations
Catalog	of	datasets	and	expressions
Provenance	and	Governance
Live Walkthrough
19
Project References
• Anaconda Mosaic -
http://know.continuum.io/Anaconda-Mosaic
• Blaze Ecosystem - http://blaze.pydata.org
• Bokeh - http://bokeh.pydata.org
20

More Related Content

PDF
Drupal and the Semantic Web - ESIP Webinar
PDF
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
PDF
Querying the Wikidata Knowledge Graph
ODP
Introduction to ETL
PDF
PharoDAYS 2015: Pharo Status - by Markus Denker
PDF
FOXX - a Javascript application framework on top of ArangoDB
PDF
Beyond 2022 project presentation 2021
PDF
Drupal and the Semantic Web - ESIP Webinar
Benchmarking RDF Metadata Representations: Reification, Singleton Property an...
Querying the Wikidata Knowledge Graph
Introduction to ETL
PharoDAYS 2015: Pharo Status - by Markus Denker
FOXX - a Javascript application framework on top of ArangoDB
Beyond 2022 project presentation 2021

What's hot (20)

PDF
Using the whole web as your dataset
PDF
HyperGraphQL
PPTX
Or2019 DSpace 7 Enhanced submission & workflow
PPT
SPARQL Query Forms
PDF
The RDF Report Card: Beyond the Triple Count
PPT
Analytics and Access to the UK web archive
PPT
Talis Platform: A Linked Data Engine
PDF
Integrating Drupal with a Triple Store
PDF
Insight Data Engineering project
PDF
Multi model-databases
PDF
Drupal 7 and RDF
PDF
Multi-model databases and node.js
PPTX
Introduction to PIG
PPTX
Semantics, rdf and drupal
PPT
Drupal and the Semantic Web
PDF
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
PDF
Introduction to ArangoDB (nosql matters Barcelona 2012)
PPTX
Neo4j_allHands_04112013
PDF
guacamole: an Object Document Mapper for ArangoDB
PPT
Semantic web and Drupal: an introduction
Using the whole web as your dataset
HyperGraphQL
Or2019 DSpace 7 Enhanced submission & workflow
SPARQL Query Forms
The RDF Report Card: Beyond the Triple Count
Analytics and Access to the UK web archive
Talis Platform: A Linked Data Engine
Integrating Drupal with a Triple Store
Insight Data Engineering project
Multi model-databases
Drupal 7 and RDF
Multi-model databases and node.js
Introduction to PIG
Semantics, rdf and drupal
Drupal and the Semantic Web
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Introduction to ArangoDB (nosql matters Barcelona 2012)
Neo4j_allHands_04112013
guacamole: an Object Document Mapper for ArangoDB
Semantic web and Drupal: an introduction
Ad

Viewers also liked (19)

PDF
Trading Strategies Based on Market Impact of Macroeconomic Announcements by A...
PDF
Deep Value and the Aquirer's Multiple by Tobias Carlisle for QuantCon 2016
PDF
Quantitative Trading in Eurodollar Futures Market by Edith Mandel at QuantCon...
PDF
Improving Predictability of Oil via Reuters News Text by Sameena Shah at Quan...
PDF
Welcome to QuantCon 2016 John "fawce" Fawcett, Founder and CEO of Quantopian
PDF
Needle in the Haystack by Anshul Vikram Pandey at QuantCon 2016
PDF
Financial Engineering and Its Discontents by Emanuel Derman at QuantCon 2016
PDF
Latency in Automated Trading Systems by Andrei Kirilenko at QuantCon 2016
PDF
Meb Faber at QuantCon 2016
PDF
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
PDF
The Evolution of Social Listening for Capital Markets by Chris Camillo at Qua...
PDF
Machine Learning Based Cryptocurrency Trading by Arshak Navruzyan at QuantCon...
PDF
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
PDF
Combining the Best Stock Selection Factors by Patrick O'Shaughnessy at QuantC...
PDF
From Backtesting to Live Trading by Vesna Straser at QuantCon 2016
PDF
Statistics - The Missing Link Between Technical Analysis and Algorithmic Trad...
PPTX
The Sustainable Active Investing Framework: Simple, but Not Easy by Wesley Gr...
PDF
Dual Momentum Investing by Gary Antonacci QuantCon 2016
PDF
Market Timing, Big Data, and Machine Learning by Xiao Qiao at QuantCon 2016
Trading Strategies Based on Market Impact of Macroeconomic Announcements by A...
Deep Value and the Aquirer's Multiple by Tobias Carlisle for QuantCon 2016
Quantitative Trading in Eurodollar Futures Market by Edith Mandel at QuantCon...
Improving Predictability of Oil via Reuters News Text by Sameena Shah at Quan...
Welcome to QuantCon 2016 John "fawce" Fawcett, Founder and CEO of Quantopian
Needle in the Haystack by Anshul Vikram Pandey at QuantCon 2016
Financial Engineering and Its Discontents by Emanuel Derman at QuantCon 2016
Latency in Automated Trading Systems by Andrei Kirilenko at QuantCon 2016
Meb Faber at QuantCon 2016
Honey, I Deep-shrunk the Sample Covariance Matrix! by Erk Subasi at QuantCon ...
The Evolution of Social Listening for Capital Markets by Chris Camillo at Qua...
Machine Learning Based Cryptocurrency Trading by Arshak Navruzyan at QuantCon...
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Combining the Best Stock Selection Factors by Patrick O'Shaughnessy at QuantC...
From Backtesting to Live Trading by Vesna Straser at QuantCon 2016
Statistics - The Missing Link Between Technical Analysis and Algorithmic Trad...
The Sustainable Active Investing Framework: Simple, but Not Easy by Wesley Gr...
Dual Momentum Investing by Gary Antonacci QuantCon 2016
Market Timing, Big Data, and Machine Learning by Xiao Qiao at QuantCon 2016
Ad

Similar to Light up Your Dark Data by Lance Ransom at QuantCon 2016 (8)

PDF
Extract the Analyzed Information from Dark Data
PDF
Spark at Bloomberg: Dynamically Composable Analytics
PDF
Dark data
PPT
Data Munging in concepts of data mining in DS
PDF
The Problem with Data Portals - PUBLIC (FINAL).pdf
DOCX
Understanding Dark Data
PDF
The Problem with Data Portals: A Data Portal is just the tip of a Data Govern...
PDF
Big Data – Shining the Light on Enterprise Dark Data
Extract the Analyzed Information from Dark Data
Spark at Bloomberg: Dynamically Composable Analytics
Dark data
Data Munging in concepts of data mining in DS
The Problem with Data Portals - PUBLIC (FINAL).pdf
Understanding Dark Data
The Problem with Data Portals: A Data Portal is just the tip of a Data Govern...
Big Data – Shining the Light on Enterprise Dark Data

More from Quantopian (20)

PPTX
Being open (source) in the traditionally secretive field of quant finance.
PPTX
Stauth common pitfalls_stock_market_modeling_pqtc_fall2018
PPTX
Tearsheet feedback webinar 10.10.18
PDF
"Three Dimensional Time: Working with Alternative Data" by Kathryn Glowinski,...
PPTX
"Alpha from Alternative Data" by Emmett Kilduff, Founder and CEO of Eagle Alpha
PPTX
"Supply Chain Earnings Diffusion" by Josh Holcroft, Head of Quantitative Rese...
PDF
"Portfolio Optimisation When You Don’t Know the Future (or the Past)" by Rob...
PPTX
"Quant Trading for a Living – Lessons from a Life in the Trenches" by Andreas...
PDF
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
PDF
“Market Insights Through the Lens of a Risk Model” by Olivier d'Assier, Head ...
PDF
"Maximize Alpha with Systematic Factor Testing" by Cheng Peng, Software Engin...
PPTX
"How to Run a Quantitative Trading Business in China with Python" by Xiaoyou ...
PDF
"Fundamental Forecasts: Methods and Timing" by Vinesh Jha, CEO of ExtractAlpha
PPTX
"From Alpha Discovery to Portfolio Construction: Pitfalls and Solutions" by D...
PDF
"Deep Reinforcement Learning for Optimal Order Placement in a Limit Order Boo...
PPTX
"Making the Grade: A Look Inside the Algorithm Evaluation Process" by Dr. Jes...
PDF
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
PPTX
"From Insufficient Economic data to Economic Big Data – How Trade Data is red...
PDF
"Machine Learning Approaches to Regime-aware Portfolio Management" by Michael...
PDF
"A Framework-Based Approach to Building Quantitative Trading Systems" by Dr. ...
Being open (source) in the traditionally secretive field of quant finance.
Stauth common pitfalls_stock_market_modeling_pqtc_fall2018
Tearsheet feedback webinar 10.10.18
"Three Dimensional Time: Working with Alternative Data" by Kathryn Glowinski,...
"Alpha from Alternative Data" by Emmett Kilduff, Founder and CEO of Eagle Alpha
"Supply Chain Earnings Diffusion" by Josh Holcroft, Head of Quantitative Rese...
"Portfolio Optimisation When You Don’t Know the Future (or the Past)" by Rob...
"Quant Trading for a Living – Lessons from a Life in the Trenches" by Andreas...
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
“Market Insights Through the Lens of a Risk Model” by Olivier d'Assier, Head ...
"Maximize Alpha with Systematic Factor Testing" by Cheng Peng, Software Engin...
"How to Run a Quantitative Trading Business in China with Python" by Xiaoyou ...
"Fundamental Forecasts: Methods and Timing" by Vinesh Jha, CEO of ExtractAlpha
"From Alpha Discovery to Portfolio Construction: Pitfalls and Solutions" by D...
"Deep Reinforcement Learning for Optimal Order Placement in a Limit Order Boo...
"Making the Grade: A Look Inside the Algorithm Evaluation Process" by Dr. Jes...
"Building Diversified Portfolios that Outperform Out-of-Sample" by Dr. Marcos...
"From Insufficient Economic data to Economic Big Data – How Trade Data is red...
"Machine Learning Approaches to Regime-aware Portfolio Management" by Michael...
"A Framework-Based Approach to Building Quantitative Trading Systems" by Dr. ...

Recently uploaded (20)

PDF
how_to_earn_50k_monthly_investment_guide.pdf
PDF
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
PDF
financing insitute rbi nabard adb imf world bank insurance and credit gurantee
PPTX
How best to drive Metrics, Ratios, and Key Performance Indicators
PPTX
Session 14-16. Capital Structure Theories.pptx
PDF
Lecture1.pdf buss1040 uses economics introduction
PPTX
Who’s winning the race to be the world’s first trillionaire.pptx
PPTX
4.5.1 Financial Governance_Appropriation & Finance.pptx
PDF
Corporate Finance Fundamentals - Course Presentation.pdf
PPTX
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
PDF
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
PDF
Bladex Earnings Call Presentation 2Q2025
PDF
discourse-2025-02-building-a-trillion-dollar-dream.pdf
PDF
Mathematical Economics 23lec03slides.pdf
PDF
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
PDF
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
PPTX
kyc aml guideline a detailed pt onthat.pptx
PPTX
social-studies-subject-for-high-school-globalization.pptx
PDF
Topic Globalisation and Lifelines of National Economy.pdf
PPTX
Session 3. Time Value of Money.pptx_finance
how_to_earn_50k_monthly_investment_guide.pdf
Spending, Allocation Choices, and Aging THROUGH Retirement. Are all of these ...
financing insitute rbi nabard adb imf world bank insurance and credit gurantee
How best to drive Metrics, Ratios, and Key Performance Indicators
Session 14-16. Capital Structure Theories.pptx
Lecture1.pdf buss1040 uses economics introduction
Who’s winning the race to be the world’s first trillionaire.pptx
4.5.1 Financial Governance_Appropriation & Finance.pptx
Corporate Finance Fundamentals - Course Presentation.pdf
Basic Concepts of Economics.pvhjkl;vbjkl;ptx
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
Bladex Earnings Call Presentation 2Q2025
discourse-2025-02-building-a-trillion-dollar-dream.pdf
Mathematical Economics 23lec03slides.pdf
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
kyc aml guideline a detailed pt onthat.pptx
social-studies-subject-for-high-school-globalization.pptx
Topic Globalisation and Lifelines of National Economy.pdf
Session 3. Time Value of Money.pptx_finance

Light up Your Dark Data by Lance Ransom at QuantCon 2016