SlideShare a Scribd company logo
Data Science
joaquin vanschoren
Data science
WHAT IS DATA SCIENCE?
Hacking skills
Maths & StatsExpertise
Expertise
	

!
	

Maths & 	

Stats
Hacking skills	

!
!
!Danger	

zone!
Machine	

Learning
Research
Data	

Science
[Drew Conway]
(data) 	

science officer
Maths 	

& 	

Stats	

!
!
Hacking 	

skills	

!
!
!
	

!
	

Evil
	

	

	

Expertise
danger	

zone!
!
machine	

learning
!
James Bond	

villain
data	

science
outside	

committee	

member
identity	

thief
thesis	

advisor
office 	

mate
NSA
[Joel Grus]
– David Coallier
“Whenever you read about data science or data
analysis, it’s about the ability to store petabytes of data,
retrieve that data in nanoseconds, then turn it into a
rainbow with a unicorn dancing on it.”
THE HYPE
– Harvard Business Review
“Data Scientist: The Sexiest Job of the 21st Century”
THE REALITY
• You’ll clean a lot of data.A LOT	

• A lot of mathematics. Get over it	

• Some days will be long. Get more coffee	

• Not everything is about Big Data	

• Most people don’t care about data	

• Spend time finding the right questions
[David Coallier]
BIG DATA	

THE END OFTHEORY?
– Chris Anderson,WIRED
Out with every theory of human behavior, from linguistics to
sociology.Who knows why people do what they do?The point is
they do it… With enough data, the numbers speak for themselves.
All models
are wrong.
But some are
useful.
All models are
wrong, and
increasingly you
can succeed
without them.
Big data is a
step forward.
But our
problems are
not lack of
access to
data, but
understanding
them.
Data Scientific Method
[DJ Patil, J Elman]
START 	

WITH 	

A QUESTION
Based on an observation
START 	

WITH 	

A QUESTIONWhat (just) happened?	

Why did it happen?	

What will/could happen next?
ANALYSE 	

CURRENT	

DATA
Create an Hypothesis
CREATE
FEATURES,	

EXPERIMENT
Test Hypothesis
ANALYSE 	

RESULTS	

Won’t be pretty, repeat
LET DATA	

FRAME THE	

CONVERSATION
Data gives you the what 	

Humans give you the why
LET DATA	

FRAME THE	

CONVERSATION
Let the dataset 	

change your mindset
CONVERSE
• What data is missing? Where can
we get it?	

• Automate data collection	

• Clean data, then clean it more	

• Visualize data: the brain sees	

• Merge various sources of
information	

• Reformulate hypotheses	

• Reformulate questions
DATA SCIENCE	

TOOLS
We can't solve problems at the
same level of thinking with which
we've created them…
Probabilistic	

algorithms
When polynomial time 	

is just too slow
[1,54,853,23,4,73,…] Have we seen 73?
min H(1,54,853,…) < H(73) ?
min H2(1,54,853,…) < H2(73) ?
min H3(1,54,853,…) < H3(73) ?
MINHASHING
Bloom filters	

Find similar documents, photos, …
MapReduce
map
map
map
split
reduce
reduce
reduce
shuffle	

(remote read)
read
data(HDFS)
data(HDFS)
write
readread
worker nodes (local) worker nodes (local)
split1split2splitn
mapper	

node
reducer	

node
remote	

read
Data science
Data science
Data science
Data science
Data science
Data science
Mapper
<a,apple> <a’,slices>
Reducer
Input	

file
Intermediate 	

file (local)
Output	

file
<p,pineapple>
<p’,slices>
<a,apple>
<o,orange>
<a’,slices>
<o’,slices>
Input file Output fileIntermediate 	

file
split 0
split 1
split 2
1 mapper/split 1 reducer/key(set)
shufflesplit
map
map
map
split
reduce
reduce
reduce
shuffle	

(remote read)
read
data(HDFS)
data(HDFS)
write
readread
worker nodes (local) worker nodes (local)
split1split2splitn
map
map
map
split
reduce
reduce
shuffle + parallel sort	

(remote read)
read
data(HDFS)
data(HDFS)
write
readread
split1split2splitn
master node
assigns map/reduce jobs	

reassigns if nodes fail
Data science
Data science
Data science
Data science
Nearest bar
? nearest within distance d?
Input
graph
(node,label)
Nearest bar
Map
∀ , search graph with radius d
< ,{ ,distance} >
Input
graph
(node,label)
Nearest bar
Map
∀ , search graph
< ,{ ,distance} >
Shuffle/
Sort
by id
Input
graph
(node,label)
Nearest bar
Map
∀ , search graph
< ,{ ,distance} >
Input
graph
(node,label)
Shuffle/
Sort
by id
Reduce
< ,[{ ,distance},
{ ,distance}] >
-> min()
Output
< , >
< , >
marked graph
Sensor data
Vibration
Strain (longitudinal)
Strain (transverse)
Temperature
NOISE
MATHS
Convolution
signal
kernel
convolution
multiplied by 1:5 and delayed by 14 sample intervals.
Evidently, we have just described in words the following definition of discrete
convolution with a response function of finite duration M:
.r s/j Á
M=2
X
kD M=2C1
sj k rk (13.1.1)
If a discrete response function is nonzero only in some range M=2 < k Ä M=2,
where M is a sufficiently large even integer, then the response function is called a
finite impulse response (FIR), and its duration is M. (Notice that we are defining M
as the number of nonzero values of rk; these values span a time interval of M 1
sampling times.) In most practical circumstances the case of finite M is the case of
interest, either because the response really has a finite duration, or because we choose
to truncate it at some point and approximate it by a finite-duration response function.
The discrete convolution theorem is this: If a signal sj is periodic with period
N, so that it is completely determined by the N values s0; : : : ; sN 1, then its discrete
convolution with a response function of finite duration N is a member of the discrete
Fourier transform pair,
g h Á
Z 1
1
g. /h.t / d (12
e that g h is a function in the time domain and that g h D h g. It turn
the function g h is one member of a simple transform pair,
g h ” G.f /H.f / convolution theorem (12.0
ther words, the Fourier transform of the convolution is just the product o
vidual Fourier transforms.
The correlation of two functions, denoted Corr.g; h/, is defined by
Corr.g; h/ Á
Z 1
1
g. C t/h. / d (12.0
correlation is a function of t, which is called the lag. It therefore lies in the
ain, and it turns out to be one member of the transform pair:
Corr.g; h/ ” G.f /H .f / correlation theorem (12.0
COMPLEXITY
MATHS,AGAIN
scale space decomposition
signal
kernel 1
convolution 2
kernel 2
convolution 1
SCALE-SPACE
Baseline (σ64)
Traffic Jams (σ16 -σ64)
Slowdown (σ4 -σ16)
Vehicles (σ0 -σ4)
SCALE-SPACE	

DECOMPOSITION
VOLUME
MapReduce
145 sensors	

100Hz	

5GB/day	

2TB/year 	

50MB/s disk I/O
Reduce Reduce Reduce Reduce Reduce
Map Map Map MapMap
Build
windows
Build	

windows
Build
windows
Build
windows
Build
windows
Shuffle
Convolute Convolute Convolute Convolute Convolute
CONVOLUTION
CONVOLUTE-ADD
Map	

(convolute	

with 0-padding)
Reduce	

(add)
Add values in overlapping regions	

0 0
0 0
0 0
A A+B B B+C C
SEGMENTATION
• You don’t need 100Hz data for everything	

• Approximate signal with linear segments	

• Key points: 0-crossings of 1st, 2nd, 3rd derivative	

• Maths: derivative of smoothed signal =
convolution with derivative of kernel
1st, 2nd,3rd degree derivatives
signal
convolution
segmentation
SEGMENTATION RESULT
VISUALIZATION
Twitter data
TRACKING NEWS STORIES
Geospacial data
OPEN SOURCETOOLS
modelling, testing, prototyping
lubridate, zoo: dates, time series	

reshape2: reshape data	

ggplot2: visualize data	

RCurl, RJSONIO: find more data	

HMisc: miscellaneous	

DMwR, mlr: machine learning	

Forecast: time series forecasting	

garch: time series modelling	

quantmod: statistical financial trading	

xts: extensible time series	

igraph: study networks	

maptools: read and view maps
R
scientific computing
numpy: linear algebra	

scipy: optimization, signal/image processing, …	

scikits: toolkits for scipy	

scikit-learn: machine learning toolkit	

statsmodels: advanced statistic modelling	

matplotlib: plotting	

NLTK: natural language processing	

PyBrain: more machine learning	

PyMC: Bayesian inference	

Pattern:Web mining	

NetworkX: Study networks	

Pandas: easy-to-use data structures
PYTHON
CouchDB
OTHER
D3.js
Data science
@joavanschoren
joaquin.vanschoren@gmail.com

More Related Content

PPTX
Oxford 05-oct-2012
PPTX
Cmu Lecture on Hadoop Performance
PDF
[251] implementing deep learning using cu dnn
PDF
zeropadding
PPTX
Lec_4_1_IntrotoPIG.pptx
PDF
lec6_ref.pdf
PPTX
Deep Learning for AI (2)
PDF
Pixel RNN to Pixel CNN++
Oxford 05-oct-2012
Cmu Lecture on Hadoop Performance
[251] implementing deep learning using cu dnn
zeropadding
Lec_4_1_IntrotoPIG.pptx
lec6_ref.pdf
Deep Learning for AI (2)
Pixel RNN to Pixel CNN++

What's hot (16)

PDF
Recursive algorithms
PDF
A comparison of efficient algorithms for scheduling parallel data redistribution
PDF
Hands-on Tutorial of Machine Learning in Python
PPT
Processing Reachability Queries with Realistic Constraints on Massive Network...
PDF
lec1_ref.pdf
PDF
Why Batch Normalization Works so Well
PDF
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
PPTX
Aggarwal Draft
PDF
Generalised quantumsecretsharingslides
PDF
Bayesian Counters
PDF
MapReduce: teoria e prática
PDF
Fractals in Small-World Networks With Time Delay
PDF
Detecting Misleading Headlines in Online News: Hands-on Experiences on Attent...
PDF
Information Flow and Search in Unstructured Keyword based Social Networks
PDF
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
PPTX
News from Mahout
Recursive algorithms
A comparison of efficient algorithms for scheduling parallel data redistribution
Hands-on Tutorial of Machine Learning in Python
Processing Reachability Queries with Realistic Constraints on Massive Network...
lec1_ref.pdf
Why Batch Normalization Works so Well
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Aggarwal Draft
Generalised quantumsecretsharingslides
Bayesian Counters
MapReduce: teoria e prática
Fractals in Small-World Networks With Time Delay
Detecting Misleading Headlines in Online News: Hands-on Experiences on Attent...
Information Flow and Search in Unstructured Keyword based Social Networks
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
News from Mahout
Ad

Similar to Data science (20)

PPTX
The Use of Data and Datasets in Data Science
PDF
50YearsDataScience.pdf
PDF
Data Science unit 2 By: Professor Lili Saghafi
PDF
Data Clustering Theory Algorithms And Applications Guojun Gan
PPT
Data preprocessing
PDF
Data Science: Origins, Methods, Challenges and the future?
PPTX
DS_Teacher_Presentation DS and Education.pptx
PDF
Chapter-Four.pdf
PPTX
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PPT
data science ppt of emngineering studnets
PDF
Introduction to Data Science and Analytics
PDF
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
PPTX
Data Science topic and introduction to basic concepts involving data manageme...
PPTX
Data science fullOCS353 UNIT 1 UPDATED.pptx
PDF
Why Data Science is a Science
PDF
Data Anayltics: How to predict anything
PPTX
Introduction to Big Data
PPTX
Fundamentals of Data science Introduction Unit 1
PPTX
Intro to Data Science Concepts
The Use of Data and Datasets in Data Science
50YearsDataScience.pdf
Data Science unit 2 By: Professor Lili Saghafi
Data Clustering Theory Algorithms And Applications Guojun Gan
Data preprocessing
Data Science: Origins, Methods, Challenges and the future?
DS_Teacher_Presentation DS and Education.pptx
Chapter-Four.pdf
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
data science ppt of emngineering studnets
Introduction to Data Science and Analytics
Wso2datasciencesummerschool20151 150714180825-lva1-app6892
Data Science topic and introduction to basic concepts involving data manageme...
Data science fullOCS353 UNIT 1 UPDATED.pptx
Why Data Science is a Science
Data Anayltics: How to predict anything
Introduction to Big Data
Fundamentals of Data science Introduction Unit 1
Intro to Data Science Concepts
Ad

More from Joaquin Vanschoren (19)

PDF
Meta learning tutorial
PDF
AutoML lectures (ACDL 2019)
PDF
OpenML 2019
PDF
Exposé Ontology
PDF
Designed Serendipity
PDF
Learning how to learn
PDF
OpenML NeurIPS2018
PDF
Open and Automated Machine Learning
PDF
OpenML Reproducibility in Machine Learning ICML2017
PDF
OpenML DALI
PDF
OpenML data@Sheffield
PDF
OpenML Tutorial ECMLPKDD 2015
PDF
OpenML Tutorial: Networked Science in Machine Learning
PDF
OpenML 2014
PDF
Open Machine Learning
PDF
Hadoop tutorial
PDF
Hadoop sensordata part2
PDF
Hadoop sensordata part1
PDF
Hadoop sensordata part3
Meta learning tutorial
AutoML lectures (ACDL 2019)
OpenML 2019
Exposé Ontology
Designed Serendipity
Learning how to learn
OpenML NeurIPS2018
Open and Automated Machine Learning
OpenML Reproducibility in Machine Learning ICML2017
OpenML DALI
OpenML data@Sheffield
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial: Networked Science in Machine Learning
OpenML 2014
Open Machine Learning
Hadoop tutorial
Hadoop sensordata part2
Hadoop sensordata part1
Hadoop sensordata part3

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to machine learning and Linear Models
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IBA_Chapter_11_Slides_Final_Accessible.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to machine learning and Linear Models
Fluorescence-microscope_Botany_detailed content
climate analysis of Dhaka ,Banglades.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
1_Introduction to advance data techniques.pptx
Business Acumen Training GuidePresentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ISS -ESG Data flows What is ESG and HowHow
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
STUDY DESIGN details- Lt Col Maksud (21).pptx
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Data science