SlideShare a Scribd company logo
Modern Data Science
Alejandro Correa Bahnsen
June 2016
@albahnsen
1
Who am I?
Data Scientist
PhD in Machine Learning
Interested in Big Data Engineering
Passionate about open-source
Scikit-Learn contributor :)
Organizer of the Bogota Big Data Science Meetup
2
Who I've worked with
3
Where I work
Lead Data Scientist working on applying
Machine Learning for Security Informatics
4
Aims of this talk
Discuss what a Modern Data Scientist is
(And what is not)
5
6
It's 2016 and there is still no
unique definition of Data
Science
7
8
“ A data scientist is a statistician
who lives in San Fransisco.
“ Data Science is statistics on a
Mac.
9
Data Science is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
10
Even worse, people use
several words interchangeable
11
12
13
14
15
Lets focus only on modern
data science
16
So what is Data
Science?
17
Data Science
18
Data Science is the intersection of
Hacking Skills, Math & Statistics
Knowledge and Substantive Expertise
Those are the pillars of data science: computing,
statistics, mathematics and quantitative disciplines
combined to analyze data for better decision making
19
Hacking Skills
Ability to build things and find clever solutions to
problems.
Programming/Coding: Python and R (and others)
Databases: MySQL, PostgreSQL, Cassandra,
MongoDB and CouchDB.
Visualization: D3, Tableau, Qlikview and Markdown.
Big Data: Hadoop, MapReduce and Spark.
20
Hacking Skills
21
Hacking Skills
http://www.kdnuggets.com/2016/06/r-python-top-
analytics-data-mining-data-science-software.html
22
Hacking Skills
http://www.kdnuggets.com/2016/06/r-python-top-
analytics-data-mining-data-science-software.html
23
Math & Statistics
Being able understand the right solution to each
problem
Linear algebra: Matrix manipulation
Machine Learning: Random Forests, SVM, Boosting
Descriptive statistics: Describe, Cluster
Statistical inference: Generate new knowledge .
24
Math & Statistics
25
Substantive Expertise
Ability to ask good questions requires domain
understanding, that’s why a data scientist can’t create
data based solutions without a good industry knowledge
Is this A or B or C? (classification)
Is this weird? (anomaly detection).
How much/how many? (regression).
How is it organized? (clustering).
What should I do next? (reinforcement learning)
26
How did we get here
27
Data Science
Examples
28
Netflix Price
29
Goolge flu trends
30
Creating a rembrandt
31
Obama campaign
32
Moneyball
33
AlphaGo
34
My recent
experience
35
Phishing Detection
36
Malware Identification
37
Man-in-the-Browser Attacks
38
Intrusion Detection
39
Fraud Detection
40
Fraud Detection
Estimate the probability of a transaction being fraud
based on customer patterns and recent fraudulent
behavior
Issues when constructing a fraud detection system:
Class Imbalance
Cost-sensitivity
Short time response of the system
Dimensionality of the search space
Feature preprocessing
Model selection
41
Fraud Detection
42
Class Imbalance
Fraudulent transactions represents between 0.01% to
0.5% of the transactions
Create a balanced dataset using:
Under sampling
Over sampling
TomekLinks sampling
Condensed Nearest Neighbor
NearMiss
Synthetic Majority Over Sampling
43
Class Imbalance
Synthetic Majority Over Sampling Technique
SMOTE
44
Cost-Sensitivity
Typical evaluation of a classification model:
Actual Fraud Actual Legitimate
Predicted Fraud True Positives (TP) False Positives (FP)
Predicted Legitimate False Negatives (FN) True Negatives (FN)
Accuracy = TP+FP+TN+FN
TP+TN
F Score =1 TP+FN+FP
TP
45
Cost-Sensitivity
Assumes the same financial cost of false positives and
false negatives!
Not the case in fraud detection:
False positives: When predicting a transaction as
fraudulent, when in fact it is not a fraud, there is an
administrative cost
False negatives: Failing to detect a fraud, the amount
of that transaction is lost.
46
Cost-Sensitivity
Cost Matrix
Actual Fraud Actual Legitimate
Predicted Fraud
Predicted Legitimate
Cost(f(S)) = y (1 − c )AMT + c C∑i=1
N
i i i i a
c = CTP a c = CFP a
c = AMTFN i c = 0TN
47
Feature Engineering
Raw Features
48
Feature Engineering
Transaction aggregated features
49
Feature Engineering
Periodic Features
50
Feature Engineering
Social Networks Analysis
51
Finally - Some Models
Data
Large European Card Processing company
2012 & 2013 card present transactions
20 Million transactions
40,000 frauds
2 Million Euros in losses in the test set
52
Finally - Some Models
Algorithms
Fuzzy Rules
Neural Networks
Naive Bayes
Random Forests
Random Forests with Cost-Proportonate Sampling
Cost-Sensitive Random Patches Decision Trees
53
Finally - Some Models
54
Takeaways
55
How could you learn more?
56
How could you learn more?
57
How could you learn more?
58
Embrace open-source
59
Support open-source
60
Modern
Data
Scientist
The sexiest job of
the 21th century
61
Thank You!
@albahnsen
albahnsen.com
62

More Related Content

PDF
Lecture1 introduction to machine learning
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PPT
data mining
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PDF
Calibrated Recommendations
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
PDF
Introduction to data analytics
Lecture1 introduction to machine learning
Lessons Learned from Building Machine Learning Software at Netflix
data mining
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining: Concepts and Techniques — Chapter 2 —
Calibrated Recommendations
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Introduction to data analytics

What's hot (20)

PDF
Knowledge Graph Embeddings for Recommender Systems
ODP
Topic Modeling
PPTX
Naive Bayes Presentation
PDF
Bayesian networks
PDF
Time, Context and Causality in Recommender Systems
PDF
Introduction to Deep Generative Models
PDF
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
PDF
Representation learning on graphs
PDF
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PDF
Facebook Talk at Netflix ML Platform meetup Sep 2019
PDF
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
PPTX
Big data ppt
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PDF
CounterFactual Explanations.pdf
PPTX
Module 4 part_1
PDF
Scaling and Normalization
PPTX
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
PDF
03 Machine Learning Linear Algebra
PPTX
Federated Learning
Knowledge Graph Embeddings for Recommender Systems
Topic Modeling
Naive Bayes Presentation
Bayesian networks
Time, Context and Causality in Recommender Systems
Introduction to Deep Generative Models
Artificial Intelligence And Machine Learning PowerPoint Presentation Slides C...
Representation learning on graphs
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Facebook Talk at Netflix ML Platform meetup Sep 2019
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Big data ppt
Federated Learning: ML with Privacy on the Edge 11.15.18
CounterFactual Explanations.pdf
Module 4 part_1
Scaling and Normalization
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
03 Machine Learning Linear Algebra
Federated Learning
Ad

Viewers also liked (13)

PDF
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
PDF
2011 advanced analytics through the credit cycle
PPTX
Maximizing a churn campaigns profitability with cost sensitive machine learning
PPTX
Fraud Detection with Cost-Sensitive Predictive Analytics
PDF
PhD Defense - Example-Dependent Cost-Sensitive Classification
PDF
Analytics - compitiendo en la era de la informacion
PDF
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
PPTX
1609 Fraud Data Science
PDF
2013 credit card fraud detection why theory dosent adjust to practice
PPTX
Classifying Phishing URLs Using Recurrent Neural Networks
PDF
Fraud analytics detección y prevención de fraudes en la era del big data sl...
PDF
Demystifying machine learning using lime
PDF
Ensembles of example dependent cost-sensitive decision trees slides
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
2011 advanced analytics through the credit cycle
Maximizing a churn campaigns profitability with cost sensitive machine learning
Fraud Detection with Cost-Sensitive Predictive Analytics
PhD Defense - Example-Dependent Cost-Sensitive Classification
Analytics - compitiendo en la era de la informacion
Example-Dependent Cost-Sensitive Credit Card Fraud Detection
1609 Fraud Data Science
2013 credit card fraud detection why theory dosent adjust to practice
Classifying Phishing URLs Using Recurrent Neural Networks
Fraud analytics detección y prevención de fraudes en la era del big data sl...
Demystifying machine learning using lime
Ensembles of example dependent cost-sensitive decision trees slides
Ad

Similar to Modern Data Science (20)

PDF
Data Science: lesson01_intro-to-ds-and-ml.pdf
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
PPTX
Data Science Training in Chandigarh h
PDF
Untitled document.pdf
PDF
5_Data Analytics, Data Science and Machine Learning
PPTX
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
PPTX
Data Science Data Science Data Science.pptx
PPTX
Fundamentals of Analytics and Statistic (1).pptx
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PPTX
Data science applications and usecases
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PPTX
Career_Jobs_in_Data_Science.pptx
PPTX
Data science and visualization power point
PPTX
intro to data science Clustering and visualization of data science subfields ...
PPTX
DS_Teacher_Presentation DS and Education.pptx
PPTX
IT in Business: Chapter 11 Data Sciences
PPTX
An-Introduction-to-the-Data-Science.pptx
PPT
Data_Science_Presentationforlearning machine learning
PDF
introds_110116.pdf
PDF
S1-Introduction_to_Computational_physics.pdf
Data Science: lesson01_intro-to-ds-and-ml.pdf
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Data Science Training in Chandigarh h
Untitled document.pdf
5_Data Analytics, Data Science and Machine Learning
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
Data Science Data Science Data Science.pptx
Fundamentals of Analytics and Statistic (1).pptx
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Data science applications and usecases
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Career_Jobs_in_Data_Science.pptx
Data science and visualization power point
intro to data science Clustering and visualization of data science subfields ...
DS_Teacher_Presentation DS and Education.pptx
IT in Business: Chapter 11 Data Sciences
An-Introduction-to-the-Data-Science.pptx
Data_Science_Presentationforlearning machine learning
introds_110116.pdf
S1-Introduction_to_Computational_physics.pdf

More from Alejandro Correa Bahnsen, PhD (6)

PPTX
black hat deephish
PPTX
DeepPhish: Simulating malicious AI
PDF
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
PPTX
How I Learned to Stop Worrying and Love Building Data Products
PPTX
Fraud Detection by Stacking Cost-Sensitive Decision Trees
PDF
2012 predictive clusters
black hat deephish
DeepPhish: Simulating malicious AI
AI vs. AI: Can Predictive Models Stop the Tide of Hacker AI?
How I Learned to Stop Worrying and Love Building Data Products
Fraud Detection by Stacking Cost-Sensitive Decision Trees
2012 predictive clusters

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
sap open course for s4hana steps from ECC to s4
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
sap open course for s4hana steps from ECC to s4
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx

Modern Data Science