SlideShare a Scribd company logo
Data Science in 10 steps:
A framework for developing Data science applications
2018 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
sri@quantuniversity.com
www.analyticscertificate.com
2
About us:
• Data Science, Quant Finance and
Machine Learning Startup
• Technologies using MATLAB, Python
and R
• Programs
▫ Analytics Certificate Program
▫ Fintech programs
• Platform
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers.
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
3
4
Slides to be shared on https://researchhub.qusandbox.com
5
6
7
8
9
What’s driving data science?
Smart
Algorithms
Hardware
Data
10
The rise of Big Data and Data Science
Image Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
11
Smarter Algorithms
Parallel and Distributing Computing Frameworks Deep Learning Frameworks
1. Our labeled datasets were thousands of times too
small.
2. Our computers were millions of times too slow.
3. We initialized the weights in a stupid way.
4. We used the wrong type of non-linearity.
- Geoff Hinton
“Capital One was able to determine fraudulent credit
card applications in 100 milliseconds”*
* http://go.databricks.com/hubfs/pdfs/Databricks-for-FinTech-170306.pdf
12
Hardware
13
Typical data science workflows
Data
cleansing
Feature
Engineering
Training and
Testing
Model
building
Model
selection
Model
Deployment
14
The reality of working on data science problems
Data
cleansing
Feature
Engineering
Training and
Testing
Model
building
Model
selection
Model
Deployment
15
16
17
1. Articulate your business problem
Data science in 10 steps
18
2. The Data questions
1. Do you know what data you need ?
2. Do you know if the data is available?
3. Do you have the data ?
4. Do you have the right data?
5. Will you continue to have the data?
Data science in 10 steps
19
3. Develop a data acquisition and data prep strategy
1. Do you know how to get the data ?
2. Who gets the data?
3. How do you process it?
4. How do you access it?
5. How do you version and govern the data?
Data science in 10 steps
20
4. Explore and evaluate your data and get it in the right format
Data science in 10 steps
21
5. Define your goal:
1. Summarization
2. Fact finding
3. Understanding relationships
4. Prediction
Data science in 10 steps
22
6. Shortlist (not “Choose” ) the
techniques/methodologies/algorithms
Data science in 10 steps
23
7. Evaluate/establish business constraints and narrow down your
choices of techniques/methodologies/algorithms
1. Cloud/Cost/Expertise/Cost-Value
2. Build/buy/access
Data science in 10 steps
Outcomes
Time
Quality
Cost
24
6. Establish criteria to know if the methodology/models/algorithms
work
1. Is the process replicable?
2. What performance metrics do we choose?
3. Can you evaluate the performance and validate if the models meet
the criteria?
4. Does it provide business value?
Data science in 10 steps
25
9. Fine tune your algorithms and algorithm selection
1. Hyper parameter tuning
2. Bias-variance tradeoff
3. Handling imbalanced class problems
4. Ensemble techniques
5. AutoML
Data science in 10 steps
https://support.sas.com/resources/papers/proceedings17/SAS0514-2017.pdf
26
10 How will this process reach decision makers
1. Deployment choices (On-prem/Cloud)
2. Frequency of data/model updates
3. Governance/Role/Responsibilities
4. Speed, Scale, Availability, Disaster recovery, Rollback, Pull-Plug
Data science in 10 steps
27
How do you monitor the efficacy of your solution?
1. Retuning
2. Monitoring
3. Model decay
4. Data augmentation
5. Newer innovations
Data science in 10 steps - Bonus
28
29
The drivers in the markets are changing!
30
Market impact at the speed of light!
31
The Veracity of Information also affects markets
"The goal of the securities law is to provide the capital markets with accurate
information, and people's motivation are really beside the point,"
- Prof. Jill Fisch, University of Pennsylvania Law School
32
And sentiments drives markets
33
34
• Understanding sentiments in Earnings call transcripts
Goal
35
• Interpreting emotions
• Labeling data
Challenges
36
What is NLP ?
AI
Linguistics
Computer
Science
37
• Q/A
• Dialog systems - Chatbots
• Topic summarization
• Sentiment analysis
• Classification
• Keyword extraction - Search
• Information extraction – Prices, Dates, People etc.
• Tone Analysis
• Machine Translation
• Document comparison – Similar/Dissimilar
Sample applications
38
NLP in Finance
39
• If computers can understand language, opens huge possibilities
▫ Read and summarize
▫ Translate
▫ Describe what’s happening
▫ Understand commands
▫ Answer questions
▫ Respond in plain language
Language allows understanding
40
• Describe rules of grammar
• Describe meanings of words and their
relationships
• …including all the special cases
• ...and idioms
• ...and special cases for the idioms
• ...
• ...understand language!
Traditional language AI
https://en.wikipedia.org/wiki/Formal_language
41
What is NLP ?
Jumping NLP Curves
https://ieeexplore.ieee.org/document/6786458/
42
Q: What’s hard about writing programs
to understand text?
43
• Ambiguity:
▫ “ground”
▫ “jaguar”
▫ “The car hit the pole while it was moving”
▫ “One morning I shot an elephant in my pajamas. How he got into my
pajamas, I’ll never know.”
▫ “The tank is full of soldiers.”
“The tank is full of nitrogen.”
Language is hard to deal with
44
45
• Many ways to say the same thing
▫ “the same thing can be said in many ways”
▫ “language is versatile”
▫ “The same words can be arranged in many different ways to express
the same idea”
▫ …
Language is hard to deal with
46
• APIs
• Human Insight
• Expert Knowledge
• Build your own
Options?
47
NLP pipeline
Data Ingestion
from Edgar
Pre-Processing
Invoking APIs to
label data
Compare APIs
Build a new
model for
sentiment
Analysis
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
• Amazon Comprehend API
• Google API
• Watson API
• Azure API
48
Step1: Setup projects for each Stage on the QuSandbox
Code Data
Environment Process
49
Compare and evaluate results
50
Build a model with MATLAB using the MATLAB IDE on
QuSandbox
51
Review results through the QuResearchHub
52
Creating replicable environments
Creating and manage replicable environments (Code + software + data) in a single portal
53
Test, Iterate, Snapshot experiments
54
Set up your Quant Research Pipeline on the Model
Management Studio to enable replication and automation
55
Manage tasks and errors
56
Create process/replicability documentation for each stage
57
Share results & API/App endpoints through the
QuResearchHub
58
NLP pipeline
Data Ingestion
from Edgar
Pre-Processing
Invoking APIs to
label data
Compare APIs
Build a new
model for
sentiment
Analysis
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
• Amazon Comprehend API
• Google API
• Watson API
• Azure API
59
Register at:
https://researchhub.qusandbox.com
for slides and code

More Related Content

PDF
Data science in 10 steps
PDF
Time series analysis : Refresher and Innovations
PDF
ML master class
PDF
Machine Learning for Finance Master Class
PDF
QuSandbox+NVIDIA Rapids
PDF
Ds for finance day1
PDF
Nlp workshop-share
PDF
Machine Learning and AI: Core Methods and Applications
Data science in 10 steps
Time series analysis : Refresher and Innovations
ML master class
Machine Learning for Finance Master Class
QuSandbox+NVIDIA Rapids
Ds for finance day1
Nlp workshop-share
Machine Learning and AI: Core Methods and Applications

What's hot (20)

PDF
Model governance in the age of data science & AI
PDF
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
PDF
Machine Learning Applications in Credit Risk
PDF
Ai in finance
PDF
QCon conference 2019
PDF
Synthetic VIX Data Generation Using ML Techniques
PDF
achine Learning and Model Risk
PDF
QuantUniversity Machine Learning in Finance Course
PDF
Ds for finance day 4
PDF
Adopting Data Science and Machine Learning in the financial enterprise
PDF
Python for Data science
PDF
Practical model management in the age of Data science and ML
PDF
Synthetic Data Generation with DoppelGanger
PDF
Modular Machine Learning for Model Validation
PDF
No, you don't need to learn python
PDF
Model Risk Management for Machine Learning
PDF
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
PDF
Ml conference slides boston june 2019
PDF
CFA-NY Workshop - Final slides
PDF
10 Key Considerations for AI/ML Model Governance
Model governance in the age of data science & AI
Machine Learning and AI: An Intuitive Introduction - CFA Institute Masterclass
Machine Learning Applications in Credit Risk
Ai in finance
QCon conference 2019
Synthetic VIX Data Generation Using ML Techniques
achine Learning and Model Risk
QuantUniversity Machine Learning in Finance Course
Ds for finance day 4
Adopting Data Science and Machine Learning in the financial enterprise
Python for Data science
Practical model management in the age of Data science and ML
Synthetic Data Generation with DoppelGanger
Modular Machine Learning for Model Validation
No, you don't need to learn python
Model Risk Management for Machine Learning
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
Ml conference slides boston june 2019
CFA-NY Workshop - Final slides
10 Key Considerations for AI/ML Model Governance
Ad

Similar to Data science in 10 steps (20)

PDF
Data science presentation
PDF
Top 10 Trends to Watch for In Data Science.pdf
PPTX
Data Science Introduction to Data Science
PDF
Startds9.19.17sd
PPTX
Introduction to data science
PDF
Untitled document.pdf
PDF
2017 06-14-getting started with data science
PPTX
Data Science Mastery Course in Pitampura
PPTX
DATASCIENCE.pptx
PDF
Deck 92-146 (3)
PDF
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
PDF
Artificial Intelligence (ML - DL)
PDF
How to start your journey as a data scientist
PPTX
Data Science course in Hyderabad .
PPTX
Data Science course in Hyderabad .
PDF
Data science course in ameerpet Hyderabad
PPTX
data science course training in Hyderabad
PPTX
data science course in Hyderabad data science course in Hyderabad
PPTX
data science.pptx
PPTX
best data science course institutes in Hyderabad
Data science presentation
Top 10 Trends to Watch for In Data Science.pdf
Data Science Introduction to Data Science
Startds9.19.17sd
Introduction to data science
Untitled document.pdf
2017 06-14-getting started with data science
Data Science Mastery Course in Pitampura
DATASCIENCE.pptx
Deck 92-146 (3)
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Artificial Intelligence (ML - DL)
How to start your journey as a data scientist
Data Science course in Hyderabad .
Data Science course in Hyderabad .
Data science course in ameerpet Hyderabad
data science course training in Hyderabad
data science course in Hyderabad data science course in Hyderabad
data science.pptx
best data science course institutes in Hyderabad
Ad

More from QuantUniversity (20)

PDF
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
PDF
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
PDF
EU Artificial Intelligence Act 2024 passed !
PDF
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
PDF
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PDF
Qu for India - QuantUniversity FundRaiser
PDF
Ml master class for CFA Dallas
PDF
Algorithmic auditing 1.0
PDF
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
PDF
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
PDF
Seeing what a gan cannot generate: paper review
PDF
AI Explainability and Model Risk Management
PDF
Algorithmic auditing 1.0
PDF
Machine Learning in Finance: 10 Things You Need to Know in 2021
PDF
Bayesian Portfolio Allocation
PDF
The API Jungle
PDF
Explainable AI Workshop
PDF
Constructing Private Asset Benchmarks
PDF
Machine Learning Interpretability
PDF
Responsible AI in Action
AI in Finance and Retirement Systems: Insights from the EBRI-Milken Institute...
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitig...
EU Artificial Intelligence Act 2024 passed !
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
Qu for India - QuantUniversity FundRaiser
Ml master class for CFA Dallas
Algorithmic auditing 1.0
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Seeing what a gan cannot generate: paper review
AI Explainability and Model Risk Management
Algorithmic auditing 1.0
Machine Learning in Finance: 10 Things You Need to Know in 2021
Bayesian Portfolio Allocation
The API Jungle
Explainable AI Workshop
Constructing Private Asset Benchmarks
Machine Learning Interpretability
Responsible AI in Action

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
annual-report-2024-2025 original latest.
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction-to-Cloud-ComputingFinal.pptx
Foundation of Data Science unit number two notes
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx

Data science in 10 steps

  • 1. Data Science in 10 steps: A framework for developing Data science applications 2018 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP sri@quantuniversity.com www.analyticscertificate.com
  • 2. 2 About us: • Data Science, Quant Finance and Machine Learning Startup • Technologies using MATLAB, Python and R • Programs ▫ Analytics Certificate Program ▫ Fintech programs • Platform
  • 3. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers. • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 3
  • 4. 4 Slides to be shared on https://researchhub.qusandbox.com
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. 8
  • 9. 9 What’s driving data science? Smart Algorithms Hardware Data
  • 10. 10 The rise of Big Data and Data Science Image Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
  • 11. 11 Smarter Algorithms Parallel and Distributing Computing Frameworks Deep Learning Frameworks 1. Our labeled datasets were thousands of times too small. 2. Our computers were millions of times too slow. 3. We initialized the weights in a stupid way. 4. We used the wrong type of non-linearity. - Geoff Hinton “Capital One was able to determine fraudulent credit card applications in 100 milliseconds”* * http://go.databricks.com/hubfs/pdfs/Databricks-for-FinTech-170306.pdf
  • 13. 13 Typical data science workflows Data cleansing Feature Engineering Training and Testing Model building Model selection Model Deployment
  • 14. 14 The reality of working on data science problems Data cleansing Feature Engineering Training and Testing Model building Model selection Model Deployment
  • 15. 15
  • 16. 16
  • 17. 17 1. Articulate your business problem Data science in 10 steps
  • 18. 18 2. The Data questions 1. Do you know what data you need ? 2. Do you know if the data is available? 3. Do you have the data ? 4. Do you have the right data? 5. Will you continue to have the data? Data science in 10 steps
  • 19. 19 3. Develop a data acquisition and data prep strategy 1. Do you know how to get the data ? 2. Who gets the data? 3. How do you process it? 4. How do you access it? 5. How do you version and govern the data? Data science in 10 steps
  • 20. 20 4. Explore and evaluate your data and get it in the right format Data science in 10 steps
  • 21. 21 5. Define your goal: 1. Summarization 2. Fact finding 3. Understanding relationships 4. Prediction Data science in 10 steps
  • 22. 22 6. Shortlist (not “Choose” ) the techniques/methodologies/algorithms Data science in 10 steps
  • 23. 23 7. Evaluate/establish business constraints and narrow down your choices of techniques/methodologies/algorithms 1. Cloud/Cost/Expertise/Cost-Value 2. Build/buy/access Data science in 10 steps Outcomes Time Quality Cost
  • 24. 24 6. Establish criteria to know if the methodology/models/algorithms work 1. Is the process replicable? 2. What performance metrics do we choose? 3. Can you evaluate the performance and validate if the models meet the criteria? 4. Does it provide business value? Data science in 10 steps
  • 25. 25 9. Fine tune your algorithms and algorithm selection 1. Hyper parameter tuning 2. Bias-variance tradeoff 3. Handling imbalanced class problems 4. Ensemble techniques 5. AutoML Data science in 10 steps https://support.sas.com/resources/papers/proceedings17/SAS0514-2017.pdf
  • 26. 26 10 How will this process reach decision makers 1. Deployment choices (On-prem/Cloud) 2. Frequency of data/model updates 3. Governance/Role/Responsibilities 4. Speed, Scale, Availability, Disaster recovery, Rollback, Pull-Plug Data science in 10 steps
  • 27. 27 How do you monitor the efficacy of your solution? 1. Retuning 2. Monitoring 3. Model decay 4. Data augmentation 5. Newer innovations Data science in 10 steps - Bonus
  • 28. 28
  • 29. 29 The drivers in the markets are changing!
  • 30. 30 Market impact at the speed of light!
  • 31. 31 The Veracity of Information also affects markets "The goal of the securities law is to provide the capital markets with accurate information, and people's motivation are really beside the point," - Prof. Jill Fisch, University of Pennsylvania Law School
  • 33. 33
  • 34. 34 • Understanding sentiments in Earnings call transcripts Goal
  • 35. 35 • Interpreting emotions • Labeling data Challenges
  • 36. 36 What is NLP ? AI Linguistics Computer Science
  • 37. 37 • Q/A • Dialog systems - Chatbots • Topic summarization • Sentiment analysis • Classification • Keyword extraction - Search • Information extraction – Prices, Dates, People etc. • Tone Analysis • Machine Translation • Document comparison – Similar/Dissimilar Sample applications
  • 39. 39 • If computers can understand language, opens huge possibilities ▫ Read and summarize ▫ Translate ▫ Describe what’s happening ▫ Understand commands ▫ Answer questions ▫ Respond in plain language Language allows understanding
  • 40. 40 • Describe rules of grammar • Describe meanings of words and their relationships • …including all the special cases • ...and idioms • ...and special cases for the idioms • ... • ...understand language! Traditional language AI https://en.wikipedia.org/wiki/Formal_language
  • 41. 41 What is NLP ? Jumping NLP Curves https://ieeexplore.ieee.org/document/6786458/
  • 42. 42 Q: What’s hard about writing programs to understand text?
  • 43. 43 • Ambiguity: ▫ “ground” ▫ “jaguar” ▫ “The car hit the pole while it was moving” ▫ “One morning I shot an elephant in my pajamas. How he got into my pajamas, I’ll never know.” ▫ “The tank is full of soldiers.” “The tank is full of nitrogen.” Language is hard to deal with
  • 44. 44
  • 45. 45 • Many ways to say the same thing ▫ “the same thing can be said in many ways” ▫ “language is versatile” ▫ “The same words can be arranged in many different ways to express the same idea” ▫ … Language is hard to deal with
  • 46. 46 • APIs • Human Insight • Expert Knowledge • Build your own Options?
  • 47. 47 NLP pipeline Data Ingestion from Edgar Pre-Processing Invoking APIs to label data Compare APIs Build a new model for sentiment Analysis Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 • Amazon Comprehend API • Google API • Watson API • Azure API
  • 48. 48 Step1: Setup projects for each Stage on the QuSandbox Code Data Environment Process
  • 50. 50 Build a model with MATLAB using the MATLAB IDE on QuSandbox
  • 51. 51 Review results through the QuResearchHub
  • 52. 52 Creating replicable environments Creating and manage replicable environments (Code + software + data) in a single portal
  • 54. 54 Set up your Quant Research Pipeline on the Model Management Studio to enable replication and automation
  • 57. 57 Share results & API/App endpoints through the QuResearchHub
  • 58. 58 NLP pipeline Data Ingestion from Edgar Pre-Processing Invoking APIs to label data Compare APIs Build a new model for sentiment Analysis Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 • Amazon Comprehend API • Google API • Watson API • Azure API