Data Science
in the Elastic Stack
A Data Science Process?
[1]: df = pd.read_csv(“data.csv”)
[2]: train, test = preprocess(df)
[3]: pipe = Pipeline([('transform', ct), ('lr', LR())])
[4]: pipe.fit(train, y)
[5]: plt.plot(results)
[6]: plt.save_fig(‘results.png’)
> mkdir project
> cd project
> mkdir data
> mv ~/Downloads/data/* data
> virtualenv venv
> source venv/bin/activate
> pip install pandas numpy
matplotlib sklearn nltk requests
bs4 boto3 jupyter
> jupyter notebook
{api}
/f(x)
f(x)
f(x)
du -sch data/*
du -sch data/*
98.4G total
919.8M total
data actionable
results?
data
data
data
data
data
data
data
ta
data
ta
ta
data
google: data too big for
pandas
data actionable
results?
data
data
data
data
data
data
data
ta
data
ta
ta
data
jupyter notebook
google: data too big for
pandas
data actionable
results?
data
data
data
data
data
data
data
ta
data
ta
ta
data
jupyter notebook
json.dump(“export_FINAL.json”)
with open(“export_FINAL_2.csv”)
google: data too big for
pandas
data actionable
results?
data
data
data
data
data
data
data
ta
data
ta
ta
data
jupyter notebook
json.dump(“export_FINAL.json”)
with open(“export_FINAL_2.csv”)
google: data too big for
pandas
google: deploy model to production
data actionable
results?
data
data
data
data
data
data
data
ta
data
ta
ta
data
jupyter notebook
import dask
json.dump(“export_FINAL.json”)
with open(“export_FINAL_2.csv”)
google: data too big for
pandas
google: deploy model to production
data actionable
results?
data
data
data
data
data
data
data
ta
data
ta
ta
data
data results
A Data Science Process
ModelDiscover Ingest Operationalize
Methods OutcomesGoals
● work with customers and
stakeholders to understand and
identify business problems
● data audit
● project charter (README.md)
● data source definitions
● data dictionaries
● define objectives - specify key
variables and related metrics
● identify relevant data sources
ModelDiscover Ingest Operationalize
Methods ArtifactsGoals
● ad-hoc exploratory data analysis
of raw data.
● development of data pipeline that
transforms raw data.
● charter updates (README.md)
● architecture
● produce high-quality datasets with
a clear relationship to the target
objectives
● provide data to an analytical
environment
● develop architecture to keep data
fresh and up to date
ModelDiscover Ingest Operationalize
ModelDiscover Ingest Operationalize
ModelDiscover Ingest Operationalize
POST _bulkhelpers.bulk(es, data)
Methods ArtifactsGoals
● feature engineering
● modeling training
● feature sets
● A standardized way of
benchmarking model results.
● in our case, job configs :D
● identification of optimal features
for machine learning modeling
● creation of a model that best fits
the business objectives and
modelling task
● production ready model!
ModelDiscover Ingest Operationalize
ModelDiscover Ingest Operationalize
Logging and Metrics: Spot an unusual drop in application
requests and drill in on the troublesome server contributing to
the problem.
Security Analytics: Identify unusual network activity or user
behavior to pinpoint attackers before they do damage.
Business Analytics: Get notified if there is an unusual
increase in abandoned shopping carts in your ecommerce
site.
Application Performance Monitoring: Catch bottlenecks
and slow response times so your apps can keep running
smoothly.
Methods OutcomesGoals
● validation
● project hand-off
● operationally useful KPIs powered
by ML
● user acceptance
ModelDiscover Ingest Operationalize
Data Science in the Elastic Stack
- Identify demand in real time
Data
Metrics
Goals & Objectives
- raw taxi trip information
ModelDiscover Ingest Operationalize
?
Data
Metrics
Goals & Objectives
Data
Features
Success Factors
Goals & Objectives
- decrease waiting times by 50%
- aggregate rides to and from specific locations
ModelDiscover Ingest Operationalize
KPIs
Data
Features
Success Factors
Goals & Objectives
average waiting time
raw taxi trip information
aggregate rides to and from specific locations
decrease waiting times by 50%
Identify demand in real time
ModelDiscover Ingest Operationalize
ModelDiscover Ingest Operationalize
http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_r
ecords_yellow.pdf
ModelDiscover Ingest Operationalize
TLC Objectives
● identify pockets of demand as
they materialize
● identify potential failures of
payment systems
● get a sense of volume in the
coming weeks
helpers.bulk(es, data)
ModelDiscover Ingest Operationalize
Thanks

More Related Content

PDF
Hydrosphere.io Platform for AI/ML Operations Automation
PDF
AI and ML 101
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
PDF
PredictionIO – A Machine Learning Server in Scala – SF Scala
PPTX
Marios Michailidis & Mathias Muller, H2O.ai - Time Series with H2O Driverless...
PPTX
Time Series with Driverless AI - Marios Michailidis and Mathias Müller - H2O ...
PPTX
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
PDF
Introduce to PredictionIO
Hydrosphere.io Platform for AI/ML Operations Automation
AI and ML 101
Automated Hyperparameter Tuning, Scaling and Tracking
PredictionIO – A Machine Learning Server in Scala – SF Scala
Marios Michailidis & Mathias Muller, H2O.ai - Time Series with H2O Driverless...
Time Series with Driverless AI - Marios Michailidis and Mathias Müller - H2O ...
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Introduce to PredictionIO

Similar to Data Science in the Elastic Stack (20)

PDF
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Ideas spracklen-final
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
PDF
Strategic AI Integration in Engineering Teams
PDF
Pragmatic Machine Learning @ ML Spain
PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
PDF
Machine learning systems for engineers
PDF
Production-Ready BIG ML Workflows - from zero to hero
PDF
Challenges of Operationalising Data Science in Production
PPTX
Big Data Pipelines and Machine Learning at Uber
PDF
FlorenceAI: Reinventing Data Science at Humana
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PDF
Making Netflix Machine Learning Algorithms Reliable
PDF
A Study on New York City Taxi Rides
PDF
Resume
PPTX
MOPs & ML Pipelines on GCP - Session 6, RGDC
PDF
DutchMLSchool 2022 - Automation
PDF
DevOps Days Rockies MLOps
PDF
Automated Machine Learning
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ideas spracklen-final
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Strategic AI Integration in Engineering Teams
Pragmatic Machine Learning @ ML Spain
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Machine learning systems for engineers
Production-Ready BIG ML Workflows - from zero to hero
Challenges of Operationalising Data Science in Production
Big Data Pipelines and Machine Learning at Uber
FlorenceAI: Reinventing Data Science at Humana
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Making Netflix Machine Learning Algorithms Reliable
A Study on New York City Taxi Rides
Resume
MOPs & ML Pipelines on GCP - Session 6, RGDC
DutchMLSchool 2022 - Automation
DevOps Days Rockies MLOps
Automated Machine Learning
Ad

Recently uploaded (20)

PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
STKI Israel Market Study 2025 version august
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Configure Apache Mutual Authentication
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Modernising the Digital Integration Hub
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Custom Battery Pack Design Considerations for Performance and Safety
STKI Israel Market Study 2025 version august
Flame analysis and combustion estimation using large language and vision assi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
OpenACC and Open Hackathons Monthly Highlights July 2025
Developing a website for English-speaking practice to English as a foreign la...
Configure Apache Mutual Authentication
Getting started with AI Agents and Multi-Agent Systems
Benefits of Physical activity for teenagers.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
Modernising the Digital Integration Hub
CloudStack 4.21: First Look Webinar slides
Enhancing emotion recognition model for a student engagement use case through...
sustainability-14-14877-v2.pddhzftheheeeee
sbt 2.0: go big (Scala Days 2025 edition)
A review of recent deep learning applications in wood surface defect identifi...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Ad

Data Science in the Elastic Stack