SlideShare a Scribd company logo
Farhan Abrol
Product Lead, Pure Storage
An end-to-end Spark based data stack in the
hybrid cloud
#HWCSAIS12
fabrol92@gmail.com
@F_Abrol
www.linkedin.com/in/fabrol
#HWCSAIS12
Outline
• Environment overview & problems
• Solutions - Hint : Spark
• More Spark More Problems
• Hybrid Cloud
– Options & Performance comparison
– Should you do it ?
– Basics of datacenter
2
#HWCSAIS12
Pure1
3
● Fleet dashboard for IoT devices
○ Storage arrays
○ VM’s
● Real-time log/metric streaming
● 16 TB logs/metrics ingested daily
● Intelligence
○ Proactive scanning for issues
○ Predictive alerting
○ Machine learned forecasting
#HWCSAIS12
S3 S3
Infrequent Access
FUSE Filesystem
Ad-Hoc analysis by Engineering
Continuous or
Daily ETL
Historical Grep
Machine Learning
Logs are king
S3
4
#HWCSAIS12
Problems
- Speed of running historical greps
- Bottlenecked on single machine throughput
- Resource wastage for ETL machines
- Code/maintenance for new ETL jobs
- Becoming a monolith
- ML training time
- As data grows, taking 8-12 hours
5
#HWCSAIS12
all the things !
- Faster*
- Better resource utilization
- Uniform language and tooling
- Streaming / batch jobs
- One infra to maintain
6
#HWCSAIS12 7
#HWCSAIS12 8
Spark Driver
Spark Executor
Spark Executor
Spark Executor
Spark Executor
rgrep “xyz” --obj-id 100 --start-date=5/13/18
--end-date=5/18/18
05/13/2018 - 5/14/2018
05/14/2018 - 5/15/2018
05/15/2018 - 5/16/2018
05/16/2018 - 5/17/2018
Grep -> Distributed grep on Spark
#HWCSAIS12
Done !
9
#HWCSAIS12
Problem - AWS Cost trend
10
#HWCSAIS12 11
#HWCSAIS12
Hybrid Cloud
Data Center with HW
Direct-Connect
Dedicated 10G
private fiber link
EC2 VM
EC2 VM
Pure LUN
Pure FS
Switch Switch
500 TB
12
#HWCSAIS12
Hybrid Cloud - Pricing
Data in = $0/month
Utility Price Usage Total per
month
10G port $2.25/hr 720 hr $1620
Data transfer out of AWS $0.020/GB 500 TB $10000
AWS Cost $11620
13
#HWCSAIS12
Log analysis pipeline - Smoke test
Phonehome
servers
S3
Infrequent Access
DirectConnect 30 days logs
EMR
+
Historical Grep + ML
500 TB
14
#HWCSAIS12
Aside
Storage Protocols
Storage system
Generic Optimized
Flashblade
15
#HWCSAIS12 16
AWS Only
EMR
Amazon
S3
EMR
Switch
Switch
Hybrid with EC2 Hybrid with Local Compute
5ms-20ms
500 TB
500 TB
#HWCSAIS12
144 node spark cluster
Workload - Distributed grep
~3x-10x better throughput
17
#HWCSAIS12 18
Good for
- Read heavy workloads
- Latency insensitive workloads
- Low Bandwidth workloads
EMR
Switch
Hybrid with EC2
5ms-20ms
500 TB
Performance
Costs
- Link latency
- Cloud networking stack
#HWCSAIS12 19
Switch
Hybrid with Local Compute
500 TB
Good for
- Read heavy workloads
- Latency sensitive workloads
- High bandwidth workloads
Performance
Costs
#HWCSAIS12
144 node spark cluster
Workload - Distributed grep
~3x-10x better throughput
20
#HWCSAIS12
Datacenter setup
21
Networking switch
Storage
Compute servers
~$10k
32 vCPUs ~$10-20k
Varies
Software
#HWCSAIS12
Conclusion
22
⎯ Best use cases: Workloads with higher read, lower write requirements
⎯ When write portion of read/write ratio increases, be cognizant of cumulative
AWS transfer costs
⎯ High performance cloud services can be expensive, on-prem can alleviate
this cost
⎯ Unique capabilities of on-prem storage & compute:
⎯ Instant snapshots
⎯ All kind of workloads on one platform
⎯ Resilience
#HWCSAIS12 23

More Related Content

PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Spark Summit EU talk by Tug Grall
PDF
Superset druid realtime
PDF
The Revolution Will be Streamed
PPTX
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Modern ETL Pipelines with Change Data Capture
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Spark Summit EU talk by Tug Grall
Superset druid realtime
The Revolution Will be Streamed
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Introduction to Data Engineer and Data Pipeline at Credit OK

What's hot (20)

PDF
Data Warehousing with Spark Streaming at Zalando
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
Insights Without Tradeoffs: Using Structured Streaming
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
Make your PySpark Data Fly with Arrow!
PDF
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
PDF
Building Robust Production Data Pipelines with Databricks Delta
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PDF
Observability for Data Pipelines With OpenLineage
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
PDF
Spark Summit EU talk by Bas Geerdink
PPTX
Realtime streaming architecture in INFINARIO
PDF
Spark Summit EU talk by Ahsan Javed Awan
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
Data Warehousing with Spark Streaming at Zalando
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Insights Without Tradeoffs: Using Structured Streaming
Apache Pulsar: The Next Generation Messaging and Queuing System
Presto: Optimizing Performance of SQL-on-Anything Engine
Make your PySpark Data Fly with Arrow!
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Building Robust Production Data Pipelines with Databricks Delta
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Rental Cars and Industrialized Learning to Rank with Sean Downes
Observability for Data Pipelines With OpenLineage
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Spark Summit EU talk by Bas Geerdink
Realtime streaming architecture in INFINARIO
Spark Summit EU talk by Ahsan Javed Awan
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Self-Service Apache Spark Structured Streaming Applications and Analytics
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
Ad

Similar to An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Farhan Abrol (20)

PPTX
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
PDF
In-Memory Data Grids - Ampool (1)
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PDF
Amazon Elastic Map Reduce - Ian Meyers
PPTX
Rethinking the database for the cloud (iJAWS)
PDF
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
PDF
Sqream DB on OpenPOWER performance
PPTX
Sql server 2016 it just runs faster sql bits 2017 edition
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PDF
Apache Hadoop 3.0 Community Update
PDF
Stsg17 speaker yousunjeong
PPTX
A New "Sparkitecture" for modernizing your data warehouse
PDF
IEEE International Conference on Data Engineering 2015
PPTX
Scale Out Database Solution
PDF
Enabling big data & AI workloads on the object store at DBS
PPTX
Oracle made it easy: Cloud DB Vergleich
PPTX
Big data dive amazon emr processing
PDF
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
In-Memory Data Grids - Ampool (1)
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Amazon Elastic Map Reduce - Ian Meyers
Rethinking the database for the cloud (iJAWS)
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
Sqream DB on OpenPOWER performance
Sql server 2016 it just runs faster sql bits 2017 edition
HPC DAY 2017 | HPE Storage and Data Management for Big Data
Hadoop 3 @ Hadoop Summit San Jose 2017
Apache Hadoop 3.0 Community Update
Stsg17 speaker yousunjeong
A New "Sparkitecture" for modernizing your data warehouse
IEEE International Conference on Data Engineering 2015
Scale Out Database Solution
Enabling big data & AI workloads on the object store at DBS
Oracle made it easy: Cloud DB Vergleich
Big data dive amazon emr processing
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.
Qualitative Qantitative and Mixed Methods.pptx
Database Infoormation System (DBIS).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Knowledge Engineering Part 1
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Farhan Abrol