An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Farhan Abrol

Farhan Abrol
Product Lead, Pure Storage
An end-to-end Spark based data stack in the
hybrid cloud
#HWCSAIS12
fabrol92@gmail.com
@F_Abrol
www.linkedin.com/in/fabrol

#HWCSAIS12
Outline
• Environment overview & problems
• Solutions - Hint : Spark
• More Spark More Problems
• Hybrid Cloud
– Options & Performance comparison
– Should you do it ?
– Basics of datacenter
2

#HWCSAIS12
Pure1
3
● Fleet dashboard for IoT devices
○ Storage arrays
○ VM’s
● Real-time log/metric streaming
● 16 TB logs/metrics ingested daily
● Intelligence
○ Proactive scanning for issues
○ Predictive alerting
○ Machine learned forecasting

#HWCSAIS12
S3 S3
Infrequent Access
FUSE Filesystem
Ad-Hoc analysis by Engineering
Continuous or
Daily ETL
Historical Grep
Machine Learning
Logs are king
S3
4

#HWCSAIS12
Problems
- Speed of running historical greps
- Bottlenecked on single machine throughput
- Resource wastage for ETL machines
- Code/maintenance for new ETL jobs
- Becoming a monolith
- ML training time
- As data grows, taking 8-12 hours
5

#HWCSAIS12
all the things !
- Faster*
- Better resource utilization
- Uniform language and tooling
- Streaming / batch jobs
- One infra to maintain
6

#HWCSAIS12 8
Spark Driver
Spark Executor
Spark Executor
Spark Executor
Spark Executor
rgrep “xyz” --obj-id 100 --start-date=5/13/18
--end-date=5/18/18
05/13/2018 - 5/14/2018
05/14/2018 - 5/15/2018
05/15/2018 - 5/16/2018
05/16/2018 - 5/17/2018
Grep -> Distributed grep on Spark

#HWCSAIS12
Problem - AWS Cost trend
10

#HWCSAIS12
Hybrid Cloud
Data Center with HW
Direct-Connect
Dedicated 10G
private fiber link
EC2 VM
EC2 VM
Pure LUN
Pure FS
Switch Switch
500 TB
12

#HWCSAIS12
Hybrid Cloud - Pricing
Data in = $0/month
Utility Price Usage Total per
month
10G port $2.25/hr 720 hr $1620
Data transfer out of AWS $0.020/GB 500 TB $10000
AWS Cost $11620
13

#HWCSAIS12
Log analysis pipeline - Smoke test
Phonehome
servers
S3
Infrequent Access
DirectConnect 30 days logs
EMR
+
Historical Grep + ML
500 TB
14

#HWCSAIS12
Aside
Storage Protocols
Storage system
Generic Optimized
Flashblade
15

#HWCSAIS12 16
AWS Only
EMR
Amazon
S3
EMR
Switch
Switch
Hybrid with EC2 Hybrid with Local Compute
5ms-20ms
500 TB
500 TB

#HWCSAIS12
144 node spark cluster
Workload - Distributed grep
~3x-10x better throughput
17

#HWCSAIS12 18
Good for
- Read heavy workloads
- Latency insensitive workloads
- Low Bandwidth workloads
EMR
Switch
Hybrid with EC2
5ms-20ms
500 TB
Performance
Costs
- Link latency
- Cloud networking stack

#HWCSAIS12 19
Switch
Hybrid with Local Compute
500 TB
Good for
- Read heavy workloads
- Latency sensitive workloads
- High bandwidth workloads
Performance
Costs

#HWCSAIS12
144 node spark cluster
Workload - Distributed grep
~3x-10x better throughput
20

#HWCSAIS12
Datacenter setup
21
Networking switch
Storage
Compute servers
~$10k
32 vCPUs ~$10-20k
Varies
Software

#HWCSAIS12
Conclusion
22
⎯ Best use cases: Workloads with higher read, lower write requirements
⎯ When write portion of read/write ratio increases, be cognizant of cumulative
AWS transfer costs
⎯ High performance cloud services can be expensive, on-prem can alleviate
this cost
⎯ Unique capabilities of on-prem storage & compute:
⎯ Instant snapshots
⎯ All kind of workloads on one platform
⎯ Resilience

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Farhan Abrol

More Related Content

What's hot (20)

Similar to An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Farhan Abrol (20)

More from Databricks (20)

Recently uploaded (20)

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Farhan Abrol