SlideShare a Scribd company logo
Achieving Separation of Compute and Storage
in a Cloud World
Yonggang Hu | Chief Architect at Spectrum Computing | IBM
Bin Fan | Founding Engineer andVice President of Open Source | Alluxio
Attendee Poll
Yonggang Hu
Distinguished Engineer, Chief Architect at Spectrum
Computing, IBM
Email: yhu@ca.ibm.com
Yonggang has been working on distributed computing, grid,
cloud and big data for the past 20 years. Before joining
Spectrum Computing, Yonggang was Vice President and
Application Architect at JPMorgan Chase focusing on
computational analytics and application infrastructure.
Yonggang holds MS in Computer Science from Peking
University and MBA from Cornell University.
Today’s Speaker
Bin Fan
VP of Open Source at Alluxio
Email: binfan@alluxio.com
Bin Fan is the founding engineer and VP of Open Source at
Alluxio, Inc. Prior to Alluxio, he worked for Google to build the
next-generation storage infrastructure. Bin received his Ph.D.
in Computer Science from Carnegie Mellon University on the
design and implementation of distributed systems.
Today’s Speaker
Agenda
Why storage-independent compute? And Spectrum Conductor
AlluxioTechnology Overview
Real-world Use Cases
Data
Time
Available
Data
Understood Data
Enterprise
Amnesia
80 million
wearable health
devices will
be available by
2017.
2.5
quintillion
bytes of data
generated daily
by connected
machines.
There
will be
28 times
more
sensor-
enabled
devices
than
people
by the
year 2020.
25 gigabytes
of data per hour
is generated by a
connected car.
90% of cars will
be connected by 2020.
153 exabytes
of healthcare
data generated by
devices in 2013.
Increasing to 2,314
exabytes in 2020.
1.7 megabytes
of data per
second
generated by
every human
being on the
planet by 2020.
7
8
“We have many useless 2 to
10TB Hadoop clusters “
“Our Hadoop cluster is too big
to upgrade – still running Spark
1.4”
“Why there is no data locality
in my HDFS?“
“Data scientists cannot
run analytics in the lake”
“I just want to run ETL.
Why I have to copy to
lake first?”
Data & Analytics Challenges
“Spark and AI workload
are compute intensive”
“Too complex and costly
to run and manage”
“Need independent
scaling of compute and
storage”
“Big batch job taking
over the cluster. On-
line query is too slow”
Cluster Sprawl – The Elephant in the Room
Apache Spark is revolutionizing big data
Benefits
• 10-100x the performance as MapReduce
• In-memory, batch/streaming ETL, ML
• Simplicity
• Faster application development
• Enables de-coupled compute and storage
The Apache Spark big data processing framework will
account for more than 37% of all big data spending by
2022, according to research by Wikibon.
Last 3 Years
(July 2016 – July 2019)
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
IBM contributions to Apache Spark
IBM Developer / © 2019 IBM Corporation
11
ØTotal Spark contribution:
• Total: 1195 Jira merged; Covered areas: SQL, ML, Pyspark. Spark core
• 148 commits merged in the last 12 months
• Currently we have 11 contributors with 4 committers
• Total LoC: 90K + in main source and test code
• More than 65% are major issues, more than 95% are in areas: SQL, ML, Pyspark, Spark core
and more than 50% are improvement and features.
ØIn the latest release of Apache Spark (v2.4.0)
• IBM contributed about 100 commits and 6586 LOC in source and test.
• Blog post to highlights Spark team contribution in this release.
https://developer.ibm.com/blogs/2018/11/29/ibm-continues-commitment-to-apache-
spark/
Inefficient workload
management results in
poor server utilization
rates and throughput.
Lack of security with
open-source frameworks
and applications
introduces risk.
Multiple Spark teams each
with dedicated servers =
wasted capacity and high
administrative overhead.
Scaling Spark is
challenging
Deliver faster results Drive down TCO
• Streamline deployment and
management
• Integrated monitoring, alerting,
reporting, and diagnostics
• Quickly onboard users and
Spark instances and versions.
IBM Spectrum Conductor unlocks the power in your
infrastructure to build data pipelines at scale
• Maximize Spark performance
with intelligent workload
scheduling.
• 30 to 224% greater throughput
than Spark with YARN, 25 to
88% greater than Spark with
Mesos*
• Workload aware, dynamic
bursting to external clouds
• Eliminate cluster sprawl with a
multitenant shared infrastructure.
• Get more out of current hardware
by increasing utilization and
throughput.
• Simultaneously run multiple
versions and instances of Spark.
Simplify administration
Dedicated Resources
First, optimize the environment
Better performance at lower cost
• Dynamic resource sharing with guaranteed SLAs
• Ability to borrow CPU cores dynamically from other
tenants to meet job needs
• Enterprise-proven at scale:
• 5K hosts, 150K cores, >1B tasks/day
Secure Multitenant Shared Resources
Shared Resources
Dedicated Resource Silos
IBM Cloud
x86
Spectrum Conductor: flexible resource sharing and SLA control
Analytics-clusters 60000
Risk analytics
CECL-prod
40000
20000 8000 100000
Credit-Bond 10000 3000 100000
Model review 8000 5000 200000
Spark -test 2000 2000 100000
40000
Big-Data-Env
5000
20000
R-Analytics
Notebook-Service
TensorFlow
Docker-containers
20
10
10
60
5000
5000
15000
Unique “Resource Plan” capability to express flexible
resource sharing policies among applications (consumers)
and multiple resource pools
Hierarchical structure can reflect organizational structure and
application hierarchies
Notions of time-variant resource ownership, ranking, sharing
ratios, lend & borrow limits
Application/LOB tenants can borrow resources from others
(subject to policy) improving performance and maximizing
asset utilization
Introduction of prioritization for Spark and support of FCFS
Priority and Fairshare policies * (Spark Summit 2018)
Flexible pre-emption policies enable tenants to quickly reclaim
owned resources ensuring that service-level objectives and
business deadlines are met
15
Maximize utilization while ensuring tenant / LOB SLAs with unique resource sharing model
https://www.slideshare.net/databricks/dynamic-priorities-for-apache-spark-applications-resource-
allocations-with-michael-feiman-and-shinnosuke-okada
16
Superior Spark performance
and predictability
• 30 to 224% faster than YARN
• 25 to 88% faster than Apache Mesos
• Consistent and predictable delivering 71%
relative standard deviation (RSD) in
interactive workloads
• YARN & Mesos are relatively unpredictable
304% and 777% RSD respectively
2607
327
899
574
1673
253
582
177
1660
202
478 458
0
500
1000
1500
2000
2500
3000
Case 1: Sync interactive
multi-user
Case 2: Asynchonous batch
multi-user
Case 3: Mixed multi-user Case 4: Mixed multi-tenant
Throughput of Spark SMB-2 benchmark workload on various Resource Managers
(Jobs/hour - higher is better)
IBM Spectrum Conductor Apache YARN v2.7.3 Apache Mesos v1.0.1
Audited benchmark results https://stacresearch.com/news/2017/05/19/IBM170405
Deliver faster results
17
Faster, more predictable analytics
https://www.slideshare.net/databricks/deep-dive-into-apache-spark-multiuser-performance-michael-feiman-mikhail-genkin-and-peter-lankford
Faster Time to Results – GPU Support
Accelerate Spark applications with GPUs
Presented at Spark Summit San Francisco June 2016
Conductor scheduler interfaces with Spark to ensure that GPU resources are assigned
to the applications that can use them
Workload Management
Spark Application
Session Scheduler
GPU resources
CPU resources
| 18https://www.slideshare.net/sparktc/gpu-support-in-spark-and-gpucpu-mixed-resource-scheduling-at-production-scale
To provide enhanced user experiences
Researcher
HPC
Instance #3
IT or Data
Warehouse
ETL / Reporting
Instance #4
Risk Analytics
LOB
Instance #5
Administrator
Compute Nodes
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Data scientist
Customer behavior...
Trend analysis...
Instance #2
Instance #1
LOB
Marketing…
Fraud Detection…
Broader use, better business value
• Logical view - each group has its own Spark cluster
• Right resources dynamically allocated to each user
based on workload characteristics and SLA priority
• Multitenancy ensures data and application isolation
20
Embrace cloud, transparently leveraging cloud providers
End-to-end hybrid cloud, policy-based bursting to multiple clouds with automated data staging and cloud
staging, auto-deploy containerized workloads on IaaS or cloud-native (K8s) cloud services
Google
IBM Cloud
Microsoft
Amazon
User applications & workloads
Spectrum Computing forwards
specific workloads to multiple clouds
based on site defined policies and
multi-cluster
Spectrum Computing works with
Spectrum Scale to pre-stage
necessary between on-premises and
cloud-based Spectrum Scale
environments before cloud hosts are
auto-provisioned
Spectrum Computing Resource
Connector dynamically resizes the
pool of hosts in the cloud based on
workload demand and policies
IBM Apsera can be used as the underlying transport
Customer example:
Augment Data Lake with a Spark
Analytics Grid
…
Spectrum Conductor
Optimized
Compute Cluster Read/Write Data
X86 ‘Data Lake’ and
Legacy Hadoop Apps
HDFS Connector
Linux on Power or x86
HDFS
POSIX
Solution
ü Provide optimized high performance, multi-user,
shared environment for Spark, Python, H20, AI
ü Connect to HDFS, EDW, Spectrum Scale, and
Object Stores
ü Drive up utilization with SLA policy driven sharing
of compute resources
ü Provide the foundation for AI with
GPU acceleration
Business Problem
• Mounting data lake costs and risks
• Data lake not well suited to in-memory workloads
• Low compute utilization and limited flexibility
• Need to introduce new analytics applications
Enterprise Data
Warehouse
Business Problem
• 500+ users
• Large volumes of portfolio, trade and
market data
• Diverse user groups and analytics requirements
• Interactive queries and batch reports needed
• Sub-second response times necessary
• Current data warehouse and Hadoop
architectures limited on price-performance,
scalability and SLAs to users
Solution
ü Performance-optimized Spark-based
environment
ü Shared in-memory cache and Shared Spark
context
ü Intelligent workload management to meet
user SLAs
ü Multi-tenancy to enable security and isolation
ü High performance resource management to
increase infrastructure efficiency
…
Spark
Cache
HDFS/ Spectrum Scale
Traders
• Need visibility via custom
desktop application
• Analyze risk and trade
impact to determine
trading strategy
Risk Group/ CRO
• Understand risk across
the entire portfolio
Finance/ CFO
• P&L and financial
impact for the firm
Business Analysts &
Data Scientists
• Analytics
Customer example:
Risk Aggregation, Reporting and
Analytics Grid
Business Problem
• Multiple ETL silos unable to meet
the business SLA
• Multiple tools required for different data stores
• Mounting costs trying to keep up with the LOB
demand for high quality data
Solution
ü High performance data analytics & warehouse
ü Storage agnostic
ü Distributed, in-memory ETL Engine
ü Faster time to results
ü Easy implementation & administration
ü Improved resource utilization
ü Life cycle management
• Multiple concurrent instances & versions of Spark
• Deploy new versions in minutes
Customer example:
Consolidate and Optimize ETL
Distributed
Parallel
ETL Engine
Data / Storage Agnostic
How to Scale Spark Workloads?
The challenges of independent scaling for data-driven workloads
Data Locality
Data Accessibility
Data Abstraction
Data is no more local to compute and
workload processing time will increase
particularly in hybrid cloud deployments
Data is in multiple storage systems in multiple
locations. Highly complex when all compute
frameworks talk to all storage systems
Data can still only be accessed using the
specific storage system APIs
STORAGE
COMPUTE
Truly independent scaling of the data stack
Data Locality Data AccessibilityData Abstraction
A new layer emerges between Compute & Storage
The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software
Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio
Virtual Unified File System
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Flexible APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
> rdd = sc.textFile(“alluxio://master:port/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
Data Accessibility via popular APIs
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver
Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Data Locality via Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL
Alluxio Reference Architecture
Alluxio
Master
Zookeeper /
RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio
Client
RAM / SSD / HDD
RAM / SSD / HDD
Under Store 1
Under Store 2
Application
WAN
Alluxio
Client
Application
Data Flow In Alluxio
1. Applications Read/Write data via the Alluxio Client
2. Read Scenarios
• Data not in Alluxio (i.e. first time, or no cache)
• Data on same node as client
• Data on different node from client
3. Write Scenarios
• Write only to Alluxio
• Write only to Under Store
• Write synchronously to Alluxio and Under Store
• Write to Alluxio and asynchronously write to Under Store
35
Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)
Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
Sharing Data via Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
Sharing Data via Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
Data Resilience During Crash
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process
Data Resilience During Crash
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Data Resilience During Crash
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
Data Resilience During Crash
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
Performance Tuning Tips
§ Data Locality
§ Ensure Spark Executor Locality
§ Ensure Spark Task Locality
§ Prioritize Locality
§ Load Balancing
§ Smaller Block Size
§ Tune Executor Number
§ DeterminsticHashingPolicy to load data from UFS
Read more https://dzone.com/articles/top-10-tips-for-making-the-spark-alluxio-stack-bla
Real world Use cases
Solution: Analytic Data Lake with ETL and Reporting for Large Telco
48
• Business Requirement
– Centralize the IT infrastructure to headquarters and
collect all data from regions to headquarters
– 3G/4G BOSS System, critical in telecom, is centralized
in headquarters with >100 million users. 2G Boss
System is still hosted by regions
– Siloed envs cause low utilization and missing of SLA
– Need data orchestration capability for agility and
performance
• Key Values
– Performance improvements (comparing to HDP) come
from Session Scheduler (fine-grain scheduling)
– Stability of HA support and Enterprise capability in
scale out (original Greenplum solution is 2 duplicated
clusters with 20 machines storing same data with
significant wasted hardware)
– Multiple tenants - in one month only 6 days used by
Revenue sharing application, would like to build
sharable multiple-tenants spark cluster which can be
used by other departments
– Strong technical and commercial support
HDFS from HDP
Alluxio (Orchestration layer)
IBM Spectrum Conductor (multiple tenants)
X86 linux
HiveTez on Yarn
Hive client
Data import Data analysis
GPFS FPO (2 disks/host)
Alluxio M
HDFS M
CwS MCwS MC
Spark M
Hive Meta
mysql
Spark M
Spark M
Alluxio
MC
Spark D
Spark E
Alluxio w
HDFS D
Spark D
Spark E
Alluxio w
HDFS D
Spark D
Spark E
Alluxio w
HDFS D
Spark E
Alluxio w
HDFS D
Spark E
Alluxio w
HDFS D
…
…
Management hosts Info hosts
10Gb
Alluxio
MC
HDFS MC
Hive
client
Compute hosts
CwS C CwS C CwS C CwS C
DATA ORCHESTRATION
SPARK
HDFS
SPARK
ETLSPARK
§ Single namespace to access & address all data
§ Data local to compute accelerates workloads
Data Orchestration for Agility
LeadingTelco serving 300+ million subscribers
HDFS HDFS HDFS
Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the
conversation
on Slack
slackin.alluxio.io
DATA ORCHESTRATION SUMMIT
November 7, 2019 | Computer History Museum | Mountain View, CA
Organized by
Register Here!
ThankYou
Questions? Email me: dipti@alluxio.com
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | Twitter: @Alluxio | Slack

More Related Content

PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Alluxio Use Cases and Future Directions
PDF
Accelerate Cloud Training with Alluxio
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Iceberg + Alluxio for Fast Data Analytics
Alluxio Use Cases and Future Directions
Accelerate Cloud Training with Alluxio
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...

What's hot (20)

PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
Best Practices for Using Alluxio with Spark
PDF
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
PDF
Burst Presto & Spark workloads to AWS EMR with no data copies
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PDF
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
PPTX
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
PDF
Apache Spark Workshop at Hadoop Summit
PDF
The hidden engineering behind machine learning products at Helixa
PDF
Presto + Alluxio on steroids a romantic drama on Production with happy end
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Best Practices for Using Alluxio with Spark
Improving Data Locality for Spark Jobs on Kubernetes Using Alluxio
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Accelerate Analytics and ML in the Hybrid Cloud Era
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Hybrid data lake on google cloud with alluxio and dataproc
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
From limited Hadoop compute capacity to increased data scientist efficiency
Burst Presto & Spark workloads to AWS EMR with no data copies
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Architecting a Heterogeneous Data Platform Across Clusters, Regions, and Clouds
Apache Spark Workshop at Hadoop Summit
The hidden engineering behind machine learning products at Helixa
Presto + Alluxio on steroids a romantic drama on Production with happy end
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Ad

Similar to Powering Data Science and AI with Apache Spark, Alluxio, and IBM (20)

PDF
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
PPTX
Lambda architecture with Spark
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
PDF
AI Scalability for the Next Decade
PPTX
Spark One Platform Webinar
PPTX
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
PDF
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
PPT
Cloud Computing
PDF
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
PDF
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
PDF
Estimating the Total Costs of Your Cloud Analytics Platform
PDF
Experiences in Delivering Spark as a Service
PDF
Oracle cloud oagi
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
PPTX
OCC Overview OMG Clouds Meeting 07-13-09 v3
PPTX
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
PDF
Solving enterprise challenges through scale out storage & big compute final
PDF
What's New in Upcoming Apache Spark 2.3
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
PPTX
The Last Frontier- Virtualization, Hybrid Management and the Cloud
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lambda architecture with Spark
Databricks Meetup @ Los Angeles Apache Spark User Group
AI Scalability for the Next Decade
Spark One Platform Webinar
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Cloud Computing
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Software Defined IT @ Evento SOIEL Roma 6 Aprile 2017
Estimating the Total Costs of Your Cloud Analytics Platform
Experiences in Delivering Spark as a Service
Oracle cloud oagi
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
OCC Overview OMG Clouds Meeting 07-13-09 v3
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Solving enterprise challenges through scale out storage & big compute final
What's New in Upcoming Apache Spark 2.3
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
The Last Frontier- Virtualization, Hybrid Management and the Cloud
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
PDF
Best Practice for LLM Serving in the Cloud
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Introduction to Apache Iceberg™ & Tableflow
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Best Practice for LLM Serving in the Cloud
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio Webinar | Accelerate AI: Alluxio 101
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
AI/ML Infra Meetup | Big Data and AI, Zoom Developers

Recently uploaded (20)

PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administration Chapter 2
PPTX
L1 - Introduction to python Backend.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Introduction to Artificial Intelligence
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
System and Network Administraation Chapter 3
PDF
medical staffing services at VALiNTRY
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Transform Your Business with a Software ERP System
Operating system designcfffgfgggggggvggggggggg
PTS Company Brochure 2025 (1).pdf.......
System and Network Administration Chapter 2
L1 - Introduction to python Backend.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Nekopoi APK 2025 free lastest update
Wondershare Filmora 15 Crack With Activation Key [2025
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Softaken Excel to vCard Converter Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Introduction to Artificial Intelligence
Which alternative to Crystal Reports is best for small or large businesses.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Internet Downloader Manager (IDM) Crack 6.42 Build 41
System and Network Administraation Chapter 3
medical staffing services at VALiNTRY
Understanding Forklifts - TECH EHS Solution
Transform Your Business with a Software ERP System

Powering Data Science and AI with Apache Spark, Alluxio, and IBM

  • 1. Achieving Separation of Compute and Storage in a Cloud World Yonggang Hu | Chief Architect at Spectrum Computing | IBM Bin Fan | Founding Engineer andVice President of Open Source | Alluxio
  • 3. Yonggang Hu Distinguished Engineer, Chief Architect at Spectrum Computing, IBM Email: yhu@ca.ibm.com Yonggang has been working on distributed computing, grid, cloud and big data for the past 20 years. Before joining Spectrum Computing, Yonggang was Vice President and Application Architect at JPMorgan Chase focusing on computational analytics and application infrastructure. Yonggang holds MS in Computer Science from Peking University and MBA from Cornell University. Today’s Speaker
  • 4. Bin Fan VP of Open Source at Alluxio Email: binfan@alluxio.com Bin Fan is the founding engineer and VP of Open Source at Alluxio, Inc. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems. Today’s Speaker
  • 5. Agenda Why storage-independent compute? And Spectrum Conductor AlluxioTechnology Overview Real-world Use Cases
  • 6. Data Time Available Data Understood Data Enterprise Amnesia 80 million wearable health devices will be available by 2017. 2.5 quintillion bytes of data generated daily by connected machines. There will be 28 times more sensor- enabled devices than people by the year 2020. 25 gigabytes of data per hour is generated by a connected car. 90% of cars will be connected by 2020. 153 exabytes of healthcare data generated by devices in 2013. Increasing to 2,314 exabytes in 2020. 1.7 megabytes of data per second generated by every human being on the planet by 2020.
  • 7. 7
  • 8. 8 “We have many useless 2 to 10TB Hadoop clusters “ “Our Hadoop cluster is too big to upgrade – still running Spark 1.4” “Why there is no data locality in my HDFS?“ “Data scientists cannot run analytics in the lake” “I just want to run ETL. Why I have to copy to lake first?” Data & Analytics Challenges “Spark and AI workload are compute intensive” “Too complex and costly to run and manage” “Need independent scaling of compute and storage” “Big batch job taking over the cluster. On- line query is too slow” Cluster Sprawl – The Elephant in the Room
  • 9. Apache Spark is revolutionizing big data Benefits • 10-100x the performance as MapReduce • In-memory, batch/streaming ETL, ML • Simplicity • Faster application development • Enables de-coupled compute and storage The Apache Spark big data processing framework will account for more than 37% of all big data spending by 2022, according to research by Wikibon. Last 3 Years (July 2016 – July 2019)
  • 11. IBM contributions to Apache Spark IBM Developer / © 2019 IBM Corporation 11 ØTotal Spark contribution: • Total: 1195 Jira merged; Covered areas: SQL, ML, Pyspark. Spark core • 148 commits merged in the last 12 months • Currently we have 11 contributors with 4 committers • Total LoC: 90K + in main source and test code • More than 65% are major issues, more than 95% are in areas: SQL, ML, Pyspark, Spark core and more than 50% are improvement and features. ØIn the latest release of Apache Spark (v2.4.0) • IBM contributed about 100 commits and 6586 LOC in source and test. • Blog post to highlights Spark team contribution in this release. https://developer.ibm.com/blogs/2018/11/29/ibm-continues-commitment-to-apache- spark/
  • 12. Inefficient workload management results in poor server utilization rates and throughput. Lack of security with open-source frameworks and applications introduces risk. Multiple Spark teams each with dedicated servers = wasted capacity and high administrative overhead. Scaling Spark is challenging
  • 13. Deliver faster results Drive down TCO • Streamline deployment and management • Integrated monitoring, alerting, reporting, and diagnostics • Quickly onboard users and Spark instances and versions. IBM Spectrum Conductor unlocks the power in your infrastructure to build data pipelines at scale • Maximize Spark performance with intelligent workload scheduling. • 30 to 224% greater throughput than Spark with YARN, 25 to 88% greater than Spark with Mesos* • Workload aware, dynamic bursting to external clouds • Eliminate cluster sprawl with a multitenant shared infrastructure. • Get more out of current hardware by increasing utilization and throughput. • Simultaneously run multiple versions and instances of Spark. Simplify administration
  • 14. Dedicated Resources First, optimize the environment Better performance at lower cost • Dynamic resource sharing with guaranteed SLAs • Ability to borrow CPU cores dynamically from other tenants to meet job needs • Enterprise-proven at scale: • 5K hosts, 150K cores, >1B tasks/day Secure Multitenant Shared Resources Shared Resources Dedicated Resource Silos IBM Cloud x86
  • 15. Spectrum Conductor: flexible resource sharing and SLA control Analytics-clusters 60000 Risk analytics CECL-prod 40000 20000 8000 100000 Credit-Bond 10000 3000 100000 Model review 8000 5000 200000 Spark -test 2000 2000 100000 40000 Big-Data-Env 5000 20000 R-Analytics Notebook-Service TensorFlow Docker-containers 20 10 10 60 5000 5000 15000 Unique “Resource Plan” capability to express flexible resource sharing policies among applications (consumers) and multiple resource pools Hierarchical structure can reflect organizational structure and application hierarchies Notions of time-variant resource ownership, ranking, sharing ratios, lend & borrow limits Application/LOB tenants can borrow resources from others (subject to policy) improving performance and maximizing asset utilization Introduction of prioritization for Spark and support of FCFS Priority and Fairshare policies * (Spark Summit 2018) Flexible pre-emption policies enable tenants to quickly reclaim owned resources ensuring that service-level objectives and business deadlines are met 15 Maximize utilization while ensuring tenant / LOB SLAs with unique resource sharing model https://www.slideshare.net/databricks/dynamic-priorities-for-apache-spark-applications-resource- allocations-with-michael-feiman-and-shinnosuke-okada
  • 16. 16 Superior Spark performance and predictability • 30 to 224% faster than YARN • 25 to 88% faster than Apache Mesos • Consistent and predictable delivering 71% relative standard deviation (RSD) in interactive workloads • YARN & Mesos are relatively unpredictable 304% and 777% RSD respectively 2607 327 899 574 1673 253 582 177 1660 202 478 458 0 500 1000 1500 2000 2500 3000 Case 1: Sync interactive multi-user Case 2: Asynchonous batch multi-user Case 3: Mixed multi-user Case 4: Mixed multi-tenant Throughput of Spark SMB-2 benchmark workload on various Resource Managers (Jobs/hour - higher is better) IBM Spectrum Conductor Apache YARN v2.7.3 Apache Mesos v1.0.1 Audited benchmark results https://stacresearch.com/news/2017/05/19/IBM170405 Deliver faster results
  • 17. 17 Faster, more predictable analytics https://www.slideshare.net/databricks/deep-dive-into-apache-spark-multiuser-performance-michael-feiman-mikhail-genkin-and-peter-lankford
  • 18. Faster Time to Results – GPU Support Accelerate Spark applications with GPUs Presented at Spark Summit San Francisco June 2016 Conductor scheduler interfaces with Spark to ensure that GPU resources are assigned to the applications that can use them Workload Management Spark Application Session Scheduler GPU resources CPU resources | 18https://www.slideshare.net/sparktc/gpu-support-in-spark-and-gpucpu-mixed-resource-scheduling-at-production-scale
  • 19. To provide enhanced user experiences Researcher HPC Instance #3 IT or Data Warehouse ETL / Reporting Instance #4 Risk Analytics LOB Instance #5 Administrator Compute Nodes Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Data scientist Customer behavior... Trend analysis... Instance #2 Instance #1 LOB Marketing… Fraud Detection… Broader use, better business value • Logical view - each group has its own Spark cluster • Right resources dynamically allocated to each user based on workload characteristics and SLA priority • Multitenancy ensures data and application isolation
  • 20. 20 Embrace cloud, transparently leveraging cloud providers End-to-end hybrid cloud, policy-based bursting to multiple clouds with automated data staging and cloud staging, auto-deploy containerized workloads on IaaS or cloud-native (K8s) cloud services Google IBM Cloud Microsoft Amazon User applications & workloads Spectrum Computing forwards specific workloads to multiple clouds based on site defined policies and multi-cluster Spectrum Computing works with Spectrum Scale to pre-stage necessary between on-premises and cloud-based Spectrum Scale environments before cloud hosts are auto-provisioned Spectrum Computing Resource Connector dynamically resizes the pool of hosts in the cloud based on workload demand and policies IBM Apsera can be used as the underlying transport
  • 21. Customer example: Augment Data Lake with a Spark Analytics Grid … Spectrum Conductor Optimized Compute Cluster Read/Write Data X86 ‘Data Lake’ and Legacy Hadoop Apps HDFS Connector Linux on Power or x86 HDFS POSIX Solution ü Provide optimized high performance, multi-user, shared environment for Spark, Python, H20, AI ü Connect to HDFS, EDW, Spectrum Scale, and Object Stores ü Drive up utilization with SLA policy driven sharing of compute resources ü Provide the foundation for AI with GPU acceleration Business Problem • Mounting data lake costs and risks • Data lake not well suited to in-memory workloads • Low compute utilization and limited flexibility • Need to introduce new analytics applications Enterprise Data Warehouse
  • 22. Business Problem • 500+ users • Large volumes of portfolio, trade and market data • Diverse user groups and analytics requirements • Interactive queries and batch reports needed • Sub-second response times necessary • Current data warehouse and Hadoop architectures limited on price-performance, scalability and SLAs to users Solution ü Performance-optimized Spark-based environment ü Shared in-memory cache and Shared Spark context ü Intelligent workload management to meet user SLAs ü Multi-tenancy to enable security and isolation ü High performance resource management to increase infrastructure efficiency … Spark Cache HDFS/ Spectrum Scale Traders • Need visibility via custom desktop application • Analyze risk and trade impact to determine trading strategy Risk Group/ CRO • Understand risk across the entire portfolio Finance/ CFO • P&L and financial impact for the firm Business Analysts & Data Scientists • Analytics Customer example: Risk Aggregation, Reporting and Analytics Grid
  • 23. Business Problem • Multiple ETL silos unable to meet the business SLA • Multiple tools required for different data stores • Mounting costs trying to keep up with the LOB demand for high quality data Solution ü High performance data analytics & warehouse ü Storage agnostic ü Distributed, in-memory ETL Engine ü Faster time to results ü Easy implementation & administration ü Improved resource utilization ü Life cycle management • Multiple concurrent instances & versions of Spark • Deploy new versions in minutes Customer example: Consolidate and Optimize ETL Distributed Parallel ETL Engine Data / Storage Agnostic
  • 24. How to Scale Spark Workloads?
  • 25. The challenges of independent scaling for data-driven workloads Data Locality Data Accessibility Data Abstraction Data is no more local to compute and workload processing time will increase particularly in hybrid cloud deployments Data is in multiple storage systems in multiple locations. Highly complex when all compute frameworks talk to all storage systems Data can still only be accessed using the specific storage system APIs
  • 26. STORAGE COMPUTE Truly independent scaling of the data stack Data Locality Data AccessibilityData Abstraction A new layer emerges between Compute & Storage
  • 27. The Alluxio Story Originated as Tachyon project, at UC Berkley AMPLab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 20192018 2019 Top 10 Big Data 2019 Top 10 Cloud Software
  • 28. Fast-growing Open Source Community 4000+ Github Stars1000+ Contributors Join the community on Slack alluxio.io/slack Apache 2.0 Licensed Contribute to source code github.com/alluxio/alluxio
  • 29. Virtual Unified File System Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  • 30. Flexible APIs to Interact with data in Alluxio Spark Presto POSIX Java > rdd = sc.textFile(“alluxio://master:port/myInput”) CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
  • 31. Data Accessibility via popular APIs Convert from Client-side Interface to native Storage Interface Java File API HDFS Interface S3 Interface REST APIFUSE Interface HDFS Driver Swift DriverS3 Driver NFS Driver
  • 32. Data Abstraction via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 33. Data Locality via Intelligent Multi-tiering Local performance from remote data using multi-tier storage Hot Warm Cold RAM SSD HDD Read & Write Buffering Transparent to App Policies for pinning, promotion/demotion,TTL
  • 34. Alluxio Reference Architecture Alluxio Master Zookeeper / RAFT Standby Master Alluxio Worker Alluxio Worker Alluxio Client RAM / SSD / HDD RAM / SSD / HDD Under Store 1 Under Store 2 Application WAN Alluxio Client Application
  • 35. Data Flow In Alluxio 1. Applications Read/Write data via the Alluxio Client 2. Read Scenarios • Data not in Alluxio (i.e. first time, or no cache) • Data on same node as client • Data on different node from client 3. Write Scenarios • Write only to Alluxio • Write only to Under Store • Write synchronously to Alluxio and Under Store • Write to Alluxio and asynchronously write to Under Store 35
  • 36. Accessing Alluxio Data From Spark Writing Data Write to an Alluxio file Reading Data Read from an Alluxio file
  • 37. Code Example for Spark RDDs Writing RDD to Alluxio rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath) Reading RDD from Alluxio rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)
  • 38. Code Example for Spark DataFrames Writing to Alluxio df.write.parquet(alluxioPath) Reading from Alluxio df = sc.read.parquet(alluxioPath)
  • 39. Sharing Data via Memory Storage Engine & Execution Engine Same Process • Two copies of data in memory – double the memory used • Sharing Slowed Down by Network / Disk I/O Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Spark Compute Spark Storage block 1 block 3
  • 40. Sharing Data via Memory Storage Engine & Execution Engine Different process • Half the memory used • Sharing Data at Memory Speed Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Spark Compute Spark Storage
  • 41. Data Resilience During Crash Spark Compute Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process
  • 42. Data Resilience During Crash CRASH Spark Storage block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 • Process Crash Requires Network and/or Disk I/O to Re-read Data Storage Engine & Execution Engine Same Process
  • 43. Data Resilience During Crash CRASH HDFS / Amazon S3 block 1 block 3 block 2 block 4 Storage Engine & Execution Engine Same Process • Process Crash Requires Network and/or Disk I/O to Re-read Data
  • 44. Data Resilience During Crash Spark Compute Spark Storage HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 Storage Engine & Execution Engine Different process
  • 45. Data Resilience During Crash • Process Crash – Data is Re-read at Memory Speed HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Alluxio block 1 block 3 block 4 CRASH Storage Engine & Execution Engine Different process
  • 46. Performance Tuning Tips § Data Locality § Ensure Spark Executor Locality § Ensure Spark Task Locality § Prioritize Locality § Load Balancing § Smaller Block Size § Tune Executor Number § DeterminsticHashingPolicy to load data from UFS Read more https://dzone.com/articles/top-10-tips-for-making-the-spark-alluxio-stack-bla
  • 47. Real world Use cases
  • 48. Solution: Analytic Data Lake with ETL and Reporting for Large Telco 48 • Business Requirement – Centralize the IT infrastructure to headquarters and collect all data from regions to headquarters – 3G/4G BOSS System, critical in telecom, is centralized in headquarters with >100 million users. 2G Boss System is still hosted by regions – Siloed envs cause low utilization and missing of SLA – Need data orchestration capability for agility and performance • Key Values – Performance improvements (comparing to HDP) come from Session Scheduler (fine-grain scheduling) – Stability of HA support and Enterprise capability in scale out (original Greenplum solution is 2 duplicated clusters with 20 machines storing same data with significant wasted hardware) – Multiple tenants - in one month only 6 days used by Revenue sharing application, would like to build sharable multiple-tenants spark cluster which can be used by other departments – Strong technical and commercial support HDFS from HDP Alluxio (Orchestration layer) IBM Spectrum Conductor (multiple tenants) X86 linux HiveTez on Yarn Hive client Data import Data analysis GPFS FPO (2 disks/host) Alluxio M HDFS M CwS MCwS MC Spark M Hive Meta mysql Spark M Spark M Alluxio MC Spark D Spark E Alluxio w HDFS D Spark D Spark E Alluxio w HDFS D Spark D Spark E Alluxio w HDFS D Spark E Alluxio w HDFS D Spark E Alluxio w HDFS D … … Management hosts Info hosts 10Gb Alluxio MC HDFS MC Hive client Compute hosts CwS C CwS C CwS C CwS C
  • 49. DATA ORCHESTRATION SPARK HDFS SPARK ETLSPARK § Single namespace to access & address all data § Data local to compute accelerates workloads Data Orchestration for Agility LeadingTelco serving 300+ million subscribers HDFS HDFS HDFS
  • 50. Open Source Started From UC Berkeley AMPLab 1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed GitHub’s Top 100 Most Valuable Repositories Out of 96 Million Join the conversation on Slack slackin.alluxio.io
  • 51. DATA ORCHESTRATION SUMMIT November 7, 2019 | Computer History Museum | Mountain View, CA Organized by Register Here!
  • 52. ThankYou Questions? Email me: dipti@alluxio.com Join the Alluxio Community www.alluxio.org | www.alluxio.com | Twitter: @Alluxio | Slack