Powering Data Science and AI with Apache Spark, Alluxio, and IBM

Achieving Separation of Compute and Storage
in a Cloud World
Yonggang Hu | Chief Architect at Spectrum Computing | IBM
Bin Fan | Founding Engineer andVice President of Open Source | Alluxio

Yonggang Hu
Distinguished Engineer, Chief Architect at Spectrum
Computing, IBM
Email: yhu@ca.ibm.com
Yonggang has been working on distributed computing, grid,
cloud and big data for the past 20 years. Before joining
Spectrum Computing, Yonggang was Vice President and
Application Architect at JPMorgan Chase focusing on
computational analytics and application infrastructure.
Yonggang holds MS in Computer Science from Peking
University and MBA from Cornell University.
Today’s Speaker

Bin Fan
VP of Open Source at Alluxio
Email: binfan@alluxio.com
Bin Fan is the founding engineer and VP of Open Source at
Alluxio, Inc. Prior to Alluxio, he worked for Google to build the
next-generation storage infrastructure. Bin received his Ph.D.
in Computer Science from Carnegie Mellon University on the
design and implementation of distributed systems.
Today’s Speaker

Agenda
Why storage-independent compute? And Spectrum Conductor
AlluxioTechnology Overview
Real-world Use Cases

Data
Time
Available
Data
Understood Data
Enterprise
Amnesia
80 million
wearable health
devices will
be available by
2017.
2.5
quintillion
bytes of data
generated daily
by connected
machines.
There
will be
28 times
more
sensor-
enabled
devices
than
people
by the
year 2020.
25 gigabytes
of data per hour
is generated by a
connected car.
90% of cars will
be connected by 2020.
153 exabytes
of healthcare
data generated by
devices in 2013.
Increasing to 2,314
exabytes in 2020.
1.7 megabytes
of data per
second
generated by
every human
being on the
planet by 2020.

8
“We have many useless 2 to
10TB Hadoop clusters “
“Our Hadoop cluster is too big
to upgrade – still running Spark
1.4”
“Why there is no data locality
in my HDFS?“
“Data scientists cannot
run analytics in the lake”
“I just want to run ETL.
Why I have to copy to
lake first?”
Data & Analytics Challenges
“Spark and AI workload
are compute intensive”
“Too complex and costly
to run and manage”
“Need independent
scaling of compute and
storage”
“Big batch job taking
over the cluster. On-
line query is too slow”
Cluster Sprawl – The Elephant in the Room

Apache Spark is revolutionizing big data
Benefits
• 10-100x the performance as MapReduce
• In-memory, batch/streaming ETL, ML
• Simplicity
• Faster application development
• Enables de-coupled compute and storage
The Apache Spark big data processing framework will
account for more than 37% of all big data spending by
2022, according to research by Wikibon.
Last 3 Years
(July 2016 – July 2019)

Powering Data Science and AI with Apache Spark, Alluxio, and IBM

IBM contributions to Apache Spark
IBM Developer / © 2019 IBM Corporation
11
ØTotal Spark contribution:
• Total: 1195 Jira merged; Covered areas: SQL, ML, Pyspark. Spark core
• 148 commits merged in the last 12 months
• Currently we have 11 contributors with 4 committers
• Total LoC: 90K + in main source and test code
• More than 65% are major issues, more than 95% are in areas: SQL, ML, Pyspark, Spark core
and more than 50% are improvement and features.
ØIn the latest release of Apache Spark (v2.4.0)
• IBM contributed about 100 commits and 6586 LOC in source and test.
• Blog post to highlights Spark team contribution in this release.
https://developer.ibm.com/blogs/2018/11/29/ibm-continues-commitment-to-apache-
spark/

Inefficient workload
management results in
poor server utilization
rates and throughput.
Lack of security with
open-source frameworks
and applications
introduces risk.
Multiple Spark teams each
with dedicated servers =
wasted capacity and high
administrative overhead.
Scaling Spark is
challenging

Deliver faster results Drive down TCO
• Streamline deployment and
management
• Integrated monitoring, alerting,
reporting, and diagnostics
• Quickly onboard users and
Spark instances and versions.
IBM Spectrum Conductor unlocks the power in your
infrastructure to build data pipelines at scale
• Maximize Spark performance
with intelligent workload
scheduling.
• 30 to 224% greater throughput
than Spark with YARN, 25 to
88% greater than Spark with
Mesos*
• Workload aware, dynamic
bursting to external clouds
• Eliminate cluster sprawl with a
multitenant shared infrastructure.
• Get more out of current hardware
by increasing utilization and
throughput.
• Simultaneously run multiple
versions and instances of Spark.
Simplify administration

Dedicated Resources
First, optimize the environment
Better performance at lower cost
• Dynamic resource sharing with guaranteed SLAs
• Ability to borrow CPU cores dynamically from other
tenants to meet job needs
• Enterprise-proven at scale:
• 5K hosts, 150K cores, >1B tasks/day
Secure Multitenant Shared Resources
Shared Resources
Dedicated Resource Silos
IBM Cloud
x86

Spectrum Conductor: flexible resource sharing and SLA control
Analytics-clusters 60000
Risk analytics
CECL-prod
40000
20000 8000 100000
Credit-Bond 10000 3000 100000
Model review 8000 5000 200000
Spark -test 2000 2000 100000
40000
Big-Data-Env
5000
20000
R-Analytics
Notebook-Service
TensorFlow
Docker-containers
20
10
10
60
5000
5000
15000
Unique “Resource Plan” capability to express flexible
resource sharing policies among applications (consumers)
and multiple resource pools
Hierarchical structure can reflect organizational structure and
application hierarchies
Notions of time-variant resource ownership, ranking, sharing
ratios, lend & borrow limits
Application/LOB tenants can borrow resources from others
(subject to policy) improving performance and maximizing
asset utilization
Introduction of prioritization for Spark and support of FCFS
Priority and Fairshare policies * (Spark Summit 2018)
Flexible pre-emption policies enable tenants to quickly reclaim
owned resources ensuring that service-level objectives and
business deadlines are met
15
Maximize utilization while ensuring tenant / LOB SLAs with unique resource sharing model
https://www.slideshare.net/databricks/dynamic-priorities-for-apache-spark-applications-resource-
allocations-with-michael-feiman-and-shinnosuke-okada

16
Superior Spark performance
and predictability
• 30 to 224% faster than YARN
• 25 to 88% faster than Apache Mesos
• Consistent and predictable delivering 71%
relative standard deviation (RSD) in
interactive workloads
• YARN & Mesos are relatively unpredictable
304% and 777% RSD respectively
2607
327
899
574
1673
253
582
177
1660
202
478 458
0
500
1000
1500
2000
2500
3000
Case 1: Sync interactive
multi-user
Case 2: Asynchonous batch
multi-user
Case 3: Mixed multi-user Case 4: Mixed multi-tenant
Throughput of Spark SMB-2 benchmark workload on various Resource Managers
(Jobs/hour - higher is better)
IBM Spectrum Conductor Apache YARN v2.7.3 Apache Mesos v1.0.1
Audited benchmark results https://stacresearch.com/news/2017/05/19/IBM170405
Deliver faster results

17
Faster, more predictable analytics
https://www.slideshare.net/databricks/deep-dive-into-apache-spark-multiuser-performance-michael-feiman-mikhail-genkin-and-peter-lankford

Faster Time to Results – GPU Support
Accelerate Spark applications with GPUs
Presented at Spark Summit San Francisco June 2016
Conductor scheduler interfaces with Spark to ensure that GPU resources are assigned
to the applications that can use them
Workload Management
Spark Application
Session Scheduler
GPU resources
CPU resources
| 18https://www.slideshare.net/sparktc/gpu-support-in-spark-and-gpucpu-mixed-resource-scheduling-at-production-scale

To provide enhanced user experiences
Researcher
HPC
Instance #3
IT or Data
Warehouse
ETL / Reporting
Instance #4
Risk Analytics
LOB
Instance #5
Administrator
Compute Nodes
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Linux
Data scientist
Customer behavior...
Trend analysis...
Instance #2
Instance #1
LOB
Marketing…
Fraud Detection…
Broader use, better business value
• Logical view - each group has its own Spark cluster
• Right resources dynamically allocated to each user
based on workload characteristics and SLA priority
• Multitenancy ensures data and application isolation

20
Embrace cloud, transparently leveraging cloud providers
End-to-end hybrid cloud, policy-based bursting to multiple clouds with automated data staging and cloud
staging, auto-deploy containerized workloads on IaaS or cloud-native (K8s) cloud services
Google
IBM Cloud
Microsoft
Amazon
User applications & workloads
Spectrum Computing forwards
specific workloads to multiple clouds
based on site defined policies and
multi-cluster
Spectrum Computing works with
Spectrum Scale to pre-stage
necessary between on-premises and
cloud-based Spectrum Scale
environments before cloud hosts are
auto-provisioned
Spectrum Computing Resource
Connector dynamically resizes the
pool of hosts in the cloud based on
workload demand and policies
IBM Apsera can be used as the underlying transport

Customer example:
Augment Data Lake with a Spark
Analytics Grid
…
Spectrum Conductor
Optimized
Compute Cluster Read/Write Data
X86 ‘Data Lake’ and
Legacy Hadoop Apps
HDFS Connector
Linux on Power or x86
HDFS
POSIX
Solution
ü Provide optimized high performance, multi-user,
shared environment for Spark, Python, H20, AI
ü Connect to HDFS, EDW, Spectrum Scale, and
Object Stores
ü Drive up utilization with SLA policy driven sharing
of compute resources
ü Provide the foundation for AI with
GPU acceleration
Business Problem
• Mounting data lake costs and risks
• Data lake not well suited to in-memory workloads
• Low compute utilization and limited flexibility
• Need to introduce new analytics applications
Enterprise Data
Warehouse

Business Problem
• 500+ users
• Large volumes of portfolio, trade and
market data
• Diverse user groups and analytics requirements
• Interactive queries and batch reports needed
• Sub-second response times necessary
• Current data warehouse and Hadoop
architectures limited on price-performance,
scalability and SLAs to users
Solution
ü Performance-optimized Spark-based
environment
ü Shared in-memory cache and Shared Spark
context
ü Intelligent workload management to meet
user SLAs
ü Multi-tenancy to enable security and isolation
ü High performance resource management to
increase infrastructure efficiency
…
Spark
Cache
HDFS/ Spectrum Scale
Traders
• Need visibility via custom
desktop application
• Analyze risk and trade
impact to determine
trading strategy
Risk Group/ CRO
• Understand risk across
the entire portfolio
Finance/ CFO
• P&L and financial
impact for the firm
Business Analysts &
Data Scientists
• Analytics
Customer example:
Risk Aggregation, Reporting and
Analytics Grid

Business Problem
• Multiple ETL silos unable to meet
the business SLA
• Multiple tools required for different data stores
• Mounting costs trying to keep up with the LOB
demand for high quality data
Solution
ü High performance data analytics & warehouse
ü Storage agnostic
ü Distributed, in-memory ETL Engine
ü Faster time to results
ü Easy implementation & administration
ü Improved resource utilization
ü Life cycle management
• Multiple concurrent instances & versions of Spark
• Deploy new versions in minutes
Customer example:
Consolidate and Optimize ETL
Distributed
Parallel
ETL Engine
Data / Storage Agnostic

The challenges of independent scaling for data-driven workloads
Data Locality
Data Accessibility
Data Abstraction
Data is no more local to compute and
workload processing time will increase
particularly in hybrid cloud deployments
Data is in multiple storage systems in multiple
locations. Highly complex when all compute
frameworks talk to all storage systems
Data can still only be accessed using the
specific storage system APIs

STORAGE
COMPUTE
Truly independent scaling of the data stack
Data Locality Data AccessibilityData Abstraction
A new layer emerges between Compute & Storage

The Alluxio Story
Originated as Tachyon project, at UC Berkley AMPLab by
then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.2013
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed for the Cloud
for data driven apps such as Big Data Analytics, ML and AI.
20192018
2019
Top 10 Big Data
2019
Top 10 Cloud Software

Fast-growing Open Source Community
4000+ Github Stars1000+ Contributors
Join the community on Slack
alluxio.io/slack
Apache 2.0 Licensed
Contribute to source code
github.com/alluxio/alluxio

Virtual Unified File System
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver

Flexible APIs to Interact with data in Alluxio
Spark
Presto
POSIX
Java
> rdd = sc.textFile(“alluxio://master:port/myInput”)
CREATE SCHEMA hive.web
WITH (location = 'alluxio://master:port/my-table/')
$ cat /mnt/alluxio/myInput
FileSystem fs = FileSystem.Factory.get();
FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

Data Accessibility via popular APIs
Convert from Client-side Interface to native Storage Interface
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift DriverS3 Driver NFS Driver

Data Abstraction via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming

Data Locality via Intelligent Multi-tiering
Local performance from remote data using multi-tier storage
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering
Transparent to App
Policies for pinning,
promotion/demotion,TTL

Alluxio Reference Architecture
Alluxio
Master
Zookeeper /
RAFT
Standby
Master
Alluxio
Worker
Alluxio
Worker
Alluxio
Client
RAM / SSD / HDD
RAM / SSD / HDD
Under Store 1
Under Store 2
Application
WAN
Alluxio
Client
Application

Data Flow In Alluxio
1. Applications Read/Write data via the Alluxio Client
2. Read Scenarios
• Data not in Alluxio (i.e. first time, or no cache)
• Data on same node as client
• Data on different node from client
3. Write Scenarios
• Write only to Alluxio
• Write only to Under Store
• Write synchronously to Alluxio and Under Store
• Write to Alluxio and asynchronously write to Under Store
35

Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file

Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)
rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)
rdd = sc.objectFile(alluxioPath)

Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)

Sharing Data via Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3

Sharing Data via Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage

Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process

CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process

CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data

Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process

• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process

Performance Tuning Tips
§ Data Locality
§ Ensure Spark Executor Locality
§ Ensure Spark Task Locality
§ Prioritize Locality
§ Load Balancing
§ Smaller Block Size
§ Tune Executor Number
§ DeterminsticHashingPolicy to load data from UFS
Read more https://dzone.com/articles/top-10-tips-for-making-the-spark-alluxio-stack-bla

Solution: Analytic Data Lake with ETL and Reporting for Large Telco
48
• Business Requirement
– Centralize the IT infrastructure to headquarters and
collect all data from regions to headquarters
– 3G/4G BOSS System, critical in telecom, is centralized
in headquarters with >100 million users. 2G Boss
System is still hosted by regions
– Siloed envs cause low utilization and missing of SLA
– Need data orchestration capability for agility and
performance
• Key Values
– Performance improvements (comparing to HDP) come
from Session Scheduler (fine-grain scheduling)
– Stability of HA support and Enterprise capability in
scale out (original Greenplum solution is 2 duplicated
clusters with 20 machines storing same data with
significant wasted hardware)
– Multiple tenants - in one month only 6 days used by
Revenue sharing application, would like to build
sharable multiple-tenants spark cluster which can be
used by other departments
– Strong technical and commercial support
HDFS from HDP
Alluxio (Orchestration layer)
IBM Spectrum Conductor (multiple tenants)
X86 linux
HiveTez on Yarn
Hive client
Data import Data analysis
GPFS FPO (2 disks/host)
Alluxio M
HDFS M
CwS MCwS MC
Spark M
Hive Meta
mysql
Spark M
Spark M
Alluxio
MC
Spark D
Spark E
Alluxio w
HDFS D
Spark D
Spark E
Alluxio w
HDFS D
Spark D
Spark E
Alluxio w
HDFS D
Spark E
Alluxio w
HDFS D
Spark E
Alluxio w
HDFS D
…
…
Management hosts Info hosts
10Gb
Alluxio
MC
HDFS MC
Hive
client
Compute hosts
CwS C CwS C CwS C CwS C

DATA ORCHESTRATION
SPARK
HDFS
SPARK
ETLSPARK
§ Single namespace to access & address all data
§ Data local to compute accelerates workloads
Data Orchestration for Agility
LeadingTelco serving 300+ million subscribers
HDFS HDFS HDFS

Open Source Started From UC Berkeley AMPLab
1000+ contributors &
growing
4000+ Git Stars
Apache 2.0 Licensed
GitHub’s Top 100 Most
Valuable Repositories
Out of 96 Million
Join the
conversation
on Slack
slackin.alluxio.io

DATA ORCHESTRATION SUMMIT
November 7, 2019 | Computer History Museum | Mountain View, CA
Organized by
Register Here!

ThankYou
Questions? Email me: dipti@alluxio.com
Join the Alluxio Community
www.alluxio.org | www.alluxio.com | Twitter: @Alluxio | Slack

Powering Data Science and AI with Apache Spark, Alluxio, and IBM

More Related Content

What's hot (20)

Similar to Powering Data Science and AI with Apache Spark, Alluxio, and IBM (20)

More from Alluxio, Inc. (20)

Recently uploaded (20)

Powering Data Science and AI with Apache Spark, Alluxio, and IBM