Key trends in Big Data and new reference architecture from Hewlett Packard Enterprise / Gilles Noisette (Hewlett Packard)

A modern, flexible approach to Hadoop implementation
HPE Big Data Reference Architecture
Gilles Noisette
HPE EMEA Big Data Center Of Excellence
November 2015

Agenda
• Big Data IT infrastructure trends
• Hadoop Evolution & Architecture trends
• Hadoop YARN Labelling
• Hadoop Storage Tiering
• New HPE Architecture approach to Big Data
• HPE Big Data Reference Architecture
• Scaling Hadoop more efficiently
• HPE BDRA Components
• HPE BDRA in a virtualized context
• HPE Big Data Architecture long term view

IT infrastructures must evolve to handle Big Data demands
• Multiple silos with multiple copies
of the same data
• Difficult to standardize on a
consistent server architecture
• Less elastic than other virtualized
or converged infrastructure
• Large scale makes density, cost
and power problematic
Challenges

The Pace of Change
And how people are buying Hadoop is changing also….

Hadoop YARN Labelling
Running applications on particular set of nodes
YARN Labelling (Node-labels / Hadoop 2.6 / jira YARN-796)
Capability to create groups of similar nodes to run different types of
applications with different workload, each, on the most appropriate group
of node
• Admin tags nodes with labels (e.g.: GPU, Storm)
− One node can have more than one label (e.g.: GPU, m710)
• Applications can include labels in container requests
Enabling the next Generation of Hadoop Applications . . .
NodeManager
[Storm]
Application
Master
I want a GPU
NodeManager
[GPU, m710]
HPE Moonshot cartridge
NodeManager
[Analytic, XL170r]
HPE Apollo blades

YARN Labels are used in production
YARN Labelling case studies
Vinod Vavilapalli – @Tshooter
Yahoo! uses machines with GPUs on #Hadoop clusters (#YARN) to model
'beautiful' images on Flickr. #hadoopsummit
1:43 AM - 16 Apr 2015
Vinod Vavilapalli – @Tshooter
.@pcnudde talking about #Yahoo using custom #Hadoop #YARN apps together
with Node labels / High CPU machines for learning. #hadoopsummit
1:49 AM - 16 Apr 2015
Yahoo uses YARN labels
eBay cluster use YARN labels to
• Separate Machine Learning workloads from regular workloads
• Separate licensed software to some machines
• Enable GPU workloads
• Separate organizational workloads
Mayank Bansal, ebay

Hadoop Storage tiering
Hadoop Architecture trends
HDFS Tiering / Heterogeneous Storage Tiers (HDFS-2832)
Allows a single cluster to have multiple storage tiers such as ARCHIVE, DISK,
SSD, RAM-disk.
Awareness of storage media allow HDFS to make better decisions about the
placement of block data with input from applications. Distribution of replicas
could be based on its performance and durability requirements.
• Phase2:
–HDFS-5682 - Application APIs for heterogeneous storage
–HDFS-7228 - SSD storage tier
–HDFS-5851 - Memory as a storage tier
HDFS Archival Storage Design (HDFS-6584)
– Introduces a new concept of storage policies. For accommodating future storage
technology and different cluster characteristics, cluster administrators will be able to
modify the predefined storage policies and/or define custom storage policies.
– Data policy names : Very Hot  Hot  Warm  Luke Warm  Cold

Ebay use Tiered Storage for its Hadoop cluster
HDFS Tiering case study
 40 PB / 2000 nodes cluster was getting full
HDFS Tiering features
• Data reside on same cluster in a standard HDFS
• Data could easily move back and forth, to and from,
the Archive
• Tiered storage is operated using storage types and
storage policies
• Archival policy is based on access pattern
– Antony Benoy, ebay
40 PB / 2000 nodes
DISK
10 PB / 48 nodes
ARCHIVAL
HDFS

Hadoop gets asymmetric
but I thought we were taking the work to the data…
B
App
L1 L1 L1
Isolate
A A A
nodes
labels
Hot
All replicas on DISK
Warm
1 replica on DISK, others on
ARCHIVE
Cold
All replicas on
ARCHIVE
Hadoop cluster
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
ARCHIVE
Yarn Labels
Allows applications running
in yarn containers to be
constrained to designated
nodes in the cluster
HDFS Tiering
Allows the creation of pools of
storage for SSD, HDD and
Archive, RAM-disk, leveraging
different server configurations
What about Data Locality ?

New complementary approach to address Big Data demands
Storage Optimized Servers

Benefits of HPE Big Data Reference Architecture
HPE Moonshot and Apollo servers address a variety of enterprise big data needs
Cluster consolidation
Multiple big data environments can
directly access a shared pool of data
Flexibility to scale
Scale compute and storage
independently
Maximum elasticity
Rapidly provision compute without
affecting storage
Breakthrough economics
Significantly better density, cost and
power through workload optimized
components

DFSIO testing on Big Data Reference architecture
Better numbers with optimized IO Servers for HDFS

Hadoop and its ecosystem take advantage of the BDRA
17
Ethernet
Network Switches
East - West Networking
Impala

HPE Hadoop Traditional vs HPE Big Data Reference Architecture
2X Hadoop MapReduce performance with the same footprint
2.5X HBase performance with the same footprint
Note: Comparison configuration is ProLiant DL380 Gen9 servers
2 x Higher Density
2.4 x Memory Density
46% Less Power (Watts)
Traditional
architecture
Big Data
Reference
Architecture
versus

1.5PB configuration example
Comparable Hadoop performance and raw compute (SpecInt) power
Compared to 2U rackmount BDRA
Acquisition cost 3% lower
Power 54% lower
Density (total rack U) 2x density
5 year power/cooling savings (assume $.20/kWh) $472K

HOT COLD
Independent scaling of compute and storage
[ HPE ProLiant DL380 Gen9 ] vs [ HPE Moonshot for Computing + HPE ProLiant Apollo 4200 for Storage ]
Traditional
Architecture
2.8x compute
97% of the storage capacity
4x the memory
1.6x compute
1.5x the storage capacity
2.5x the memory
90% of the compute
2.1x the storage capacity
1.5x the memory

Hadoop performance density > 2 times better - Power consumption = 0.5
Scale-Out Building blocks
HPE Apollo
Scalable System
Storage optimized servers
Cost-effective industry
standard storage server
purpose built for big data with
converged infrastructure that
offers high density energy-
efficient storage
HPE Network Switches
East – West Networking
HPE Moonshot System
with 45 x m710 Compute nodes
HPE Apollo 2200
with 4 x XL170r Gen9 High Compute nodes
Compute
optimized
servers
Front
Rear

HPE Moonshot 1500
28
2 internal switches
45 hot-plug cartridges
• 1-node = 45 servers in a chassis
• 4-nodes =180 servers in a chassis
• HP Moonshot-45G (45 x1Gb port)
• HP Moonshot-180G (180 x1Gb port)
• HP Moonshot-45XG (45 x10Gb port)
Web-cache
64-bit ARM
m400
Remote PCs
XenDesktop
m700
Big Data, Hadoop
Video transcoding
m710p
Real-time analytics
Telecom, finance
m800
Web-hosting
180 servers in 4.3U
m350
Full WEB-infrastructure in
a single chassis
Dedicated hosting
m300
45 Hadoop Low-power Hadoop compute nodes per enclosure !

Big data Storage Node
HPE Apollo 4200 - Bringing Big Data storage server density to enterprise

Big data Storage Node for Backup or Archival
HPE Apollo 4510 - Very High density Big Data storage server
Scalable density
Lower TCO
Workload
optimized
Rack-scale storage server density
Up to 5.44 PB in 42U rack
Rack-scale extreme density – 5.44 PB per Rack!
Cost effective
68 LFF HDDs/SSDs in 4U server
chassis for low-cost, power & space efficient
solutions
Configuration flexibility
Balance capacity, cost and throughput with flexible
options for disks, CPUs , I/O and interconnects

HPE BDRA in a Virtualized context
Usage example
33

HPE BDRA used for multi-tenancy or Hadoop as a Service
Multi-tenancy or Hadoop as a service, are made easier when separating the
data processing service and the storage management service as it brings
Often based on a Virtualized environment
– Better workload isolation between
YARN applications
– More flexibility by scaling compute and
storage independently
– Full elasticity on the computing side
– Rapidly provision and decommission
compute without affecting storage

VMDK
HPE BDRA used in a fully Elastic Virtualized environment
Compute and Storage nodes are virtualized in a different manner
363PARF400
3PARF400
3PARF400
VMDK
VMDK
Ext4
Ext4
Ext4
Hadoop DataNode
Virtualization Hosts
3PARF400
3PARF400
3PARF400
3PARF400
3PARF400
3PARF400
HadoopComputeNode
HadoopComputeNode
HadoopComputeNode
HadoopComputeNode
VMDKExt4
HostVM
BDRAStorageNode
BDRAComputeNodes

Summarizing &
HPE Big Data Architecture long term view
37

– The HPE BDRA is a complementary Hadoop reference Architecture that brings
• Elasticity  extreme elasticity brought to Hadoop
• Flexibility  adaptive architecture that makes IT more responsive
• Efficiency  scale compute and storage independantly
– It takes advantage of new Hadoop trends and features like
• Hadoop YARN Labels
• Hadoop HDFS Tiering
– The target customers are
• Mature Hadoop customers who want to consolidate clusters
• People who need virtualization, multi-tenancy, Elasticity or want to build a
smart Data Lake
• People who want to optimize the density and the power consumption
(breakthrough economics)
– The BDRA works with fully standard Hadoop stacks (no patches, not proprietary)
• Cloudera Enterprise 5
• Hortonworks Data Platform 2
• MapR M5

HPE BDRA Optimized Compute & Storage nodes
Support multiple compute and storage blocks

Converged Infrastructure benefits for Big Data
Hadoop Node Labels feature (jira YARN-796)
• Combined with the HPE Big Data Reference Architecture, compute nodes
can be dynamically assigned as there is no need for data repartitioning
• HPE contributed IP into the Hadoop trunk, working with Hortonworks
• Allows scheduling of YARN containers to specific pools of nodes

HPE BDRA CI for Big Data long term view
Evolve to support multiple compute and storage blocks
Multi-temperate Storage using HDFS Tiering and ObjectStores
Workload Optimized compute nodes to accelerate various big data software

Thankyou!
Learn more on how your organization can benefit from
HPE Big Data Reference Architecture: Overview
HPE Big Data Reference Architecture: Hortonworks implementation
HPE Big Data Reference Architecture: Cloudera implementation
HPE Big Data Reference Architecture: MapR implementation
Running HBase on the HPE Big Data Reference Architecture
http://www.hpe.com/go/hadoop

Key trends in Big Data and new reference architecture from Hewlett Packard Enterprise / Gilles Noisette (Hewlett Packard)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Key trends in Big Data and new reference architecture from Hewlett Packard Enterprise / Gilles Noisette (Hewlett Packard) (20)

More from Ontico (20)

Recently uploaded (20)

Key trends in Big Data and new reference architecture from Hewlett Packard Enterprise / Gilles Noisette (Hewlett Packard)