SlideShare a Scribd company logo
1© Copyright 2014 EMC Corporation. All rights reserved.
EMC ViPR HDFS Data
Service Technical
Overview
Download this slide
http://ouo.io/FuYX5
VIRTUALIZE
EVERYTHING
COMPROMISE
NOTHING
2© Copyright 2014 EMC Corporation. All rights reserved.
Disruptive / Opportunistic IT Trends
Mobile Cloud Big Data Social
T R U S T
3© Copyright 2014 EMC Corporation. All rights reserved.
Mainframe, Mini Computer
Terminals
MILLIONS
OF USERS
THOUSANDS
OF APPS
LAN/Internet Client/Server
PC
HUNDREDS OF MILLIONS
OF USERS
TENS OF THOUSANDS
OF APPS
Mobile Cloud Big Data Social
Mobile Devices
BILLIONS
OF USERS
MILLIONS
OF APPS
Source: IDC, 2013
4© Copyright 2014 EMC Corporation. All rights reserved.
The Big Data Economy
More data sources, richer content, longer utility
40
ZB
Source: IDC 2012 Digital Universe Study
5© Copyright 2014 EMC Corporation. All rights reserved.
Significant financial value across many verticals
The Big Data Potential
Source: “Big Data: The Next Frontier for Innovation, Competition,
and Productivity”, McKinsey Global Institute
US Retail
• 60+% increase in net
margin possible
• 0.5-1% annual
productivity growth
US Healthcare
• $300 billion value per year
• 0.7% annual productivity
growth
Manufacturing
• Up to 50% decrease in
product development,
assembly costs
• Up to 7% reduction in
working capital
Global personal location
data
• $100 billion+ revenue for
service providers
• Up to $700 billion value to
end users
6© Copyright 2014 EMC Corporation. All rights reserved.
Supporting 3rd platform app with 2nd platform infrastructure
The Challenges to Widespread Adoption
 How to move from the lab to
production?
– Trusting an open source Hadoop
distribution
– HDFS not enterprise grade
– Analytics on existing data?
 What’s the risk?
– Dedicated cluster requires significant
investment
– ROI? – does the data have value?
 What are the costs?
– Costs increase as my dedicated
analytics cluster scales
– Bandwidth and network costs of
moving data to the cluster
7© Copyright 2014 EMC Corporation. All rights reserved.
Big Data Storage Requirements
In-place analytics and protection of all data types
 Data Unification:
– Big Data storage must support structured, semi-
structured, and unstructured data types.
 In-Place Analytics:
– Analytics, compute workloads need to execute
where the data live.
 Data Compliance:
– More sources of data, more volume, velocity,
etc. exacerbate compliance and long-term
retention requirements
40 ZB
8© Copyright 2014 EMC Corporation. All rights reserved.
ViPR Data Services
Overview
9© Copyright 2014 EMC Corporation. All rights reserved.
Data Services that Span Arrays and Support Hybrid Data Types
ViPR Data Services
 Storage services at cloud scale
– Built in software
– Layered over both traditional and new storage
devices
 Object and HDFS data services
– Many more to follow, at regular intervals
– Open API for 3rd party development
 Unified platform
– Data services can be used as different
semantic views on the same data e.g. Object
on File, HDFS on Object
10© Copyright 2014 EMC Corporation. All rights reserved.
EMC ViPR - Software-Defined Storage
ViPR
Data Services
ViPR
Controller
EMC ViPR Platform
Provisioning Self-Service Reporting Automation
Third-Party
Isilon
Atmos
VMAX VNX VPLEX
Commodity
XtremIOCentera
11© Copyright 2014 EMC Corporation. All rights reserved.
ViPR Data Services: Architecture
ViPR
Data Path
ViPR
Control Path
• Distributed Infrastructure
• Device Drivers
• Elastic Volumes
• Migration
GEO-SCALE INDEX, METADATA, TRANSACTIONS
… 3rd PARTYOBJECT HDFS KEY-VALUE
GEO SCALE INDEX, METADATA, TRANSACTIONS
Commodity
VNX Isilon
3rd Party
12© Copyright 2014 EMC Corporation. All rights reserved.
ViPR Data Services Address Big Data
Storage Requirements
 Data Unification
– Transform existing storage infrastructure into a
data lake
– Structured, semi/un-structured content
 In-place Analytics
– Run queries against data on existing arrays
– Flexible software model supports future colocation
of compute and storage
 Data Compliance
– Choice and flexibility or persistence layer
– Support cloud-scale and consumer-grade
applications on enterprise-grade infrastructure
40 ZB
13© Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS Data
Service Overview
14© Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS Service Overview
 HDFS is becoming the de facto file
system for distributed applications
 ViPR is a great platform for HDFS
– Addresses limitations of off-the-shelf
HDFS
– Brings HDFS to existing storage
hardware
– Enables HDFS/Object/File scenarios
– Flexible software model
15© Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS Service Overview
 API head
– Custom client/server protocol optimized
for high scale
– Uses the same unstructured storage
engine as ViPR Object data service
 Client library over the HDFS API
– Provides a viprhdfs:// drop-in
replacement for HDFS 2.0
– Can be seamlessly added to existing
Hadoop distributions
16© Copyright 2014 EMC Corporation. All rights reserved.
EMC ViPR Data Services
ViPR
Data Services
ViPR
Controller
EMC ViPR Platform
Provisioning Self-Service Reporting Automation
Third-Party
IsilonVNX
17© Copyright 2014 EMC Corporation. All rights reserved.
How ViPR HDFS Data Service Helps
Accelerate Big Data initiatives
 Quickly move from lab to production
– Utilize existing infrastructure as a big data
repository or “data lake”
– Eliminate single namenode single point of failure
 Reduce risk
– Run queries against data on existing arrays
– Leverage existing investments
 Reduce costs
– Reduce the growth in dedicated analytics
infrastructure
– Reduce bandwidth, storage and network costs
40 ZB
18© Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS Data
Service
Technical Deep Dive
Name Node
JOB TRACKER
Commodity Compute & Storage
TASK TRACKER
Data Store
MapReduce Task
Client
TASK TRACKER
Data Store
MapReduce Task
TASK TRACKER
Data Store
MapReduce Task
HDFS ARCHITECTURE
VNX Isilon
3rd Party
VMAX
Commodity
JOB TRACKER
TASK TRACKER
MapReduce Task
Client
TASK TRACKER
MapReduce Task
TASK TRACKER
MapReduce Task
ViPR HDFS ARCHITECTURE
VNX Isilon
3rd Party
VMAX
Commodity
JOB TRACKER
TASK TRACKER
MapReduce Task
Client
TASK TRACKER
MapReduce Task
TASK TRACKER
MapReduce Task
• No single point of failure
• Leverage existing storage
• Compatible with existing
Hadoop distribution
• Mixed workload across
HDFS and Object
ViPR HDFS ARCHITECTURE
22© Copyright 2014 EMC Corporation. All rights reserved.
MapReduce Job Flow
Master Node
Job
Tracker
Task Tracker
Data Store
Commodity Compute & Storage
MapReduce Task
Client
Task Tracker
Data Store
MapReduce Task
Task Tracker
Data Store
MapReduce Task
Name
Node
Secondary
NameNode
Submit Job
Split into tasks
Rack 1 Rack 2
Data Node 1 Data Node 2 Data Node 3
23© Copyright 2014 EMC Corporation. All rights reserved.
Presales Training
Customer’s Hadoop Compute
Cluster
ViPR Controller
ViPR Data Node(s) running outside
the ViPR managed arrays
Blob
Engine
S3
Head
HDFS
Head
Customer
AD
Trust Relationship
ViPR HDFS - Under The Hood
Trust RelationshipTrust Relationship
Data
Read/
Write
Kerberos KDC
VNX
Isilon
3rd Party
24© Copyright 2014 EMC Corporation. All rights reserved.
HDFS uses ViPR Object Storage Engine
ViPR data services creates a unified pool (bucket) of data
VIRTUAL ARRAY
 Buckets of data span file shares
– Grow and shrink on demand
 Data is distributed and intermingled across
the storage
 Provides an HDFS interface
 ViPR makes HDFS enterprise grade
– ViPR HDFS replaces namenodes, no single point of
failure
Isilon
3rd Party
VNX
5500
25© Copyright 2014 EMC Corporation. All rights reserved.
Support Mixed Workloads
Object, File and HDFS operations on the same data
VIRTUAL ARRAY
Isilon
3rd Party
VNX
5500
 ViPR Data Services offer three
bucket options:
– Object
– HDFS
– ObjectandHDFS
 ObjectandHDFS provides user with
access to either S3 or HDFS
Interface
– Full compatibility with existing
object based APIs
▪ Amazon S3, Openstack Swift, Atmos
Object HDFS
Object
& HDFS
26© Copyright 2014 EMC Corporation. All rights reserved.
ViPR HDFS Data
Value Proposition
27© Copyright 2014 EMC Corporation. All rights reserved.
Instantly Deploy a Big Data Repository
Use existing arrays as a big data store
Isilon
3rd Party
VNX
5500
VIRTUAL ARRAY
 Reduce risk
– Reduce CAPEX investment required to
perform analytics
– Maintain data protection, compliance at
array level
 Reduces cost and complexity of
dedicated clusters
– Reduce need for new vendor nodes and
storage capacity
 Reduce data transfer time and
bandwidth costs
– 10 TBs takes 25 hours via 10gE
– 10 TBs takes 3 days via dedicated WAN
28© Copyright 2014 EMC Corporation. All rights reserved.
Expand the Reach of Big Data Queries
Expand analytics to ViPR-managed data stores
 Extend big data queries to run
on existing file arrays as
existing Hadoop deployments
 Opens new opportunities and
analytics scenarios
– Faster, easier business insights
Isilon
3rd Party
VNX
5500
VIRTUAL ARRAY
29© Copyright 2014 EMC Corporation. All rights reserved.
Leverage and Extend Existing Investments
Utilize existing Hadoop infrastructure
 ViPR HDFS data service can
be the data source for
Pig/Hive queries
– Fully compatible with existing
Hive/Pig query engines
 Can use an existing
infrastructure to query ViPR-
managed data stores
– Add data stores via ViPR
without having to re-write
queriesIsilon
3rd Party
VNX
5500
VIRTUAL ARRAY
30© Copyright 2014 EMC Corporation. All rights reserved.
Support Mixed Workloads
Provide multiple semantic views of the same data
 Eliminates expensive data movement
– Object based workloads and analytics applications can
manipulate the same data
 Increase developer productivity
– Different applications can target the same data without re-
writes
– IT can serve different developer and business groups with
the same infrastructure
 Increases data value
– Extract more insight from file and object data
(unstructured, semi-structured)
 Reduce infrastructure costs
– Eliminate dedicated data silos
31© Copyright 2014 EMC Corporation. All rights reserved.
Summary
 ViPR provides storage services at cloud scale
– Built in software
– Layered over both traditional and new storage devices
 ViPR creates a unified platform
– Data services can be used as different semantic views on the
same data e.g. Object, File, HDFS interfaces for same data
 ViPR HDFS accelerates journey to 3rd Platform
– Extend Big Data queries to existing storage
– Reduces complexity and cost of dedicated analytics
infrastructure
– Leverages existing investments
Emc vi pr hdfs data service technical overview

More Related Content

PPTX
Emc vi pr global data services
PPTX
Emc vi pr hdfs data service technical overview
PPTX
Emc vi pr data services
PPTX
Big Data – General Introduction
 
PDF
EMC ViPR Services Storage Engine Architecture
 
PPTX
EMC Academic Alliance overview
 
PPTX
Emc vi pr software defined storage
PPTX
EMC config Hadoop
Emc vi pr global data services
Emc vi pr hdfs data service technical overview
Emc vi pr data services
Big Data – General Introduction
 
EMC ViPR Services Storage Engine Architecture
 
EMC Academic Alliance overview
 
Emc vi pr software defined storage
EMC config Hadoop

What's hot (20)

PDF
S100293 hybrid-cloud-orlando-v1804a
PDF
S100294 bcdr-seven-tiers-orlando-v1804a
PDF
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
 
PPTX
EMC EC Overview
PDF
EMC Starter Kit - IBM BigInsights - EMC Isilon
PDF
S100295 reporting-monitoring-orlando-v1804a
PDF
2019 Top IT Trends - Understanding the fundamentals of the next generation ...
PDF
S100297 ilm-archive-orlando-v1804c
PDF
S100296 data-footprint-orlando-v1804a
PDF
Software-Defined Storage (SDS)
PDF
S100298 pendulum-swings-orlando-v1804a
PDF
Performance,cost and reliability through hybrid cloud storage
PDF
IMEXresearch software defined storage
PDF
Overview of Cloud Storage Enablement and Intelligent Storage Clouds
PPTX
IBM Cloud Storage - Cleversafe
PPTX
Scale IO Software Defined Block Storage
PDF
Преимущества облачной инфраструктуры Huawei.
PDF
Carrier Grade OCP: Open Solutions for Telecom Data Centers
PDF
Software Defined Data Center: The Intersection of Networking and Storage
 
PPTX
Hadoop-as-a-Service for Lifecycle Management Simplicity
S100293 hybrid-cloud-orlando-v1804a
S100294 bcdr-seven-tiers-orlando-v1804a
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
 
EMC EC Overview
EMC Starter Kit - IBM BigInsights - EMC Isilon
S100295 reporting-monitoring-orlando-v1804a
2019 Top IT Trends - Understanding the fundamentals of the next generation ...
S100297 ilm-archive-orlando-v1804c
S100296 data-footprint-orlando-v1804a
Software-Defined Storage (SDS)
S100298 pendulum-swings-orlando-v1804a
Performance,cost and reliability through hybrid cloud storage
IMEXresearch software defined storage
Overview of Cloud Storage Enablement and Intelligent Storage Clouds
IBM Cloud Storage - Cleversafe
Scale IO Software Defined Block Storage
Преимущества облачной инфраструктуры Huawei.
Carrier Grade OCP: Open Solutions for Telecom Data Centers
Software Defined Data Center: The Intersection of Networking and Storage
 
Hadoop-as-a-Service for Lifecycle Management Simplicity
Ad

Similar to Emc vi pr hdfs data service technical overview (20)

PPTX
Emc vi pr software defined storage
PPTX
Software Defined Datacenter als 'route' naar het 3e IT platform
PPTX
EMC Hadoop Starter Kit - ViPR Edition
PDF
VIPR SOFTWARE-DEFINED STORAGE
PPTX
Emc vi pr controller customer presentation
PPTX
Emc vi pr controller tecnical customer presentation
PDF
The Future of Storage : EMC Software Defined Solution
 
PPTX
Vitaly Kozlovsky
PDF
ViPR Services Storage Engine Architecture
 
PPTX
Emc vi pr controller
PDF
EMC Big Data | Hadoop Starter Kit | EMC Forum 2014
 
PPTX
EMC Big Data Solutions Overview
PDF
Koyushev
PPTX
True Storage Virtualization with Software-Defined Storage
PPTX
EMC HADOOP Storage Strategy
PDF
Maitrisez l'évolution de vos infrastructures avec ViPR SRM & Controller
 
PPTX
Emc vipr srm workshop
PPTX
EMC Vipr srm-technical Deep dive
PPTX
ECS/Cloud Object Storage - DevOps Day
PDF
Le Software Defined Solutions, ou comment automatiser les ressources IT ?
 
Emc vi pr software defined storage
Software Defined Datacenter als 'route' naar het 3e IT platform
EMC Hadoop Starter Kit - ViPR Edition
VIPR SOFTWARE-DEFINED STORAGE
Emc vi pr controller customer presentation
Emc vi pr controller tecnical customer presentation
The Future of Storage : EMC Software Defined Solution
 
Vitaly Kozlovsky
ViPR Services Storage Engine Architecture
 
Emc vi pr controller
EMC Big Data | Hadoop Starter Kit | EMC Forum 2014
 
EMC Big Data Solutions Overview
Koyushev
True Storage Virtualization with Software-Defined Storage
EMC HADOOP Storage Strategy
Maitrisez l'évolution de vos infrastructures avec ViPR SRM & Controller
 
Emc vipr srm workshop
EMC Vipr srm-technical Deep dive
ECS/Cloud Object Storage - DevOps Day
Le Software Defined Solutions, ou comment automatiser les ressources IT ?
 
Ad

More from solarisyougood (20)

PPTX
Emc recoverpoint technical
PPTX
Emc vmax3 technical deep workshop
PPTX
EMC Atmos for service providers
PPTX
Cisco prime network 4.1 technical overview
PPTX
Designing your xen desktop 7.5 environment with training guide
PPT
Ibm aix technical deep dive workshop advanced administration and problem dete...
PPT
Ibm power ha v7 technical deep dive workshop
PPT
Power8 hardware technical deep dive workshop
PPT
Power systems virtualization with power kvm
PPTX
Power vc for powervm deep dive tips & tricks
PPTX
Emc data domain technical deep dive workshop
PPT
Ibm flash system v9000 technical deep dive workshop
PPTX
Emc vnx2 technical deep dive workshop
PPTX
Emc isilon technical deep dive workshop
PPTX
Emc ecs 2 technical deep dive workshop
PPTX
Emc vplex deep dive
PPTX
Cisco mds 9148 s training workshop
PPTX
Cisco cloud computing deploying openstack
PPTX
Se training storage grid webscale technical overview
PPTX
Vmware 2015 with vsphereHigh performance application platforms
Emc recoverpoint technical
Emc vmax3 technical deep workshop
EMC Atmos for service providers
Cisco prime network 4.1 technical overview
Designing your xen desktop 7.5 environment with training guide
Ibm aix technical deep dive workshop advanced administration and problem dete...
Ibm power ha v7 technical deep dive workshop
Power8 hardware technical deep dive workshop
Power systems virtualization with power kvm
Power vc for powervm deep dive tips & tricks
Emc data domain technical deep dive workshop
Ibm flash system v9000 technical deep dive workshop
Emc vnx2 technical deep dive workshop
Emc isilon technical deep dive workshop
Emc ecs 2 technical deep dive workshop
Emc vplex deep dive
Cisco mds 9148 s training workshop
Cisco cloud computing deploying openstack
Se training storage grid webscale technical overview
Vmware 2015 with vsphereHigh performance application platforms

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.

Emc vi pr hdfs data service technical overview

  • 1. 1© Copyright 2014 EMC Corporation. All rights reserved. EMC ViPR HDFS Data Service Technical Overview Download this slide http://ouo.io/FuYX5 VIRTUALIZE EVERYTHING COMPROMISE NOTHING
  • 2. 2© Copyright 2014 EMC Corporation. All rights reserved. Disruptive / Opportunistic IT Trends Mobile Cloud Big Data Social T R U S T
  • 3. 3© Copyright 2014 EMC Corporation. All rights reserved. Mainframe, Mini Computer Terminals MILLIONS OF USERS THOUSANDS OF APPS LAN/Internet Client/Server PC HUNDREDS OF MILLIONS OF USERS TENS OF THOUSANDS OF APPS Mobile Cloud Big Data Social Mobile Devices BILLIONS OF USERS MILLIONS OF APPS Source: IDC, 2013
  • 4. 4© Copyright 2014 EMC Corporation. All rights reserved. The Big Data Economy More data sources, richer content, longer utility 40 ZB Source: IDC 2012 Digital Universe Study
  • 5. 5© Copyright 2014 EMC Corporation. All rights reserved. Significant financial value across many verticals The Big Data Potential Source: “Big Data: The Next Frontier for Innovation, Competition, and Productivity”, McKinsey Global Institute US Retail • 60+% increase in net margin possible • 0.5-1% annual productivity growth US Healthcare • $300 billion value per year • 0.7% annual productivity growth Manufacturing • Up to 50% decrease in product development, assembly costs • Up to 7% reduction in working capital Global personal location data • $100 billion+ revenue for service providers • Up to $700 billion value to end users
  • 6. 6© Copyright 2014 EMC Corporation. All rights reserved. Supporting 3rd platform app with 2nd platform infrastructure The Challenges to Widespread Adoption  How to move from the lab to production? – Trusting an open source Hadoop distribution – HDFS not enterprise grade – Analytics on existing data?  What’s the risk? – Dedicated cluster requires significant investment – ROI? – does the data have value?  What are the costs? – Costs increase as my dedicated analytics cluster scales – Bandwidth and network costs of moving data to the cluster
  • 7. 7© Copyright 2014 EMC Corporation. All rights reserved. Big Data Storage Requirements In-place analytics and protection of all data types  Data Unification: – Big Data storage must support structured, semi- structured, and unstructured data types.  In-Place Analytics: – Analytics, compute workloads need to execute where the data live.  Data Compliance: – More sources of data, more volume, velocity, etc. exacerbate compliance and long-term retention requirements 40 ZB
  • 8. 8© Copyright 2014 EMC Corporation. All rights reserved. ViPR Data Services Overview
  • 9. 9© Copyright 2014 EMC Corporation. All rights reserved. Data Services that Span Arrays and Support Hybrid Data Types ViPR Data Services  Storage services at cloud scale – Built in software – Layered over both traditional and new storage devices  Object and HDFS data services – Many more to follow, at regular intervals – Open API for 3rd party development  Unified platform – Data services can be used as different semantic views on the same data e.g. Object on File, HDFS on Object
  • 10. 10© Copyright 2014 EMC Corporation. All rights reserved. EMC ViPR - Software-Defined Storage ViPR Data Services ViPR Controller EMC ViPR Platform Provisioning Self-Service Reporting Automation Third-Party Isilon Atmos VMAX VNX VPLEX Commodity XtremIOCentera
  • 11. 11© Copyright 2014 EMC Corporation. All rights reserved. ViPR Data Services: Architecture ViPR Data Path ViPR Control Path • Distributed Infrastructure • Device Drivers • Elastic Volumes • Migration GEO-SCALE INDEX, METADATA, TRANSACTIONS … 3rd PARTYOBJECT HDFS KEY-VALUE GEO SCALE INDEX, METADATA, TRANSACTIONS Commodity VNX Isilon 3rd Party
  • 12. 12© Copyright 2014 EMC Corporation. All rights reserved. ViPR Data Services Address Big Data Storage Requirements  Data Unification – Transform existing storage infrastructure into a data lake – Structured, semi/un-structured content  In-place Analytics – Run queries against data on existing arrays – Flexible software model supports future colocation of compute and storage  Data Compliance – Choice and flexibility or persistence layer – Support cloud-scale and consumer-grade applications on enterprise-grade infrastructure 40 ZB
  • 13. 13© Copyright 2014 EMC Corporation. All rights reserved. ViPR HDFS Data Service Overview
  • 14. 14© Copyright 2014 EMC Corporation. All rights reserved. ViPR HDFS Service Overview  HDFS is becoming the de facto file system for distributed applications  ViPR is a great platform for HDFS – Addresses limitations of off-the-shelf HDFS – Brings HDFS to existing storage hardware – Enables HDFS/Object/File scenarios – Flexible software model
  • 15. 15© Copyright 2014 EMC Corporation. All rights reserved. ViPR HDFS Service Overview  API head – Custom client/server protocol optimized for high scale – Uses the same unstructured storage engine as ViPR Object data service  Client library over the HDFS API – Provides a viprhdfs:// drop-in replacement for HDFS 2.0 – Can be seamlessly added to existing Hadoop distributions
  • 16. 16© Copyright 2014 EMC Corporation. All rights reserved. EMC ViPR Data Services ViPR Data Services ViPR Controller EMC ViPR Platform Provisioning Self-Service Reporting Automation Third-Party IsilonVNX
  • 17. 17© Copyright 2014 EMC Corporation. All rights reserved. How ViPR HDFS Data Service Helps Accelerate Big Data initiatives  Quickly move from lab to production – Utilize existing infrastructure as a big data repository or “data lake” – Eliminate single namenode single point of failure  Reduce risk – Run queries against data on existing arrays – Leverage existing investments  Reduce costs – Reduce the growth in dedicated analytics infrastructure – Reduce bandwidth, storage and network costs 40 ZB
  • 18. 18© Copyright 2014 EMC Corporation. All rights reserved. ViPR HDFS Data Service Technical Deep Dive
  • 19. Name Node JOB TRACKER Commodity Compute & Storage TASK TRACKER Data Store MapReduce Task Client TASK TRACKER Data Store MapReduce Task TASK TRACKER Data Store MapReduce Task HDFS ARCHITECTURE
  • 20. VNX Isilon 3rd Party VMAX Commodity JOB TRACKER TASK TRACKER MapReduce Task Client TASK TRACKER MapReduce Task TASK TRACKER MapReduce Task ViPR HDFS ARCHITECTURE
  • 21. VNX Isilon 3rd Party VMAX Commodity JOB TRACKER TASK TRACKER MapReduce Task Client TASK TRACKER MapReduce Task TASK TRACKER MapReduce Task • No single point of failure • Leverage existing storage • Compatible with existing Hadoop distribution • Mixed workload across HDFS and Object ViPR HDFS ARCHITECTURE
  • 22. 22© Copyright 2014 EMC Corporation. All rights reserved. MapReduce Job Flow Master Node Job Tracker Task Tracker Data Store Commodity Compute & Storage MapReduce Task Client Task Tracker Data Store MapReduce Task Task Tracker Data Store MapReduce Task Name Node Secondary NameNode Submit Job Split into tasks Rack 1 Rack 2 Data Node 1 Data Node 2 Data Node 3
  • 23. 23© Copyright 2014 EMC Corporation. All rights reserved. Presales Training Customer’s Hadoop Compute Cluster ViPR Controller ViPR Data Node(s) running outside the ViPR managed arrays Blob Engine S3 Head HDFS Head Customer AD Trust Relationship ViPR HDFS - Under The Hood Trust RelationshipTrust Relationship Data Read/ Write Kerberos KDC VNX Isilon 3rd Party
  • 24. 24© Copyright 2014 EMC Corporation. All rights reserved. HDFS uses ViPR Object Storage Engine ViPR data services creates a unified pool (bucket) of data VIRTUAL ARRAY  Buckets of data span file shares – Grow and shrink on demand  Data is distributed and intermingled across the storage  Provides an HDFS interface  ViPR makes HDFS enterprise grade – ViPR HDFS replaces namenodes, no single point of failure Isilon 3rd Party VNX 5500
  • 25. 25© Copyright 2014 EMC Corporation. All rights reserved. Support Mixed Workloads Object, File and HDFS operations on the same data VIRTUAL ARRAY Isilon 3rd Party VNX 5500  ViPR Data Services offer three bucket options: – Object – HDFS – ObjectandHDFS  ObjectandHDFS provides user with access to either S3 or HDFS Interface – Full compatibility with existing object based APIs ▪ Amazon S3, Openstack Swift, Atmos Object HDFS Object & HDFS
  • 26. 26© Copyright 2014 EMC Corporation. All rights reserved. ViPR HDFS Data Value Proposition
  • 27. 27© Copyright 2014 EMC Corporation. All rights reserved. Instantly Deploy a Big Data Repository Use existing arrays as a big data store Isilon 3rd Party VNX 5500 VIRTUAL ARRAY  Reduce risk – Reduce CAPEX investment required to perform analytics – Maintain data protection, compliance at array level  Reduces cost and complexity of dedicated clusters – Reduce need for new vendor nodes and storage capacity  Reduce data transfer time and bandwidth costs – 10 TBs takes 25 hours via 10gE – 10 TBs takes 3 days via dedicated WAN
  • 28. 28© Copyright 2014 EMC Corporation. All rights reserved. Expand the Reach of Big Data Queries Expand analytics to ViPR-managed data stores  Extend big data queries to run on existing file arrays as existing Hadoop deployments  Opens new opportunities and analytics scenarios – Faster, easier business insights Isilon 3rd Party VNX 5500 VIRTUAL ARRAY
  • 29. 29© Copyright 2014 EMC Corporation. All rights reserved. Leverage and Extend Existing Investments Utilize existing Hadoop infrastructure  ViPR HDFS data service can be the data source for Pig/Hive queries – Fully compatible with existing Hive/Pig query engines  Can use an existing infrastructure to query ViPR- managed data stores – Add data stores via ViPR without having to re-write queriesIsilon 3rd Party VNX 5500 VIRTUAL ARRAY
  • 30. 30© Copyright 2014 EMC Corporation. All rights reserved. Support Mixed Workloads Provide multiple semantic views of the same data  Eliminates expensive data movement – Object based workloads and analytics applications can manipulate the same data  Increase developer productivity – Different applications can target the same data without re- writes – IT can serve different developer and business groups with the same infrastructure  Increases data value – Extract more insight from file and object data (unstructured, semi-structured)  Reduce infrastructure costs – Eliminate dedicated data silos
  • 31. 31© Copyright 2014 EMC Corporation. All rights reserved. Summary  ViPR provides storage services at cloud scale – Built in software – Layered over both traditional and new storage devices  ViPR creates a unified platform – Data services can be used as different semantic views on the same data e.g. Object, File, HDFS interfaces for same data  ViPR HDFS accelerates journey to 3rd Platform – Extend Big Data queries to existing storage – Reduces complexity and cost of dedicated analytics infrastructure – Leverages existing investments

Editor's Notes

  • #3: There are megatrends transforming our industry that are predicated on a platform of trust. According to leading industry analysts, the four major trends that are shaping IT and the business: Mobility Cloud Big Data and Social
  • #4: These trends are forming what is being called the third platform - a platform architected for these trends and built to support billions of users and millions of applications As we look back, the first platform was mainframes with thousand of applications and millions of users with end user devices of choice being proprietary terminals. The second platform is and was the internet and client servers with end user devices being the PC. This platform continues to support tens of thousands of applications and hundreds of millions of users. However, current architectures are being pushed and scaling this type of environment can be costly and ineffective. The third platform is architected with web-scale in mind, supporting millions of applications and billions of users and is built on the technology pillars of mobility, cloud services, big data and analytics, and social networking. When we talk about the third platform in an enterprise setting, we’re really talking about the convergence of these forces and their powerful combination to serve as a foundational architecture for IT organizations. Beyond the individual trends, the seamless “combination” of these trends is becoming critical since it collectively represents an agile new IT fabric for applications, data centers and, most importantly, the user experience. According to  IDC, the third platform, will serve as the primary growth driver of the IT industry over the next decade, responsible for 75% of new growth as worldwide IT spending moves from $3.7 trillion in 2013 to more than $5 trillion in 2020.
  • #5: Unstructured data is no longer files from office productivity applications. The real growth and storage management problem is coming from: New media such as videos and podcasts Machine-generated data from devices such as sensors – telemetry data – in fact a transatlantic flight from NYC to London can generate 20-30 TB of telemetry data! Communities – social interactions Mobile Devices – pictures, music, etc. Imaging Equipment – imaging, imaging studies, health records The intelligent economy produces a constant stream of data that is being monitored and analyzed. IDC estimates that the digital universe will be 40ZB by 2020. That’s a 40 followed by 21 zeroes. Social interactions, mobile devices, facilities, equipment, R&D, simulations, and physical infrastructure all contribute to the flow of information. In aggregate, this is what is called Big Data. The Big Data economy, is characterized by: More Sources of data Communities Mobile Devices Sensors Imaging Equipment Richer Content Pictures Videos Data Streams Longer utility Durable value – information and information about information (metadata) has value for a long time after its creation. All this data can have business value. Regulatory burdens – always a contributor to the need to retain data for longer and longer periods of time, often indefinitely.
  • #6: Data has value well-beyond the context of the application that created it. Information-based applications and services will have tremendous financial impact across many market segments. Evolving to the 3rd platform and exploiting information will have quantifiable impact on profit margins, revenues, productivity metrics and operating costs. The potential is obvious and has been validated by early adopters. Big Web companies, Oil & Gas, Pharmaceutical firms, large retailers and many more have used Big Data analytics for deep business insights that target and retain customers and build competitive advantage. The early/late majority, however, are moving more cautiously. Enterprise customers are not starting with a blank canvas, and while they want all the benefits that the 3rd platform offers, they have invested millions if not billions of dollars into an infrastructure that they must continue to maintain and grow. The cost, risk and value of moving to a 3rd platform is still uncertain. They have questions about how they gain the value of the 3rd platform while leveraging their current IT infrastructure.
  • #7: Big Data and HDFS are Disruptive. According to 451 Research, the market for Hadoop/NoSQL software and services will be $3.5 billion by 2017 (45% CAGR). It’s more than analytics, though that’ a huge part of it. The disruptive change is that data has value beyond its initial application. Information about the information provides insights that are critical to understanding and predicting the business. Everyone sees the potential but adoption has still been somewhat cautious. Hadoop represents a 3rd platform infrastructure that co-locates compute and storage. But, for most enterprises, a Hadoop cluster only contains a fraction of their enterprise data. Customers need the confidence to move from the lab to production. Can they leverage their existing infrastructure and data? Which Hadoop distribution should they use? There are also concerns about HDFS not being enterprise grade. The namenode still represents and single point of failure which can be a non-starter for some data and uses. Customers are still calculating their risk . A dedicated cluster can be very cheap (free) to get started but requires significant investment as it scales. It’s also hard to calculate ROI when it’s unclear which data has value. Other costs that need to be factored in are the bandwidth and network costs of moving data to the cluster and back to primary storage. Customers see the potential and the necessity of Big Data and 3rd platform applications and services. But their 2nd platform infrastructure is not built for this new model. Yet, existing infrastructure, data and applications are not going away. Organizations need a way to “mind the gap” – leveraging their existing infrastructure and data today while building a platform for the future.
  • #8: The era of Big Data places new demands on data storage. Storage must contend with varying data types, all of which need to be stored securely for a long periods of time and be available for analysis. Data Unification: There is an increasing focus on data unification meaning that the storage infrastructure for Big Data has to cater to structured, semi-structured, and unstructured data types. In-Place Analytics: There is a growing emphasis on in-place analytics in which the compute workloads such as Hadoop Map/Reduce operations are run right where the data lives. Data Compliance: This market is fraught with challenges stemming from regulatory and compliance requirements. As the platform that hosts data the instant it is created, storage is not immune to these challenges — and how data gets stored in the long term.
  • #10: ViPR aggregates multi-vendor heterogeneous storage into a unified storage platform, that, in turn, can be leveraged as a logical scale-out layer which can serve as the underlying infrastructure for hosting a range of data services to support collecting, managing and utilizing unstructured content at massive scale. ViPR Data Services are implemented in software and feature a simple, lightweight, low-touch, scale-out design. Data services are storage abstractions that reflect the combination of a data type (file, object or block of data), access protocols (iSCSI, NFS, REST, etc.), and durability, availability, and security characteristics (snapshots, replication, etc.) In ViPR, block, file, object, and HDFS are all data services, though ViPR is not in the data path for file and block (these can be thought of us “control services”). Object and HDFS are available with more to follow. Data services can be used to provide different semantic views of the same data. You can manipulate a file as a file or as an object without having to move the data to a different platform that features that semantic.
  • #11: The immediate benefit of ViPR is its ability to automate storage management and provisioning and make storage available as a self-service, consumable resource within a software-defined data center (SDDC). But ViPR also transforms how enterprises deliver data services. With storage arrays and storage services defined in software and managed by policy, ViPR enables organizations to deploy unique Data Services that cloud-enable existing infrastructure and extend the use cases for their data and the value of their storage investments. ViPR aggregates multi-vendor heterogeneous storage into a unified storage platform that can be leveraged as a logical scale-out layer which can serve as the underlying infrastructure for hosting a range of data services to support collecting, managing and utilizing unstructured content at massive scale
  • #12: This depicts the architecture for ViPR and highlights the data services functionality. At the bottom are the physical arrays that ViPR can manage. Above the arrays is the ViPR controller which has features that enable a distributed infrastructure (Cassandra, a distributed DB and Zookeeper to manage status of different nodes in the system) and device drivers to hook into APIs of arrays so the Controller can automate provisioning, management, etc. On top of that are ViPR data services. The Object Data Service was released at the same time as ViPR Controller in October 2013. HDFS was released in December 2013. HDFS uses the same unstructured storage engine as the Object data service.
  • #13: The era of Big Data places new demands on data storage. Storage must contend with varying data types, all of which need to be stored securely for a long periods of time and be available for analysis. Data Unification: There is an increasing focus on data unification meaning that the storage infrastructure for Big Data has to cater to structured, semi-structured, and unstructured data types. In-Place Analytics: There is a growing emphasis on in-place analytics in which the compute workloads such as Hadoop Map/Reduce operations are run right where the data lives. Data Compliance: This market is fraught with challenges stemming from regulatory and compliance requirements. As the platform that hosts data the instant it is created, storage is not immune to these challenges — and how data gets stored in the long term.
  • #15: The ViPR HDFS data service is the second data service to be released by EMC. It will be available by the end of 2013. The HDFS service gives organizations the ability to run analytics using well known industry Hadoop distributions on existing data stored across heterogeneous systems such as VNX, Isilon and Netapp arrays. Hadoop has become a de-facto standard for companies that are investigating novel strategies for addressing their Big Data challenges. HDFS is the core distributed file system used by Hadoop. Many organizations have an HDFS project in their labs. However, many of these companies have found Hadoop to be difficult to deploy and manage at scale. The ViPR approach to HDFS takes advantage of proven storage hardware to overcome this challenge. Instead of building a discrete analytics silo with dedicated infrastructure, the ViPR HDFS data service leverages the existing ViPR virtualized storage environment and the backend storage platforms it utilizes.
  • #16: The ViPR HDFS data service is the second data service to be released by EMC. It will be available by the end of 2013. The HDFS service gives organizations the ability to run analytics using well known industry Hadoop distributions on existing data stored across heterogeneous systems such as VNX, Isilon and Netapp arrays. Hadoop has become a de-facto standard for companies that are investigating novel strategies for addressing their Big Data challenges. HDFS is the core distributed file system used by Hadoop. Many organizations have an HDFS project in their labs. However, many of these companies have found Hadoop to be difficult to deploy and manage at scale. The ViPR approach to HDFS takes advantage of proven storage hardware to overcome this challenge. Instead of building a discrete analytics silo with dedicated infrastructure, the ViPR HDFS data service leverages the existing ViPR virtualized storage environment and the backend storage platforms it utilizes.
  • #17: HDFS is becoming increasingly popular as a file system layer for distributed applications, and this goes beyond Hadoop. The ViPR HDFS data service is a Hadoop-compatible file system and supports any Hadoop 2.0 implementation including existing distros such as Cloudera and PivotalHD. HDFS supports high aggregate throughput access to data, e.g. MapReduce. In some cases, is provides low latency access. However, concerns to enterrpises include scale, durability, cost, and management.
  • #18: The era of Big Data places new demands on data storage. Storage must contend with varying data types, all of which need to be stored securely for a long periods of time and be available for analysis. Data Unification: There is an increasing focus on data unification meaning that the storage infrastructure for Big Data has to cater to structured, semi-structured, and unstructured data types. In-Place Analytics: There is a growing emphasis on in-place analytics in which the compute workloads such as Hadoop Map/Reduce operations are run right where the data lives. Data Compliance: This market is fraught with challenges stemming from regulatory and compliance requirements. As the platform that hosts data the instant it is created, storage is not immune to these challenges — and how data gets stored in the long term.
  • #20: Task trackers are processes on data / slave nodes that accept tasks from a Job Tracker. The tasks are Map, reduce and shuffle operations. Task trackers monitors the tasks running on a node and communicate with the job tracker. Every task tracker has a specified number of slots that correspond to how many tasks it can accept. During scheduling of a task, the Job tracker looks for an empty task slot on the same node as where the data block resides – thus achieving data locality. Next, it looks for a node with an empty slot on the same rack.
  • #22: ViPR HDFS provides and HDFS-compatible file system. In this way, the compute portion of an existing Hadoop cluster communicates with ViPR HDFS. Existing storage arrays managed by ViPR can now be made accessible via HDFS.
  • #23: Task trackers are processes on data / slave nodes that accept tasks from a Job Tracker. The tasks are Map, reduce and shuffle operations. Task trackers monitors the tasks running on a node and communicate with the job tracker. Every task tracker has a specified number of slots that correspond to how many tasks it can accept. During scheduling of a task, the Job tracker looks for an empty task slot on the same node as where the data block resides – thus achieving data locality. Next, it looks for a node with an empty slot on the same rack.
  • #24: The HDFS data service uses the same unstructured storage engine as the ViPR Object data service. ViPR data services create a unified pool (bucket) of data. Similar to the Object data service, users create buckets which can span file shares that can grow and shrink on demand. The data is distributed across the arrays according to how the virtual storage pool is configured. The bucket provides an HDFS interface or, optionally, an Object (S3) and HDFS interface. In this way, the compute portion of an existing Hadoop cluster communicates with ViPR HDFS, which uses existing data (added to the HDFS bucket) as the target for Big Data applications and queries. The above diagram illustrates the system architecture of how a ViPR customer can expose their existing data in a ViPR managed array to their Hadoop cluster and run MapReduce jobs on this data. The object data service and the HDFS data service run on the same set of ViPR Data Service VMs. These VMs can be scaled as the capacity of storage is increased. ViPR 1.1 will make available a client library (ViPR-HDFS Client) that needs to be installed on all the nodes that run MR jobs on the customer’s Hadoop cluster. When a task running on the node needs to read a file, the request will go to the ViPR-HDFS client (as the customer will point to viprfs:// as their data source) and the ViPR client will communicate with the HDFS head on the ViPR data node. The ViPR client passes in a authN token that identifies the user to the HDFS Head The HDFS head in the ViPR Data node receives requests from the ViPR-HDFS client . The HDFS Head then verifies the user’s identity by authenticating against the KDC. Then it talks to the Blob engine and the controller process running on the node to fetch the requested data once authN and authZ succeed.
  • #25: In addition to physical segregation, buckets provide logical segregation within the object store. Just like in S3, a user can create buckets which logically segregate applications or sets of data. These buckets can grown and shrink on-demand. The actual data objects are distributed and intermingled across the physical devices that comprise the virtual storage array.
  • #26: In addition to physical segregation, buckets provide logical segregation within the object store. Just like in S3, a user can create buckets which logically segregate applications or sets of data. These buckets can grown and shrink on-demand. The actual data objects are distributed and intermingled across the physical devices that comprise the virtual storage array.
  • #28: Use Case: Customer sets up ViPR across multiple Isilon and VNX arrays and ingests data into ViPR ViPR data services creates a unified pool (bucket) of data across file shares and provides user with an HDFS interface Customer installs ViPR HDFS client on an existing PivotalHD cluster Customer starts writing Hive queries referencing ViPR HDFS as the data source
  • #29: Use Case: Customer has an existing PivotalHD cluster with data stored in HDFS within the cluster and has also installed ViPR HDFS client on this PivotalHD cluster Customer also sets up ViPR across multiple Isilon and VNX arrays and ingests data into ViPR Customer starts writing MapReduce jobs that reference data in HDFS within the PivotalHD cluster as well as data in ViPR HDFS thereby opening up new analytics scenarios. The spanning use case is meant to explain that ViPR HDFS and HDFS can coexist. ViPR HDFS will not entirely replace HDFS.
  • #30: Use Case: An environment with cloudera infrastructure installs ViPR HDFS client Customer sets up ViPR across multiple Isilon and VNX arrays Customer starts writing Hive queries referencing ViPR HDFS as the data source and is able to utilize existing environment to point against ViPR HDFS
  • #31: Use Case: An environment with multiple VNX and Isilon, installs ViPR data services ViPR data services creates a unified pool (bucket) of data across file shares and provides user with access to either S3 or HDFS Interface Object based applications as well as analytics workload are able to use the same set of data without having to move it around.