SlideShare a Scribd company logo
1© Copyright 2011 EMC Corporation. All rights reserved.
EMC Isilon HDFS –
Enterprise Storage for
Hadoop
Featuring EMC Isilon Scale-Out NAS
Storage
Shai Harmelin
EMC System Enginer – Isilon Specialist
May 21, 2013
2© Copyright 2011 EMC Corporation. All rights reserved.
Today’s Agenda
• EMC Isilon Background
• HDFS Architectural Challenges
• Isilon HDFS Benefits
• Performance Comparison
• Customer Case Study
• Q+A
3© Copyright 2011 EMC Corporation. All rights reserved.
EMC Isilon
Setting the standard for scale-out NAS
• Founded in 2000 as the leader in Scaleout NAS (Gartner 2010)
• Broad adoption across many markets
– High Performance Computing (HPC): Life Sciences, Oil & Gas, Electronic
Design Automation, Media & Entertainment, Financial Services
– Enterprise IT: Archive, Home Directories, File Shares, Virtualization,
Business Analytics
• Acquired by EMC in 2011 for $2.5B
• Over 3,500 global customers
• Isilon OneFS: Seventh generation, industry-proven, innovative
scale-out operating environment
• 2012 – EMC Isilon is Industry’s First Scale-Out NAS System with Native
HDFS Support
4© Copyright 2011 EMC Corporation. All rights reserved.
Isilon Growing Momentum
3,500+ customers
5© Copyright 2011 EMC Corporation. All rights reserved.
Why Hadoop is Important to EMC
Isilon Customers
Pragmatic approach to analytics on a very large scale
– Opens up new ways of gaining insights and identifying
opportunities for businesses
Designed to address the rise of unstructured data
– Enterprise data to grow by 650% over next 5 years
– More than 80% of this growth will be unstructured data
Hadoop is only ONE component of
Enterprise Big Data Analytics PIPELINE
6© Copyright 2011 EMC Corporation. All rights reserved.
Isilon Scale-Out NAS Architecture
OneFS Operating
Environment
Intra-cluster
Communication Layer
Servers
Client/Application Layer Ethernet Layer
Servers
Servers
SingleFS/Volume
CIFSNFS
FTPHTTP
HDFS
for
Hadoop
7© Copyright 2011 EMC Corporation. All rights reserved.
Isilon Core Innovation
OneFS scale-out operating system
Single File System
Simplicity
Leadership Efficiency
High Performance
Easy Growth
Automated Tiering
Linear Scalability
8© Copyright 2011 EMC Corporation. All rights reserved.
Largest and Most Scalable File System
500X More Scalable than Traditional Storage Systems
OneFS™ can scale from 18TB to over 20,000 TB in a
single file system
•
•
•
9© Copyright 2011 EMC Corporation. All rights reserved.
AutoBalance
Automated data balancing across nodes reduces costs,
complexity and risks for scaling storage
“Using Software to do Work Unfit for Humans”
• AutoBalance migrates
content to new storage nodes
while system is
online and in production
• Requires NO manual
intervention, NO
reconfiguration,
NO server or client mount point
or application changes
• Eliminate “Hot Spots”
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
FULL
FULL
FULL
FULL
BALANCED
BALANCED
BALANCED
BALANCED
BALANCED
10© Copyright 2011 EMC Corporation. All rights reserved. Back to Navigation
11© Copyright 2011 EMC Corporation. All rights reserved.
• Load balancing
• Seamless failover
• Performance zones
• Quota
management
• Thin provisioning
• High speed replication
• Disaster recovery
• Business continuance
• Instant recovery
• Data protection
Isilon, Scale-Out NAS for Big Data
Single File System, Single Volume Simplicity For Active,
Persistent, And Archive Data
WAN/LAN
Primary &
Nearline Storage
Local/Remote
Archive
Client/Application
Layer
Virtualized Servers
Virtualized
Servers
Clients
X-series
Network
NL-series
• File immutability
• Protection from
deletion/change
NL-series
Backup
Accelerator
S-series
• Automated
storage tiering
12© Copyright 2011 EMC Corporation. All rights reserved. Back to Navigation
Easiest Storage System to Manage
Single-level of
Management
Manage a 18TB to 10PB
single file system from
one intuitive console
"Isilon has made some very
bold claims with respect to its
clustered storage products -
not least the idea of
genuinely revolutionizing the
ease and speed with which
mass storage - over 500
Terabytes - can be added and
managed thereafter. We have
conducted rigorous testing
and unanimously agree with
their assertions. This stuff
is almost frighteningly
simple to use.”
Steve Broadhead, Founder,
Broadband-Testing
Laboratories
14© Copyright 2011 EMC Corporation. All rights reserved.
HDFS Overview
15© Copyright 2011 EMC Corporation. All rights reserved.
Secondary NameNode
DataNode / Task TrackerJob Tracker
NameNode
Core Hadoop Components
16© Copyright 2011 EMC Corporation. All rights reserved.
Job Tracker
Manages all the jobs to the cluster
Tracks and reports the status of jobs and tasks
Provides job queuing functionality
Communicates with NameNode and tries to align TaskTracker to Data Nodes
The compute workhorse
Serves read/write requests from the clients
Executes Map/Reduce tasks
Typically performs I/O against local or remote DataNodes
Task Tracker
Compute Components
17© Copyright 2011 EMC Corporation. All rights reserved.
NameNode
Manages the file system namespace
Stores all the Metadata in the RAM – a
limitation on file system size
Filenames, owners, group, access info
Knows associated blocks
Manages block replication across
DataNodes
Manages edit log and check-
pointing of name node metadata
Does not provide name node hot
failover
CDH4 has a solution for this, but
is not in full scale production in
most environments
Secondary NameNode
Stores blocks of files on top of native host OS file system (e.g. EXT3, XFS, ZFS)
Same block is stored on multiple DataNodes for redundancy
Has no “awareness” of data blocks living elsewhere (only the namenode does)
DataNode
File System
Components
18© Copyright 2011 EMC Corporation. All rights reserved.
Enterprise Challenges of Hadoop
Hadoop DAS Environment
1
Dedicated Storage Infrastructure
– One-off for Hadoop only
2
Single Point of Failure
– Namenode
3
Lacking Enterprise Data Protection
– No Snapshots, replication, backup
4
Poor Storage Efficiency
– 3X mirroring
5
Fixed Scalability
– Rigid compute to storage ratio
6
Manual Import/Export
– No protocol interoperability support
Name node
19© Copyright 2011 EMC Corporation. All rights reserved.
Enterprise Challenges of Hadoop
Hadoop DAS Environment
1
Dedicated Storage Infrastructure
– One-off for Hadoop only
2
Single Point of Failure
– Namenode
3
Lacking Enterprise Data Protection
– No Snapshots, replication, backup
4
Poor Storage Efficiency
– 3X mirroring
5
Fixed Scalability
– Rigid compute to storage ratio
6
Manual Import/Export
– No protocol support
1x
1x
2x
2x
3x
2x
3x
3x
1x
Namenode
20© Copyright 2011 EMC Corporation. All rights reserved.
Isilon HDFS Support
Isilon supports the HDFS
interfaces for the NameNode
and DataNode to host and
metadata and data
Underlying file system is
OneFS
As simple as pointing the
Hadoop Nodes to the DNS
name of the Isilon cluster!
21© Copyright 2011 EMC Corporation. All rights reserved.
HDFS is a protocol!
Each Isilon node now “speaks” the HDFS NameNode and
DataNode protocol
We eliminate need to run these services on the Hadoop compute
cluster
Every Isilon node acts as both a namenode and datanode
(isi_hdfs_d)
Data is laid out within OneFS exactly the same as for NFS, SMB,
etc.
Data is protected just like any other data in the Isilon File
System. No Mirroring, only Parity = 80% utilization
All Isilon Enterprise Features are applied to Hadoop data:
Snapshots, Replication, SmartCache, SmartLock, etc…
22© Copyright 2011 EMC Corporation. All rights reserved.
HDFS Writes on Isilon
Jobtracker asks Isilon namenode (isi_hdfs_d) “tell me where to
place /path/file”
OneFS isi_hdfs_d hands JT list of 3 “datanode” addresses for
each block (aligned to block size defined on Hadoop cluster)
Jobtracker assigns task tracker to communicate to data-node
(isi_hdfs_d) to write each data block (an abstraction in our case)
When complete, isi_hdfs_d responds by saying the block is
replicated (a lie) because Data is striped like any other file,
written over any protocol.
HDFS files are laid out on Isilon File Systems (IFS) similarly to any other
protocol (NFS, CIFS, FTP)
File can be written over NFS (nfsd) or CIFS (lwiod) and accessed
over HDFS (isi_hdfs_d)
23© Copyright 2011 EMC Corporation. All rights reserved.
HDFS Reads on Isilon
Jobtracker asks Isilon namenode (isi_hdfs_d) “tell me where
/path/file lives”
isi_hdfs_d responds with list of block addresses (3 datanode IP’s
per block). Note that the blocksize in this case is configurable
on isilon (default 64MB)
Jobtracker assigns task trackers to read each block (first address
out of 3 for each)
Tasks within each task tracker ask namenode (again) for block
locations, then initiate I/O transactions to read the data over the
network
The concept of locality is eliminated accept for rack awareness.
24© Copyright 2011 EMC Corporation. All rights reserved.
Isilon HDFS Settings
25© Copyright 2011 EMC Corporation. All rights reserved.
How EMC Isilon Addresses the Hadoop
Challenge
1
Dedicated Storage Infrastructure
– One-off for Hadoop only
2
Single Point of Failure
– Namenode
3
Lacking Enterprise Data Protection
– No Snapshots, replication, backup
4
Poor Storage Efficiency
– 3X mirroring
5
Fixed Scalability
– Rigid compute to storage ratio
6
Manual Import/Export
– No protocol support
1
Scale-Out Storage Platform
– Multiple applications & workflows
2
No Single Point of Failure
– Distributed Namenode
3
End-to-End Data Protection
– SnapshotIQ, SyncIQ, NDMP Backup
4
Industry-Leading Storage Efficiency
– >80% Storage Utilization
5
Independent Scalability
– Add compute & storage separately
6
Multi-Protocol
– Industry standard protocols
– NFS, CIFS, FTP, HTTP, HDFS
27© Copyright 2011 EMC Corporation. All rights reserved.
Distributed (Clustered) Name Node When Using Isilon
MTTDL = 5,000 years
Metadata stored across
systems same way as
standard file metadata
Built-in clustered redundancy
across many nodesName Node
Clustering the
NameNode on
Isilon allows
for the failure
protection
level Isilon
already
provides
ClusteredNameNode
28© Copyright 2011 EMC Corporation. All rights reserved.
Fixed Scaling / Independent Scaling
Hadoop
Isilon
Storage to Compute ratio is fixed
Scaling compute means scaling
capacity
Difficult to provide QoS
Compute upgrade is a forklift
Scale compute independent of
storage
Achieve optimal performance
balance even as workloads evolve
No data migrations, ever!
Add new performance as
hardware evolves
storage
compute
Desired
performance/
capacity
29© Copyright 2011 EMC Corporation. All rights reserved.
Protocol Support
Servers
Servers
Servers
Before
After
HDFS is not visible to
Windows, Unix, Linux,
Apple, or any other file
system natively
Big Data is only used for
Big Data
Inherent Multi-Protocol
Support in Isilon allows
ubiquitous access to all
file systems including
Hadoop
Big Data is actual data!
Servers
30© Copyright 2011 EMC Corporation. All rights reserved.
Data Center Network
Time-to-Results
Data Copy Analysis In-Place Analysis
Existing Primary Storage
Hadoop on a Stick
Have you ever
copied 100TB from
Primary Storage to
a Hadoop system?
How long does it
take ≈ to copy
100TB from one
place to another
over a 10GB link?
>24 Hours
Data Center Network
Existing Primary Storage
Hadoop Processing Nodes
Reading relevant
data to analysis
31© Copyright 2011 EMC Corporation. All rights reserved.
Snapshot/Version Control
Before
After
Traditional HDFS does not
have replication
No Snapshotting of data
Loss of Version control
Not designed for Mission
Critical
Full Snapshot IQTM
integration identifies
changes
Multi-threaded, Multi-Node
Scale-Out replication
Improved RPO/RTO for
business continuity
Geo-replicated Hadoop!
5 5
32© Copyright 2011 EMC Corporation. All rights reserved.
Hadoop Distributions Support on Isilon HDFS
• Available now in 7.0.1.5
• Multiple HDFS:// namespaces
– hdfs://DAS + hdfs://isilon
– Potential for archive/tiering
– Hadoop cluster version mixing
• Distributions:
– Cloudera CDH4.x
– Hortonworks HDP-2
– PivotalHD 1.0 (aka: GPHD 2.0)
– Apache 0.23 / apache 2.0
HDFS v2HDFS v1
33© Copyright 2011 EMC Corporation. All rights reserved.
Performance
34© Copyright 2011 EMC Corporation. All rights reserved.
Test Used HiBench
Developed by Intel and Open Sourced
– Collection of standard Hadoop jobs
– Our tests focused on TeraSort and TestDFSIO
All results normalized as throughput per node to allow comparison of differing
configs
TestDFSIO tests were uncompressed, which shows actual I/O efficiency
– Compressed gives much higher performance, but is not actual I/O
35© Copyright 2011 EMC Corporation. All rights reserved.
GPHD-Isilon is Highly Competitive
36© Copyright 2011 EMC Corporation. All rights reserved.
Terasort Performance is Comparable
Between Configurations
37© Copyright 2011 EMC Corporation. All rights reserved.
I/O Performance Scales As Isilon Nodes
Are Added
38© Copyright 2011 EMC Corporation. All rights reserved.
For Typical Workloads, 1.5 Compute
Nodes Per Isilon x400 Node is Good
(4) Isilon x400
Nodes Tested
39© Copyright 2011 EMC Corporation. All rights reserved.
Return Path
http://www.emc.com/collateral/customer-profiles/h11528-return-path-cp.pdf
Challenges
Limited performance and capacity to support intensive Hadoop analytics
NFS and Hadoop environments struggled to handle unique data sets comprised of
hundreds of millions of small email files, and large analytics files, which hindered
analytics and delivery of customer solutions
25 different DAS and NAS storage systems lacked performance and capacity
Storage projected to increase from 150TB to 2PB over the next 5 years
Company background:
• Return Path is the worldwide leader in email intelligence, serving Internet
service providers (ISPs), businesses, and individuals.
• The company’s email intelligence solutions process and analyze massive volumes
of data to maximize email performance, ensure email delivery, and protect users
from spam and other abuse.
• Developed Hadoop based email intelligence solutions combined with NAS based
data access
40© Copyright 2011 EMC Corporation. All rights reserved.
Return Path
Results
Return Path now has a single repository for all its Big Data, accessible to email
analysts, product development teams and external customers.
Isilon delivers real-time data to Return Path’s end-user applications while
providing seamless integration with Hadoop for back-end data analytics
Reduces shared storage data center footprint by 30 percent
Shortens weekly administration time by more than 35 percent
Improves availability and reliability for Hadoop analytics
Savings of $350,000 from lower power, cooling, and maintenance
Isilon Solution and Benefits
Solution
Isilon X400 Scaleout NAS – Approx 200TB capacity
SmartConnect, SmartQuotas, InsightIQ Software suite
NFS and HDFS Data Access Protocols
41© Copyright 2011 EMC Corporation. All rights reserved.
Return Path
“To have all this data being generated by our email intelligence products, but no way
to access it directly by Hadoop, was a major hindrance,”
“Isilon serves NFS data across multiple product suites and makes it easily accessible to
our Hadoop analytics team. That’s a significant business enabler, allowing Return Path to
develop customer solutions much faster.”
“Isilon InsightIQ software has been invaluable, providing visibility into our infrastructure
and managing our space efficiently as we grow.”
DIZ CARTER
VP Infrastructure
Operations
Customer Quotes
42© Copyright 2011 EMC Corporation. All rights reserved.
Questions?
43© Copyright 2011 EMC Corporation. All rights reserved.
Thank You!
7. emc isilon hdfs   enterprise storage for hadoop

More Related Content

PPTX
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
PPTX
IBM DS8880 and IBM Z - Integrated by Design
PPTX
NetApp & Storage fundamentals
PPT
Parallel Sysplex Implement2
PPTX
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
PPTX
Emc isilon overview
PPTX
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
PPTX
IBM Spectrum Scale Overview november 2015
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
IBM DS8880 and IBM Z - Integrated by Design
NetApp & Storage fundamentals
Parallel Sysplex Implement2
Ibm spectrum scale fundamentals workshop for americas part 5 ess gnr-usecases...
Emc isilon overview
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
IBM Spectrum Scale Overview november 2015

What's hot (20)

PDF
TCP/IP Stack Configuration with Configuration Assistant for IBM z/OS CS
PDF
IBM FlashSystem 7300 Product Guide.pdf
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PDF
White Paper: EMC Isilon OneFS — A Technical Overview
 
PPT
Ibm aix technical deep dive workshop advanced administration and problem dete...
PPTX
GDPS and System Complex
PDF
Enterprise manager 13c
PPTX
Linux Kernel MMC Storage driver Overview
PDF
Snowflake free trial_lab_guide
PPTX
EMC Data domain advanced features and functions
PPTX
Linux MMAP & Ioremap introduction
PPTX
NetApp Se training storage grid webscale technical overview
PPTX
PPTX
EMC Vmax3 tech-deck deep dive
PDF
MAA Best Practices for Oracle Database 19c
PDF
steeleye Replication
PDF
DB2 Data Sharing Performance for Beginners
PDF
Oracle db performance tuning
PDF
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
PPT
An overview of snowflake
TCP/IP Stack Configuration with Configuration Assistant for IBM z/OS CS
IBM FlashSystem 7300 Product Guide.pdf
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
White Paper: EMC Isilon OneFS — A Technical Overview
 
Ibm aix technical deep dive workshop advanced administration and problem dete...
GDPS and System Complex
Enterprise manager 13c
Linux Kernel MMC Storage driver Overview
Snowflake free trial_lab_guide
EMC Data domain advanced features and functions
Linux MMAP & Ioremap introduction
NetApp Se training storage grid webscale technical overview
EMC Vmax3 tech-deck deep dive
MAA Best Practices for Oracle Database 19c
steeleye Replication
DB2 Data Sharing Performance for Beginners
Oracle db performance tuning
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
An overview of snowflake
Ad

Viewers also liked (20)

PPTX
EMC config Hadoop
PPTX
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
PDF
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
PPTX
1. beyond mission critical virtualizing big data and hadoop
PDF
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
PDF
Soyez Big Data ready avec Isilon
 
PPTX
EMC Isilon Solutions for Data Archives
PDF
EMC Hadoop Starter Kit
 
PPTX
Emerging Big Data & Analytics Trends with Hadoop
PDF
EMC Starter Kit - IBM BigInsights - EMC Isilon
PDF
Big data on virtualized infrastucture
PPTX
Gartner IT Symposium 2014 - VMware Cloud Services
PPTX
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
PPTX
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
PPTX
EMC isilon for -media-and-entertainment-sales-deck
PDF
MT129 Isilon Data Lake Overview
PDF
Cloud Management with vRealize Operations
PDF
Hadoop Administration pdf
PDF
David Goulden keynote at Dell EMC World
PDF
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
EMC config Hadoop
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
1. beyond mission critical virtualizing big data and hadoop
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
Soyez Big Data ready avec Isilon
 
EMC Isilon Solutions for Data Archives
EMC Hadoop Starter Kit
 
Emerging Big Data & Analytics Trends with Hadoop
EMC Starter Kit - IBM BigInsights - EMC Isilon
Big data on virtualized infrastucture
Gartner IT Symposium 2014 - VMware Cloud Services
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
EMC isilon for -media-and-entertainment-sales-deck
MT129 Isilon Data Lake Overview
Cloud Management with vRealize Operations
Hadoop Administration pdf
David Goulden keynote at Dell EMC World
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
Ad

Similar to 7. emc isilon hdfs enterprise storage for hadoop (20)

PPTX
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
PPTX
In-Place analytics with Unified Data Access
PDF
BlueTalon-Isilon-Validation
PPTX
Hadoop Analytics on Isilon Deep Dive
PDF
Transform Your Business with Big Data Storage
 
PPTX
IBM Platform Computing Elastic Storage
PPTX
Disaggregated Hadoop Stacks
PPTX
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
PDF
EMC Isilon Best Practices for Hadoop Data Storage
 
PPTX
EMC HADOOP Storage Strategy
PPTX
Alluxio: Unify Data at Memory Speed
PDF
EMC Isilon Best Practices for Hadoop Data Storage
 
PDF
S106195 cos-use cases-istanbul-v1902a
PPTX
PPTX
Modernise your EDW - Data Lake
PPTX
Dell-emc-powerstore-product-overview.pptx
PDF
Flexible and Fast Storage for Deep Learning with Alluxio
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PPTX
Benchmark emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
PDF
EMC Isilon Multitenancy for Hadoop Big Data Analytics
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
In-Place analytics with Unified Data Access
BlueTalon-Isilon-Validation
Hadoop Analytics on Isilon Deep Dive
Transform Your Business with Big Data Storage
 
IBM Platform Computing Elastic Storage
Disaggregated Hadoop Stacks
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
EMC Isilon Best Practices for Hadoop Data Storage
 
EMC HADOOP Storage Strategy
Alluxio: Unify Data at Memory Speed
EMC Isilon Best Practices for Hadoop Data Storage
 
S106195 cos-use cases-istanbul-v1902a
Modernise your EDW - Data Lake
Dell-emc-powerstore-product-overview.pptx
Flexible and Fast Storage for Deep Learning with Alluxio
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Benchmark emc vnx7500, emc fast suite, emc snap sure and oracle rac on v-mware
EMC Isilon Multitenancy for Hadoop Big Data Analytics
 

More from Taldor Group (12)

PPTX
5. big data vs it stki - pini cohen
PPTX
4. hadoop גיא לבנברג
PDF
3. ami big data hadoop on ucs seminar may 2013
PPTX
A new platform for a new era emc
PPTX
Yossi cohen 3 base
PPTX
פיני מנדל תובנות עסקיות מיישומי Hadoop
PPTX
נתן פרידחי הקדמה לכנס Hadoop
PDF
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
PDF
Dcl צביקה מנלה - סיפורי לקוחות
PDF
Taldor data quality einat shimoni - stki
PDF
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
PDF
Loshin operationalizingdatagovernance
5. big data vs it stki - pini cohen
4. hadoop גיא לבנברג
3. ami big data hadoop on ucs seminar may 2013
A new platform for a new era emc
Yossi cohen 3 base
פיני מנדל תובנות עסקיות מיישומי Hadoop
נתן פרידחי הקדמה לכנס Hadoop
הערך העסקי שבאיכות הנתונים קוסטין מרזאה
Dcl צביקה מנלה - סיפורי לקוחות
Taldor data quality einat shimoni - stki
2013 04 irm mdmdg - jon asprey 4 most asked dg questions v 1 3
Loshin operationalizingdatagovernance

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx

7. emc isilon hdfs enterprise storage for hadoop

  • 1. 1© Copyright 2011 EMC Corporation. All rights reserved. EMC Isilon HDFS – Enterprise Storage for Hadoop Featuring EMC Isilon Scale-Out NAS Storage Shai Harmelin EMC System Enginer – Isilon Specialist May 21, 2013
  • 2. 2© Copyright 2011 EMC Corporation. All rights reserved. Today’s Agenda • EMC Isilon Background • HDFS Architectural Challenges • Isilon HDFS Benefits • Performance Comparison • Customer Case Study • Q+A
  • 3. 3© Copyright 2011 EMC Corporation. All rights reserved. EMC Isilon Setting the standard for scale-out NAS • Founded in 2000 as the leader in Scaleout NAS (Gartner 2010) • Broad adoption across many markets – High Performance Computing (HPC): Life Sciences, Oil & Gas, Electronic Design Automation, Media & Entertainment, Financial Services – Enterprise IT: Archive, Home Directories, File Shares, Virtualization, Business Analytics • Acquired by EMC in 2011 for $2.5B • Over 3,500 global customers • Isilon OneFS: Seventh generation, industry-proven, innovative scale-out operating environment • 2012 – EMC Isilon is Industry’s First Scale-Out NAS System with Native HDFS Support
  • 4. 4© Copyright 2011 EMC Corporation. All rights reserved. Isilon Growing Momentum 3,500+ customers
  • 5. 5© Copyright 2011 EMC Corporation. All rights reserved. Why Hadoop is Important to EMC Isilon Customers Pragmatic approach to analytics on a very large scale – Opens up new ways of gaining insights and identifying opportunities for businesses Designed to address the rise of unstructured data – Enterprise data to grow by 650% over next 5 years – More than 80% of this growth will be unstructured data Hadoop is only ONE component of Enterprise Big Data Analytics PIPELINE
  • 6. 6© Copyright 2011 EMC Corporation. All rights reserved. Isilon Scale-Out NAS Architecture OneFS Operating Environment Intra-cluster Communication Layer Servers Client/Application Layer Ethernet Layer Servers Servers SingleFS/Volume CIFSNFS FTPHTTP HDFS for Hadoop
  • 7. 7© Copyright 2011 EMC Corporation. All rights reserved. Isilon Core Innovation OneFS scale-out operating system Single File System Simplicity Leadership Efficiency High Performance Easy Growth Automated Tiering Linear Scalability
  • 8. 8© Copyright 2011 EMC Corporation. All rights reserved. Largest and Most Scalable File System 500X More Scalable than Traditional Storage Systems OneFS™ can scale from 18TB to over 20,000 TB in a single file system • • •
  • 9. 9© Copyright 2011 EMC Corporation. All rights reserved. AutoBalance Automated data balancing across nodes reduces costs, complexity and risks for scaling storage “Using Software to do Work Unfit for Humans” • AutoBalance migrates content to new storage nodes while system is online and in production • Requires NO manual intervention, NO reconfiguration, NO server or client mount point or application changes • Eliminate “Hot Spots” EMPTY EMPTY EMPTY EMPTY EMPTY FULL FULL FULL FULL BALANCED BALANCED BALANCED BALANCED BALANCED
  • 10. 10© Copyright 2011 EMC Corporation. All rights reserved. Back to Navigation
  • 11. 11© Copyright 2011 EMC Corporation. All rights reserved. • Load balancing • Seamless failover • Performance zones • Quota management • Thin provisioning • High speed replication • Disaster recovery • Business continuance • Instant recovery • Data protection Isilon, Scale-Out NAS for Big Data Single File System, Single Volume Simplicity For Active, Persistent, And Archive Data WAN/LAN Primary & Nearline Storage Local/Remote Archive Client/Application Layer Virtualized Servers Virtualized Servers Clients X-series Network NL-series • File immutability • Protection from deletion/change NL-series Backup Accelerator S-series • Automated storage tiering
  • 12. 12© Copyright 2011 EMC Corporation. All rights reserved. Back to Navigation Easiest Storage System to Manage Single-level of Management Manage a 18TB to 10PB single file system from one intuitive console "Isilon has made some very bold claims with respect to its clustered storage products - not least the idea of genuinely revolutionizing the ease and speed with which mass storage - over 500 Terabytes - can be added and managed thereafter. We have conducted rigorous testing and unanimously agree with their assertions. This stuff is almost frighteningly simple to use.” Steve Broadhead, Founder, Broadband-Testing Laboratories
  • 13. 14© Copyright 2011 EMC Corporation. All rights reserved. HDFS Overview
  • 14. 15© Copyright 2011 EMC Corporation. All rights reserved. Secondary NameNode DataNode / Task TrackerJob Tracker NameNode Core Hadoop Components
  • 15. 16© Copyright 2011 EMC Corporation. All rights reserved. Job Tracker Manages all the jobs to the cluster Tracks and reports the status of jobs and tasks Provides job queuing functionality Communicates with NameNode and tries to align TaskTracker to Data Nodes The compute workhorse Serves read/write requests from the clients Executes Map/Reduce tasks Typically performs I/O against local or remote DataNodes Task Tracker Compute Components
  • 16. 17© Copyright 2011 EMC Corporation. All rights reserved. NameNode Manages the file system namespace Stores all the Metadata in the RAM – a limitation on file system size Filenames, owners, group, access info Knows associated blocks Manages block replication across DataNodes Manages edit log and check- pointing of name node metadata Does not provide name node hot failover CDH4 has a solution for this, but is not in full scale production in most environments Secondary NameNode Stores blocks of files on top of native host OS file system (e.g. EXT3, XFS, ZFS) Same block is stored on multiple DataNodes for redundancy Has no “awareness” of data blocks living elsewhere (only the namenode does) DataNode File System Components
  • 17. 18© Copyright 2011 EMC Corporation. All rights reserved. Enterprise Challenges of Hadoop Hadoop DAS Environment 1 Dedicated Storage Infrastructure – One-off for Hadoop only 2 Single Point of Failure – Namenode 3 Lacking Enterprise Data Protection – No Snapshots, replication, backup 4 Poor Storage Efficiency – 3X mirroring 5 Fixed Scalability – Rigid compute to storage ratio 6 Manual Import/Export – No protocol interoperability support Name node
  • 18. 19© Copyright 2011 EMC Corporation. All rights reserved. Enterprise Challenges of Hadoop Hadoop DAS Environment 1 Dedicated Storage Infrastructure – One-off for Hadoop only 2 Single Point of Failure – Namenode 3 Lacking Enterprise Data Protection – No Snapshots, replication, backup 4 Poor Storage Efficiency – 3X mirroring 5 Fixed Scalability – Rigid compute to storage ratio 6 Manual Import/Export – No protocol support 1x 1x 2x 2x 3x 2x 3x 3x 1x Namenode
  • 19. 20© Copyright 2011 EMC Corporation. All rights reserved. Isilon HDFS Support Isilon supports the HDFS interfaces for the NameNode and DataNode to host and metadata and data Underlying file system is OneFS As simple as pointing the Hadoop Nodes to the DNS name of the Isilon cluster!
  • 20. 21© Copyright 2011 EMC Corporation. All rights reserved. HDFS is a protocol! Each Isilon node now “speaks” the HDFS NameNode and DataNode protocol We eliminate need to run these services on the Hadoop compute cluster Every Isilon node acts as both a namenode and datanode (isi_hdfs_d) Data is laid out within OneFS exactly the same as for NFS, SMB, etc. Data is protected just like any other data in the Isilon File System. No Mirroring, only Parity = 80% utilization All Isilon Enterprise Features are applied to Hadoop data: Snapshots, Replication, SmartCache, SmartLock, etc…
  • 21. 22© Copyright 2011 EMC Corporation. All rights reserved. HDFS Writes on Isilon Jobtracker asks Isilon namenode (isi_hdfs_d) “tell me where to place /path/file” OneFS isi_hdfs_d hands JT list of 3 “datanode” addresses for each block (aligned to block size defined on Hadoop cluster) Jobtracker assigns task tracker to communicate to data-node (isi_hdfs_d) to write each data block (an abstraction in our case) When complete, isi_hdfs_d responds by saying the block is replicated (a lie) because Data is striped like any other file, written over any protocol. HDFS files are laid out on Isilon File Systems (IFS) similarly to any other protocol (NFS, CIFS, FTP) File can be written over NFS (nfsd) or CIFS (lwiod) and accessed over HDFS (isi_hdfs_d)
  • 22. 23© Copyright 2011 EMC Corporation. All rights reserved. HDFS Reads on Isilon Jobtracker asks Isilon namenode (isi_hdfs_d) “tell me where /path/file lives” isi_hdfs_d responds with list of block addresses (3 datanode IP’s per block). Note that the blocksize in this case is configurable on isilon (default 64MB) Jobtracker assigns task trackers to read each block (first address out of 3 for each) Tasks within each task tracker ask namenode (again) for block locations, then initiate I/O transactions to read the data over the network The concept of locality is eliminated accept for rack awareness.
  • 23. 24© Copyright 2011 EMC Corporation. All rights reserved. Isilon HDFS Settings
  • 24. 25© Copyright 2011 EMC Corporation. All rights reserved. How EMC Isilon Addresses the Hadoop Challenge 1 Dedicated Storage Infrastructure – One-off for Hadoop only 2 Single Point of Failure – Namenode 3 Lacking Enterprise Data Protection – No Snapshots, replication, backup 4 Poor Storage Efficiency – 3X mirroring 5 Fixed Scalability – Rigid compute to storage ratio 6 Manual Import/Export – No protocol support 1 Scale-Out Storage Platform – Multiple applications & workflows 2 No Single Point of Failure – Distributed Namenode 3 End-to-End Data Protection – SnapshotIQ, SyncIQ, NDMP Backup 4 Industry-Leading Storage Efficiency – >80% Storage Utilization 5 Independent Scalability – Add compute & storage separately 6 Multi-Protocol – Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS
  • 25. 27© Copyright 2011 EMC Corporation. All rights reserved. Distributed (Clustered) Name Node When Using Isilon MTTDL = 5,000 years Metadata stored across systems same way as standard file metadata Built-in clustered redundancy across many nodesName Node Clustering the NameNode on Isilon allows for the failure protection level Isilon already provides ClusteredNameNode
  • 26. 28© Copyright 2011 EMC Corporation. All rights reserved. Fixed Scaling / Independent Scaling Hadoop Isilon Storage to Compute ratio is fixed Scaling compute means scaling capacity Difficult to provide QoS Compute upgrade is a forklift Scale compute independent of storage Achieve optimal performance balance even as workloads evolve No data migrations, ever! Add new performance as hardware evolves storage compute Desired performance/ capacity
  • 27. 29© Copyright 2011 EMC Corporation. All rights reserved. Protocol Support Servers Servers Servers Before After HDFS is not visible to Windows, Unix, Linux, Apple, or any other file system natively Big Data is only used for Big Data Inherent Multi-Protocol Support in Isilon allows ubiquitous access to all file systems including Hadoop Big Data is actual data! Servers
  • 28. 30© Copyright 2011 EMC Corporation. All rights reserved. Data Center Network Time-to-Results Data Copy Analysis In-Place Analysis Existing Primary Storage Hadoop on a Stick Have you ever copied 100TB from Primary Storage to a Hadoop system? How long does it take ≈ to copy 100TB from one place to another over a 10GB link? >24 Hours Data Center Network Existing Primary Storage Hadoop Processing Nodes Reading relevant data to analysis
  • 29. 31© Copyright 2011 EMC Corporation. All rights reserved. Snapshot/Version Control Before After Traditional HDFS does not have replication No Snapshotting of data Loss of Version control Not designed for Mission Critical Full Snapshot IQTM integration identifies changes Multi-threaded, Multi-Node Scale-Out replication Improved RPO/RTO for business continuity Geo-replicated Hadoop! 5 5
  • 30. 32© Copyright 2011 EMC Corporation. All rights reserved. Hadoop Distributions Support on Isilon HDFS • Available now in 7.0.1.5 • Multiple HDFS:// namespaces – hdfs://DAS + hdfs://isilon – Potential for archive/tiering – Hadoop cluster version mixing • Distributions: – Cloudera CDH4.x – Hortonworks HDP-2 – PivotalHD 1.0 (aka: GPHD 2.0) – Apache 0.23 / apache 2.0 HDFS v2HDFS v1
  • 31. 33© Copyright 2011 EMC Corporation. All rights reserved. Performance
  • 32. 34© Copyright 2011 EMC Corporation. All rights reserved. Test Used HiBench Developed by Intel and Open Sourced – Collection of standard Hadoop jobs – Our tests focused on TeraSort and TestDFSIO All results normalized as throughput per node to allow comparison of differing configs TestDFSIO tests were uncompressed, which shows actual I/O efficiency – Compressed gives much higher performance, but is not actual I/O
  • 33. 35© Copyright 2011 EMC Corporation. All rights reserved. GPHD-Isilon is Highly Competitive
  • 34. 36© Copyright 2011 EMC Corporation. All rights reserved. Terasort Performance is Comparable Between Configurations
  • 35. 37© Copyright 2011 EMC Corporation. All rights reserved. I/O Performance Scales As Isilon Nodes Are Added
  • 36. 38© Copyright 2011 EMC Corporation. All rights reserved. For Typical Workloads, 1.5 Compute Nodes Per Isilon x400 Node is Good (4) Isilon x400 Nodes Tested
  • 37. 39© Copyright 2011 EMC Corporation. All rights reserved. Return Path http://www.emc.com/collateral/customer-profiles/h11528-return-path-cp.pdf Challenges Limited performance and capacity to support intensive Hadoop analytics NFS and Hadoop environments struggled to handle unique data sets comprised of hundreds of millions of small email files, and large analytics files, which hindered analytics and delivery of customer solutions 25 different DAS and NAS storage systems lacked performance and capacity Storage projected to increase from 150TB to 2PB over the next 5 years Company background: • Return Path is the worldwide leader in email intelligence, serving Internet service providers (ISPs), businesses, and individuals. • The company’s email intelligence solutions process and analyze massive volumes of data to maximize email performance, ensure email delivery, and protect users from spam and other abuse. • Developed Hadoop based email intelligence solutions combined with NAS based data access
  • 38. 40© Copyright 2011 EMC Corporation. All rights reserved. Return Path Results Return Path now has a single repository for all its Big Data, accessible to email analysts, product development teams and external customers. Isilon delivers real-time data to Return Path’s end-user applications while providing seamless integration with Hadoop for back-end data analytics Reduces shared storage data center footprint by 30 percent Shortens weekly administration time by more than 35 percent Improves availability and reliability for Hadoop analytics Savings of $350,000 from lower power, cooling, and maintenance Isilon Solution and Benefits Solution Isilon X400 Scaleout NAS – Approx 200TB capacity SmartConnect, SmartQuotas, InsightIQ Software suite NFS and HDFS Data Access Protocols
  • 39. 41© Copyright 2011 EMC Corporation. All rights reserved. Return Path “To have all this data being generated by our email intelligence products, but no way to access it directly by Hadoop, was a major hindrance,” “Isilon serves NFS data across multiple product suites and makes it easily accessible to our Hadoop analytics team. That’s a significant business enabler, allowing Return Path to develop customer solutions much faster.” “Isilon InsightIQ software has been invaluable, providing visibility into our infrastructure and managing our space efficiently as we grow.” DIZ CARTER VP Infrastructure Operations Customer Quotes
  • 40. 42© Copyright 2011 EMC Corporation. All rights reserved. Questions?
  • 41. 43© Copyright 2011 EMC Corporation. All rights reserved. Thank You!

Editor's Notes

  • #2: <Note to speakers:The EMC Isilon presenter will cover the 1st half of the presentation, through slide 24. The EMC Greenplum presenter will cover the 2nd half of the presentation, slides 25 – 37Both presenters will participate in the Q+A (with backup from other EMC team members attending the event><To kick off the presentation>:Welcome the audience + thank them for joining usIntroduce yourself + the EMC Greenplum presenter
  • #3: Here’s what we’re going to cover in today’s session:Walk through agenda
  • #4: Isilon has been a leading innovator in scale-out NAS for more than10 years.Isilon scale-out storage is being used today across a wide range of organizations:Data-intensive, high performance computing (HPC) environments such as Life Sciences, Electronic Design Automation, and Media & Entertainment, to name a few examples.Traditional enterprise IT environments: Isilon’s storage systems are used to support a variety of large-scale use cases including archiving, home directories and file shares; virtualization (Tier 3 and Tier 4); and business analytics (Hadoop).In total, Isilon’s scale-out storage solutions are being used by over 3,000 organizations around the world today and, thanks to the success that customers have enjoyed, the business is growing rapidly…about 100percent per year last year. The key engine of customers’ success is the Isilon OneFS operating system. It is instrumental in providing customers with an innovative, scale-out data environment. Note to Presenter: Here are some additional facts that you may want to point out about Isilon:Isilon was founded more than 10 years ago (as Isilon Systems) and is now recognized as the industry leader in scale-out NAS storage solutions. Isilon joined the EMC team in December 2010 (when EMC acquired Isilon Systems). Since then, Isilon’s scale-out storage solutions business has continued to grow rapidly—being adopted in large enterprises across a wide range of industries.Gartner report can be found here: http://www.gartner.com/id=1960515 (abstract only)
  • #5: This slide shows just a sampling of customers who are benefiting from Isilon scale-out storage.
  • #6: One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business.Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future.Now let’s look at how the adoption of Hadoop is evolving.
  • #8: The Isilon OneFS operating system provides the intelligence behind all Isilon scale-out storage systems. It combines the three layers of traditional storage architectures—file system, volume manager, and data protection—into one unified software layer, creating a single intelligent file system that spans all nodes within an Isilon cluster.Note to Presenter: Click now in Slide Show mode for animation.OneFS provides a number of important advantages: A single file system for great ease of management Unmatched efficiency with over 80 percent storage utilization plus automated storage tiering to gain additional efficienciesHigh-performance NASEasy, “grow as you go” flexibility Linear scalabilitylets you can scale performance and capacity to over 15 PB
  • #11: Putting It All Together.The Isilon IQ X-Series, powered by the OneFS® operating system, uses Isilon's scale-out storage architecture to speed access to massive amounts of critical data, while dramatically reducing cost and complexity. Isilon delivers a flexible solution to accelerate your high-concurrent and sequential-throughput applications. With SSD technology for file-system metadata, the Isilon X-Series significantly accelerates namespace intensive operations. S-Series nodes provide balanced throughput and performance and the NL nodes form the foundation for nearline, and archive.Isilon’s modular architecture and intelligent software make deployment and management simple. You can have an Isilon cluster online in less than 10 minutes, without time-consuming, expensive integration services. Scale a cluster in performance and capacity in about one minute all within a single pool of storage with a global namespace, eliminating the need to support multiple volumes and file systems. Isilon’s suite of applications then work together to provide the data management and protection capabilities required by corporate IT – from the front end intelligence that eliminates client and data migration to quota management for file shares. SnapshotIQ and SyncIQ work in concert to protect and replicate important data for local and remote archive while SnapLock provides for the immutability of data. And finally, backup accelerator speeds file replication to tape with a scalable, parallel infrastructure that insures backup windows and recovery time objectives are always met.
  • #14: It this section, we’re going to identify and describe the key technology challenges of Hadoop, especially when deployed using direct-attached storage (DAS).
  • #15: There are 5 basic roles to every hadoop environment:HDFS is made up of the namenode, secondary namenode, and datanode roles.Mapreduce is comprised of the jobtracker and task tracker.
  • #16: The job tracker is effectively the queue master of a hadoopmapreduce environment. It schedules jobs, distributes tasks across available task-trackers, and allows administrators to get a glimpse into the overall activity for a hadoop environment.
  • #17: To go into more detail, the namenode is effectively the metadata server for all HDFS data and data blocks. In large hadoop clusters, this role is run on a dedicated host, typically with a large amount of D-RAM. This is because all metadata for the entire HDFS namespace is stored in local DRAM on this host. As such, traditional hadoop architectures have limitations on the number of objects which can be stored within each HDFS namespace.The namenode is contacted for every block request, both for reads and writes, and is responsible for making sure data blocks are mirrored to multiple datanodes, spanning multiple racks.
  • #18: One challenge associated with traditional deployments of Hadoop, is that it has largely been done on a dedicated infrastructure and not integrated with or connected to any other applications. In effect, a silo’d environment, often outside the realm of the IT team. This poses a number inefficiencies and risks.<click>A well-recognized issue with traditional Hadoop deployments is the “single-point-of-failure” problem with the HadoopNamenode. In a Hadoop environment, a single namenode manages the hadoopfilesystem. If it goes down, the Hadoop environment will immediately go off-line. If the namenode does not come back online, the data stored within all of HDFS is lost and cannot be reconstructed.<Click to next build slide>
  • #19: Another issue with traditional Hadoop environments is the lack of enterprise-level data protection. Typical Hadoop deployments do not have rigorous data protection backup and recovery capabilities such as snapshots or data replication for disaster recovery (DR) purposes.<click> Traditional Hadoop deployments on direct-attached storage (DAS) are also extremely inefficient. It’s not unusual for a DAS environment to operate with a 30-35% storage utilization rate (or less). Compounding this inefficiency is the fact that data is often mirrored (the default is 3 times). In addition to storage inefficiency, this type of infrastructure is very management-intensive.<click>Another issue with Hadoop running with direct attached storage is that ‘server’ and ‘storage’ resources must be increased together in lock-step. For example, if more storage resources are required, a new server must be deployed (and vice versa). This rigidity adds additional inefficiencies. Another issue is the manual import/export of data that is required in a traditional hadoop environment. In addition to being time and resource (bandwith) consuming, the hadoop data in typical environments can not be accessed or shared with other enterprise applications due to the lack of industry-standard protocol support.To address these challenges and to enable enterprises to begin realizing the benefits of Hadoop quickly and easily, EMC has recently introduced an exciting new Hadoop solution.<click to advance to next slide>
  • #20: Isilon is able to “pretend” to be a HDFS cluster: it mimics the NameNode and DataNode protocols to host data.Underlying system is OneFS and does not follow the traditional HDFS scheme.Point HDFS clients (MapReduce, command line, etc.) to the DNS name of the Isilon cluster.
  • #21: One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business.Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future.Now let’s look at how the adoption of Hadoop is evolving.
  • #22: One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business.Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future.Now let’s look at how the adoption of Hadoop is evolving.
  • #23: One reason Hadoop has emerged as an important technology is because it is an innovative, Big Data analytics engine designed specifically for massively large data volumes. With it, organizations can greatly reduce the time required to derive valuable insight from an enterprise’s dataset. By adopting Hadoop to store and analyze massive data volumes, enterprises are gaining an agile new platform to deliver new insights and identify new opportunities to accelerate their business.Hadoop has also been designed to tackle analytics for unstructured data. This is significant because this is the dominant area of data growth projected for the foreseeable future.Now let’s look at how the adoption of Hadoop is evolving.
  • #24: The new EMC solution also eliminates the “single-point-of-failure” issue. We do this by enabling all nodes in an EMC Isilon storage cluster to become, in effect, namenodes. This greatly improves the resiliency of your hadoop environment.The EMC solution for hadoop also provides reliable, end-to-end data protection for Hadoop data including snapshoting for backup and recovery and data replication (with SyncIQ) for disaster recovery capabilities.Our new hadoop solution also takes advantage of the outstanding efficiency of EMC Isilon storage systems. With our solutions, customers can achieve up to 80% or more storage utilization.EMC Hadoop solutions can also scale easily and independently. This means if you need to add more storage capacity, you don’t need to add another server (and vice versa). With EMC isilon, you also get the added benefit of linear increases in performance as the scale increases.EMC also recently announced that we are the 1st vendor to integrate the HDFS (Hadoop Distributed File System) into our storage solutions. This means that with EMC Isilon storage, you can readily use your Hadoop data with other enterprise applications and workloads while eliminating the need to manually move data around as you would with direct-attached storage.
  • #28: Math Logic on 28 hours.100 TB = 100,000,000 MB10GB can transfer approx 1GB per second (not including spindle speeds in calculations)So, 100TB/1GB = # of seconds to transfer then divide by 60 seconds / 60 minutes = 28 hours (ish)
  • #33: It this section, we’re going to identify and describe the key technology challenges of Hadoop, especially when deployed using direct-attached storage (DAS).
  • #34: Customer Profile: http://www.emc.com/collateral/customer-profiles/h11528-return-path-cp.pdf Company background: www.returnpath.comReturn Path is the worldwide leader in email intelligence, serving Internet service providers (ISPs), businesses, and individuals. The company’s email intelligence solutions process and analyze massive volumes of data to maximize email performance, ensure email delivery, and protect users from spam and other abuse.Previous Environment & Existing ApplicationsPreviously a hodge-podge of more than 25 different storage systems, including server-attached storage, shared Oracle appliances, as well as NetApp and Hewlett-Packard systemsCompany Challenges: Data growing 25–50 terabytes per yearLimited performance and capacity to support intensive Hadoop analyticsDisparate systems lacked performance and capacityEMC Solution & Important Benefits to Customer:EMC Isilon X-seriesHadoop, internally developed email intelligence solutionsSmartPools,SmartConnect,SmartQuotas,InsightIQResults: Enables unconstrained access to email data for analysisReduces shared storage data center footprint by 30 percentImproves availability and reliability for Hadoop analyticsAchieves faster development and time to market of new productsEstimates five-year cost savings of $350,000 from lower power, cooling, and maintenanceShortens weekly administration time by more than 35 percentQuotes: “Isilon serves NFS data across multiple product suites and makes it easily accessible to our Hadoop analytics team. That’s a significant business enabler, allowing Return Path todevelop customer solutions much faster.” Diz Carter Vice President of Infrastructure Operations, Return Path“Considering our projected growth, we were able to make a strong business case for Isilon,” says Carter. “Looking out over five years, we estimate greater than $350,000 in savings from lower power, cooling, and maintenance requirements.”“We went from having boxes on the dock to serving up 180 terabytes in just over three hours,” says Carter. “I’ve never come across another solution as easy toimplement as Isilon.”
  • #35: With Isilon, Return Path now has a single repository for all its Big Data, accessible to email analysts, product development teams and external customers. Previously, performing analytics on email data residing in shared storage required making a separate copy of the data set and manually moving it to the Hadoop environment.  Today, Isilon delivers real-time data to Return Path’s end-user applications while providing seamless integration with Hadoop for back-end data analytics, boosting customer satisfaction and business productivity.“To have all this data being generated by our email intelligence products, but no way to access it directly by Hadoop, was a major hindrance,” Carter remarks. “Now, Isilon serves NFS data across multiple product suites and makes it easily accessible to our Hadoop analytics team. That’s a huge business enabler because we're able to develop products much faster.” Pam please add a place holder for time savings from the old process of manually creating multiple copies to now with Isilon
  • #36: Customer Profile: http://www.emc.com/collateral/customer-profiles/h11528-return-path-cp.pdf Company background: www.returnpath.comReturn Path is the worldwide leader in email intelligence, serving Internet service providers (ISPs), businesses, and individuals. The company’s email intelligence solutions process and analyze massive volumes of data to maximize email performance, ensure email delivery, and protect users from spam and other abuse.Previous Environment & Existing ApplicationsPreviously a hodge-podge of more than 25 different storage systems, including server-attached storage, shared Oracle appliances, as well as NetApp and Hewlett-Packard systemsCompany Challenges: Data growing 25–50 terabytes per yearLimited performance and capacity to support intensive Hadoop analyticsDisparate systems lacked performance and capacityEMC Solution & Important Benefits to Customer:EMC Isilon X-seriesHadoop, internally developed email intelligence solutionsSmartPools,SmartConnect,SmartQuotas,InsightIQResults: Enables unconstrained access to email data for analysisReduces shared storage data center footprint by 30 percentImproves availability and reliability for Hadoop analyticsAchieves faster development and time to market of new productsEstimates five-year cost savings of $350,000 from lower power, cooling, and maintenanceShortens weekly administration time by more than 35 percentQuotes: “Isilon serves NFS data across multiple product suites and makes it easily accessible to our Hadoop analytics team. That’s a significant business enabler, allowing Return Path todevelop customer solutions much faster.” Diz Carter Vice President of Infrastructure Operations, Return Path“Considering our projected growth, we were able to make a strong business case for Isilon,” says Carter. “Looking out over five years, we estimate greater than $350,000 in savings from lower power, cooling, and maintenance requirements.”“We went from having boxes on the dock to serving up 180 terabytes in just over three hours,” says Carter. “I’ve never come across another solution as easy toimplement as Isilon.”
  • #39: Thank you