SlideShare a Scribd company logo
LET THE
TRANSFORMATION
BEGIN
The Active/Active
Data Lake with
Elastic Cloud Storage
© Copyright 2016 Dell Inc.
3
Multiple Hadoop Clusters - Challenges
MR/Hive/Pig
YARN
HDFS
MR/Hive/Pig
YARN
HDFS
MR/Hive/Pig
YARN
HDFS
ACTIVE/ACTIVE GLOBAL HADOOP STORAGE
© Copyright 2016 Dell Inc.
4
But we use DistCp…
• Active/Passive access
• Very high disk usage (3x replication on each site)
• Periodic transfers (once every few hours)
• Must carefully update Hive Metastore manually at each site
• Hard to build for 3 or more clusters
• Consumes YARN CPU/memory
• No Hive concurrency controls on target cluster
© Copyright 2016 Dell Inc.
5
Active/Active Hadoop with ECS
Global Namespace For Active/Active Hadoop
STORAGE EFFICIENCY
Denver Beijing Paris
GEO CACHING
STRONG CONSISTENCY ACTIVE/ACTIVE w/ FAILOVER
© Copyright 2016 Dell Inc.
6
Active/Active Hive with ECS: Solution Overview
• 2 to 8 sites
• A shared, common:
– Hadoop-Compatible File System – global namespace, readable and writable from all sites
– Hive Metastore DB
• Strong consistency
• Asynchronous replication (low latency updates)
• Fully recoverable from the failure of a single site and 4 drives in each site
• Very high storage efficiency (4.5 times better than HDFS with 3 sites)
• Hive concurrency and ACID transactions (insert, update, delete), even across sites
© Copyright 2016 Dell Inc.
7
Active/Active Hive with ECS: Architecture
ECS Node
ECS Node
ECS Node
ECS Node Node Manager
Metastore DB
Hadoop Master
Node Manager
Node Manager
Site 1
ECS Node
ECS Node
ECS Node
ECS Node Node Manager
Metastore DB
Hadoop Master
Node Manager
Node Manager
Site 2
ECS Node
ECS Node
ECS Node
ECS Node Node Manager
Metastore DB
Hadoop Master
Node Manager
Node Manager
Site 3
ECS Replication (async)
Metastore DB Repl (sync)
Hadoop installations at
different sites are
independent except for
the Hive Metastore DB and
the common file system
provided by ECS.
© Copyright 2016 Dell Inc.
8
Active/Active Hive Demonstration
hive> create table demotab1…
hive> insert into demotab1 partition (site=1)
values (11, ‘ant'), (12,'bear');
Site 1 Site 2
hive> select * from demotab1;
11 ant 1
12 bear 1
hive> insert into demotab1 partition (site=2)
values (21, 'cat'), (22,'dog');
hive> select * from demotab1;
11 ant 1
12 bear 1
21 cat 2
22 dog 2
hive> delete from demotab1 where site=2 and id=21;
hive> select * from demotab1;
11 ant 1
12 bear 1
22 dog 2
hive> create table demotab1
(
id int,
s string
)
partitioned by (site int)
clustered by (id) into 4 buckets
stored as orc tblproperties ('transactional'='true');
© Copyright 2016 Dell Inc.
9
Use Case: Enterprise Data Warehouse Offload
• Tables can be exported from an EDW to Hive on ECS with Apache Sqoop or similar tools
• Data will be efficiently distributed and protected across multiple sites
• If desired, delete exported records from the EDW
• Data can be queried in place using Hive SQL from any site
EDW SQL
© Copyright 2016 Dell Inc.
10
Storage Efficiency Comparison
38%
50%
56%
60% 63% 64% 66%
17%
11%
8% 7% 6% 5% 4%
0%
20%
40%
60%
80%
100%
2 3 4 5 6 7 8
Storage
Efficiency
Number of Sites
ECS Efficiency
HDFS Efficiency
Storage efficiency is the effective % of raw disk bytes that are usable by your data
4.5x Better Efficiency!
Active Active Data Lake with ECS

More Related Content

PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
HDFS tiered storage
PDF
Hadoop HDFS
PPTX
Multi-tenant Hadoop - the challenge of maintaining high SLAS
PPTX
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
PPTX
The Transformation of your Data in modern IT (Presented by DellEMC)
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS tiered storage
Hadoop HDFS
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
The Transformation of your Data in modern IT (Presented by DellEMC)
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...

Similar to Active Active Data Lake with ECS (20)

PDF
Hadoop Data Management (1).pdfhbjhkjkkmkm
PDF
h13345-top-reasons-service-providers-choose-ecs-appliance
PPTX
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PDF
Infrastructure Around Hadoop
PPT
Hw09 Production Deep Dive With High Availability
PPTX
Hadoop Architecture_Cluster_Cap_Plan
PPTX
HDFS Tiered Storage
PPTX
Storage and-compute-hdfs-map reduce
PPTX
Hopsfs 10x HDFS performance
PPTX
Apache Hadoop- Hadoop Basics.pptx
PDF
Hadoop and object stores: Can we do it better?
PPTX
Big Data-Session, data engineering and scala
PDF
Hadoop and object stores can we do it better
PDF
Scaling Hadoop at LinkedIn
PDF
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
PPTX
Introduction to HDFS and MapReduce
PPTX
BIG DATA: Apache Hadoop
PPTX
Hadoop Interacting with HDFS
PDF
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
Hadoop Data Management (1).pdfhbjhkjkkmkm
h13345-top-reasons-service-providers-choose-ecs-appliance
Hortonworks Technical Workshop: What's New in HDP 2.3
Infrastructure Around Hadoop
Hw09 Production Deep Dive With High Availability
Hadoop Architecture_Cluster_Cap_Plan
HDFS Tiered Storage
Storage and-compute-hdfs-map reduce
Hopsfs 10x HDFS performance
Apache Hadoop- Hadoop Basics.pptx
Hadoop and object stores: Can we do it better?
Big Data-Session, data engineering and scala
Hadoop and object stores can we do it better
Scaling Hadoop at LinkedIn
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Introduction to HDFS and MapReduce
BIG DATA: Apache Hadoop
Hadoop Interacting with HDFS
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
Ad

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Quality review (1)_presentation of this 21
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
oil_refinery_comprehensive_20250804084928 (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to Knowledge Engineering Part 1
Quality review (1)_presentation of this 21
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Ad

Active Active Data Lake with ECS

  • 2. The Active/Active Data Lake with Elastic Cloud Storage
  • 3. © Copyright 2016 Dell Inc. 3 Multiple Hadoop Clusters - Challenges MR/Hive/Pig YARN HDFS MR/Hive/Pig YARN HDFS MR/Hive/Pig YARN HDFS ACTIVE/ACTIVE GLOBAL HADOOP STORAGE
  • 4. © Copyright 2016 Dell Inc. 4 But we use DistCp… • Active/Passive access • Very high disk usage (3x replication on each site) • Periodic transfers (once every few hours) • Must carefully update Hive Metastore manually at each site • Hard to build for 3 or more clusters • Consumes YARN CPU/memory • No Hive concurrency controls on target cluster
  • 5. © Copyright 2016 Dell Inc. 5 Active/Active Hadoop with ECS Global Namespace For Active/Active Hadoop STORAGE EFFICIENCY Denver Beijing Paris GEO CACHING STRONG CONSISTENCY ACTIVE/ACTIVE w/ FAILOVER
  • 6. © Copyright 2016 Dell Inc. 6 Active/Active Hive with ECS: Solution Overview • 2 to 8 sites • A shared, common: – Hadoop-Compatible File System – global namespace, readable and writable from all sites – Hive Metastore DB • Strong consistency • Asynchronous replication (low latency updates) • Fully recoverable from the failure of a single site and 4 drives in each site • Very high storage efficiency (4.5 times better than HDFS with 3 sites) • Hive concurrency and ACID transactions (insert, update, delete), even across sites
  • 7. © Copyright 2016 Dell Inc. 7 Active/Active Hive with ECS: Architecture ECS Node ECS Node ECS Node ECS Node Node Manager Metastore DB Hadoop Master Node Manager Node Manager Site 1 ECS Node ECS Node ECS Node ECS Node Node Manager Metastore DB Hadoop Master Node Manager Node Manager Site 2 ECS Node ECS Node ECS Node ECS Node Node Manager Metastore DB Hadoop Master Node Manager Node Manager Site 3 ECS Replication (async) Metastore DB Repl (sync) Hadoop installations at different sites are independent except for the Hive Metastore DB and the common file system provided by ECS.
  • 8. © Copyright 2016 Dell Inc. 8 Active/Active Hive Demonstration hive> create table demotab1… hive> insert into demotab1 partition (site=1) values (11, ‘ant'), (12,'bear'); Site 1 Site 2 hive> select * from demotab1; 11 ant 1 12 bear 1 hive> insert into demotab1 partition (site=2) values (21, 'cat'), (22,'dog'); hive> select * from demotab1; 11 ant 1 12 bear 1 21 cat 2 22 dog 2 hive> delete from demotab1 where site=2 and id=21; hive> select * from demotab1; 11 ant 1 12 bear 1 22 dog 2 hive> create table demotab1 ( id int, s string ) partitioned by (site int) clustered by (id) into 4 buckets stored as orc tblproperties ('transactional'='true');
  • 9. © Copyright 2016 Dell Inc. 9 Use Case: Enterprise Data Warehouse Offload • Tables can be exported from an EDW to Hive on ECS with Apache Sqoop or similar tools • Data will be efficiently distributed and protected across multiple sites • If desired, delete exported records from the EDW • Data can be queried in place using Hive SQL from any site EDW SQL
  • 10. © Copyright 2016 Dell Inc. 10 Storage Efficiency Comparison 38% 50% 56% 60% 63% 64% 66% 17% 11% 8% 7% 6% 5% 4% 0% 20% 40% 60% 80% 100% 2 3 4 5 6 7 8 Storage Efficiency Number of Sites ECS Efficiency HDFS Efficiency Storage efficiency is the effective % of raw disk bytes that are usable by your data 4.5x Better Efficiency!

Editor's Notes

  • #4: Hadoop customers who use multiple clusters across geographies face a lot of challenges today. The are lot of isolated “DAS clusters” in the environment, so the utilization/efficiency rate is very low, which drives up the overall TCO. It is hard and time consuming to move date from one cluster to another. While customers use DistCp, it is not ideal Distcp provides Active-passive access and consumes lot of resources ( CPU and Memory) It is very difficult to keep the data strongly consistent between the clusters- which is a big problem. Because, different users from different locations will have different versions of the same data.
  • #5: List as many cons about distcp as possible
  • #6: EMC Elastic Cloud Storage (ECS) solves those problems with its unique and industry leading Active-active Hadoop solution. With ECS, Data , metadata & index replicated across multiple geographic sites, Same bucket accessed from both sites simultaneously Hadoop compute running on both sides can access the same data. Moreover, ECS presents a single global namespace – which means that any data can be accessed from anywhere .
  • #9: Updates also work. If queries overlap with inserts, etc., Hive concurrency controls (locks) work across the sites to ensure that consistent views are maintained. Locks are maintained in the shared Metastore DB.
  • #11: Assumes that the default HDFS replication count of 3 is used at all sites.