SlideShare a Scribd company logo
1
DATA PROTECTION FOR
HADOOP ENVIRONMENTS
PETER MARELAS
PRINCIPAL SYSTEMS ENGINEER
DATA PROTECTION SOLUTIONS
EMC
2
• How to protect Data in Hadoop environments?
• Do we need Data Protection for Hadoop?
• What motivates people to question whether they need
to protect Hadoop?
HOW DID I GET HERE?
3
• Major backup vendors don’t have solutions
• Hadoop size and scale is a challenge
• Hadoop has inbuilt Data Protection properties
WHAT I FOUND
4
Are Hadoop’s inbuilt Data Protection
properties good enough?
QUESTION TO EXPLORE
5
ARCHITECTURE CONSTRAINTS
Traditional Enterprise Application Infrastructure
6
ARCHITECTURE CONSTRAINTS
Enterprise Hadoop Infrastructure
7
Efficient
Server-Centric
Data Protection
for
traditional
Hadoop architecture
8
Are Hadoop’s inbuilt
Data Protection
properties
good enough..
9
• Onboard Data Protection methods
– Built into HDFS
– Captive
• Offboard Data Protection methods
– Getting copies of data out of Hadoop
HADOOP INBUILT DATA PROTECTION
10
ONBOARD DATA PROTECTION
Access Layer
Redundancy
NameNode HA
Redundant
Storage Controllers
Persistence Layer
Redundancy
N-way Replication
RAID/EC Schemes
11
• Proactive Data Protection
• HDFS does not assume data stays correct
• Protects against data corruption
• Verify integrity and repair from replica copies
ONBOARD DATA PROTECTION
12
• HDFS Snapshots
• Read only
• Directory level
• Not consistent at time of snapshot
• Preserves consistency on file close (beware open files!)
• Data owner controls the snapshot
ONBOARD DATA PROTECTION
13
• HDFS Trash (recycle bin)
• Moves deleted files to user trash bin
• Deleted after predefined time
• Implemented in HDFS client
• Can be overridden by user
• Trash bin can be accessed or moved back
ONBOARD DATA PROTECTION
14
• Distributed Copy
• HDFS, S3, OpenStack Swift, FTP, Azure (2.7.0)
• Single file copy performance bound to one data node
• 10TB file @ 1 Gbe = 22 hours
OFFBOARD DATA PROTECTION
15
To answer the question..
Is Hadoop inbuilt data
protection good enough we
need to understand..
What are we protecting
against…
16
DATA LOSS EVENT MATRIX
17
There is no such thing as software
that does not unexpectedly fail
18
In 2009 HortonWorks examined
HDFS’s data integrity at Yahoo!
HDFS lost 650 blocks out of
329 million blocks on 10 clusters
with 20,000 nodes
85% due to software bugs
15% due to single block replica
19
Condition that causes
blocks to be lost
HDFS-5042
20
HDFS now supports truncate()
No longer immutable
or write-once
HDFS-3107
21
Plan for software failures..
THE MORAL OF THE STORY
Plan for human failures..
22
Not all data is equal..
Protect what is valuable..
Protect what can’t be derived
in a reasonable timeframe..
THE MORAL OF THE STORY
23
DATA PROTECTION GUIDING PRINCIPALS
24
Diversify
Loosely Coupling
DATA PROTECTION GUIDING PRINCIPALS
25
Logical Isolation
Physical Isolation
Separation of Concerns
DATA PROTECTION GUIDING PRINCIPALS
26
Frequently Verified
DATA PROTECTION GUIDING PRINCIPALS
27
DATA PROTECTION STRATEGY
Versioned
Copies
(HDFS->Other)
Copy/Repl
(HDFS->HDFS)
HDFS Snapshot
HDFS Trash
Critical
Essential
Necessary
Desirable
Protection Method Data Value
28
LIVE DEMO
Hadoop Data Protection
~
Scalable Versioned Copies
~
Data Domain Protection Storage
29
(b) www.beebotech.com.au
(t) @pmarelas
(e) peter.marelas@emc.com
THANK YOU

More Related Content

PDF
Hadoop & Security - Past, Present, Future
PPTX
Overview of HDFS Transparent Encryption
PPT
Hadoop Operations: How to Secure and Control Cluster Access
PDF
Hadoop Security and Compliance - StampedeCon 2016
PDF
Nl HUG 2016 Feb Hadoop security from the trenches
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PPTX
Open Source Security Tools for Big Data
PPTX
Hadoop Security Features That make your risk officer happy
Hadoop & Security - Past, Present, Future
Overview of HDFS Transparent Encryption
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Security and Compliance - StampedeCon 2016
Nl HUG 2016 Feb Hadoop security from the trenches
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Open Source Security Tools for Big Data
Hadoop Security Features That make your risk officer happy

What's hot (20)

PPT
Hadoop Security Architecture
PPTX
Hadoop security
PDF
Hadoop security
PDF
Hadoop Security
PPTX
Hadoop security @ Philly Hadoop Meetup May 2015
PDF
Hadoop Security: Overview
PDF
Big Data Security with Hadoop
PPTX
Hdp security overview
PPTX
Hadoop Security Today and Tomorrow
PDF
Apache Sentry for Hadoop security
PPTX
Deploying Enterprise-grade Security for Hadoop
PPTX
Hadoop Security Today & Tomorrow with Apache Knox
PPTX
The Future of Hadoop Security - Hadoop Summit 2014
PPTX
Big data security
PPT
How to Protect Big Data in a Containerized Environment
PPTX
Hadoop Security Features that make your risk officer happy
PDF
April 2014 HUG : Apache Sentry
PDF
2014 sept 4_hadoop_security
PPTX
Protect your private data with ORC column encryption
PPTX
Securing the Hadoop Ecosystem
Hadoop Security Architecture
Hadoop security
Hadoop security
Hadoop Security
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop Security: Overview
Big Data Security with Hadoop
Hdp security overview
Hadoop Security Today and Tomorrow
Apache Sentry for Hadoop security
Deploying Enterprise-grade Security for Hadoop
Hadoop Security Today & Tomorrow with Apache Knox
The Future of Hadoop Security - Hadoop Summit 2014
Big data security
How to Protect Big Data in a Containerized Environment
Hadoop Security Features that make your risk officer happy
April 2014 HUG : Apache Sentry
2014 sept 4_hadoop_security
Protect your private data with ORC column encryption
Securing the Hadoop Ecosystem
Ad

Viewers also liked (20)

PPTX
Availability and Integrity in hadoop (Strata EU Edition)
PDF
Owler Special Report | May 26, 2015 | Granify, Remix, Karro, Rubrik and more.
PPTX
Commercial track 1_The Power of UDP
PPTX
Inside hadoop-dev
PPTX
Protecting enterprise Data in Hadoop
PPTX
Top 8 data protection officer resume samples
PPTX
Importance of Big data for your Business
PDF
Hadoop Application Architectures tutorial - Strata London
PDF
How the latest trends in data security can help your data protection strategy...
PDF
Architectural considerations for Hadoop Applications
PDF
Hadoop I/O Analysis
PPTX
Introduction to Apache Hadoop
PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PPT
Hadoop hive presentation
PPTX
Apache Hadoop YARN: best practices
PDF
Introduction to Apache Hive
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
PPTX
Hadoop HDFS Detailed Introduction
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PPSX
Availability and Integrity in hadoop (Strata EU Edition)
Owler Special Report | May 26, 2015 | Granify, Remix, Karro, Rubrik and more.
Commercial track 1_The Power of UDP
Inside hadoop-dev
Protecting enterprise Data in Hadoop
Top 8 data protection officer resume samples
Importance of Big data for your Business
Hadoop Application Architectures tutorial - Strata London
How the latest trends in data security can help your data protection strategy...
Architectural considerations for Hadoop Applications
Hadoop I/O Analysis
Introduction to Apache Hadoop
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Hadoop hive presentation
Apache Hadoop YARN: best practices
Introduction to Apache Hive
New Data Transfer Tools for Hadoop: Sqoop 2
Hadoop HDFS Detailed Introduction
Apache Hadoop YARN - Enabling Next Generation Data Applications
Ad

Similar to Data protection for hadoop environments (20)

PDF
Hadoop 101
 
PDF
Privacy Preserving Data Analytics using Cryptographic Technique for Large Dat...
PPTX
Hadoop crash course workshop at Hadoop Summit
PPTX
Hadoop operations-2014-strata-new-york-v5
PDF
Delivering Modern Data Protection for VMware Environments
PPTX
PPTX
Hadoop enhancements using next gen IA technologies
PDF
Hortonworks and Voltage Security webinar
PDF
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
PDF
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
PDF
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
PDF
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
PDF
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
PPTX
Big Data & Hadoop
PPTX
Webinar Presentation: Stories of Accidental Data Loss
PDF
Intro to big data choco devday - 23-01-2014
PPTX
Apache Hadoop- Hadoop Basics.pptx
PPT
HDFS_architecture.ppt
PPTX
Seminar ppt
PPTX
Hadoop Operations - Best Practices from the Field
Hadoop 101
 
Privacy Preserving Data Analytics using Cryptographic Technique for Large Dat...
Hadoop crash course workshop at Hadoop Summit
Hadoop operations-2014-strata-new-york-v5
Delivering Modern Data Protection for VMware Environments
Hadoop enhancements using next gen IA technologies
Hortonworks and Voltage Security webinar
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
Rethinking Data Protection Strategies 1st Edition by Aberdeen group
DEVOPS & THE DEATH AND REBIRTH OF CHILDHOOD INNOCENCE
Big Data & Hadoop
Webinar Presentation: Stories of Accidental Data Loss
Intro to big data choco devday - 23-01-2014
Apache Hadoop- Hadoop Basics.pptx
HDFS_architecture.ppt
Seminar ppt
Hadoop Operations - Best Practices from the Field

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
Advanced Soft Computing BINUS July 2025.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
cuic standard and advanced reporting.pdf

Data protection for hadoop environments

Editor's Notes

  • #2: Welcome. Peter Marelas Principal Systems Engineer for EMC Data Protection Solutions Division Today we will be learning about Data Protection for Hadoop Environments.
  • #3: Share story on how got here, didn’t know Hadoop 12 months ago So my day job involves architecting solutions for Enterprise customers. Most of the time I’m architecting solutions for mission critical workloads, like EDW, CRM, ERP. But more recently customers have been asking us how to protect data in Hadoop environments. And some customers even asked us, do we need to protect Hadoop environments. So I didn’t have all the answers.
  • #4: I spent about 1 month researching Hadoop Data Protection Here is what I found. All major backup vendors didn’t have solutions for Hadoop Hadoop’s size and scale is so daunting most customers don’t even know where to start. Hadoop has some interesting inbuilt data protection properties So I figured the first two points we could investigate and probably solve But I wanted to understand the last point before I did anything else
  • #5: And so as part of my research I wanted to answer the question.. Are Hadoop’s inbuilt data protection properties good enough? Now before we explore that questions I want to take you through some of the constraints with traditional Hadoop architectures relative to Enterprise architectures in the context of data protection
  • #6: This is a typical Enterprise application architecture. Blue boxes are the servers. Green boxes is the app storage. Two ways to create data protection copies Stream data via app servers to heterogeneous storage (grey boxes) That’s what most backup solutions do today for Enterprise apps assuming sufficient time and resources We call this a server-centric protection strategy Other option is to use versioned storage replication to create our copies and recovery points We call this a storage-centric protection strategy
  • #7: Contrast this to a standard Hadoop architecture where storage and compute are combined We cannot use storage-centric methods to protect the data – plain disk, no intelligence So the constraint is we have to drive the process process using a server-centric approach.
  • #8: Given this constraint another goal was to find an efficient method to protect Hadoop I am going to demo this approach at the end of this presentation
  • #9: So lets go back to answer this question
  • #10: Hadoop has two types of Data Protection properties that I have classified into onboard and off board methods. Onboard is concerned with protecting data without leaving the cluster Off board is about getting copies of data out of the cluster
  • #11: If we look at onboard protection first Hadoop provides redundancy at the data access layer using a Highly Available NameNode This is like having redundant storage controllers in a storage system For the persistence layer the Hadoop file system implements N-way replication across nodes and racks This is equivalent to a RAID scheme for storage systems
  • #12: Hadoop also provides proactive Data Protection HDFS does not trust disk storage Assumes disks will degrade and return the wrong data Protects against this it generates checksums, regularly verifies them and repairs corruption from replica copies
  • #13: HDFS also supports read only snapshots There are 2 caveats with them They do not behave like storage system snapshots Storage snapshots are consistent for open and closed files HDFS snapshots are consistent for closed files only If you want consistent recovery points you need to ensure files are closed before taking a snapshot Also keep this in mind Snapshots can be deleted by data owners
  • #14: HDFS has a trash feature that operates like a recycle bin Files move into trash once deleted and then removed after a predetermined time Keep in mind Implemented in HDFS client Can be emptied at any time by file owner Can be overridden by file owner If your deleting files some other way there is no trash
  • #15: So those were the onboard data protection properties that come with Hadoop. Offboard data protection is provided by Hadoop distributed copy Lets you create copies of files to various targets. HDFS, S3, Openstack, FTP, Azure, etc. Distributed copy is great as it distributes the work amongst nodes, so it can scale with your cluster However, each file copy is mapped to one node Single file copy performance is bound by the network performance of one data node Keep this in mind 
  • #16: So now we know what Hadoop provides out of the box with respect to Data Protection We need to ask the question.. What are we protecting against? And how do Hadoop’s inbuilt methods fair?
  • #17: This is a Data Loss Event Matrix I use to assess Data Protection strategies On the left we have the events that can lead to data loss To the right we have the rating Then to the right again we have features and properties applicable to the event. And to the far right concerns relative to the features relative to the event My conclusion. Hadoop fairs well when it comes to Data Corruption, Component failures, Infrastructure software failures – firmware Has risk when it comes to Operational Failures, Site Failure, User Accidents, App software failures, Malicious user events, Malware
  • #18: I am a big believer that …. software is not immune to failure Some examples
  • #19: Data integrity study @ Yahoo! 650 blocks lost out of 329 million That’s a phenomenal achievement but look at the causes 85% due to software bugs 15% due to single block replica – operator error Last one interesting. What I found is its difficult to enforce data protection standards in Hadoop You can set a default But Data owners can define their own and change them retrospectively
  • #20: Although very rare I did find one known open condition in the Apache codebase that can cause blocks to be lost
  • #21: And a new thing to keep in mind is as of 2.7 release HDFS now supports truncate operations In the past we assumed immutability = protection That assumption is no longer valid
  • #22: Moral of the story Plan for software failures Plan for human failures
  • #23: But be sensible in your approach Not all data is equal Only protect the data that is valuable And only protect what can’t be derived again in a reasonable timeframe
  • #24: Here are my three Guiding Principles for Data Protection Strategies
  • #25: Diversify your protection copies Analogy here is investments. We hedge our risk by spreading investments across asset classes We should do the same with data copies Message: Keep your protection copies on a system that is diverse and different to the source system
  • #26: We want to maintain logical and physical isolation so that problems that impact the source system do not propagate to the target system For this to be successful we need to have separation of concerns Rule: the system protecting the source data should not trust the source system
  • #27: We want our protection copies to be frequently verified We don’t want to assume data is written correctly or stays correct We need to regularly verify in an automated way
  • #28: So here is a sample strategy that you can use that aligns the protection method to the value of data. Data that is desirable protect it with HDFS trash only Data that is necessary protect it with HDFS trash and snapshots Data that is essential protect it with trash, snapshot and distributed copy to another HDFS target Data that is critical protect it with all of the above plus versioned copies to a diverse and different storage target
  • #29: Demo how you can use Hadoop distributed copy to create versioned copies to Data Domain which is diverse and different to a Hadoop cluster Data Domain is our protection storage platform Has few unique properties Does inline deduplication Strong data integrity properties Really fast at ingesting streaming data which is good for distributed copy What’s unique about this approach is we are going to use distributed copy with an efficient incremental forever technique to maintain versioned copies