SlideShare a Scribd company logo
Security Threats to Hadoop: Data Leakage Attacks
and Investigation
Presented By
KIRAN GAJBHIYE
Contents
• Introduction
• Hadoop security
• Hadoop Components
• Data Leakage Attack
• Analyze Data Leakage in Data analyzer
• Disadvantages of Data Leakage
• overcome the threat
• Conclusion
Introduction
• Hadoop is an open-source framework that allows to store and process big
data
• popular platforms for big data storage and analysis
ex - manufacturing, healthcare, insurance, and retail
• Hadoop consists of core libraries
• A distributed file system (HDFS) decrease data losses
• YARN for resource management and application scheduling
• Map Reduce for data processing
• powerful processing capacity, huge storage capacity, scalability
Hadoop Security
• Hadoop Contains Sensitive Data
As Hadoop adoption grows too has the types of data organizations look to store.
Often the data is proprietary or personal and it must be protected
• Add Kerberos-based authentication to server
– Establishes identity for clients, hosts and services
– Prevents impersonation/passwords are never sent over the wire
• Data stored in encrypted form
• MySQL is often used
Hadoop Components
 MapReduce (version 1.2.1)
• Scheduling
• Map Reduce is a framework used to write the applications that process
large amounts of data on hardware
• MapReduce framework divides an input file into many chunks and then a
mapper for each chunk reads the data, does computations and provides
outputs in the form of key/value pairs
Continue……
 HDFS(Hadoop Distributed File System)
• The HDFS is the Java portable file system which is more scalable, reliable
and distributed in the Hadoop framework environment
• Store large data sets and cope with hardware failure
• Namenode
consists of a single Namenode, a master server that manages the file system
namespaces and regulates the access to files by clients
• DataNode
responsible for serving read and write requests from file system clients
Fig 1 Architecture of HDFS (hadoop distributed file system)
Data Leakage Attack
 application layer data leakage
attackers can obtain private data by application-layer vulnerability or
malware
ex - A vulnerability in the current Hadoop audit mechanism is that it only
records the operation type, time, and content, but no information on who did this
operation
Continue…
 operating system layer data leakage
• if attackers have the permission to log-in to the host operating
system of a Hadoop node
• They can bypass the monitor of Hadoop and steal data directly
in the OS layer
Investigate on Data Leakage
• an investigation framework aimed at data leakage attacks in Hadoop was
proposed
• This framework is composed of many data collectors and one data analyzer
• The data collector is located in the kernel of the host operating system in Hadoop
nodes
• It actively monitors the accesses to important data on each node and transmits
these behavior logs to the data analyzer
• The data it collects includes Hadoop logs, the Fsimage file, our own monitor (i.e.,
HProgger) logs and images or logs of files, processes, networks, and system
Fig 2 Data Collector
Fig 3 Data Analyzer
Continue…
 OS-layer data leakage attack, the detection algorithm works in
four dimensions:
• Abnormal directory
In collected logs, if any directory that is out of this range is found, an attack may
have happened.
• Abnormal user
if any other user instead of the Hadoop super user is found in HProgger logs, an
attack may have happened
Continue…
• Abnormal operation
if an attacker wants to steal a block from the physical machine, they need to copy
(move) it to another directory and sometimes may rename the block, so if any of
these operations are found, the attack may have happened
• Block Proportion
SusBlockNum (FileID) represent the number of suspicious blocks in a file dataset
and AllBlockNumber (FileID) represent the total number of blocks in a file dataset.
We define block proportion (BP) as the result of dividing SusBlockNum (FileID) by
AllBlockNumber (FileID).
Disadvantages of data leakage
• Lost of important and confidential documents
• Hack the whole system
•
Overcome the threat
• Transport-level encryption
• Kerberos used for strong authentication
• Apache Knox used for perimeter security
• Data-centric security
• Data-at-Rest Encryption
• Data-in-Motion Encryption
Conclusion
This including an on-demand data collection method and an automatic analysis
method for data leakage attacks in Hadoop. It collects data from the machines in the
Hadoop cluster to our forensic server and then analyzes them.
With the automatic detection algorithm, it can find out whether there exist
suspicious data leakage behaviors and give warnings and evidence to users. This
collected evidence can be used to find the attackers and reconstruct the attack
scenarios.
References
• D. Das and O. O’Malley, “Adding Security to Apache Hadoop,” Hortonworks
Technical Report, 2011.
• S. Park and Y. Lee, “Secure Hadoop with Encrypted HDFS,” GPC 2013, Seoul,
Korea, 2013, May 9–11.
• R. K. L. Ko and M. A. Will. “Progger: An Efficient, Tamper-Evident Kernel-
Space Logger for Cloud Data Provenance Tracking,” CLOUD 2014, Anchorage,
AK, 2014, June 27– July 2.
THANK YOU !!!

More Related Content

PDF
CNIT 121: 13 Investigating Mac OS X Systems
PDF
CNIT 152: 9 Network Evidence
PDF
CNIT 121: 12 Investigating Windows Systems (Part 1 of 3)
PDF
CNIT 121: 2 IR Management Handbook
PPTX
Learning UML with Enterprise Architect
PPTX
Understanding Distributed Databases Scalability
PDF
CNIT 121: 12 Investigating Windows Systems (Part 3)
PDF
CNIT 152 11 Analysis Methodology
CNIT 121: 13 Investigating Mac OS X Systems
CNIT 152: 9 Network Evidence
CNIT 121: 12 Investigating Windows Systems (Part 1 of 3)
CNIT 121: 2 IR Management Handbook
Learning UML with Enterprise Architect
Understanding Distributed Databases Scalability
CNIT 121: 12 Investigating Windows Systems (Part 3)
CNIT 152 11 Analysis Methodology

What's hot (20)

PDF
CNIT 121: 14 Investigating Applications
PDF
CNIT 152: 4 Starting the Investigation & 5 Leads
PDF
CNIT 152: 13 Investigating Mac OS X Systems
PDF
Beyond Hadoop and MapReduce
PDF
CNIT 152 12. Investigating Windows Systems (Part 3)
PPTX
Big Data Security on Microsoft Azure - HDInsight and HortonWorks
PDF
Datafoucs 2014 on line digital forensic investigations damir delija 2
PDF
CNIT 152: 10 Enterprise Services
PDF
EnCase Enterprise Basic File Collection
PDF
[NetApp] Simplified HA:DR Using Storage Solutions
PDF
Cloudera Impala - HUG Karlsruhe, July 04, 2013
PDF
CNIT 152: 9 Network Evidence
PPTX
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
PPTX
Selective Data Replication with Geographically Distributed Hadoop
PDF
CNIT 121: 6 Discovering the Scope of the Incident & 7 Live Data Collection
PDF
Usage aspects techniques for enterprise forensics data analytics tools
PPTX
Flashy prefetching for high performance flash drives
PDF
CNIT 152: 12 Investigating Windows Systems (Part 2 of 3)
PPTX
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
CNIT 121: 14 Investigating Applications
CNIT 152: 4 Starting the Investigation & 5 Leads
CNIT 152: 13 Investigating Mac OS X Systems
Beyond Hadoop and MapReduce
CNIT 152 12. Investigating Windows Systems (Part 3)
Big Data Security on Microsoft Azure - HDInsight and HortonWorks
Datafoucs 2014 on line digital forensic investigations damir delija 2
CNIT 152: 10 Enterprise Services
EnCase Enterprise Basic File Collection
[NetApp] Simplified HA:DR Using Storage Solutions
Cloudera Impala - HUG Karlsruhe, July 04, 2013
CNIT 152: 9 Network Evidence
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Selective Data Replication with Geographically Distributed Hadoop
CNIT 121: 6 Discovering the Scope of the Incident & 7 Live Data Collection
Usage aspects techniques for enterprise forensics data analytics tools
Flashy prefetching for high performance flash drives
CNIT 152: 12 Investigating Windows Systems (Part 2 of 3)
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Ad

Similar to Security Threats to Hadoop: Data Leakage Attacks and Investigation (20)

PDF
IRJET- Secured Hadoop Environment
PDF
Review on Big Data Security in Hadoop
PDF
Privacy Preserving Data Analytics using Cryptographic Technique for Large Dat...
PDF
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
PPTX
Hadoop Security Features that make your risk officer happy
PPTX
Hadoop Security Features That make your risk officer happy
PDF
Охота на уязвимости Hadoop
PDF
Practical Hadoop Security 1st ed. Edition Lakhe
PDF
Protecting your data against cyber attacks in big data environments
PDF
Protecting your data against cyber attacks in big data environments
PDF
Voltage Security, Protecting Sensitive Data in Hadoop
PPTX
Big data in term of security measure
PDF
Forensic Readiness on Hadoop Platform: Non-Ambari HDP as a Case Study
PDF
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
PDF
Security and Compliance for Scale-Out Hadoop Data Lakes
 
PPTX
Hadoop and Big Data Security
PDF
BigData Security - A Point of View
PDF
Hortonworks and Voltage Security webinar
PPTX
Securing Data in Hadoop at Uber
PDF
Authentic and Anonymous Data Sharing with Data Partitioning in Big Data
IRJET- Secured Hadoop Environment
Review on Big Data Security in Hadoop
Privacy Preserving Data Analytics using Cryptographic Technique for Large Dat...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hadoop Security Features that make your risk officer happy
Hadoop Security Features That make your risk officer happy
Охота на уязвимости Hadoop
Practical Hadoop Security 1st ed. Edition Lakhe
Protecting your data against cyber attacks in big data environments
Protecting your data against cyber attacks in big data environments
Voltage Security, Protecting Sensitive Data in Hadoop
Big data in term of security measure
Forensic Readiness on Hadoop Platform: Non-Ambari HDP as a Case Study
Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protect...
Security and Compliance for Scale-Out Hadoop Data Lakes
 
Hadoop and Big Data Security
BigData Security - A Point of View
Hortonworks and Voltage Security webinar
Securing Data in Hadoop at Uber
Authentic and Anonymous Data Sharing with Data Partitioning in Big Data
Ad

Recently uploaded (20)

PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
UNIT 4 Total Quality Management .pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Welding lecture in detail for understanding
PDF
Digital Logic Computer Design lecture notes
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
PPT on Performance Review to get promotions
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPT
Project quality management in manufacturing
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
UNIT 4 Total Quality Management .pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Welding lecture in detail for understanding
Digital Logic Computer Design lecture notes
CYBER-CRIMES AND SECURITY A guide to understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Operating System & Kernel Study Guide-1 - converted.pdf
bas. eng. economics group 4 presentation 1.pptx
PPT on Performance Review to get promotions
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Lesson 3_Tessellation.pptx finite Mathematics
Project quality management in manufacturing

Security Threats to Hadoop: Data Leakage Attacks and Investigation

  • 1. Security Threats to Hadoop: Data Leakage Attacks and Investigation Presented By KIRAN GAJBHIYE
  • 2. Contents • Introduction • Hadoop security • Hadoop Components • Data Leakage Attack • Analyze Data Leakage in Data analyzer • Disadvantages of Data Leakage • overcome the threat • Conclusion
  • 3. Introduction • Hadoop is an open-source framework that allows to store and process big data • popular platforms for big data storage and analysis ex - manufacturing, healthcare, insurance, and retail • Hadoop consists of core libraries • A distributed file system (HDFS) decrease data losses • YARN for resource management and application scheduling • Map Reduce for data processing • powerful processing capacity, huge storage capacity, scalability
  • 4. Hadoop Security • Hadoop Contains Sensitive Data As Hadoop adoption grows too has the types of data organizations look to store. Often the data is proprietary or personal and it must be protected • Add Kerberos-based authentication to server – Establishes identity for clients, hosts and services – Prevents impersonation/passwords are never sent over the wire • Data stored in encrypted form • MySQL is often used
  • 5. Hadoop Components  MapReduce (version 1.2.1) • Scheduling • Map Reduce is a framework used to write the applications that process large amounts of data on hardware • MapReduce framework divides an input file into many chunks and then a mapper for each chunk reads the data, does computations and provides outputs in the form of key/value pairs
  • 6. Continue……  HDFS(Hadoop Distributed File System) • The HDFS is the Java portable file system which is more scalable, reliable and distributed in the Hadoop framework environment • Store large data sets and cope with hardware failure • Namenode consists of a single Namenode, a master server that manages the file system namespaces and regulates the access to files by clients • DataNode responsible for serving read and write requests from file system clients
  • 7. Fig 1 Architecture of HDFS (hadoop distributed file system)
  • 8. Data Leakage Attack  application layer data leakage attackers can obtain private data by application-layer vulnerability or malware ex - A vulnerability in the current Hadoop audit mechanism is that it only records the operation type, time, and content, but no information on who did this operation
  • 9. Continue…  operating system layer data leakage • if attackers have the permission to log-in to the host operating system of a Hadoop node • They can bypass the monitor of Hadoop and steal data directly in the OS layer
  • 10. Investigate on Data Leakage • an investigation framework aimed at data leakage attacks in Hadoop was proposed • This framework is composed of many data collectors and one data analyzer • The data collector is located in the kernel of the host operating system in Hadoop nodes • It actively monitors the accesses to important data on each node and transmits these behavior logs to the data analyzer • The data it collects includes Hadoop logs, the Fsimage file, our own monitor (i.e., HProgger) logs and images or logs of files, processes, networks, and system
  • 11. Fig 2 Data Collector
  • 12. Fig 3 Data Analyzer
  • 13. Continue…  OS-layer data leakage attack, the detection algorithm works in four dimensions: • Abnormal directory In collected logs, if any directory that is out of this range is found, an attack may have happened. • Abnormal user if any other user instead of the Hadoop super user is found in HProgger logs, an attack may have happened
  • 14. Continue… • Abnormal operation if an attacker wants to steal a block from the physical machine, they need to copy (move) it to another directory and sometimes may rename the block, so if any of these operations are found, the attack may have happened • Block Proportion SusBlockNum (FileID) represent the number of suspicious blocks in a file dataset and AllBlockNumber (FileID) represent the total number of blocks in a file dataset. We define block proportion (BP) as the result of dividing SusBlockNum (FileID) by AllBlockNumber (FileID).
  • 15. Disadvantages of data leakage • Lost of important and confidential documents • Hack the whole system •
  • 16. Overcome the threat • Transport-level encryption • Kerberos used for strong authentication • Apache Knox used for perimeter security • Data-centric security • Data-at-Rest Encryption • Data-in-Motion Encryption
  • 17. Conclusion This including an on-demand data collection method and an automatic analysis method for data leakage attacks in Hadoop. It collects data from the machines in the Hadoop cluster to our forensic server and then analyzes them. With the automatic detection algorithm, it can find out whether there exist suspicious data leakage behaviors and give warnings and evidence to users. This collected evidence can be used to find the attackers and reconstruct the attack scenarios.
  • 18. References • D. Das and O. O’Malley, “Adding Security to Apache Hadoop,” Hortonworks Technical Report, 2011. • S. Park and Y. Lee, “Secure Hadoop with Encrypted HDFS,” GPC 2013, Seoul, Korea, 2013, May 9–11. • R. K. L. Ko and M. A. Will. “Progger: An Efficient, Tamper-Evident Kernel- Space Logger for Cloud Data Provenance Tracking,” CLOUD 2014, Anchorage, AK, 2014, June 27– July 2.