SlideShare a Scribd company logo
Hadoop storage
M.SandhiyaM.SC(IT)
Department of CS&IT
Nadar Saraswathi College of Arts Science
Theni
Apache Hadoop
• Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
• Created by Doug Cutting and Mike Carafella in
2005.
• Cutting named the program after his son’s toy
elephant.
Uses for Hadoop
• Data-intensive text processing
• Assembly of large genomes
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
Overview
• Responsible for storing data on the cluster
• Data files are split into blocks and distributed
across the nodes in the cluster
• Each block is replicated multiple times
HDFS Basic Concepts
• HDFS is a file system written in Java based on
the Google’s GFS
• Provides redundant storage for massive
amounts of data
How are Files Stored
• Files are split into blocks
• Blocks are split across many machines at load
time
– Different blocks from the same file will be stored
on different machines
• Blocks are replicated across multiple machines
• The NameNode keeps track of which blocks
make up a file and where they are stored
Storage efficiency
• with Parquet or Kudu and Snappy compression the total volume of the
• data can be reduced by a factor 10 comparing to uncompressed simple serialization format.
• • Data ingestion speed – all tested file based solutions provide faster ingestion rates (between
• x2 and x10) than specialized storage engines or MapFiles (sorted sequence).
• • Random data access time – using HBase or Kudu, typical random data lookup speed is below
• 500ms. With smart HDFS namespace partitioning Parquet could deliver random lookup on a
• level of a second but consumes more resources.
• • Data analytics – with Parquet or Kudu it is possible to perform fast and scalable (typically
• more than 300k records per second per CPU core) data aggregation, filtering and reporting.
• • Support of in-place data mutation – HBase and Kudu can modify records (schema and values)
• in-place where it is not possible with data stored directly in HDFS files.
• Figure
approaches for Core Storage
• The data access and ingestion tests were performed on a
cluster composed of 14 physical machines,
• each equipped with 2 CPUs with 8 physical cores with clock
speed 2.60GHz, 64 GB of RAM and 48
• SAS drives, 4TB each. Hadoop was installed from Cloudera
Data Hub (CDH) distribution version
• 5.7.0, which includes, Hadoop core 2.6.0, Impala 2.5.0, Hive
1.1.0, HBase 1.2.0 (configured JVM
• heap size for region servers = 30 GB) and Kudu 1.0
(configured memory limit = 30 GB). Apache
• Impala (incubating) was used as a data ingestion and data
access framework in all the conducted tests
• presented later in this report
Evaluated formats and technologies
• data serialization standard for compact binary format widely used for
• storing persistent data in HDFS as well as for communication protocols. One of the
advantages of
• using Avro is lightweight and fast data serialization and deserialization, which can
deliver very good
• ingestion performance.
• Even though it does not have any internal index (like in the case of MapFiles), the
HDFS directorybased
• partitioning technique can be applied to quickly navigate to the collections of
interest when fast
• random data access is needed. In the test a tuple of runnumber, project and
streamname was used as a
• partitioning key. This allowed obtaining good balance between the number of
partitions (few
• thousands) and an average partitions size (hundreds of megabytes). Two
supported by Apache Avro
• algorithms were used in the tests: Snappy and DEFLATE
Apache Avro
• Dictionary, Bit
• packing), and the compression applied on series
of values from the same columns that gives very
good
• compaction ratios. When storing data in HDFS in
Parquet format, the same partitioning strategy
was
• used as in the Avro case. Two Apache Parquet
supported algorithms have been used to
compressed
Ingestion speed
• Measuring records ingestion speed into a single data partition should reflect the performance of
• writing to the ATLAS EventIndex Core Storage system that can be expected when using different
• storage techniques. The results of this test are presented on Figure 2.
• In general, it is difficult to make a valid performance comparison between writing data to files and
• writing data to a storage engine. However, because Apache Impala performs writing into a single
• HDFS directory (Hive partition) serially, the results obtained for HDFS formats and HBase or Kudu
• can be directly compared for single data partition ingestion efficiency.
• Writing to HDFS files encoded with Avro or Parquet delivered much better results (at least by a
• factor 5) than storage engines like HBase and Kudu. Since Avro has the most lightweight encoder, it
• achieved the best ingestion performance. At the other end of the spectrum, HBase in this test was
very
• slow (worse than Kudu). This most likely was caused by the length of the row key (6 concatenated
• columns), that in average was around 60 bytes. HBase has to encode a key for each of the columns
in a
• row separately, which for long records (with many columns) can be suboptimal.
Random data lookup
• According to the measured results (Figure 3), when
accessing data by a record key, Kudu and
• HBase were the fastest ones, because of the usage of
built-in indexing. Values on the plot were
• measured with cold caches. Using Apache Impala for
random lookup test is suboptimal for Kudu and
• HBase as a significant amount of time is spent to set up
a query (planning, code generation etc.) before
• it really gets executed –
schema-less tables
• Apache Avro has proven to be a fast universal encoder for structured data.
Due to very efficient
• serialization and deserialization, this format can guarantee very good
performance whenever an access
• to all the attributes of a record is required at the same time – data
transportation, staging areas etc.
• On the other hand Apache HBase delivers very good random data access
performance and the
• biggest flexibility in structuring stored data (schema-less tables). The
performance of batch processing
• of HBase data heavily depends on a chosen data model and typically
cannot compete on this field with
• the other tested technologies. Therefore any analytics with HBase data
should be performed rather
• rarely.
• Notably, compression
Fault Tolerance
• Indexing events by event number and run number in HBase
database. In this approach the
• indexing key resolves to GUID and pointers to the complete records
stored on HDFS.
• So far both systems have proven to deliver very good events picking
performance on a level of tens of
• milliseconds – two orders of magnitude faster than the original
approach when using MapFiles solely.
• The only concern when running a hybrid approach in both cases is
the system size and internal
• coherence – robust procedures for handling HDFS raw data sets
updates and propagating them to
• indexing databases with low latency have to be maintained and
monitored
Core Hadoop Concepts
• Applications are written in a high-level
programming language
– No network programming or temporal dependency
• Nodes should communicate as little as possible
– A “shared nothing” architecture
• Data is spread among the machines in advance
– Perform computation where the data is already stored
as often as possible

More Related Content

PDF
Big Data and Hadoop Ecosystem
PPTX
Real time hadoop + mapreduce intro
PDF
Hadoop ecosystem
PPTX
Apache hadoop technology : Beginners
PPTX
Hadoop overview
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Hadoop Technology
PDF
Sept 17 2013 - THUG - HBase a Technical Introduction
Big Data and Hadoop Ecosystem
Real time hadoop + mapreduce intro
Hadoop ecosystem
Apache hadoop technology : Beginners
Hadoop overview
HADOOP TECHNOLOGY ppt
Hadoop Technology
Sept 17 2013 - THUG - HBase a Technical Introduction

What's hot (19)

PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
PPTX
Hadoop File system (HDFS)
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
Hadoop
PPTX
HDFS: Hadoop Distributed Filesystem
PDF
Hadoop Family and Ecosystem
PPTX
Asbury Hadoop Overview
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PPTX
Hadoop And Their Ecosystem
PPTX
Big Data and Cloud Computing
PPTX
PPTX
Gfs vs hdfs
PPTX
A Basic Introduction to the Hadoop eco system - no animation
PPT
PPTX
Pptx present
PPTX
Introduction to Hadoop and Hadoop component
PPT
Hadoop Technologies
PDF
Hadoop Overview kdd2011
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Hadoop File system (HDFS)
Introduction to Big Data & Hadoop Architecture - Module 1
Hadoop
HDFS: Hadoop Distributed Filesystem
Hadoop Family and Ecosystem
Asbury Hadoop Overview
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop And Their Ecosystem
Big Data and Cloud Computing
Gfs vs hdfs
A Basic Introduction to the Hadoop eco system - no animation
Pptx present
Introduction to Hadoop and Hadoop component
Hadoop Technologies
Hadoop Overview kdd2011
Ad

Similar to Hadoop storage (20)

PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PDF
Apache kudu
PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
PDF
Hbase status quo apache-con europe - nov 2012
PDF
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
PDF
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
PDF
ApacheCon09: Avro
PDF
Avro Data | Washington DC HUG
PDF
Hw09 Next Steps For Hadoop
PDF
Optimization on Key-value Stores in Cloud Environment
PPTX
Introduction to Apache Kudu
KEY
HBase and Hadoop at Urban Airship
PPTX
The Evolution of the Hadoop Ecosystem
PDF
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PDF
Introducing Kudu, Big Data Warehousing Meetup
DOC
PPT
Parquet and impala overview external
PPTX
Hadoop_File_Formats_and_Data_Ingestion.pptx
ODP
Hadoop Introduction
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Apache kudu
Real-Time Queries in Hadoop w/ Cloudera Impala
Hbase status quo apache-con europe - nov 2012
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
Advancing Polyglot Big Data Processing using the Hadoop Ecosystem
ApacheCon09: Avro
Avro Data | Washington DC HUG
Hw09 Next Steps For Hadoop
Optimization on Key-value Stores in Cloud Environment
Introduction to Apache Kudu
HBase and Hadoop at Urban Airship
The Evolution of the Hadoop Ecosystem
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Introducing Kudu, Big Data Warehousing Meetup
Parquet and impala overview external
Hadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop Introduction
Ad

More from SanSan149 (11)

PPTX
Sdma fdma-tdma-fixed tdm
PPTX
Histogram process spatial filtering
PPTX
Mobile computing vani
PPTX
Telecommunication system gms mobile service
PPTX
Adaptive filters and band reject filters
PPTX
Hema rdbms
PPTX
SDMA-FDMA-TDMA-fixed TDM
PPTX
joins and subqueries in big data analysis
PPTX
Normalization
PPTX
Manageral aspects of software maintance
PPTX
Common language runtime clr
Sdma fdma-tdma-fixed tdm
Histogram process spatial filtering
Mobile computing vani
Telecommunication system gms mobile service
Adaptive filters and band reject filters
Hema rdbms
SDMA-FDMA-TDMA-fixed TDM
joins and subqueries in big data analysis
Normalization
Manageral aspects of software maintance
Common language runtime clr

Recently uploaded (20)

PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
STATICS OF THE RIGID BODIES Hibbelers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Yogi Goddess Pres Conference Studio Updates
2.FourierTransform-ShortQuestionswithAnswers.pdf
Microbial disease of the cardiovascular and lymphatic systems
Pharma ospi slides which help in ospi learning
Final Presentation General Medicine 03-08-2024.pptx
Supply Chain Operations Speaking Notes -ICLT Program
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Anesthesia in Laparoscopic Surgery in India
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
01-Introduction-to-Information-Management.pdf
Microbial diseases, their pathogenesis and prophylaxis
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Weekly quiz Compilation Jan -July 25.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS

Hadoop storage

  • 1. Hadoop storage M.SandhiyaM.SC(IT) Department of CS&IT Nadar Saraswathi College of Arts Science Theni
  • 2. Apache Hadoop • Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware • Created by Doug Cutting and Mike Carafella in 2005. • Cutting named the program after his son’s toy elephant.
  • 3. Uses for Hadoop • Data-intensive text processing • Assembly of large genomes • Graph mining • Machine learning and data mining • Large scale social network analysis
  • 4. Overview • Responsible for storing data on the cluster • Data files are split into blocks and distributed across the nodes in the cluster • Each block is replicated multiple times
  • 5. HDFS Basic Concepts • HDFS is a file system written in Java based on the Google’s GFS • Provides redundant storage for massive amounts of data
  • 6. How are Files Stored • Files are split into blocks • Blocks are split across many machines at load time – Different blocks from the same file will be stored on different machines • Blocks are replicated across multiple machines • The NameNode keeps track of which blocks make up a file and where they are stored
  • 7. Storage efficiency • with Parquet or Kudu and Snappy compression the total volume of the • data can be reduced by a factor 10 comparing to uncompressed simple serialization format. • • Data ingestion speed – all tested file based solutions provide faster ingestion rates (between • x2 and x10) than specialized storage engines or MapFiles (sorted sequence). • • Random data access time – using HBase or Kudu, typical random data lookup speed is below • 500ms. With smart HDFS namespace partitioning Parquet could deliver random lookup on a • level of a second but consumes more resources. • • Data analytics – with Parquet or Kudu it is possible to perform fast and scalable (typically • more than 300k records per second per CPU core) data aggregation, filtering and reporting. • • Support of in-place data mutation – HBase and Kudu can modify records (schema and values) • in-place where it is not possible with data stored directly in HDFS files. • Figure
  • 8. approaches for Core Storage • The data access and ingestion tests were performed on a cluster composed of 14 physical machines, • each equipped with 2 CPUs with 8 physical cores with clock speed 2.60GHz, 64 GB of RAM and 48 • SAS drives, 4TB each. Hadoop was installed from Cloudera Data Hub (CDH) distribution version • 5.7.0, which includes, Hadoop core 2.6.0, Impala 2.5.0, Hive 1.1.0, HBase 1.2.0 (configured JVM • heap size for region servers = 30 GB) and Kudu 1.0 (configured memory limit = 30 GB). Apache • Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests • presented later in this report
  • 9. Evaluated formats and technologies • data serialization standard for compact binary format widely used for • storing persistent data in HDFS as well as for communication protocols. One of the advantages of • using Avro is lightweight and fast data serialization and deserialization, which can deliver very good • ingestion performance. • Even though it does not have any internal index (like in the case of MapFiles), the HDFS directorybased • partitioning technique can be applied to quickly navigate to the collections of interest when fast • random data access is needed. In the test a tuple of runnumber, project and streamname was used as a • partitioning key. This allowed obtaining good balance between the number of partitions (few • thousands) and an average partitions size (hundreds of megabytes). Two supported by Apache Avro • algorithms were used in the tests: Snappy and DEFLATE
  • 10. Apache Avro • Dictionary, Bit • packing), and the compression applied on series of values from the same columns that gives very good • compaction ratios. When storing data in HDFS in Parquet format, the same partitioning strategy was • used as in the Avro case. Two Apache Parquet supported algorithms have been used to compressed
  • 11. Ingestion speed • Measuring records ingestion speed into a single data partition should reflect the performance of • writing to the ATLAS EventIndex Core Storage system that can be expected when using different • storage techniques. The results of this test are presented on Figure 2. • In general, it is difficult to make a valid performance comparison between writing data to files and • writing data to a storage engine. However, because Apache Impala performs writing into a single • HDFS directory (Hive partition) serially, the results obtained for HDFS formats and HBase or Kudu • can be directly compared for single data partition ingestion efficiency. • Writing to HDFS files encoded with Avro or Parquet delivered much better results (at least by a • factor 5) than storage engines like HBase and Kudu. Since Avro has the most lightweight encoder, it • achieved the best ingestion performance. At the other end of the spectrum, HBase in this test was very • slow (worse than Kudu). This most likely was caused by the length of the row key (6 concatenated • columns), that in average was around 60 bytes. HBase has to encode a key for each of the columns in a • row separately, which for long records (with many columns) can be suboptimal.
  • 12. Random data lookup • According to the measured results (Figure 3), when accessing data by a record key, Kudu and • HBase were the fastest ones, because of the usage of built-in indexing. Values on the plot were • measured with cold caches. Using Apache Impala for random lookup test is suboptimal for Kudu and • HBase as a significant amount of time is spent to set up a query (planning, code generation etc.) before • it really gets executed –
  • 13. schema-less tables • Apache Avro has proven to be a fast universal encoder for structured data. Due to very efficient • serialization and deserialization, this format can guarantee very good performance whenever an access • to all the attributes of a record is required at the same time – data transportation, staging areas etc. • On the other hand Apache HBase delivers very good random data access performance and the • biggest flexibility in structuring stored data (schema-less tables). The performance of batch processing • of HBase data heavily depends on a chosen data model and typically cannot compete on this field with • the other tested technologies. Therefore any analytics with HBase data should be performed rather • rarely. • Notably, compression
  • 14. Fault Tolerance • Indexing events by event number and run number in HBase database. In this approach the • indexing key resolves to GUID and pointers to the complete records stored on HDFS. • So far both systems have proven to deliver very good events picking performance on a level of tens of • milliseconds – two orders of magnitude faster than the original approach when using MapFiles solely. • The only concern when running a hybrid approach in both cases is the system size and internal • coherence – robust procedures for handling HDFS raw data sets updates and propagating them to • indexing databases with low latency have to be maintained and monitored
  • 15. Core Hadoop Concepts • Applications are written in a high-level programming language – No network programming or temporal dependency • Nodes should communicate as little as possible – A “shared nothing” architecture • Data is spread among the machines in advance – Perform computation where the data is already stored as often as possible