SlideShare a Scribd company logo
Apache Hadoop
Sheetal Sharma
Intern At IBM Innovation Centre
What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of
which may be prone to failures.
The project includes these modules:
● Hadoop Common: The common utilities that support the other
Hadoop modules.
● Hadoop Distributed File System (HDFS™): A distributed file
system that provides high-throughput access to application
data.
● Hadoop YARN: A framework for job scheduling and cluster
resource management.
● Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
Other Hadoop-related projects at Apache
include:
● Ambari™: A web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters which includes support for Hadoop HDFS,
Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and
Sqoop. Ambari also provides a dashboard for viewing cluster health such
as heatmaps and ability to view MapReduce, Pig and Hive applications
visually along with features to diagnose their performance characteristics
in a user-friendly manner.
● Avro™: A data serialization system.
● Cassandra™: A scalable multi-master database with no single points of
failure.
● Chukwa™: A data collection system for managing large distributed
systems.
● HBase™: A scalable, distributed database that supports structured data
storage for large tables.
Other Hadoop-related projects at Apache
include:
● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
● Mahout™: A Scalable machine learning and data mining library.
● Pig™: A high-level data-flow language and execution framework for parallel
computation.
● Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications, including
ETL, machine learning, stream processing, and graph computation.
● Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process
data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and
other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g.
ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
●
ZooKeeper™: A high-performance coordination service for distributed applications.
Introduction
● The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management
web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
● Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across
any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
● Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster.
● Monitor a Hadoop Cluster
➢ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
➢ Ambari leverages Ganglia for metrics collection.
➢ Ambari leverages Nagios for system alerting and will send emails when your attention is
needed (e.g., a node goes down, remaining disk space is low, etc).
● Ambari enables Application Developers and System Integrators to:
➢ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications with the Ambari REST APIs.
Getting Started with Ambari
● Follow the installation guide for Ambari 1.7.0.
● Note: Ambari currently supports the 64-bit version
of the following Operating Systems:
● RHEL (Redhat Enterprise Linux) 5 and 6
● CentOS 5 and 6
● OEL (Oracle Enterprise Linux) 5 and 6
● SLES (SuSE Linux Enterprise Server) 11
● Ubuntu 12
Apache Avro
Introduction
● Apache Avro™ is a data serialization system.
Avro provides:
● Rich data structures.
● A compact, fast, binary data format.
● A container file, to store persistent data.
● Remote procedure call (RPC).
● Simple integration with dynamic languages. Code generation is not
required to read or write data files nor to use or implement RPC
protocols. Code generation as an optional optimization, only worth
implementing for statically typed languages.
Apache Avro
Schemas
● Avro relies on schemas. When Avro data is read, the schema used when writing
it is always present. This permits each datum to be written with no per-value
overheads, making serialization both fast and small. This also facilitates use with
dynamic, scripting languages, since data, together with its schema, is fully self-
describing.
● When Avro data is stored in a file, its schema is stored with it, so that files may
be processed later by any program. If the program reading the data expects a
different schema this can be easily resolved, since both schemas are present.
● When Avro is used in RPC, the client and server exchange schemas in the
connection handshake. (This can be optimized so that, for most calls, no schemas
are actually transmitted.) Since both client and server both have the other's full
schema, correspondence between same named fields, missing fields, extra fields,
etc. can all be easily resolved.
● Avro schemas are defined with JSON . This facilitates implementation in
languages that already have JSON libraries.
Apache Avro
Comparison with other systems
● Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro
differs from these systems in the following fundamental aspects.
● Dynamic typing: Avro does not require that code be generated. Data is always
accompanied by a schema that permits full processing of that data without code
generation, static datatypes, etc. This facilitates construction of generic data-processing
systems and languages.
● Untagged data: Since the schema is present when data is read, considerably less type
information need be encoded with data, resulting in smaller serialization size.
● No manually-assigned field IDs: When a schema changes, both the old and new schema
are always present when processing data, so differences may be resolved symbolically,
using field names.
● Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The
Apache Software Foundation.
Apache Cassandra
The Apache Cassandra database is the right choice when you need
scalability and high availability without compromising performance.
Linear scalability and proven fault-tolerance on commodity hardware or
cloud infrastructure make it the perfect platform for mission-critical data.
Cassandra's support for replicating across multiple data centers is best-in-
class, providing lower latency for your users and the peace of mind of
knowing that you can survive regional outages.
Cassandra's data model offers the convenience of column indexes with the
performance of log-structured updates, strong support for denormalization
and materialized views, and powerful built-in caching.
Apache Cassandra Overview
● Proven
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub,
GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and
over 1500 more companies that have large, active data sets.
One of the largest production deployments is Apple's, with over 75,000 nodes
storing over 10 PB of data. Other large Cassandra installations include Netflix
(2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine
Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over
100 nodes, 250 TB)
Fault Tolerant
Data is automatically replicated to multiple nodes for fault-tolerance.
Replication across multiple data centers is supported. Failed nodes can be
replaced with no downtime.
Apache Cassandra Overview
Performance
Cassandra consistently outperforms popular NoSQL alternatives in benchmarks
and real applications, primarily because of fundamental architectural choices.
Decentralized
There are no single points of failure. There are no network bottlenecks. Every
node in the cluster is identical.
Durable
Cassandra is suitable for applications that can't afford to lose data, even when an
entire data center goes down.
Apache Cassandra Overview
● You're in Control
Choose between synchronous or asynchronous replication for each update.
Highly available asynchronous operations are optimized with features like
Hinted Hand off and Read Repair.
● Elastic
Read and write throughput both increase linearly as new machines are added,
with no downtime or interruption to applications.
Professionally Supported
Cassandra support contracts and services are available from third parties.
Chukwa
● Chukwa is an open source data collection system
for monitoring large distributed systems. Chukwa
is built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce framework and
inherits Hadoop’s scalability and robustness.
Chukwa also includes a flexible and powerful
toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.
● Apache HBase™ is the Hadoop database, a distributed,
scalable, big data store.
● Use Apache HBase™ when you need random, real time
read/write access to your Big Data. This project's goal is the
hosting of very large tables -- billions of rows X millions of
columns -- atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, non-
relational database modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et
al. Just as Bigtable leverages the distributed data storage
provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.
Features of Apache HBase
● Linear and modular scalability.
● Strictly consistent reads and writes.
● Automatic and configurable sharding of tables
● Automatic failover support between Region Servers.
● Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
● Easy to use Java API for client access.
● Block cache and Bloom Filters for real-time queries.
● Query predicate push down via server side Filters
● Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data
encoding options
● Extensible jruby-based (JIRB) shell
● Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Apache Hive
● The Apache Hive ™ data warehouse software facilitates querying
and managing large data sets residing in distributed storage. Hive
provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HiveQL. At the same
time this language also allows traditional map/reduce programmers
to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
● Hive is an open source volunteer project under the Apache
Software Foundation. Previously it was a subproject of Apache
Hadoop, but has now graduated to become a top-level project of its
own.
Apache Mahout
● The Apache Mahout™ project's goal is to build a
scalable machine learning library.
With scalable we mean:
● Scalable to large data sets. Our core algorithms for clustering,
classification and collaborative filtering are implemented on
top of scalable, distributed systems. However, contributions
that run on a single machine are welcome as well.
● Scalable to support your business case. Mahout is distributed
under a commercially friendly Apache Software license.
Apache Mahout
● Scalable community. The goal of Mahout is to build a vibrant,
responsive, diverse community to facilitate discussions not only on
the project itself but also on potential use cases. Come to the
mailing lists to find out more.
● Currently Mahout supports mainly three use cases:
Recommendation mining takes users' behavior and from that tries
to find items users might like. Clustering takes e.g. text documents
and groups them into groups of topically related documents.
Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign
unlabeled documents to the (hopefully) correct category.
Apache Pig
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in
turns enables them to handle very large data sets.
● At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject).
Apache Pig
● Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
● Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex
tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
● Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
● Extensibility. Users can create their own functions to do special-
purpose processing.
Apache Spark
● Apache Spark™ is a fast and general engine for large-scale data
processing.
● Ease of Use
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala and Python shells.
● Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Spark SQL, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Apache Spark
● Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It
can access diverse data sources including HDFS, Cassandra,
HBase, S3.
You can run Spark readily using its standalone cluster mode, on
EC2, or run it on Hadoop YARN or Apache Mesos. It can read
from HDFS, HBase, Cassandra, and any Hadoop data source.
Apache Tez
Introduction
● The Apache Tez project is aimed at building an application framework
which allows for a complex directed-acyclic-graph of tasks for processing
data. It is currently built atop Apache Hadoop YARN
The 2 main design themes for Tez are:
●
Empowering end users by:
Expressive data flow definition APIs
Flexible Input-Processor-Output run time model
Data type agnostic
Simplifying deployment
Apache Tez
● Execution Performance
Performance gains over Map Reduce
Optimal resource management
Plan reconfiguration at run time
Dynamic physical data flow decisions
By allowing projects like Apache Hive and Apache Pig to run a complex
DAG of tasks, Tez can be used to process data, that earlier took multiple
MR jobs, now in a single Tez job as shown below.
Apache ZooKeeper
● Apache ZooKeeper is an effort to develop and maintain
an open-source server which enables highly reliable
distributed coordination.
● ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in some
form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and race
conditions that are inevitable. Because of the difficulty of implementing
these kinds of services, applications initially usually skimp on them
,which make them brittle in the presence of change and difficult to
manage. Even when done correctly, different implementations of these
services lead to management complexity when the applications are
deployed.
Thank You!

More Related Content

PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
PDF
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
PDF
Real time cloud native open source streaming of any data to apache solr
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PPTX
Kafka presentation
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big data conference europe real-time streaming in any and all clouds, hybri...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Real time cloud native open source streaming of any data to apache solr
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Kafka presentation
Apache Kafka - Scalable Message-Processing and more !
Introduction to Apache Kafka and Confluent... and why they matter
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...

What's hot (20)

KEY
Near-realtime analytics with Kafka and HBase
PPTX
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
PPTX
Flume vs. kafka
PPTX
Current and Future of Apache Kafka
PDF
Apache Pulsar Seattle - Meetup
PDF
Apache kafka-a distributed streaming platform
PPTX
Cloud streaming presentation
PDF
Streaming all over the world Real life use cases with Kafka Streams
PDF
Open Source Bristol 30 March 2022
PPTX
I Heart Log: Real-time Data and Apache Kafka
PDF
Kafka internals
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
PDF
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
PPTX
DynomiteDB - No spof High-availability Redis cluster solution
PDF
Apache Flume - DataDayTexas
PPTX
Embeddable data transformation for real time streams
PPTX
Cassandra - A Basic Introduction Guide
PPTX
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
PPTX
Ai big dataconference_jeffrey ricker_kappa_architecture
Near-realtime analytics with Kafka and HBase
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Flume vs. kafka
Current and Future of Apache Kafka
Apache Pulsar Seattle - Meetup
Apache kafka-a distributed streaming platform
Cloud streaming presentation
Streaming all over the world Real life use cases with Kafka Streams
Open Source Bristol 30 March 2022
I Heart Log: Real-time Data and Apache Kafka
Kafka internals
Spark Streaming & Kafka-The Future of Stream Processing
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
DynomiteDB - No spof High-availability Redis cluster solution
Apache Flume - DataDayTexas
Embeddable data transformation for real time streams
Cassandra - A Basic Introduction Guide
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Ai big dataconference_jeffrey ricker_kappa_architecture
Ad

Viewers also liked (20)

PPTX
Hdfs 2016-hadoop-summit-dublin-v1
PPTX
Como utilizar las redes sociales
PDF
Bigdata Hadoop project payment gateway domain
PPTX
Matt Franklin - Apache Software (Geekfest)
PPTX
Hadoop Distributed File System
PPTX
Apache hadoop technology : Beginners
PPTX
Hdfs 2016-hadoop-summit-san-jose-v4
PDF
Keystone - Leverage Big Data 2016
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
PPT
Hadoop Technology
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PPT
Hadoop Real Life Use Case & MapReduce Details
PDF
Filesystem Comparison: NFS vs GFS2 vs OCFS2
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Hadoop HDFS Detailed Introduction
PPTX
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
PPTX
Urban deca tower edsa (1)
PPTX
Аналіз методичної роботи
PPTX
Urban deca homes campville project presentation
Hdfs 2016-hadoop-summit-dublin-v1
Como utilizar las redes sociales
Bigdata Hadoop project payment gateway domain
Matt Franklin - Apache Software (Geekfest)
Hadoop Distributed File System
Apache hadoop technology : Beginners
Hdfs 2016-hadoop-summit-san-jose-v4
Keystone - Leverage Big Data 2016
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Hadoop Technology
Apache Tez - A New Chapter in Hadoop Data Processing
Hadoop Real Life Use Case & MapReduce Details
Filesystem Comparison: NFS vs GFS2 vs OCFS2
HADOOP TECHNOLOGY ppt
Hive + Tez: A Performance Deep Dive
Hadoop HDFS Detailed Introduction
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
Urban deca tower edsa (1)
Аналіз методичної роботи
Urban deca homes campville project presentation
Ad

Similar to Hadoop Introduction (20)

PDF
ApacheCon09: Avro
PPTX
hadoop eco system regarding big data analytics.pptx
ODP
The other Apache technologies your big data solution needs!
PPTX
Getting started big data
PDF
Tools and techniques for data science
PPTX
Cloudera Hadoop Distribution
PDF
Introduction To Hadoop Ecosystem
PPTX
PPTX
Brief Introduction about Hadoop and Core Services.
DOCX
Big Data - Hadoop Ecosystem
PDF
What is Apache Hadoop and its ecosystem?
PDF
Hw09 Next Steps For Hadoop
PPTX
Hadoop Solutions
KEY
Polyglot Persistence & Big Data in the Cloud
PPTX
Storage and-compute-hdfs-map reduce
PDF
Webinar: The Future of Hadoop
PPT
Hadoop distributed file system (HDFS), HDFS concept
PDF
Introduction to Data Science with Hadoop
PDF
Hadoop Ecosystem
PPTX
Hadoop jon
ApacheCon09: Avro
hadoop eco system regarding big data analytics.pptx
The other Apache technologies your big data solution needs!
Getting started big data
Tools and techniques for data science
Cloudera Hadoop Distribution
Introduction To Hadoop Ecosystem
Brief Introduction about Hadoop and Core Services.
Big Data - Hadoop Ecosystem
What is Apache Hadoop and its ecosystem?
Hw09 Next Steps For Hadoop
Hadoop Solutions
Polyglot Persistence & Big Data in the Cloud
Storage and-compute-hdfs-map reduce
Webinar: The Future of Hadoop
Hadoop distributed file system (HDFS), HDFS concept
Introduction to Data Science with Hadoop
Hadoop Ecosystem
Hadoop jon

More from sheetal sharma (9)

PDF
Db import&export
PDF
Db import&export
ODP
Apache hadoop
ODP
Apache hive1
ODP
Apache hadoop hbase
PDF
Telecommunication Analysis (3 use-cases) with IBM watson analytics
PDF
Telecommunication Analysis(3 use-cases) with IBM cognos insight
PPTX
Sentiment Analysis App with DevOps Services
PPTX
Watson analytics
Db import&export
Db import&export
Apache hadoop
Apache hive1
Apache hadoop hbase
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Sentiment Analysis App with DevOps Services
Watson analytics

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.

Hadoop Introduction

  • 1. Apache Hadoop Sheetal Sharma Intern At IBM Innovation Centre
  • 2. What Is Apache Hadoop? The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 3. The project includes these modules: ● Hadoop Common: The common utilities that support the other Hadoop modules. ● Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. ● Hadoop YARN: A framework for job scheduling and cluster resource management. ● Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 4. Other Hadoop-related projects at Apache include: ● Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. ● Avro™: A data serialization system. ● Cassandra™: A scalable multi-master database with no single points of failure. ● Chukwa™: A data collection system for managing large distributed systems. ● HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • 5. Other Hadoop-related projects at Apache include: ● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. ● Mahout™: A Scalable machine learning and data mining library. ● Pig™: A high-level data-flow language and execution framework for parallel computation. ● Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. ● Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. ● ZooKeeper™: A high-performance coordination service for distributed applications.
  • 6. Introduction ● The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Ambari enables System Administrators to: ● Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. Ambari handles configuration of Hadoop services for the cluster. ● Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
  • 7. ● Monitor a Hadoop Cluster ➢ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. ➢ Ambari leverages Ganglia for metrics collection. ➢ Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). ● Ambari enables Application Developers and System Integrators to: ➢ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.
  • 8. Getting Started with Ambari ● Follow the installation guide for Ambari 1.7.0. ● Note: Ambari currently supports the 64-bit version of the following Operating Systems: ● RHEL (Redhat Enterprise Linux) 5 and 6 ● CentOS 5 and 6 ● OEL (Oracle Enterprise Linux) 5 and 6 ● SLES (SuSE Linux Enterprise Server) 11 ● Ubuntu 12
  • 9. Apache Avro Introduction ● Apache Avro™ is a data serialization system. Avro provides: ● Rich data structures. ● A compact, fast, binary data format. ● A container file, to store persistent data. ● Remote procedure call (RPC). ● Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
  • 10. Apache Avro Schemas ● Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self- describing. ● When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present. ● When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved. ● Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.
  • 11. Apache Avro Comparison with other systems ● Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects. ● Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. ● Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. ● No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names. ● Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.
  • 12. Apache Cassandra The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple data centers is best-in- class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages. Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.
  • 13. Apache Cassandra Overview ● Proven Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets. One of the largest production deployments is Apple's, with over 75,000 nodes storing over 10 PB of data. Other large Cassandra installations include Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over 100 nodes, 250 TB) Fault Tolerant Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
  • 14. Apache Cassandra Overview Performance Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices. Decentralized There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical. Durable Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.
  • 15. Apache Cassandra Overview ● You're in Control Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Hand off and Read Repair. ● Elastic Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. Professionally Supported Cassandra support contracts and services are available from third parties.
  • 16. Chukwa ● Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
  • 17. ● Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. ● Use Apache HBase™ when you need random, real time read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non- relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
  • 18. Features of Apache HBase ● Linear and modular scalability. ● Strictly consistent reads and writes. ● Automatic and configurable sharding of tables ● Automatic failover support between Region Servers. ● Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. ● Easy to use Java API for client access. ● Block cache and Bloom Filters for real-time queries. ● Query predicate push down via server side Filters ● Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options ● Extensible jruby-based (JIRB) shell ● Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
  • 19. Apache Hive ● The Apache Hive ™ data warehouse software facilitates querying and managing large data sets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. ● Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.
  • 20. Apache Mahout ● The Apache Mahout™ project's goal is to build a scalable machine learning library. With scalable we mean: ● Scalable to large data sets. Our core algorithms for clustering, classification and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well. ● Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.
  • 21. Apache Mahout ● Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. ● Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category.
  • 22. Apache Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. ● At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject).
  • 23. Apache Pig ● Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: ● Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. ● Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. ● Extensibility. Users can create their own functions to do special- purpose processing.
  • 24. Apache Spark ● Apache Spark™ is a fast and general engine for large-scale data processing. ● Ease of Use Write applications quickly in Java, Scala or Python. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells. ● Generality Combine SQL, streaming, and complex analytics. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
  • 25. Apache Spark ● Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3. You can run Spark readily using its standalone cluster mode, on EC2, or run it on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
  • 26. Apache Tez Introduction ● The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN The 2 main design themes for Tez are: ● Empowering end users by: Expressive data flow definition APIs Flexible Input-Processor-Output run time model Data type agnostic Simplifying deployment
  • 27. Apache Tez ● Execution Performance Performance gains over Map Reduce Optimal resource management Plan reconfiguration at run time Dynamic physical data flow decisions
  • 28. By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown below.
  • 29. Apache ZooKeeper ● Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination. ● ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.