Hadoop Introduction

Apache Hadoop
Sheetal Sharma
Intern At IBM Innovation Centre

What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of
which may be prone to failures.

The project includes these modules:
● Hadoop Common: The common utilities that support the other
Hadoop modules.
● Hadoop Distributed File System (HDFS™): A distributed file
system that provides high-throughput access to application
data.
● Hadoop YARN: A framework for job scheduling and cluster
resource management.
● Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.

Other Hadoop-related projects at Apache
include:
● Ambari™: A web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters which includes support for Hadoop HDFS,
Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and
Sqoop. Ambari also provides a dashboard for viewing cluster health such
as heatmaps and ability to view MapReduce, Pig and Hive applications
visually along with features to diagnose their performance characteristics
in a user-friendly manner.
● Avro™: A data serialization system.
● Cassandra™: A scalable multi-master database with no single points of
failure.
● Chukwa™: A data collection system for managing large distributed
systems.
● HBase™: A scalable, distributed database that supports structured data
storage for large tables.

Other Hadoop-related projects at Apache
include:
● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
● Mahout™: A Scalable machine learning and data mining library.
● Pig™: A high-level data-flow language and execution framework for parallel
computation.
● Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications, including
ETL, machine learning, stream processing, and graph computation.
● Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process
data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and
other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g.
ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
●
ZooKeeper™: A high-performance coordination service for distributed applications.

Introduction
● The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management
web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
● Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across
any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
● Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster.

● Monitor a Hadoop Cluster
➢ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
➢ Ambari leverages Ganglia for metrics collection.
➢ Ambari leverages Nagios for system alerting and will send emails when your attention is
needed (e.g., a node goes down, remaining disk space is low, etc).
● Ambari enables Application Developers and System Integrators to:
➢ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications with the Ambari REST APIs.

Getting Started with Ambari
● Follow the installation guide for Ambari 1.7.0.
● Note: Ambari currently supports the 64-bit version
of the following Operating Systems:
● RHEL (Redhat Enterprise Linux) 5 and 6
● CentOS 5 and 6
● OEL (Oracle Enterprise Linux) 5 and 6
● SLES (SuSE Linux Enterprise Server) 11
● Ubuntu 12

Apache Avro
Introduction
● Apache Avro™ is a data serialization system.
Avro provides:
● Rich data structures.
● A compact, fast, binary data format.
● A container file, to store persistent data.
● Remote procedure call (RPC).
● Simple integration with dynamic languages. Code generation is not
required to read or write data files nor to use or implement RPC
protocols. Code generation as an optional optimization, only worth
implementing for statically typed languages.

Apache Avro
Schemas
● Avro relies on schemas. When Avro data is read, the schema used when writing
it is always present. This permits each datum to be written with no per-value
overheads, making serialization both fast and small. This also facilitates use with
dynamic, scripting languages, since data, together with its schema, is fully self-
describing.
● When Avro data is stored in a file, its schema is stored with it, so that files may
be processed later by any program. If the program reading the data expects a
different schema this can be easily resolved, since both schemas are present.
● When Avro is used in RPC, the client and server exchange schemas in the
connection handshake. (This can be optimized so that, for most calls, no schemas
are actually transmitted.) Since both client and server both have the other's full
schema, correspondence between same named fields, missing fields, extra fields,
etc. can all be easily resolved.
● Avro schemas are defined with JSON . This facilitates implementation in
languages that already have JSON libraries.

Apache Avro
Comparison with other systems
● Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro
differs from these systems in the following fundamental aspects.
● Dynamic typing: Avro does not require that code be generated. Data is always
accompanied by a schema that permits full processing of that data without code
generation, static datatypes, etc. This facilitates construction of generic data-processing
systems and languages.
● Untagged data: Since the schema is present when data is read, considerably less type
information need be encoded with data, resulting in smaller serialization size.
● No manually-assigned field IDs: When a schema changes, both the old and new schema
are always present when processing data, so differences may be resolved symbolically,
using field names.
● Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The
Apache Software Foundation.

Apache Cassandra
The Apache Cassandra database is the right choice when you need
scalability and high availability without compromising performance.
Linear scalability and proven fault-tolerance on commodity hardware or
cloud infrastructure make it the perfect platform for mission-critical data.
Cassandra's support for replicating across multiple data centers is best-in-
class, providing lower latency for your users and the peace of mind of
knowing that you can survive regional outages.
Cassandra's data model offers the convenience of column indexes with the
performance of log-structured updates, strong support for denormalization
and materialized views, and powerful built-in caching.

Apache Cassandra Overview
● Proven
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub,
GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and
over 1500 more companies that have large, active data sets.
One of the largest production deployments is Apple's, with over 75,000 nodes
storing over 10 PB of data. Other large Cassandra installations include Netflix
(2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine
Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over
100 nodes, 250 TB)
Fault Tolerant
Data is automatically replicated to multiple nodes for fault-tolerance.
Replication across multiple data centers is supported. Failed nodes can be
replaced with no downtime.

Performance
Cassandra consistently outperforms popular NoSQL alternatives in benchmarks
and real applications, primarily because of fundamental architectural choices.
Decentralized
There are no single points of failure. There are no network bottlenecks. Every
node in the cluster is identical.
Durable
Cassandra is suitable for applications that can't afford to lose data, even when an
entire data center goes down.

● You're in Control
Choose between synchronous or asynchronous replication for each update.
Highly available asynchronous operations are optimized with features like
Hinted Hand off and Read Repair.
● Elastic
Read and write throughput both increase linearly as new machines are added,
with no downtime or interruption to applications.
Professionally Supported
Cassandra support contracts and services are available from third parties.

Chukwa
● Chukwa is an open source data collection system
for monitoring large distributed systems. Chukwa
is built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce framework and
inherits Hadoop’s scalability and robustness.
Chukwa also includes a ﬂexible and powerful
toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.

● Apache HBase™ is the Hadoop database, a distributed,
scalable, big data store.
● Use Apache HBase™ when you need random, real time
read/write access to your Big Data. This project's goal is the
hosting of very large tables -- billions of rows X millions of
columns -- atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, non-
relational database modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et
al. Just as Bigtable leverages the distributed data storage
provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.

Features of Apache HBase
● Linear and modular scalability.
● Strictly consistent reads and writes.
● Automatic and configurable sharding of tables
● Automatic failover support between Region Servers.
● Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
● Easy to use Java API for client access.
● Block cache and Bloom Filters for real-time queries.
● Query predicate push down via server side Filters
● Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data
encoding options
● Extensible jruby-based (JIRB) shell
● Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Apache Hive
● The Apache Hive ™ data warehouse software facilitates querying
and managing large data sets residing in distributed storage. Hive
provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HiveQL. At the same
time this language also allows traditional map/reduce programmers
to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
● Hive is an open source volunteer project under the Apache
Software Foundation. Previously it was a subproject of Apache
Hadoop, but has now graduated to become a top-level project of its
own.

Apache Mahout
● The Apache Mahout™ project's goal is to build a
scalable machine learning library.
With scalable we mean:
● Scalable to large data sets. Our core algorithms for clustering,
classification and collaborative filtering are implemented on
top of scalable, distributed systems. However, contributions
that run on a single machine are welcome as well.
● Scalable to support your business case. Mahout is distributed
under a commercially friendly Apache Software license.

Apache Mahout
● Scalable community. The goal of Mahout is to build a vibrant,
responsive, diverse community to facilitate discussions not only on
the project itself but also on potential use cases. Come to the
mailing lists to find out more.
● Currently Mahout supports mainly three use cases:
Recommendation mining takes users' behavior and from that tries
to find items users might like. Clustering takes e.g. text documents
and groups them into groups of topically related documents.
Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign
unlabeled documents to the (hopefully) correct category.

Apache Pig
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in
turns enables them to handle very large data sets.
● At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject).

Apache Pig
● Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
● Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex
tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
● Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
● Extensibility. Users can create their own functions to do special-
purpose processing.

Apache Spark
● Apache Spark™ is a fast and general engine for large-scale data
processing.
● Ease of Use
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala and Python shells.
● Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Spark SQL, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.

Apache Spark
● Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It
can access diverse data sources including HDFS, Cassandra,
HBase, S3.
You can run Spark readily using its standalone cluster mode, on
EC2, or run it on Hadoop YARN or Apache Mesos. It can read
from HDFS, HBase, Cassandra, and any Hadoop data source.

Apache Tez
Introduction
● The Apache Tez project is aimed at building an application framework
which allows for a complex directed-acyclic-graph of tasks for processing
data. It is currently built atop Apache Hadoop YARN
The 2 main design themes for Tez are:
●
Empowering end users by:
Expressive data flow definition APIs
Flexible Input-Processor-Output run time model
Data type agnostic
Simplifying deployment

Apache Tez
● Execution Performance
Performance gains over Map Reduce
Optimal resource management
Plan reconfiguration at run time
Dynamic physical data flow decisions

By allowing projects like Apache Hive and Apache Pig to run a complex
DAG of tasks, Tez can be used to process data, that earlier took multiple
MR jobs, now in a single Tez job as shown below.

Apache ZooKeeper
● Apache ZooKeeper is an effort to develop and maintain
an open-source server which enables highly reliable
distributed coordination.
● ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in some
form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and race
conditions that are inevitable. Because of the difficulty of implementing
these kinds of services, applications initially usually skimp on them
,which make them brittle in the presence of change and difficult to
manage. Even when done correctly, different implementations of these
services lead to management complexity when the applications are
deployed.

Hadoop Introduction

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop Introduction (20)

More from sheetal sharma (9)

Recently uploaded (20)

Hadoop Introduction