Big data, Hadoop and Hive

Big Data – Srinath & Arjun
• The BIG-DATA
• Hadoop
• Hadoop Components
• Hadoop Eco Systems
2
Agenda

The BIG-DATA

Big Data – Srinath & Arjun 4
The Context
• Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
• Google collects 270PB data in a month (2007), 20000PB a day (2008)
• 2010 census data is expected to be a huge gold mine of information
• Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.

• We are in a knowledge economy.
– Data is an important asset to any organization
– Discovery of knowledge; Enabling discovery; annotation of data
• We are looking at newer
– programming models, and
– Supporting algorithms and data structures.
The Context

• Big Data is New
• Big Data is only about Massive Data Volume
• Big data means Hadoop
• Big data need a Data Warehouse
• Big data means Unstructured Data
• Big data is for Social Media and Data mining Analyses
6
The Myth about Big Data

It is all about better analytic on a broader spectrum of data, and
therefore represents an opportunity to create even more differentiation
among industries.
7
Big Data is…

Where Data is coming….?
12+ TBs
of tweet data
every day
25+ TBs
of
log data
every day
?TBsof
dataevery
day
2+
billion
people
on the
Web by
end 2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014

Facebook
• 4.5 billion Facebook likes every day
• 350 million photos uploaded on a daily basis
• 250 billion photos stored by Facebook
• 10 billion messages sent everyday
• 1 trillion posts in Facebook’s graph search database
• 500 TB of data processed daily
• 100 PB of data stored in Facebok’s Hadoop disk cluster (1PB=1000TB=1000000
GB)
Example of Big Data Generation

Flights
• 1 Boeing plane engine generates 20TB of data for every hour of flying
• How much data do all the flights in this world generate every year if
there are 100000 two engine flights daily?
Example of Big Data Generation

• Black Box Data
• Social Media Data
• Stock Exchange Data
• Power Grid Data
• Transport Data
• Search Engine Data
What comes under Big data?

• Capturing Data
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
Big Data Challenges

Characteristics of Big Data
Volume
of Tweets
create daily.
12+terabytes
Variety
of different
types of data.
100’s
Veracity
decision makers trust
their information.
Only 1 in 3
trade events
per second.
5+million
Velocity

• Structured data : Relational Data
• Semi Structured data : XML data
• Unstructured Data : Word, PDF, Text, Media Logs
Types of Data

The Data Explosion
• 2.5 quintillion bytes of data created each year
• 90 % of data in the world was created in the last two years

Hadoop

Hadoop
• Open Source Software Framework
• Inspired by Google’s Map – Reduce Programming Model (GFS)
• Originally written for the Nutch search engine project
• Written in java
• Efficiently processes large volumes of Data
• Breaks up Big data into multiple parts
• Two key parts
• HDFS
• MapReduce

History of Hadoop

Hadoop Architecture

Hadoop Components

HDFS – Hadoop Distributed File System
• It’s a file system designed for storing very large files running on cluster of
commodity hardware
• High fault tolerance, Distributed, Reliable, Scalable file system for Data
Storage
• Stores multiple copies of data on different nodes. (default 64MB)
• Typically has a single namenode and no.of datanodes to form the HDFS
cluster

HDFS Architecture
• Two types of Nodes
 Master or Namenode
 Slave or Datanode

HDFS Architecture

Read a File

Write a File

Hadoop Cluster Modes
• Standalone Mode
• Pseudo-Distributed Mode
• Fully-Distributed Mode

MapReduce
Programming Model designed for processing large volumes of data in
parallel by dividing the work into a set of independent tasks

Terminology
• Job
• Task
• Task Attempt
• NameNode
• MasterNode
• SlaveNode
• Clusters
• Commodity Hardware

Components
• Master Nodes
• Slave Nodes

Workflow

Example

Closer Look

Input Formats
• Text Input Format
• Sequential input format
• Key value text input format

NoSQL
• NoSQL mean “not only SQL”
• This includes key value stores, document-oriented databases, graph
databases, big datable structures, and caching data stores
Eg. MongoDB, Cassandra

Hadoop ECO Systems

What is HIVE?
• Data Warehousing Infrastructure
• Data Summarization, ad-hoc querying and analysis of large
volumes of data

HiveQL
• HiveQL is the Hive query language.
• Hive doesn’t support transactions.

Hive Application
• Log Processing
• Text Mining
• Document indexing
• Customer – facing Business intelligence (eg. Google Analytics)
• Predictive modelling, hypothesis testing

Thank You….

Big data, Hadoop and Hive

More Related Content

What's hot (20)

Similar to Big data, Hadoop and Hive (20)

Recently uploaded (20)

Big data, Hadoop and Hive