Big data and hadoop

Big Data and
Hadoop
PARKHE KISHOR B.
M. TECH. (INDUSTRIAL MATHEMATICS AND COMPUTER APPLICATIONS)

! Prerequisites
JAVA
! OOPs concepts
! Serialization
! Data Structures (Hash Map, Lists)
! FILE I/O
! UNIX Commands (mv, cp, ls etc, mkdir, ps, vi)

Development Environment
! Install jdk 1.6, jre 6, eclipse

! What is Big Data?
! Typically we work on excel sheet, ppt, word docs, code files. They are of the order 1-2Mb. Even
a movie is just 1–2 Gb size.

! The BIG DATA we want to deal with, is of the order of Petabytes.
! 10^12 times size of ordinary files.

What Happens in An Internet Minute?

Where is this Data ?
! This data is generated from multiple sources. The data that goes in the logs of google,
facebook, linkedin, yahoo servers is of billion users of all around the world.
! What are users accessing, how long the user remains in site. All the meta data sites visited,
friend's list, status. Torrent downloads In Every 5 minutes granularity google gets Petabytes
information in it's server logs. Same goes for facebook, yahoo, AT&T, Airtel.
Why need to understand data?
1. Analytics
2. Why we need analytics?

Use Case of analytics.
1. How effectively you do business
2. Cost cutting and leverage productivity
3. Google, Amazon, Ebay they get logs so that ads and products can be recommended to
customers.
e.g. Public transport

Data Categories
1. Structure Data
2. Unstructured Data

Challenges
Problem is not getting this big data. Problem is how to store, process and analyze
this data.

Telecom Company
1. Airmobile (50 million subscribers) wants to sell it’s expensive $500 monthly plan
to it’s customers, for this it wants to find out its top subscribers and the total
bytes they have downloaded(Internet data) using it’s services in last one
month.
2. Also it wants to advertise it’s roaming plan of $100 so that subscribers don’t
switch to other networks, when going to other cities. For this it wants to find
out the minutes of usage (i.e., call duration) of top 10 thousand subscribers
who have roamed in last 1 month .

Issues
1. Different subscribers in different cities have different data plans. Almost all the
subscribers are active each day of month.
2. 2. Data collection is huge every minute almost 1 million people visit 5–6 sites.
3. 3. Every day tera bytes of information is collected in airmobile servers of
each city, which get discarded because of unavailability of storage.

Solution
1. Introduced by Google was GFS (Google file system) and Map Reduce.
2. Then Hadoop became open source and now is owned by apache.
3. Hadoop is used by Facebook, Yahoo, Google, Twitter, linkedin, Rackspace.

How is HADOOP the Solution?
1. Storage -> HDFS A distributed file system where commodity hardware can be
used to form clusters and store the huge data in distributed fashion. There is
no need for high end hardwares.
2. Process -> MAP Reduce Paradigm
3. Analyze -> Hive, Pig MapReduce.
4. It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just
configuration change.

Applications of HADOOP
1. Telecommunications -> To find out top subscribers for advertisement, find
peak traffic rate to install routers at right places, for cost cutting.
2. Recommendation systems -> Google Ads customized for all users.
3. Data warehousing -> to store data and analyze it e.g., categorize data into
http web or mobile, so that services by ISP can be customized accordingly.
4. Market Research and Forecasting -> Forecast subscribers, traffic based on
past data trend.
5. Finance, social networking -> To predict trends and gain profit.

What it is Not
1. Should be noted it's not OLAP (online analytical Processing) but batch /
offline oriented
2. It is not a database

Challenges
Can this data be stored in 1 machine? Hard drives are approximately 500Gb
in size. Even if you add external hard drives, you can't store the data in Peta
bytes. Let's say you add external hard drives and store this data, you wouldn't be
able to open or process that file because of insufficient RAM. And processing, it
would take months to analyze this data.

HDFS Features
1. Data is distributed over several machines, and replicated to ensure their
durability to failure and high availability to parallel applications
2. Designed for very large files (in GBs, TBs)
3. Block oriented
4. Unix like commands interface
5. Write once and read many times
6. Commodity hardware
7. Fault Tolerant when nodes fail
8. Scalable by adding new nodes

It is Not Designed for
1. Small files.
2. Multiple writes, arbitrary file modification -> Writes are always supported at
the end of the file, modifications can't be made at random offsets of files.
3. Low latency data access -> Since we are accessing huge amount of data it
comes at the expense of time taken to access the data.

Function of Name Node
1. Name Node is controller and manager of HDFS. It knows the status and the
metadata of all the files in HDFS.
2. Metadata is -> file names, permissions, and locations of each block of file
3. HDFS cluster can be accessed concurrently by multiple clients, even then this
metadata information is never desynchronized. Hence, all this information is
handled by a single machine.
4. Since metadata is typically small, all this info. is stored in main memory of
Name Node, allowing fast access to metadata.

Purpose of Secondary Name Node
1. It is not backup of name node nor data nodes connect to this. It is just a
helper of name node.
2. It only performs periodic checkpoints.
3. It communicates with name node and to take snapshots of HDFS metadata.
4. These snapshots help minimize downtime and loss of data.

Big data and hadoop

More Related Content

What's hot (19)

Viewers also liked (16)

Similar to Big data and hadoop (20)

Recently uploaded (20)

Big data and hadoop