Big Data and
Hadoop
PARKHE KISHOR B.
M. TECH. (INDUSTRIAL MATHEMATICS AND COMPUTER APPLICATIONS)
! Prerequisites
JAVA
! OOPs concepts
! Serialization
! Data Structures (Hash Map, Lists)
! FILE I/O
! UNIX Commands (mv, cp, ls etc, mkdir, ps, vi)
Development Environment
! Install jdk 1.6, jre 6, eclipse
! What is Big Data?
! Typically we work on excel sheet, ppt, word docs, code files. They are of the order 1-2Mb. Even
a movie is just 1–2 Gb size.
! The BIG DATA we want to deal with, is of the order of Petabytes.
! 10^12 times size of ordinary files.
What Happens in An Internet Minute?
Where is this Data ?
! This data is generated from multiple sources. The data that goes in the logs of google,
facebook, linkedin, yahoo servers is of billion users of all around the world.
! What are users accessing, how long the user remains in site. All the meta data sites visited,
friend's list, status. Torrent downloads In Every 5 minutes granularity google gets Petabytes
information in it's server logs. Same goes for facebook, yahoo, AT&T, Airtel.
Why need to understand data?
1. Analytics
2. Why we need analytics?
Use Case of analytics.
1. How effectively you do business
2. Cost cutting and leverage productivity
3. Google, Amazon, Ebay they get logs so that ads and products can be recommended to
customers.
e.g. Public transport
Data Categories
1. Structure Data
2. Unstructured Data
Challenges
Problem is not getting this big data. Problem is how to store, process and analyze
this data.








Case Study
Telecom Company
1. Airmobile (50 million subscribers) wants to sell it’s expensive $500 monthly plan
to it’s customers, for this it wants to find out its top subscribers and the total
bytes they have downloaded(Internet data) using it’s services in last one
month.
2. Also it wants to advertise it’s roaming plan of $100 so that subscribers don’t
switch to other networks, when going to other cities. For this it wants to find
out the minutes of usage (i.e., call duration) of top 10 thousand subscribers
who have roamed in last 1 month .
Issues
1. Different subscribers in different cities have different data plans. Almost all the
subscribers are active each day of month.
2. 2. Data collection is huge every minute almost 1 million people visit 5–6 sites.
3. 3. Every day tera bytes of information is collected in airmobile servers of
each city, which get discarded because of unavailability of storage.
Solution
1. Introduced by Google was GFS (Google file system) and Map Reduce.
2. Then Hadoop became open source and now is owned by apache.
3. Hadoop is used by Facebook, Yahoo, Google, Twitter, linkedin, Rackspace.
How is HADOOP the Solution?
1. Storage -> HDFS A distributed file system where commodity hardware can be
used to form clusters and store the huge data in distributed fashion. There is
no need for high end hardwares.
2. Process -> MAP Reduce Paradigm
3. Analyze -> Hive, Pig MapReduce.
4. It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just
configuration change.
Big data and hadoop
Applications of HADOOP
1. Telecommunications -> To find out top subscribers for advertisement, find
peak traffic rate to install routers at right places, for cost cutting.
2. Recommendation systems -> Google Ads customized for all users.
3. Data warehousing -> to store data and analyze it e.g., categorize data into
http web or mobile, so that services by ISP can be customized accordingly.
4. Market Research and Forecasting -> Forecast subscribers, traffic based on
past data trend.
5. Finance, social networking -> To predict trends and gain profit.
What it is Not
1. Should be noted it's not OLAP (online analytical Processing) but batch /
offline oriented
2. It is not a database
Challenges
Can this data be stored in 1 machine? Hard drives are approximately 500Gb
in size. Even if you add external hard drives, you can't store the data in Peta
bytes. Let's say you add external hard drives and store this data, you wouldn't be
able to open or process that file because of insufficient RAM. And processing, it
would take months to analyze this data.
HDFS Features
1. Data is distributed over several machines, and replicated to ensure their
durability to failure and high availability to parallel applications
2. Designed for very large files (in GBs, TBs)
3. Block oriented
4. Unix like commands interface
5. Write once and read many times
6. Commodity hardware
7. Fault Tolerant when nodes fail
8. Scalable by adding new nodes
It is Not Designed for
1. Small files.
2. Multiple writes, arbitrary file modification -> Writes are always supported at
the end of the file, modifications can't be made at random offsets of files.
3. Low latency data access -> Since we are accessing huge amount of data it
comes at the expense of time taken to access the data.
Architecture
! Production
! Hadoop Component
Function of Name Node
1. Name Node is controller and manager of HDFS. It knows the status and the
metadata of all the files in HDFS.
2. Metadata is -> file names, permissions, and locations of each block of file
3. HDFS cluster can be accessed concurrently by multiple clients, even then this
metadata information is never desynchronized. Hence, all this information is
handled by a single machine.
4. Since metadata is typically small, all this info. is stored in main memory of
Name Node, allowing fast access to metadata.
Purpose of Secondary Name Node
1. It is not backup of name node nor data nodes connect to this. It is just a
helper of name node.
2. It only performs periodic checkpoints.
3. It communicates with name node and to take snapshots of HDFS metadata.
4. These snapshots help minimize downtime and loss of data.
HDFS Read
HDFS Write
Replica Placement

More Related Content

PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PPTX
عصر کلان داده، چرا و چگونه؟
PDF
Introduction to Bigdata and HADOOP
PPTX
Intro to Big Data Hadoop
PPTX
Hadoop and big data
PPTX
Data lake-itweekend-sharif university-vahid amiry
PPTX
Big data ppt
Introduction to Apache Hadoop Eco-System
Big data vahidamiri-tabriz-13960226-datastack.ir
عصر کلان داده، چرا و چگونه؟
Introduction to Bigdata and HADOOP
Intro to Big Data Hadoop
Hadoop and big data
Data lake-itweekend-sharif university-vahid amiry
Big data ppt

What's hot (19)

PPTX
Hadoop Tutorial For Beginners
PPTX
Big Data and Hadoop
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Big Data Unit 4 - Hadoop
PPTX
Big Data Technology Stack : Nutshell
PDF
Introduction to Big Data Analytics on Apache Hadoop
PPTX
Big data Hadoop presentation
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PDF
Bigdata and Hadoop Bootcamp
PPTX
The Big Data Stack
PDF
Hadoop core concepts
PPTX
PPT on Hadoop
PPTX
Big Data and Hadoop
PPTX
Big data processing with apache spark part1
PPTX
Big data vahidamiri-datastack.ir
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PPTX
Big data analytics with hadoop volume 2
PPTX
Big Data and Hadoop Introduction
PPTX
Introduction of big data unit 1
Hadoop Tutorial For Beginners
Big Data and Hadoop
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Unit 4 - Hadoop
Big Data Technology Stack : Nutshell
Introduction to Big Data Analytics on Apache Hadoop
Big data Hadoop presentation
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Bigdata and Hadoop Bootcamp
The Big Data Stack
Hadoop core concepts
PPT on Hadoop
Big Data and Hadoop
Big data processing with apache spark part1
Big data vahidamiri-datastack.ir
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Big data analytics with hadoop volume 2
Big Data and Hadoop Introduction
Introduction of big data unit 1
Ad

Viewers also liked (16)

PDF
Jute rc
PPTX
PPTX
Get expertise with mongo db
PPTX
Unit 1.2 demo
PDF
70 533 implementing microsoft azure infrastructure solutions Preparation Guide
PDF
AWS Certified Solutions Architect
PPTX
Object oriented analysis and design
PDF
Die antenna help uitsaai. Artikel in Die Burger
PDF
Introducing with MongoDB
PPTX
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
PPTX
PPTX
Hadoop Architecture
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
PPTX
PPTX
Indexing In MongoDB
Jute rc
Get expertise with mongo db
Unit 1.2 demo
70 533 implementing microsoft azure infrastructure solutions Preparation Guide
AWS Certified Solutions Architect
Object oriented analysis and design
Die antenna help uitsaai. Artikel in Die Burger
Introducing with MongoDB
Getfly crm: Giải pháp quản lý khách hàng & Phát triển phòng kinh doanh
Hadoop Architecture
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Indexing In MongoDB
Ad

Similar to Big data and hadoop (20)

PPT
New Framework for Improving Bigdata Analaysis Using Mobile Agent
PDF
Hadoop Tutorial for Big Data Enthusiasts
PDF
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
PDF
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
PDF
DOCX
hadoop seminar training report
PPTX
Inroduction to Big Data
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Hadoop introduction
PPTX
Big Data and Hadoop - An Introduction
PPTX
Big Data - Need of Converged Data Platform
PDF
IRJET - Survey Paper on Map Reduce Processing using HADOOP
PPTX
Hadoop and BigData - July 2016
PDF
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
PDF
final report
PDF
A Survey on Big Data Analysis Techniques
PPTX
Big Data Analytics -Introduction education
PPTX
Schedulers optimization to handle multiple jobs in hadoop cluster
New Framework for Improving Bigdata Analaysis Using Mobile Agent
Hadoop Tutorial for Big Data Enthusiasts
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
hadoop seminar training report
Inroduction to Big Data
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop introduction
Big Data and Hadoop - An Introduction
Big Data - Need of Converged Data Platform
IRJET - Survey Paper on Map Reduce Processing using HADOOP
Hadoop and BigData - July 2016
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
final report
A Survey on Big Data Analysis Techniques
Big Data Analytics -Introduction education
Schedulers optimization to handle multiple jobs in hadoop cluster

Recently uploaded (20)

PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
Microsoft Excel 365/2024 Beginner's training
PPTX
The various Industrial Revolutions .pptx
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
STKI Israel Market Study 2025 version august
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Architecture types and enterprise applications.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Zenith AI: Advanced Artificial Intelligence
OpenACC and Open Hackathons Monthly Highlights July 2025
NewMind AI Weekly Chronicles – August ’25 Week III
Taming the Chaos: How to Turn Unstructured Data into Decisions
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
CloudStack 4.21: First Look Webinar slides
Flame analysis and combustion estimation using large language and vision assi...
Microsoft Excel 365/2024 Beginner's training
The various Industrial Revolutions .pptx
TEXTILE technology diploma scope and career opportunities
Comparative analysis of machine learning models for fake news detection in so...
STKI Israel Market Study 2025 version august
The influence of sentiment analysis in enhancing early warning system model f...
Getting started with AI Agents and Multi-Agent Systems
Architecture types and enterprise applications.pdf
Consumable AI The What, Why & How for Small Teams.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Improvisation in detection of pomegranate leaf disease using transfer learni...
Enhancing plagiarism detection using data pre-processing and machine learning...
Chapter 5: Probability Theory and Statistics
Zenith AI: Advanced Artificial Intelligence

Big data and hadoop

  • 1. Big Data and Hadoop PARKHE KISHOR B. M. TECH. (INDUSTRIAL MATHEMATICS AND COMPUTER APPLICATIONS)
  • 2. ! Prerequisites JAVA ! OOPs concepts ! Serialization ! Data Structures (Hash Map, Lists) ! FILE I/O ! UNIX Commands (mv, cp, ls etc, mkdir, ps, vi)
  • 3. Development Environment ! Install jdk 1.6, jre 6, eclipse
  • 4. ! What is Big Data? ! Typically we work on excel sheet, ppt, word docs, code files. They are of the order 1-2Mb. Even a movie is just 1–2 Gb size.
  • 5. ! The BIG DATA we want to deal with, is of the order of Petabytes. ! 10^12 times size of ordinary files.
  • 6. What Happens in An Internet Minute?
  • 7. Where is this Data ? ! This data is generated from multiple sources. The data that goes in the logs of google, facebook, linkedin, yahoo servers is of billion users of all around the world. ! What are users accessing, how long the user remains in site. All the meta data sites visited, friend's list, status. Torrent downloads In Every 5 minutes granularity google gets Petabytes information in it's server logs. Same goes for facebook, yahoo, AT&T, Airtel. Why need to understand data? 1. Analytics 2. Why we need analytics?
  • 8. Use Case of analytics. 1. How effectively you do business 2. Cost cutting and leverage productivity 3. Google, Amazon, Ebay they get logs so that ads and products can be recommended to customers. e.g. Public transport
  • 9. Data Categories 1. Structure Data 2. Unstructured Data
  • 10. Challenges Problem is not getting this big data. Problem is how to store, process and analyze this data.
  • 12. Telecom Company 1. Airmobile (50 million subscribers) wants to sell it’s expensive $500 monthly plan to it’s customers, for this it wants to find out its top subscribers and the total bytes they have downloaded(Internet data) using it’s services in last one month. 2. Also it wants to advertise it’s roaming plan of $100 so that subscribers don’t switch to other networks, when going to other cities. For this it wants to find out the minutes of usage (i.e., call duration) of top 10 thousand subscribers who have roamed in last 1 month .
  • 13. Issues 1. Different subscribers in different cities have different data plans. Almost all the subscribers are active each day of month. 2. 2. Data collection is huge every minute almost 1 million people visit 5–6 sites. 3. 3. Every day tera bytes of information is collected in airmobile servers of each city, which get discarded because of unavailability of storage.
  • 14. Solution 1. Introduced by Google was GFS (Google file system) and Map Reduce. 2. Then Hadoop became open source and now is owned by apache. 3. Hadoop is used by Facebook, Yahoo, Google, Twitter, linkedin, Rackspace.
  • 15. How is HADOOP the Solution? 1. Storage -> HDFS A distributed file system where commodity hardware can be used to form clusters and store the huge data in distributed fashion. There is no need for high end hardwares. 2. Process -> MAP Reduce Paradigm 3. Analyze -> Hive, Pig MapReduce. 4. It can easily scale to multiple nodes(1,500–2,000 nodes in a cluster), with just configuration change.
  • 17. Applications of HADOOP 1. Telecommunications -> To find out top subscribers for advertisement, find peak traffic rate to install routers at right places, for cost cutting. 2. Recommendation systems -> Google Ads customized for all users. 3. Data warehousing -> to store data and analyze it e.g., categorize data into http web or mobile, so that services by ISP can be customized accordingly. 4. Market Research and Forecasting -> Forecast subscribers, traffic based on past data trend. 5. Finance, social networking -> To predict trends and gain profit.
  • 18. What it is Not 1. Should be noted it's not OLAP (online analytical Processing) but batch / offline oriented 2. It is not a database
  • 19. Challenges Can this data be stored in 1 machine? Hard drives are approximately 500Gb in size. Even if you add external hard drives, you can't store the data in Peta bytes. Let's say you add external hard drives and store this data, you wouldn't be able to open or process that file because of insufficient RAM. And processing, it would take months to analyze this data.
  • 20. HDFS Features 1. Data is distributed over several machines, and replicated to ensure their durability to failure and high availability to parallel applications 2. Designed for very large files (in GBs, TBs) 3. Block oriented 4. Unix like commands interface 5. Write once and read many times 6. Commodity hardware 7. Fault Tolerant when nodes fail 8. Scalable by adding new nodes
  • 21. It is Not Designed for 1. Small files. 2. Multiple writes, arbitrary file modification -> Writes are always supported at the end of the file, modifications can't be made at random offsets of files. 3. Low latency data access -> Since we are accessing huge amount of data it comes at the expense of time taken to access the data.
  • 25. Function of Name Node 1. Name Node is controller and manager of HDFS. It knows the status and the metadata of all the files in HDFS. 2. Metadata is -> file names, permissions, and locations of each block of file 3. HDFS cluster can be accessed concurrently by multiple clients, even then this metadata information is never desynchronized. Hence, all this information is handled by a single machine. 4. Since metadata is typically small, all this info. is stored in main memory of Name Node, allowing fast access to metadata.
  • 26. Purpose of Secondary Name Node 1. It is not backup of name node nor data nodes connect to this. It is just a helper of name node. 2. It only performs periodic checkpoints. 3. It communicates with name node and to take snapshots of HDFS metadata. 4. These snapshots help minimize downtime and loss of data.