SlideShare a Scribd company logo
Hadoop Jon 
By HumoyunJon Lee
90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST 
THREE YEARS ALONE, AND IT IS GROWING 
AT EVEN A MORE RAPID RATE. 
BIG DATA 
The world has been exponential data growth, due to social media, 
mobility, E-commerce and other factors. 
• Volume 
• Variety 
• Velocity
“Big Data is like teenage sex; 
everyone talks about it, 
nobody really knows how to do it, 
everyone thinks everyone else is doing it, 
so everyone claims they are doing it” 
Dan Ariely, Duke University
Big Data Ecosystem
To Address This Issue 
We need HadoopJon
A Shared Nothing Network or 
What is that Hadoop
The Apache Hadoop software library is a framework that allows for the 
distributed processing of large data sets across clusters of computers using 
simple programming models. It is designed to scale up from single servers to 
thousands of machines, each offering local computation and storage. Rather 
than rely on hardware to deliver high-availability, the library itself is designed 
to detect and handle failures at the application layer, so delivering a highly-available 
service on top of a cluster of computers, each of which may be prone 
to failures.
Hadoop jon
Hadoop jon
Prerequisites : 
• Installing Java v1.5+ 
• Adding dedicated Hadoop system user. 
• Configuring SSH access. 
• Disabling IPv6. 
Installing HadoopJon
Configuring Hadoop : 
a. hadoop-env.sh 
b. core-site.xml 
c. mapred-site.xml 
d. hdfs-site.xml
Hadoop comes with several web interfaces which are by 
default available at these locations: 
• http://localhost:50070/ – web UI of the NameNode daemon 
• http://localhost:50030/ – web UI of the JobTracker daemon 
• http://localhost:50060/ – web UI of the TaskTracker daemon 
Hadoop Web Interfaces
Reliable 
Hadoop 
Features 
Flexible Economical 
Scalable 
Hadoop Key Characteristics:
• Scalable – New nodes can be added as needed, and added without 
needing to change data formats, how data is loaded, how jobs are 
written, or the applications on top. 
• Economical – Hadoop brings massively parallel computing to 
commodity servers. The result is a sizeable decrease in the cost per 
terabyte of storage, which in turn makes it affordable to model all 
your data.
• Flexible – Hadoop is schema-less, and can absorb any type of data, 
structured or not, from any number of sources. Data from multiple 
sources can be joined and aggregated in arbitrary ways enabling 
deeper analyses than any one system can provide. 
• Reliable – When you lose a node, the system redirects work to 
another location of the data and continues processing without missing 
a beat
Hadoop Ecosystem
HDFS Architecture
• HDFS is designed to store a very large amount of information 
(terabytes or petabytes). This requires spreading the data across a 
large number of machines. 
• HDFS stores data reliably. If individual machines in the cluster fail, 
data is still being available with data redundancy. 
Hadoop Distributed File 
System (HDFS):
• HDFS provides fast, scalable access to the information loaded on the 
clusters. It is possible to serve a larger number of clients by simply 
adding more machines to the cluster. 
• HDFS integrate well with Hadoop MapReduce, allowing data to be 
read and computed upon locally whenever needed. 
• HDFS was originally built as infrastructure for the Apache Nutch 
web search engine project
Hadoop does not require expensive, highly reliable hardware. It is 
designed to run on clusters of commodity hardware, an HDFS instance 
may consist of hundreds or thousands of server machines, each storing 
part of the file system’s data. The fact that there are a huge number of 
components and that each component has a non-trivial probability of 
failure means that some component of HDFS is always non-functional. 
Therefore, detection of faults and quick, automatic recovery from them 
is a core architectural goal of HDFS. 
Commodity Hardware Failure:
Applications that run on HDFS need continuous access to their data 
sets. HDFS is designed more for batch processing rather than interactive 
use by users. The emphasis is on high throughput of data access rather 
than low latency of data access. 
Continuous Data Access:
Applications that run on HDFS have large data sets. A typical file in 
HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support 
large files. 
It is also worth examining the 
applications for which using HDFS 
does not work so well. While this 
may change in the future, these are 
areas where HDFS is not a good fit 
today: 
Very Large Data Files:
• Low-latency data access 
• Lots of small files 
• Multiple writers, arbitrary file modifications
• Pig is an open-source high-level dataflow 
system. 
• It provides a simple language for queries and 
data manipulation Pig Latin, that is compiled 
into MapReduce jobs that are run on Hadoop. 
• Why is it important? 
- Companies like Yahoo, Google and Microsoft 
are collecting vast sets in the form of click 
steams, search logs, and web crawls. 
- Some form of ad-hoc processing and analysis 
of all of this information is required. 
What is Pig
• An ad-hoc way of creating and executing MapReduce jobs on very 
large data sets 
• Rapid Development 
• No Java is required 
• Developed byYahoo! 
Why was Pig created?
Hadoop jon
• Pig is a data flow language. It is at the top of Hadoop and makes it 
possible to create complex jobs to process large volumes of data 
quicly and efficiently. 
• It will consume any data that you feed it: Structured, semi-structured, 
or unstructured. 
• Pig provides the common data operations (filters, joins, ordering) and 
nested data types (tuple, bags, and maps) which are missing in 
MapReduce. 
• PIG scripts are easier and faster to write than standard Java Hadoop 
jobs and PIG has lot of clever optimizations like multi query 
execution, which can make your complex queries execute quiker. 
Where I should Use PIG
• Hive is a data warehouse infrastructure built 
on top of Hadoop. 
• It facilitates querying large datasets residing 
on a distributed storage. 
• It provides a mechanism to project structure 
on to the data and query the data using a 
SQL-like query language called “HiveQL”. 
What is Hive
• Hive was developed by Facebook and was open sourced in 2008 . 
• Data stored in Hadoop is inaccessible to business users. 
• High level languages like Pig, Cascading etc are geared towards 
developers. 
• SQL is a common language that is known to many. Hive was 
developed to give access to data stored in HadoopJon, translating 
SQL like queries into map reduce jobs. 
Why hive was developed
Hammaga rahmat 
Nahorgi Presentatsiya 
over

More Related Content

PPT
Hadoop presentation
PPTX
Introduction to Hadoop - The Essentials
PPTX
Hadoop training
PPTX
PPT on Hadoop
PPTX
Hadoop
PDF
Hadoop Ecosystem
ODP
Hadoop seminar
Hadoop presentation
Introduction to Hadoop - The Essentials
Hadoop training
PPT on Hadoop
Hadoop
Hadoop Ecosystem
Hadoop seminar

What's hot (20)

PDF
Big Data and Hadoop Ecosystem
PPTX
Hadoop Tutorial For Beginners
PPT
Seminar Presentation Hadoop
PPTX
Hadoop Technology
PPTX
Introduction to Apache Hadoop
PDF
Hadoop ecosystem
PPTX
Hadoop And Their Ecosystem
PPTX
Big data Hadoop
PPTX
HADOOP TECHNOLOGY ppt
PPTX
4. hadoop גיא לבנברג
PPTX
Big data - Online Training
PPTX
Big data Hadoop presentation
PDF
Hadoop Overview
 
PPTX
Apache Hadoop
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PPTX
Big Data on the Microsoft Platform
PPT
Presentation on Hadoop Technology
PPTX
Introduction to HDFS and MapReduce
Big Data and Hadoop Ecosystem
Hadoop Tutorial For Beginners
Seminar Presentation Hadoop
Hadoop Technology
Introduction to Apache Hadoop
Hadoop ecosystem
Hadoop And Their Ecosystem
Big data Hadoop
HADOOP TECHNOLOGY ppt
4. hadoop גיא לבנברג
Big data - Online Training
Big data Hadoop presentation
Hadoop Overview
 
Apache Hadoop
Introduction to Big Data & Hadoop Architecture - Module 1
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Big Data on the Microsoft Platform
Presentation on Hadoop Technology
Introduction to HDFS and MapReduce
Ad

Similar to Hadoop jon (20)

PPTX
Introduction to Apache Hadoop Ecosystem
PDF
Hadoop framework thesis (3)
PDF
Big data and hadoop overvew
PPTX
ch 01B Introduction to Hadoop components
PPTX
Big data
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PDF
Hadoop Primer
PPTX
Hadoop_arunam_ppt
PPTX
Big data and hadoop anupama
PDF
PDF
Hadoop and its role in Facebook: An Overview
DOCX
Hadoop Report
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PPTX
Hadoop basics
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PPTX
hadoop-ecosystem-ppt.pptx
PPTX
Hadoop ppt1
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Introduction to Apache Hadoop Ecosystem
Hadoop framework thesis (3)
Big data and hadoop overvew
ch 01B Introduction to Hadoop components
Big data
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop Primer
Hadoop_arunam_ppt
Big data and hadoop anupama
Hadoop and its role in Facebook: An Overview
Hadoop Report
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop basics
EclipseCon Keynote: Apache Hadoop - An Introduction
hadoop-ecosystem-ppt.pptx
Hadoop ppt1
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop and their in big data analysis EcoSystem.pptx
Ad

Recently uploaded (20)

PDF
The Edge You’ve Been Missing Get the Sociocosmos Edge
PPTX
Preposition and Asking and Responding Suggestion.pptx
PDF
Presence That Pays Off Activate My Social Growth
PDF
COMMENTIFY - Commentify.co: Your AI LinkedIn Comments Agent
PDF
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
PDF
Mastering Social Media Marketing in 2025.pdf
PDF
Your Best Post Vanished. Blame the Attention Economy
PDF
Medium @mikehydes The Cryptomaster Story Stats
PPTX
Types of Social Media Marketing for Business Success
PDF
11111111111111111111111111111111111111111111111
PDF
Medium @mikehydes The Cryptomaster Home page
PDF
Instant Audience, Long-Term Impact Buy Real Telegram Members
PDF
Medium @mikehydes The Cryptomaster Audience Stats
PPTX
Office Administration Courses in Trivandrum That Employers Value.pptx
PDF
Subscribe This Channel Subscribe Back You
PDF
A copy of a Medium article wishing Merry Christmas To All My Followers
PPTX
Table Top Exercise (TTEx) on Emergency.pptx
PDF
Real Presence. Real Power. Boost with Authenticity
PPTX
Result-Driven Social Media Marketing Services | Boost ROI
PDF
Instagram Reels Growth Guide 2025.......
The Edge You’ve Been Missing Get the Sociocosmos Edge
Preposition and Asking and Responding Suggestion.pptx
Presence That Pays Off Activate My Social Growth
COMMENTIFY - Commentify.co: Your AI LinkedIn Comments Agent
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
Mastering Social Media Marketing in 2025.pdf
Your Best Post Vanished. Blame the Attention Economy
Medium @mikehydes The Cryptomaster Story Stats
Types of Social Media Marketing for Business Success
11111111111111111111111111111111111111111111111
Medium @mikehydes The Cryptomaster Home page
Instant Audience, Long-Term Impact Buy Real Telegram Members
Medium @mikehydes The Cryptomaster Audience Stats
Office Administration Courses in Trivandrum That Employers Value.pptx
Subscribe This Channel Subscribe Back You
A copy of a Medium article wishing Merry Christmas To All My Followers
Table Top Exercise (TTEx) on Emergency.pptx
Real Presence. Real Power. Boost with Authenticity
Result-Driven Social Media Marketing Services | Boost ROI
Instagram Reels Growth Guide 2025.......

Hadoop jon

  • 1. Hadoop Jon By HumoyunJon Lee
  • 2. 90% OF THE WORLD’S DATA HAS BEEN GENERATED IN THE LAST THREE YEARS ALONE, AND IT IS GROWING AT EVEN A MORE RAPID RATE. BIG DATA The world has been exponential data growth, due to social media, mobility, E-commerce and other factors. • Volume • Variety • Velocity
  • 3. “Big Data is like teenage sex; everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it” Dan Ariely, Duke University
  • 5. To Address This Issue We need HadoopJon
  • 6. A Shared Nothing Network or What is that Hadoop
  • 7. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 10. Prerequisites : • Installing Java v1.5+ • Adding dedicated Hadoop system user. • Configuring SSH access. • Disabling IPv6. Installing HadoopJon
  • 11. Configuring Hadoop : a. hadoop-env.sh b. core-site.xml c. mapred-site.xml d. hdfs-site.xml
  • 12. Hadoop comes with several web interfaces which are by default available at these locations: • http://localhost:50070/ – web UI of the NameNode daemon • http://localhost:50030/ – web UI of the JobTracker daemon • http://localhost:50060/ – web UI of the TaskTracker daemon Hadoop Web Interfaces
  • 13. Reliable Hadoop Features Flexible Economical Scalable Hadoop Key Characteristics:
  • 14. • Scalable – New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. • Economical – Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • 15. • Flexible – Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. • Reliable – When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat
  • 18. • HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. • HDFS stores data reliably. If individual machines in the cluster fail, data is still being available with data redundancy. Hadoop Distributed File System (HDFS):
  • 19. • HDFS provides fast, scalable access to the information loaded on the clusters. It is possible to serve a larger number of clients by simply adding more machines to the cluster. • HDFS integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally whenever needed. • HDFS was originally built as infrastructure for the Apache Nutch web search engine project
  • 20. Hadoop does not require expensive, highly reliable hardware. It is designed to run on clusters of commodity hardware, an HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Commodity Hardware Failure:
  • 21. Applications that run on HDFS need continuous access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. Continuous Data Access:
  • 22. Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. So, HDFS is tuned to support large files. It is also worth examining the applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today: Very Large Data Files:
  • 23. • Low-latency data access • Lots of small files • Multiple writers, arbitrary file modifications
  • 24. • Pig is an open-source high-level dataflow system. • It provides a simple language for queries and data manipulation Pig Latin, that is compiled into MapReduce jobs that are run on Hadoop. • Why is it important? - Companies like Yahoo, Google and Microsoft are collecting vast sets in the form of click steams, search logs, and web crawls. - Some form of ad-hoc processing and analysis of all of this information is required. What is Pig
  • 25. • An ad-hoc way of creating and executing MapReduce jobs on very large data sets • Rapid Development • No Java is required • Developed byYahoo! Why was Pig created?
  • 27. • Pig is a data flow language. It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quicly and efficiently. • It will consume any data that you feed it: Structured, semi-structured, or unstructured. • Pig provides the common data operations (filters, joins, ordering) and nested data types (tuple, bags, and maps) which are missing in MapReduce. • PIG scripts are easier and faster to write than standard Java Hadoop jobs and PIG has lot of clever optimizations like multi query execution, which can make your complex queries execute quiker. Where I should Use PIG
  • 28. • Hive is a data warehouse infrastructure built on top of Hadoop. • It facilitates querying large datasets residing on a distributed storage. • It provides a mechanism to project structure on to the data and query the data using a SQL-like query language called “HiveQL”. What is Hive
  • 29. • Hive was developed by Facebook and was open sourced in 2008 . • Data stored in Hadoop is inaccessible to business users. • High level languages like Pig, Cascading etc are geared towards developers. • SQL is a common language that is known to many. Hive was developed to give access to data stored in HadoopJon, translating SQL like queries into map reduce jobs. Why hive was developed
  • 30. Hammaga rahmat Nahorgi Presentatsiya over