SlideShare a Scribd company logo
Hadoop introduction
Background
● Big data challenge - (link)
User Case - Retails
●
知己知彼,百戰不殆 – Retail and The Big Data Evolution:
User Case - Retails
● Data is providing gains in three main ways: opening new channels,
tailoring my service, and driving revenue:
Pain Point
● Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
So what is Hadoop?
● Hadoop was created by Doug Cutting and Mike Cafarella.
● Hadoop provides the reliable shared storage and analysis system. (HDFS)
● It is designed to scale up from a single server to thousand of machines, with a
high degree of fault tolerance. (HDFS)
● It provides a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster. (MapReduce)
History
Hadoop Distributed FileSystem (HDFS)
● A given file is broken down into blocks (default=64MB), then blocks are replicated
across cluster (default=3).
● The number and size of file have no limition.
● HDFS allows you to put/get/delete files. (No update!)
● Follows the philosophy – "Write Once, Read Multiple Times!"
MapReduceFlow
Hadoop Ecosystem
● The Hadoop Ecosystem Table
Flume
● Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
● Each Flume agent has a source, a sink and a channel
Sqoop
● Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational databases.
Hive
● Apache Hive is a high-level abstraction on top of MapReduce
● Uses an SQL-like language called HiveQL
● Generates MapReduce jobs that run on the Hadoop cluster
● Originally developed by Facebook for data warehousing - Now an open-source
Apache project.
HBase
● Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
● Linear scalability, capabile of storing hundreds of terabytes of data
● Automatic and configurable sharding of tables
● Automatic failover support
● Block cache and Bloom Filters for real-time queries
● Provides realtime random read/write access to data stored in HDFS.
Pig
● Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure for
evaluating these programs.
● The data flow language (Pig Latin)
● The interactive shell where you can type Pig Latin statements (Grunt)
● The Pig interpreter and execution engine
Oozie
● Oozie is a 'workflow engine' which runs on a server and typically outside the
cluster. It can runs workflows of Hadoop jobs including Pig, Hive, Sqoop jobs and
submit those jobs to the cluster based on a workflow definition.
Why?
Why?
Appendix
● What is Hadoop?
● https://www.youtube.com/watch?v=4DgTLaFNQq0
● https://www.youtube.com/watch?v=9s-vSeWej1U
● Intro to MapReduce
● https://www.youtube.com/watch?v=HFplUBeBhcM
● What is Big Data? Big Data Explained (Hadoop & MapReduce)
● Big Data University - Hadoop Fundamentals I - v2
● Big Data Challenges
Appendix
● Hadoop Tutorial 1 - What is Hadoop?
● Hadoop Tutorial 2 - Challenges Created by Big Data
● Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apac
● Hadoop Tutorial 4 - Overview of Hadoop Projects
● Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS
● Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox
● Hadoop Tutorial 7 - Downloading Hadoop Appliance for Oracle VirtualBox
●

More Related Content

PDF
Big Data and Hadoop Ecosystem
PPTX
PPT
Hadoop Technologies
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPTX
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Hadoop Architecture
PDF
Facebook Hadoop Data & Applications
Big Data and Hadoop Ecosystem
Hadoop Technologies
Introduction to Big Data & Hadoop Architecture - Module 1
HADOOP TECHNOLOGY ppt
Hadoop Architecture
Facebook Hadoop Data & Applications

What's hot (18)

PPT
Hadoop technology
PPTX
Hadoop training
PPTX
Hadoop And Their Ecosystem
PPT
Hadoop hive presentation
PDF
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
PDF
Hadoop Ecosystem
PDF
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
PPTX
Hadoop Technology
PPTX
Big data and Hadoop
PPTX
Hadoop
PPTX
Hadoop overview
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PPTX
Big data and tools
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
PPTX
Introduction to Hadoop Technology
PPTX
Apache Hadoop at 10
PDF
Bn1028 demo hadoop administration and development
Hadoop technology
Hadoop training
Hadoop And Their Ecosystem
Hadoop hive presentation
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop Ecosystem
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop Technology
Big data and Hadoop
Hadoop
Hadoop overview
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Big data and tools
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Introduction to Hadoop Technology
Apache Hadoop at 10
Bn1028 demo hadoop administration and development
Ad

Viewers also liked (10)

PPS
Introduction to Apache Hive
PPTX
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
PDF
Introduction to Apache Hive
PDF
Hive tuning
PPT
Hadoop MapReduce Fundamentals
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PDF
Hive Quick Start Tutorial
PPTX
Hive on spark is blazing fast or is it final
PDF
Integration of HIve and HBase
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Introduction to Apache Hive
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to Apache Hive
Hive tuning
Hadoop MapReduce Fundamentals
HIVE: Data Warehousing & Analytics on Hadoop
Hive Quick Start Tutorial
Hive on spark is blazing fast or is it final
Integration of HIve and HBase
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Ad

Similar to Hadoop introduction (20)

PPTX
ch 01B Introduction to Hadoop components
PPTX
Hadoop jon
PDF
BIGDATA ppts
ODP
Hadoop seminar
PPTX
Getting started big data
PDF
What is Apache Hadoop and its ecosystem?
PPTX
Introduction to Hadoop
PPTX
Introduction to Apache Hadoop Ecosystem
PPTX
Seminar ppt
PDF
PPTX
Hadoop An Introduction
PPSX
PPTX
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPT
Introduction to Apache hadoop
PPTX
Bigdata and Hadoop Introduction
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
ODP
Hadoop and Big Data for Absolute Beginners
ch 01B Introduction to Hadoop components
Hadoop jon
BIGDATA ppts
Hadoop seminar
Getting started big data
What is Apache Hadoop and its ecosystem?
Introduction to Hadoop
Introduction to Apache Hadoop Ecosystem
Seminar ppt
Hadoop An Introduction
Hadoop a Natural Choice for Data Intensive Log Processing
Introduction to Apache hadoop
Bigdata and Hadoop Introduction
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of big data & hadoop version 1 - Tony Nguyen
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Hadoop and Big Data for Absolute Beginners

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
KodekX | Application Modernization Development
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Machine learning based COVID-19 study performance prediction
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
GamePlan Trading System Review: Professional Trader's Honest Take
KodekX | Application Modernization Development
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Machine learning based COVID-19 study performance prediction
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Hadoop introduction

  • 2. Background ● Big data challenge - (link)
  • 3. User Case - Retails ● 知己知彼,百戰不殆 – Retail and The Big Data Evolution:
  • 4. User Case - Retails ● Data is providing gains in three main ways: opening new channels, tailoring my service, and driving revenue:
  • 5. Pain Point ● Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 6. So what is Hadoop? ● Hadoop was created by Doug Cutting and Mike Cafarella. ● Hadoop provides the reliable shared storage and analysis system. (HDFS) ● It is designed to scale up from a single server to thousand of machines, with a high degree of fault tolerance. (HDFS) ● It provides a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. (MapReduce)
  • 8. Hadoop Distributed FileSystem (HDFS) ● A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). ● The number and size of file have no limition. ● HDFS allows you to put/get/delete files. (No update!) ● Follows the philosophy – "Write Once, Read Multiple Times!"
  • 10. Hadoop Ecosystem ● The Hadoop Ecosystem Table
  • 11. Flume ● Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. ● Each Flume agent has a source, a sink and a channel
  • 12. Sqoop ● Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • 13. Hive ● Apache Hive is a high-level abstraction on top of MapReduce ● Uses an SQL-like language called HiveQL ● Generates MapReduce jobs that run on the Hadoop cluster ● Originally developed by Facebook for data warehousing - Now an open-source Apache project.
  • 14. HBase ● Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. ● Linear scalability, capabile of storing hundreds of terabytes of data ● Automatic and configurable sharding of tables ● Automatic failover support ● Block cache and Bloom Filters for real-time queries ● Provides realtime random read/write access to data stored in HDFS.
  • 15. Pig ● Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. ● The data flow language (Pig Latin) ● The interactive shell where you can type Pig Latin statements (Grunt) ● The Pig interpreter and execution engine
  • 16. Oozie ● Oozie is a 'workflow engine' which runs on a server and typically outside the cluster. It can runs workflows of Hadoop jobs including Pig, Hive, Sqoop jobs and submit those jobs to the cluster based on a workflow definition.
  • 17. Why?
  • 18. Why?
  • 19. Appendix ● What is Hadoop? ● https://www.youtube.com/watch?v=4DgTLaFNQq0 ● https://www.youtube.com/watch?v=9s-vSeWej1U ● Intro to MapReduce ● https://www.youtube.com/watch?v=HFplUBeBhcM ● What is Big Data? Big Data Explained (Hadoop & MapReduce) ● Big Data University - Hadoop Fundamentals I - v2 ● Big Data Challenges
  • 20. Appendix ● Hadoop Tutorial 1 - What is Hadoop? ● Hadoop Tutorial 2 - Challenges Created by Big Data ● Hadoop Tutorial 3 - History Behind Creation of Hadoop (Google, Yahoo, and Apac ● Hadoop Tutorial 4 - Overview of Hadoop Projects ● Hadoop Tutorial 5 - Steps to Install Hadoop on a Personal Computer (Windows/OS ● Hadoop Tutorial 6 - Downloading and Installing Oracle VirtualBox ● Hadoop Tutorial 7 - Downloading Hadoop Appliance for Oracle VirtualBox ●