SlideShare a Scribd company logo
Hadoop In Action
When
Where
Tuesday 06-12-2016
06:00 PM -08:00 PM
Badir Program for
Technology Incubators
#DataRiyadh DataGeeks DataGeeksarabia
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data
transformations on it , save our result and show it via BI tool.
presented by
Mahmoud Yassin
Hadoop:
-Hadoop quick
definition.
-Why Hadoop?
-Hadoop ecosystem.
-Tools to be used.
Practical part:
-What’s the current setup?
-Ambari look.
-Current installed systems.
-Use case high-level
description.
-Steps to develop the use
case?
Use case:
-Locating the data.
-Ingest the data into the HDFS
-See how the files got created in
HDFS
-Feed other data from DB.
-Data querying via Hive and
MapReduce
-Hive table creation.
-Running transudation job via Pig.
-Check the Hive metastore.
-Connect BI to Hadoop.
-Sqoop basic commands
-End to End look solution.
By Mahmoud Yassin
Hadoop Hands On session
Agenda:
Hadoop:
-Hadoop quick definition.
-Why Hadoop?
-Hadoop ecosystem.
-Tools to be used.
Practical part:
-What’s the current setup?
-Ambari look.
-Current installed systems.
-Use case high-level description.
-Steps to develop the use case?
Use case:
-Locating the data.
-Ingest the data into the HDFS
-See how the files got created in HDFS
-Feed other data from DB.
-Data querying via Hive and MapReduce
-Hive table creation.
-Running transudation job via Pig.
-Check the Hive metastore.
- Connect BI to Hadoop.
What is Hadoop
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for any
kind of data, enormous processing power and the ability to handle virtually limitless
concurrent tasks or jobs.
#DataRiyadh
Why Hadoop is important ?
Ability to store and process huge amounts of
any kind of data, quickly.
With data volumes and varieties constantly
increasing, especially from social media and the
Internet of Things (IoT), that's a key
consideration.
Computing power. Hadoop's distributed computing model processes big data fast. The
more computing nodes you use, the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure.
If a node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.
Why Hadoop is important ?
Flexibility. Unlike traditional relational databases, you
don’t have to preprocess data before storing it. You
can store as much data as you want and decide how
to use it later. That includes unstructured data like
text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware
to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by
adding nodes. Little administration is required.
Scalability
Horizontal scaling means that you scale by adding more
machines into your pool of resources
Vertical scaling means that you scale by adding more
power (CPU, RAM) to an existing machine #DataRiyadh
Hadoop ecosystem
Cluster monitoring, provisioning and management
#DataRiyadh
Hadoop | Data Ingestion
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured data stores such as relational databases.
#DataRiyadh
Hadoop | Data Storage Layer
Hadoop Distributed File System (HDFS) offers a way to store large files across
multiple machines. Hadoop and HDFS was derived from Google File System
(GFS) paper.
#DataRiyadh
Hadoop | Data Storage Layer
#DataRiyadh
Hadoop | Data Processing Layer
MapReduce is the heart of Hadoop. It is this programming paradigm that
allows for massive scalability across hundreds or thousands of servers in a
Hadoop cluster with a parallel, distributed algorithm.
#DataRiyadh
Hadoop | Data Processing Layer
Hadoop | Data Processing Layer
A scripting SQL based language and execution environment for creating complex
MapReduce transformations. Functions are written in Pig Latin (the language)
and translated into executable MapReduce jobs. Pig also allows the user to
create extended functions (UDFs) using Java.
#DataRiyadh
Hadoop | Data Querying Layer
A distributed data warehouse built on top of HDFS to manage and organize
large amounts of data. Hive provides a query language based on SQL semantic
(HiveQL) which is translated by the runtime engine to MapReduce jobs for
querying the data.
#DataRiyadh
Hadoop | Management Layer
intuitive, easy-to-use Hadoop management web UI. Apache Ambari was
donated by Hortonworks team. It's a powerful and nice interface for Hadoop
and other typical applications from the Hadoop ecosystem.
Hadoop | Management Layer
is an open-source Web interface that supports Apache Hadoop and its
ecosystem, licensed under the Apache v2 license
Big data existing solutions:
Current Setup
Current Setup
is a subsidiary of Dell Technologies, that provides cloud and virtualization
software and services.
http://www.vmware.com/
Current Setup
The VM make it easy to quickly get hands-on with CDH for testing, demo, and
self-learning purposes, and include Cloudera Manager for managing your
cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts
for getting started.
http://www.cloudera.com/downloads/quickstart_vms/5-8.html
Inside the VM:
Our RDBMS Hadoop Storage
Use Case
The case:
Data Sources
HDFS :
A platform for manipulating data stored in HDFS via a high-level
language called Pig Latin. It does data extractions, transformations
and loading, and basic analysis in patch mode
File
Video
RDBMS
A platform for manipulating data
stored in HDFS via a high-level
language called Pig Latin. It does
data extractions, transformations
and loading, and basic analysis in
patch mode
A data warehousing and SQL-like
query language that presents data
in the form of tables. Hive
programming is similar to database
programming.
open source massively parallel
processing (MPP) SQL query
engine for data stored in a
computer cluster running Apache
Hadoop.
Connect BI tools
to the Hadoop
cluster
Cloudera CDH cluster
Basic Linux Commands
cat [filename] Display file’s contents to the standard output
device
(usually your monitor).
cd /directorypath Change to directory.
chmod [options] mode filename Change a file’s permissions.
clear Clear a command line screen/window for a
fresh start.
cp [options] source destination Copy files and directories.
ls [options] List directory contents.
mkdir [options] directory Create a new directory.
mv [options] source destination Rename or move file(s) or directories.
pwd Display the pathname for the current
directory.touch filename Create an empty file with the specified name.
who [options] Display who is logged on.
Demo
Questions

More Related Content

PPT
Big data introduction, Hadoop in details
DOCX
Big data abstract
PPSX
Big data with Hadoop - Introduction
PDF
PPT
Big data analytics, survey r.nabati
PDF
Introduction to Big Data and Hadoop
PPTX
Intro to Big Data Hadoop
PDF
Big Data Final Presentation
Big data introduction, Hadoop in details
Big data abstract
Big data with Hadoop - Introduction
Big data analytics, survey r.nabati
Introduction to Big Data and Hadoop
Intro to Big Data Hadoop
Big Data Final Presentation

What's hot (20)

PPTX
Big data analytics - hadoop
PPTX
Hadoop and big data
PDF
Introduction to Bigdata and HADOOP
PPTX
Whatisbigdataandwhylearnhadoop
PPT
BigData Analytics with Hadoop and BIRT
PPTX
Big Data Hadoop Tutorial by Easylearning Guru
PDF
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
PPTX
Hadoop: An Industry Perspective
PDF
Hadoop core concepts
PDF
Big data technologies and Hadoop infrastructure
PPTX
Introduction to BIg Data and Hadoop
PPT
Big Data and Hadoop Basics
PDF
Introduction to Big Data Analytics on Apache Hadoop
PPTX
Big data ppt
PPTX
Big Data Concepts
PPTX
Big Data - An Overview
PPTX
Introduction to Apache Hadoop Eco-System
PDF
Introduction to Big Data
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PPTX
Introduction of Big data, NoSQL & Hadoop
Big data analytics - hadoop
Hadoop and big data
Introduction to Bigdata and HADOOP
Whatisbigdataandwhylearnhadoop
BigData Analytics with Hadoop and BIRT
Big Data Hadoop Tutorial by Easylearning Guru
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Hadoop: An Industry Perspective
Hadoop core concepts
Big data technologies and Hadoop infrastructure
Introduction to BIg Data and Hadoop
Big Data and Hadoop Basics
Introduction to Big Data Analytics on Apache Hadoop
Big data ppt
Big Data Concepts
Big Data - An Overview
Introduction to Apache Hadoop Eco-System
Introduction to Big Data
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction of Big data, NoSQL & Hadoop
Ad

Similar to Hadoop in action (20)

ODP
Hadoop introduction
PPTX
Introduction to Apache Hadoop Ecosystem
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPTX
Practical introduction to hadoop
PPTX
Big Data Training in Ludhiana
PPTX
Big Data Training in Amritsar
PDF
What is hadoop
PDF
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
PPTX
Big Data Training in Mohali
PDF
BIGDATA ppts
PPTX
Getting started big data
PPTX
Hadoop and Big Data: Revealed
PPTX
Hadoop online training
PPTX
Hadoop basics
PPTX
Foxvalley bigdata
PPTX
Hadoop, Infrastructure and Stack
PDF
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
PDF
Hadoop essentials by shiva achari - sample chapter
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hadoop introduction
Introduction to Apache Hadoop Ecosystem
Oct 2011 CHADNUG Presentation on Hadoop
MODULE 1: Introduction to Big Data Analytics.pptx
Practical introduction to hadoop
Big Data Training in Ludhiana
Big Data Training in Amritsar
What is hadoop
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
Big Data Training in Mohali
BIGDATA ppts
Getting started big data
Hadoop and Big Data: Revealed
Hadoop online training
Hadoop basics
Foxvalley bigdata
Hadoop, Infrastructure and Stack
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
Hadoop essentials by shiva achari - sample chapter
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Database Infoormation System (DBIS).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Global journeys: estimating international migration
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Foundation of Data Science unit number two notes
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Moving the Public Sector (Government) to a Digital Adoption
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Launch Your Data Science Career in Kochi – 2025
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
Global journeys: estimating international migration
.pdf is not working space design for the following data for the following dat...
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Knowledge Engineering Part 1
IB Computer Science - Internal Assessment.pptx
Mega Projects Data Mega Projects Data
Data_Analytics_and_PowerBI_Presentation.pptx
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Hadoop in action

  • 1. Hadoop In Action When Where Tuesday 06-12-2016 06:00 PM -08:00 PM Badir Program for Technology Incubators #DataRiyadh DataGeeks DataGeeksarabia Enough taking about Big data and Hadoop and let’s see how Hadoop works in action. We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool. presented by Mahmoud Yassin Hadoop: -Hadoop quick definition. -Why Hadoop? -Hadoop ecosystem. -Tools to be used. Practical part: -What’s the current setup? -Ambari look. -Current installed systems. -Use case high-level description. -Steps to develop the use case? Use case: -Locating the data. -Ingest the data into the HDFS -See how the files got created in HDFS -Feed other data from DB. -Data querying via Hive and MapReduce -Hive table creation. -Running transudation job via Pig. -Check the Hive metastore. -Connect BI to Hadoop. -Sqoop basic commands -End to End look solution.
  • 2. By Mahmoud Yassin Hadoop Hands On session
  • 3. Agenda: Hadoop: -Hadoop quick definition. -Why Hadoop? -Hadoop ecosystem. -Tools to be used. Practical part: -What’s the current setup? -Ambari look. -Current installed systems. -Use case high-level description. -Steps to develop the use case? Use case: -Locating the data. -Ingest the data into the HDFS -See how the files got created in HDFS -Feed other data from DB. -Data querying via Hive and MapReduce -Hive table creation. -Running transudation job via Pig. -Check the Hive metastore. - Connect BI to Hadoop.
  • 4. What is Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. #DataRiyadh
  • 5. Why Hadoop is important ? Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
  • 6. Why Hadoop is important ? Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Scalability Horizontal scaling means that you scale by adding more machines into your pool of resources Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine #DataRiyadh
  • 7. Hadoop ecosystem Cluster monitoring, provisioning and management #DataRiyadh
  • 8. Hadoop | Data Ingestion Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. #DataRiyadh
  • 9. Hadoop | Data Storage Layer Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. #DataRiyadh
  • 10. Hadoop | Data Storage Layer #DataRiyadh
  • 11. Hadoop | Data Processing Layer MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster with a parallel, distributed algorithm. #DataRiyadh
  • 12. Hadoop | Data Processing Layer
  • 13. Hadoop | Data Processing Layer A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java. #DataRiyadh
  • 14. Hadoop | Data Querying Layer A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantic (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data. #DataRiyadh
  • 15. Hadoop | Management Layer intuitive, easy-to-use Hadoop management web UI. Apache Ambari was donated by Hortonworks team. It's a powerful and nice interface for Hadoop and other typical applications from the Hadoop ecosystem.
  • 16. Hadoop | Management Layer is an open-source Web interface that supports Apache Hadoop and its ecosystem, licensed under the Apache v2 license
  • 17. Big data existing solutions:
  • 19. Current Setup is a subsidiary of Dell Technologies, that provides cloud and virtualization software and services. http://www.vmware.com/
  • 20. Current Setup The VM make it easy to quickly get hands-on with CDH for testing, demo, and self-learning purposes, and include Cloudera Manager for managing your cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts for getting started. http://www.cloudera.com/downloads/quickstart_vms/5-8.html
  • 21. Inside the VM: Our RDBMS Hadoop Storage
  • 23. The case: Data Sources HDFS : A platform for manipulating data stored in HDFS via a high-level language called Pig Latin. It does data extractions, transformations and loading, and basic analysis in patch mode File Video RDBMS A platform for manipulating data stored in HDFS via a high-level language called Pig Latin. It does data extractions, transformations and loading, and basic analysis in patch mode A data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Connect BI tools to the Hadoop cluster Cloudera CDH cluster
  • 24. Basic Linux Commands cat [filename] Display file’s contents to the standard output device (usually your monitor). cd /directorypath Change to directory. chmod [options] mode filename Change a file’s permissions. clear Clear a command line screen/window for a fresh start. cp [options] source destination Copy files and directories. ls [options] List directory contents. mkdir [options] directory Create a new directory. mv [options] source destination Rename or move file(s) or directories. pwd Display the pathname for the current directory.touch filename Create an empty file with the specified name. who [options] Display who is logged on.
  • 25. Demo

Editor's Notes

  • #10: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
  • #11: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf