SlideShare a Scribd company logo
BIG DATA
The following topics will be covered in our
BIG DATA
Online Training:
Copyright @ 2015 Learntek. All Rights Reserved. 2
What is Hadoop?
Big Data Hadoop Training: Hadoop is a free, Java -based programming
framework that supports the processing of large data sets in a distributed
computing environment. It is part of the Apache project sponsored by the
Apache Software Foundation. Hadoop makes it possible to run applications on
systems with thousands of nodes involving thousands of terabytes of storage
capacity. Its distributed file system facilitates rapid data transfer rates among
nodes and allows the system to continue operating uninterrupted in case of a
node failure. This approach lowers the risk of catastrophic system failure, even
if a significant number of nodes become inoperative.
Copyright @ 2015 Learntek. All Rights Reserved.
Why Hadoop?
• Large Volumes of Data: Ability to store and process huge amounts of variety (structure,
unstructured and semi structured) of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and the Internet of Things (IoT), that’s a
key consideration.
• Computing Power: Hadoop’s distributed computing model processes big data fast. The more
computing nodes you use, the more processing power you have.
• Fault Tolerance: Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.
• Flexibility: Unlike traditional relational database, you don’t have to process data before
storing it, You can store as much data as you want and decide how to use it later. That
includes unstructured data like text, images and videos etc.
• Low Cost: The open-source framework is free and used commodity hardware to store large
quantities of data.
• Scalability: You can easily grow your system to handle more data simply by adding nodes.
Little administration is required.
Copyright @ 2015 Learntek. All Rights Reserved. 4
Big Data Hadoop Training: Hadoop Introduction
• Big Data Hadoop Training:
Introduction to Data and System
• Types of Data
• Traditional way of dealing large
data and its problems
• Types of Systems & Scaling
• What is Big Data
• Challenges in Big Data
• Challenges in Traditional
Application
• New Requirements
• What is Hadoop? Why Hadoop?
• Brief history of Hadoop
• Features of Hadoop
• Hadoop and RDBMS
• Hadoop Ecosystem’s overview
Copyright @ 2015 Learntek. All Rights Reserved. 5
Hadoop Installation
• Installation in detail
• Creating Ubuntu image in
VMwareDownloading Hadoop
• Installing SSH
• Configuring Hadoop, HDFS &
MapReduce
• Download, Installation &
Configuration Hive
• Download, Installation &
Configuration Pig
• Download, Installation &
Configuration Sqoop
• Download, Installation &
Configuration Hive
• Configuring Hadoop in Different
Modes
Copyright @ 2015 Learntek. All Rights Reserved. 6
Hadoop Distribute File System (HDFS)
Copyright @ 2015 Learntek. All Rights Reserved. 7
• File System – Concepts
• Blocks
• Replication Factor
• Version File
• Safe mode
• Namespace IDs
• Purpose of Name Node
• Purpose of Data Node
• Purpose of Secondary Name
Node
• Purpose of Job Tracker
• Purpose of Task Tracker
• HDFS Shell Commands –
copy, delete, create
directories etc.
• Reading and Writing in HDFS
• Difference of Unix
Commands and HDFS
commands
• Hadoop Admin Commands
• Hands on exercise with Unix
and HDFS commands
• Read / Write in HDFS –
Internal Process between
Client, NameNode &
DataNodes.
• Accessing HDFS using Java
API
• Various Ways of Accessing
HDFS
• Understanding HDFS Java
classes and methods
• Admin: 1. Commissioning /
DeCommissioning DataNode
• Balancer
• Replication Policy
• Network Distance / Topology
Script
Map Reduce Programming
• About MapReduce
• Understanding block and
input splits
• MapReduce Data types
• Understanding Writable
• Data Flow in MapReduce
Application
• Understanding MapReduce
problem on datasets
• MapReduce and Functional
Programming
• Writing MapReduce
Application
• Understanding Mapper
function
• Understanding Reducer
Function
• Understanding Driver
• Usage of Combiner
• Understanding Partitioner
• Usage of Distributed Cache
• Passing the parameters to
mapper and reducer
• Analysing the Results
• Log files
• Input Formats and Output
Formats
• Counters, Skipping Bad and
unwanted Records
• Writing Join’s in MapReduce
with 2 Input files. Join Types.
• Execute MapReduce Job –
Insights.
• Exercise’s on MapReduce.
• Job Scheduling: Type of
Schedulers.
Copyright @ 2015 Learntek. All Rights Reserved. 8
Hive
• Hive concepts
• Schema on Read VS Schema on
Write
• Hive architecture
• Install and configure hive on
cluster
• Meta Store – Purpose & Type of
Configurations
• Different type of tables in Hive
• Buckets
• Partitions
• Joins in hive
• Hive Query Language
• Hive Data Types
• Data Loading into Hive Tables
• Hive Query Execution
• Hive library functions
• Hive UDF
• Hive Limitations
Copyright @ 2015 Learntek. All Rights Reserved. 9
Pig
• Pig basics
• Install and configure PIG on a cluster
• PIG Library functions
• Pig Vs Hive
• Write sample Pig Latin scripts
• Modes of running PIG
• Running in Grunt shell
• Running as Java program
• PIG UDFs
Copyright @ 2015 Learntek. All Rights Reserved. 10
HBase
• HBase concepts
• HBase architecture
• Region server architecture
• File storage architecture
• HBase basics
• Column access
• Scans
• HBase use cases
• Install and configure HBase on a
multi node cluster
• Create database, Develop and
run sample applications
• Access data stored in HBase
using Java API
Copyright @ 2015 Learntek. All Rights Reserved. 11
Sqoop
• Install and configure Sqoop on cluster
• Connecting to RDBMS
• Installing Mysql
• Import data from Mysql to hive
• Export data to Mysql
• Internal mechanism of import/export
Copyright @ 2015 Learntek. All Rights Reserved. 12
Oozie
• Introduction to OOZIE
• Oozie architecture
• XML file specifications
• Specifying Work flow
• Control nodes
• Oozie job coordinator
Copyright @ 2015 Learntek. All Rights Reserved. 13
Flume
• Introduction to Flume
• Configuration and Setup
• Flume Sink with example
• Channel
• Flume Source with example
• Complex flume architecture
Copyright @ 2015 Learntek. All Rights Reserved. 14
ZooKeeper
• Introduction to ZooKeeper
• Challenges in distributed Applications
• Coordination
• ZooKeeper : Design Goals
• Data Model and Hierarchical namespace
• Cilent APIs
Copyright @ 2015 Learntek. All Rights Reserved. 15
YARN
• Hadoop 1.0 Limitations
• MapReduce Limitations
• History of Hadoop 2.0
• HDFS 2: Architecture
• HDFS 2: Quorum based storage
• HDFS 2: High availability
• HDFS 2: Federation
• YARN Architecture
• Classic vs YARN
• YARN Apps
• YARN multitenancy
• YARN Capacity Scheduler
Copyright @ 2015 Learntek. All Rights Reserved. 16
Prerequisites :
• Knowledge in any programming language, Database knowledge and
Linux Operating system. Core Java or Python knowledge helpful.
Copyright @ 2015 Learntek. All Rights Reserved. 17
Copyright @ 2015 Learntek. All Rights Reserved. 18

More Related Content

PPTX
Big data Hadoop
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
PPTX
Apache hadoop technology : Beginners
PDF
Hadoop Security and Compliance - StampedeCon 2016
PPTX
Hybrid Data Platform
PPTX
Hadoop jon
PPTX
Hadoop in the Cloud - The what, why and how from the experts
PPTX
Introduction to Hadoop - The Essentials
Big data Hadoop
Scaling Deep Learning on Hadoop at LinkedIn
Apache hadoop technology : Beginners
Hadoop Security and Compliance - StampedeCon 2016
Hybrid Data Platform
Hadoop jon
Hadoop in the Cloud - The what, why and how from the experts
Introduction to Hadoop - The Essentials

What's hot (18)

PPTX
Introduction to Kudu - StampedeCon 2016
PPTX
Data protection for hadoop environments
PPTX
Deep Learning using Spark and DL4J for fun and profit
PPTX
Big data architecture on cloud computing infrastructure
PPTX
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
Querying Druid in SQL with Superset
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PDF
Strata EU tutorial - Architectural considerations for hadoop applications
PDF
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
Hadoop vs. RDBMS for Advanced Analytics
PPTX
Big data course
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Built-In Security for the Cloud
PPTX
Hadoop Ecosystem at a Glance
PDF
Leveraging docker for hadoop build automation and big data stack provisioning
PDF
Advanced Security In Hadoop Cluster
Introduction to Kudu - StampedeCon 2016
Data protection for hadoop environments
Deep Learning using Spark and DL4J for fun and profit
Big data architecture on cloud computing infrastructure
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Data Wrangling and Oracle Connectors for Hadoop
Querying Druid in SQL with Superset
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Strata EU tutorial - Architectural considerations for hadoop applications
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Hadoop vs. RDBMS for Advanced Analytics
Big data course
Innovation in the Data Warehouse - StampedeCon 2016
Built-In Security for the Cloud
Hadoop Ecosystem at a Glance
Leveraging docker for hadoop build automation and big data stack provisioning
Advanced Security In Hadoop Cluster
Ad

Similar to Big data - Online Training (20)

PPT
Big data and hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Introduction to BIg Data and Hadoop
PPTX
Hadoop ppt1
PPTX
Apache Hadoop Hive
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Colorado Springs Open Source Hadoop/MySQL
PDF
Intro to Big Data
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PPTX
SQL Server 2012 and Big Data
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
PPTX
Apache hadoop technology : Beginners
PPTX
Apache hadoop technology : Beginners
PDF
hadoop distributed file systems complete information
PPTX
Hadoo its a good pdf to read some notes p.pptx
Big data and hadoop
Hadoop and Big data in Big data and cloud.pptx
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Introduction to BIg Data and Hadoop
Hadoop ppt1
Apache Hadoop Hive
MODULE 1: Introduction to Big Data Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Colorado Springs Open Source Hadoop/MySQL
Intro to Big Data
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
SQL Server 2012 and Big Data
Hadoop and their in big data analysis EcoSystem.pptx
Apache hadoop technology : Beginners
Apache hadoop technology : Beginners
hadoop distributed file systems complete information
Hadoo its a good pdf to read some notes p.pptx
Ad

More from Learntek1 (7)

PPTX
Aws sys ops administrator
PPTX
Angular js Online Training
PPTX
Selenium Online Training
PPTX
React js Online Training
PPTX
Machine learning using spark Online Training
PPTX
Apache Flink Online Training
PPTX
Scala & Spark Online Training
Aws sys ops administrator
Angular js Online Training
Selenium Online Training
React js Online Training
Machine learning using spark Online Training
Apache Flink Online Training
Scala & Spark Online Training

Recently uploaded (20)

PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Cell Structure & Organelles in detailed.
PDF
Pre independence Education in Inndia.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Institutional Correction lecture only . . .
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
master seminar digital applications in india
PDF
Insiders guide to clinical Medicine.pdf
PDF
Complications of Minimal Access Surgery at WLH
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Renaissance Architecture: A Journey from Faith to Humanism
Cell Structure & Organelles in detailed.
Pre independence Education in Inndia.pdf
O7-L3 Supply Chain Operations - ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Final Presentation General Medicine 03-08-2024.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Institutional Correction lecture only . . .
102 student loan defaulters named and shamed – Is someone you know on the list?
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pharma ospi slides which help in ospi learning
PPH.pptx obstetrics and gynecology in nursing
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
master seminar digital applications in india
Insiders guide to clinical Medicine.pdf
Complications of Minimal Access Surgery at WLH

Big data - Online Training

  • 2. The following topics will be covered in our BIG DATA Online Training: Copyright @ 2015 Learntek. All Rights Reserved. 2
  • 3. What is Hadoop? Big Data Hadoop Training: Hadoop is a free, Java -based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes of storage capacity. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. Copyright @ 2015 Learntek. All Rights Reserved.
  • 4. Why Hadoop? • Large Volumes of Data: Ability to store and process huge amounts of variety (structure, unstructured and semi structured) of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration. • Computing Power: Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. • Fault Tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. • Flexibility: Unlike traditional relational database, you don’t have to process data before storing it, You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos etc. • Low Cost: The open-source framework is free and used commodity hardware to store large quantities of data. • Scalability: You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Copyright @ 2015 Learntek. All Rights Reserved. 4
  • 5. Big Data Hadoop Training: Hadoop Introduction • Big Data Hadoop Training: Introduction to Data and System • Types of Data • Traditional way of dealing large data and its problems • Types of Systems & Scaling • What is Big Data • Challenges in Big Data • Challenges in Traditional Application • New Requirements • What is Hadoop? Why Hadoop? • Brief history of Hadoop • Features of Hadoop • Hadoop and RDBMS • Hadoop Ecosystem’s overview Copyright @ 2015 Learntek. All Rights Reserved. 5
  • 6. Hadoop Installation • Installation in detail • Creating Ubuntu image in VMwareDownloading Hadoop • Installing SSH • Configuring Hadoop, HDFS & MapReduce • Download, Installation & Configuration Hive • Download, Installation & Configuration Pig • Download, Installation & Configuration Sqoop • Download, Installation & Configuration Hive • Configuring Hadoop in Different Modes Copyright @ 2015 Learntek. All Rights Reserved. 6
  • 7. Hadoop Distribute File System (HDFS) Copyright @ 2015 Learntek. All Rights Reserved. 7 • File System – Concepts • Blocks • Replication Factor • Version File • Safe mode • Namespace IDs • Purpose of Name Node • Purpose of Data Node • Purpose of Secondary Name Node • Purpose of Job Tracker • Purpose of Task Tracker • HDFS Shell Commands – copy, delete, create directories etc. • Reading and Writing in HDFS • Difference of Unix Commands and HDFS commands • Hadoop Admin Commands • Hands on exercise with Unix and HDFS commands • Read / Write in HDFS – Internal Process between Client, NameNode & DataNodes. • Accessing HDFS using Java API • Various Ways of Accessing HDFS • Understanding HDFS Java classes and methods • Admin: 1. Commissioning / DeCommissioning DataNode • Balancer • Replication Policy • Network Distance / Topology Script
  • 8. Map Reduce Programming • About MapReduce • Understanding block and input splits • MapReduce Data types • Understanding Writable • Data Flow in MapReduce Application • Understanding MapReduce problem on datasets • MapReduce and Functional Programming • Writing MapReduce Application • Understanding Mapper function • Understanding Reducer Function • Understanding Driver • Usage of Combiner • Understanding Partitioner • Usage of Distributed Cache • Passing the parameters to mapper and reducer • Analysing the Results • Log files • Input Formats and Output Formats • Counters, Skipping Bad and unwanted Records • Writing Join’s in MapReduce with 2 Input files. Join Types. • Execute MapReduce Job – Insights. • Exercise’s on MapReduce. • Job Scheduling: Type of Schedulers. Copyright @ 2015 Learntek. All Rights Reserved. 8
  • 9. Hive • Hive concepts • Schema on Read VS Schema on Write • Hive architecture • Install and configure hive on cluster • Meta Store – Purpose & Type of Configurations • Different type of tables in Hive • Buckets • Partitions • Joins in hive • Hive Query Language • Hive Data Types • Data Loading into Hive Tables • Hive Query Execution • Hive library functions • Hive UDF • Hive Limitations Copyright @ 2015 Learntek. All Rights Reserved. 9
  • 10. Pig • Pig basics • Install and configure PIG on a cluster • PIG Library functions • Pig Vs Hive • Write sample Pig Latin scripts • Modes of running PIG • Running in Grunt shell • Running as Java program • PIG UDFs Copyright @ 2015 Learntek. All Rights Reserved. 10
  • 11. HBase • HBase concepts • HBase architecture • Region server architecture • File storage architecture • HBase basics • Column access • Scans • HBase use cases • Install and configure HBase on a multi node cluster • Create database, Develop and run sample applications • Access data stored in HBase using Java API Copyright @ 2015 Learntek. All Rights Reserved. 11
  • 12. Sqoop • Install and configure Sqoop on cluster • Connecting to RDBMS • Installing Mysql • Import data from Mysql to hive • Export data to Mysql • Internal mechanism of import/export Copyright @ 2015 Learntek. All Rights Reserved. 12
  • 13. Oozie • Introduction to OOZIE • Oozie architecture • XML file specifications • Specifying Work flow • Control nodes • Oozie job coordinator Copyright @ 2015 Learntek. All Rights Reserved. 13
  • 14. Flume • Introduction to Flume • Configuration and Setup • Flume Sink with example • Channel • Flume Source with example • Complex flume architecture Copyright @ 2015 Learntek. All Rights Reserved. 14
  • 15. ZooKeeper • Introduction to ZooKeeper • Challenges in distributed Applications • Coordination • ZooKeeper : Design Goals • Data Model and Hierarchical namespace • Cilent APIs Copyright @ 2015 Learntek. All Rights Reserved. 15
  • 16. YARN • Hadoop 1.0 Limitations • MapReduce Limitations • History of Hadoop 2.0 • HDFS 2: Architecture • HDFS 2: Quorum based storage • HDFS 2: High availability • HDFS 2: Federation • YARN Architecture • Classic vs YARN • YARN Apps • YARN multitenancy • YARN Capacity Scheduler Copyright @ 2015 Learntek. All Rights Reserved. 16
  • 17. Prerequisites : • Knowledge in any programming language, Database knowledge and Linux Operating system. Core Java or Python knowledge helpful. Copyright @ 2015 Learntek. All Rights Reserved. 17
  • 18. Copyright @ 2015 Learntek. All Rights Reserved. 18