SlideShare a Scribd company logo
Hadoop Introduction
   Background && Installation && Hello world && related
Outline

•   Background
•   Hello world
•   Installation
•   Related




12/20/12           2
Background
• Why Hadoop?
   • Accessible: AWS
   • Robust : handle most such failures
   • Scalable: linearly
   • Simple: 1 == 1 w
• Key Points:
   • Scale-out
   • Moving code to data

12/20/12                                  3
Background: History
• Apache Top Project: Doug Cutting
• Lucence -> Nutch -> Hadoop(2004)
   • Yahoo (1w)
   • Facebook (Hive, Hbase,…)
   • HULU (Hbase)
   • Baidu (3000TB, one week)
   • Twitter (sweat data)


12/20/12                             4
Background
• Comparing SQL database and Hadoop
   • Structure:
      • SQL(structure data, Specific Pattern)
      • Hadoop(Key-value, like Text, Picture)
   • Scale-out <- scale-up
   • Key-Value <- Relation Tables
   • Functional Programming <- Declarative Queries
   • Offline batch processing <- Online (Once
     Write , Read many times)
12/20/12                                         5
Background – Understanding
• Word Count
     • File Size ++ , Memory Leak
     • Disk-Hash Table (More complex)
     • Distributed:
         • Phase 1: Part Processing
         • Phase 2: Merge Results
            • Shuffle the partitions the appropriate machines(AlphaBeta)

     • Now, We have already finish a minimal Hadoop.



12/20/12                                                                   6
Hello World: Word Count
• Two Phase:
     • Mapping: 获取输入数据,并将其装载到 mapper 中
     • Reducing: 处理来自 mapper 的所有输出,产生最终结果。

•   1.1    list(filename, file content)
•   1.2    list(word, 1)
•   2.1    list(word, list(word))
•   2.2    list(word, count)



12/20/12                                     7
Hello World
• mapper.py
• Reducer.py




12/20/12       8
Installation
• Mode:
   • 单机模式( default)
   • 伪分布模式 推荐开发和调试模式
   • 全分布模式
• Configuration:
   • 基本配置
   • Ssh 配置
   • Ubuntu 配置

12/20/12               9
Hadoop Framework
• HDFS:
   • NameNode : 跟踪,指导,记录
   • DataNode :底层 IO 操作
   • Secondary NameNode
• Map Reduce :
   • Job Tracker
   • Task Tracker


12/20/12                   10
Related
• Programming:
   • Java
   • Python
      • Jython ( Translate Python )
      • Hadoop Streaming ( stdin , stdout )
      • Dumbo
      • Happy


12/20/12                                      11
Related
•   Pig: 高级数据流语言
•   Hive: SQL 数据仓库
•   Hbase : Google BigTable , 面向列的数据库
•   ZookKeeper: 共享状态的协同系统
•   Chukwa : 数据收集系统
•   Mahout :数据挖掘与机器学习
•   Hama: 矩阵计算


12/20/12                                12
Resource
• Book:
   • Hadoop In action
   • Hadoop 实战 (第二版)
• Video && Google Course
• URL:
   • 资源收藏




12/20/12                   13
thanks




12/20/12            14

More Related Content

PPTX
Hadoop hive
PPT
Hadoop Map Reduce 程式設計
PDF
Something about Kafka - Why Kafka is so fast
PDF
百度系统部分布式系统介绍 马如悦 Sacc2010
PPTX
Hbase运维碎碎念
PDF
Distributed Data Analytics at Taobao
PPTX
淘宝Hadoop数据分析实践
PPTX
云梯的多Namenode和跨机房之路
Hadoop hive
Hadoop Map Reduce 程式設計
Something about Kafka - Why Kafka is so fast
百度系统部分布式系统介绍 马如悦 Sacc2010
Hbase运维碎碎念
Distributed Data Analytics at Taobao
淘宝Hadoop数据分析实践
云梯的多Namenode和跨机房之路

What's hot (20)

PDF
Google LevelDB Study Discuss
PDF
Hadoop ecosystem - hadoop 生態系
PDF
Leveldb background
PDF
大資料趨勢介紹與相關使用技術
PDF
Level db
PDF
Hadoop大数据实践经验
PDF
Big Data, NoSQL, and MongoDB
PDF
Cassandra
 
PDF
准实时海量数据分析系统架构探究
PPTX
開放原始碼 Ch2.4 app - oss - db (ver 1.0)
PDF
How to plan a hadoop cluster for testing and production environment
PDF
Hbase架构简介、实践
PDF
redis 适用场景与实现
PPT
Hbase
PDF
Spark introduction - In Chinese
PDF
Hadoop-分布式数据平台
PDF
Why use MySQL
PDF
诗檀软件 Oracle开发优化基础
PDF
大型网站架构的发展
Google LevelDB Study Discuss
Hadoop ecosystem - hadoop 生態系
Leveldb background
大資料趨勢介紹與相關使用技術
Level db
Hadoop大数据实践经验
Big Data, NoSQL, and MongoDB
Cassandra
 
准实时海量数据分析系统架构探究
開放原始碼 Ch2.4 app - oss - db (ver 1.0)
How to plan a hadoop cluster for testing and production environment
Hbase架构简介、实践
redis 适用场景与实现
Hbase
Spark introduction - In Chinese
Hadoop-分布式数据平台
Why use MySQL
诗檀软件 Oracle开发优化基础
大型网站架构的发展
Ad

Viewers also liked (6)

PPT
Hadoop 2
PDF
The Family of Hadoop
PDF
Semantic web meetup 14.november 2013
PDF
Migration from FAST ESP to Solr
PDF
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
PPTX
TriHUG: Lucene Solr Hadoop
Hadoop 2
The Family of Hadoop
Semantic web meetup 14.november 2013
Migration from FAST ESP to Solr
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
TriHUG: Lucene Solr Hadoop
Ad

Similar to Hadoop introduction (20)

PPT
Kmeans in-hadoop
PPS
Hadoop基础及hive入门
PDF
Hadoop 0.20 程式設計
PPTX
Introduction to big data
PDF
Hadoop开发者入门专刊
PPTX
Baidu LSP and DISQL for Log Analysis
PPTX
Hadoop 介紹 20141024
PPTX
大数据漫谈-bilibili
PPTX
DISQL 2.0: Language for Big Data Analysis Widely Adopted in Baidu
PDF
Hadoop大数据实践经验
PPTX
分布式计算与Hadoop - 刘鹏
PPT
Hadoop学习总结
PPT
Hadoop与数据分析
PPTX
Hadoop ecosystem
PPTX
Hadoop 簡介 教師 許智威
PDF
分布式流数据实时计算平台 Iprocess
PDF
Introduction to pig&zookeeper
PPT
Hadoop 與 SQL 的甜蜜連結
PDF
MapReduce 簡單介紹與練習
DOCX
关于Hbase
Kmeans in-hadoop
Hadoop基础及hive入门
Hadoop 0.20 程式設計
Introduction to big data
Hadoop开发者入门专刊
Baidu LSP and DISQL for Log Analysis
Hadoop 介紹 20141024
大数据漫谈-bilibili
DISQL 2.0: Language for Big Data Analysis Widely Adopted in Baidu
Hadoop大数据实践经验
分布式计算与Hadoop - 刘鹏
Hadoop学习总结
Hadoop与数据分析
Hadoop ecosystem
Hadoop 簡介 教師 許智威
分布式流数据实时计算平台 Iprocess
Introduction to pig&zookeeper
Hadoop 與 SQL 的甜蜜連結
MapReduce 簡單介紹與練習
关于Hbase

More from Tianwei Liu (10)

PPTX
2021 ee大会-旷视ai产品背后的研发效能工具建设
PDF
2020 gops-旷视城市大脑私有云平台实践-刘天伟
PDF
豆瓣Paa s平台 dae - 2017
PDF
douban happyday docker for daeqaci
PDF
DAE 新变化介绍
PDF
Docker在豆瓣的实践 刘天伟-20160709
PPT
Mr&ueh数据库方面
PPT
Hadoop introduction 2
PPT
2021 ee大会-旷视ai产品背后的研发效能工具建设
2020 gops-旷视城市大脑私有云平台实践-刘天伟
豆瓣Paa s平台 dae - 2017
douban happyday docker for daeqaci
DAE 新变化介绍
Docker在豆瓣的实践 刘天伟-20160709
Mr&ueh数据库方面
Hadoop introduction 2

Hadoop introduction

  • 1. Hadoop Introduction Background && Installation && Hello world && related
  • 2. Outline • Background • Hello world • Installation • Related 12/20/12 2
  • 3. Background • Why Hadoop? • Accessible: AWS • Robust : handle most such failures • Scalable: linearly • Simple: 1 == 1 w • Key Points: • Scale-out • Moving code to data 12/20/12 3
  • 4. Background: History • Apache Top Project: Doug Cutting • Lucence -> Nutch -> Hadoop(2004) • Yahoo (1w) • Facebook (Hive, Hbase,…) • HULU (Hbase) • Baidu (3000TB, one week) • Twitter (sweat data) 12/20/12 4
  • 5. Background • Comparing SQL database and Hadoop • Structure: • SQL(structure data, Specific Pattern) • Hadoop(Key-value, like Text, Picture) • Scale-out <- scale-up • Key-Value <- Relation Tables • Functional Programming <- Declarative Queries • Offline batch processing <- Online (Once Write , Read many times) 12/20/12 5
  • 6. Background – Understanding • Word Count • File Size ++ , Memory Leak • Disk-Hash Table (More complex) • Distributed: • Phase 1: Part Processing • Phase 2: Merge Results • Shuffle the partitions the appropriate machines(AlphaBeta) • Now, We have already finish a minimal Hadoop. 12/20/12 6
  • 7. Hello World: Word Count • Two Phase: • Mapping: 获取输入数据,并将其装载到 mapper 中 • Reducing: 处理来自 mapper 的所有输出,产生最终结果。 • 1.1 list(filename, file content) • 1.2 list(word, 1) • 2.1 list(word, list(word)) • 2.2 list(word, count) 12/20/12 7
  • 8. Hello World • mapper.py • Reducer.py 12/20/12 8
  • 9. Installation • Mode: • 单机模式( default) • 伪分布模式 推荐开发和调试模式 • 全分布模式 • Configuration: • 基本配置 • Ssh 配置 • Ubuntu 配置 12/20/12 9
  • 10. Hadoop Framework • HDFS: • NameNode : 跟踪,指导,记录 • DataNode :底层 IO 操作 • Secondary NameNode • Map Reduce : • Job Tracker • Task Tracker 12/20/12 10
  • 11. Related • Programming: • Java • Python • Jython ( Translate Python ) • Hadoop Streaming ( stdin , stdout ) • Dumbo • Happy 12/20/12 11
  • 12. Related • Pig: 高级数据流语言 • Hive: SQL 数据仓库 • Hbase : Google BigTable , 面向列的数据库 • ZookKeeper: 共享状态的协同系统 • Chukwa : 数据收集系统 • Mahout :数据挖掘与机器学习 • Hama: 矩阵计算 12/20/12 12
  • 13. Resource • Book: • Hadoop In action • Hadoop 实战 (第二版) • Video && Google Course • URL: • 资源收藏 12/20/12 13

Editor's Notes

  • #2: 素材天下 sucaitianxia.com