SlideShare a Scribd company logo
K-means in Hadoop
              K-means && Spark && Plan
Outline

• K-means
• Spark
• Plan




2012-12-20   2
K-means in Hadoop
• Programs:
   • Kmeans.py: k-means core algorithm
   • Wrapper.py: local control iterations of k-means
   • Generator.py: generate data in random of
     range
   • Graph.py: draw data




2012-12-20                                         3
Flowchart




2012-12-20   4
Kmeans.py
• use “in-mapper combining” technology, for
  implementing combiner functionality within every
  map task. Notice, not combiner phase.
• It makes a discrete Combine step between Map and Reduce
  unnecessary. Typically, it is not guaranteed that a combiner
  function will be called on every mapper or that ,if called , it
  will only be called once.
• In-mapper combiner design patten, we will guarantee that
  combiner-like key aggregation occurs in every mapper,
  instead of optionally in some mappers.

2012-12-20                                                      5
Kmeans.py
• The aggregation is done entirely in the memory, without
  touching disk and it happens before any emission code has
  been called
• But it can not assure “Memory Leak” issue. We
  should use python to control this condition.
• Results (3.6G Test Dataset)
   • Old: 30+ min
   • Current: 9+ min, in reduce phase we only use
     1~2 second. Saving significant time.
2012-12-20                                                6
Generator.py




2012-12-20     7
Wrapper.py
• Main controller for k-means iterations
• Function:
   • Start mapper-reduce
   • Carry basic data and program with mapper phase
   • Verify whether it runs end.
• Result:
   • Source: 13 clusters
   • Target: 10 cluster -> 180 + iterations
   • Target: 13 cluster -> 7-8 iterations

2012-12-20                                            8
Processing(13-clusters)

•   110331.286264 -> 43648.070121
•   43648.070121 -> 22167.351291
•   22167.351291 -> 5853.008014
•   5853.008014 -> 552.292067
•   552.292067 -> 8.202320
•   8.202320 -> 0.000000
•   0.000000 -> 0.000000



2012-12-20                          9
Spark
• In-memory , high performance , use Scala
• Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可
  以优化迭代工作负载。
• Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合
  对象一样轻松地操作分布式数据集。
• 尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际
  上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。
•   Scala 是一种多范式语言,它以一种流畅的、让人感到舒服的方法支持与命
    令式、函数式和面向对象的语言相关的语言特性。



2012-12-20                                   10
Spark
• Spark 是为集群计算中的特定类型的工作负载而设计,即那些在并行
  操作之间重用工作数据集(比如机器学习算法)的工作负载。
• Spark 引进了内存集群计算的概念,可在内存集群计算中将数据集缓
  存在内存中,以缩短访问延迟。




2012-12-20                        11
其他的大数据分析框架
• GraphLab :侧重于机器学习算法的并行实现
• Storm: “实时处理的 Hadoop”,它主要侧重于流处理
  和持续计算(流处理可以得出计算的结果)。Storm 是
  用 Clojure 语言(Lisp 语言的一种方言)编写的,但它
  支持用任何语言(比如 Ruby 和 Python)编写的应用程
  序。




2012-12-20                       12
Plan
• 27 PCs run properly in Hadoop
• Remote management : write some shell scripts,
  power saving, task submit from everyone etc.
• Build Mesos, spark, ZooKeeper, Hbase in our
  platform.




2012-12-20                                        13
thanks




2012-12-20            14

More Related Content

PDF
Introduction and Internals of SQL on hadoop by WangHaihua
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
PDF
Parallel-kmeans
PDF
Hadoop Design and k -Means Clustering
PDF
Data clustering using map reduce
PPT
Hadoop MapReduce Fundamentals
PDF
K means Clustering
PPT
Hadoop introduction 2
Introduction and Internals of SQL on hadoop by WangHaihua
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Parallel-kmeans
Hadoop Design and k -Means Clustering
Data clustering using map reduce
Hadoop MapReduce Fundamentals
K means Clustering
Hadoop introduction 2

Viewers also liked (10)

PPT
Behm Shah Pagerank
PPTX
Intro to Mahout -- DC Hadoop
PDF
Big data Clustering Algorithms And Strategies
PDF
Map reduce: beyond word count
PDF
Mapreduce Algorithms
PPT
K means Clustering Algorithm
PPTX
Big data and Hadoop
PPTX
Big Data Analytics with Hadoop
Behm Shah Pagerank
Intro to Mahout -- DC Hadoop
Big data Clustering Algorithms And Strategies
Map reduce: beyond word count
Mapreduce Algorithms
K means Clustering Algorithm
Big data and Hadoop
Big Data Analytics with Hadoop
Ad

Similar to Kmeans in-hadoop (20)

PDF
How do we manage more than one thousand of Pegasus clusters - backend part
PDF
Something about Kafka - Why Kafka is so fast
PDF
豆瓣网技术架构变迁
PDF
基于MySQL开放复制协议的同步扩展
PDF
SMACK Dev Experience
PDF
D2_node在淘宝的应用实践_pdf版
PDF
分布式流数据实时计算平台 Iprocess
PPT
Node.js在淘宝的应用实践
PDF
Hadoop大数据实践经验
PDF
Spark在苏宁云商的实践及经验分享
PDF
新时代的分析型云数据库 Greenplum
PPTX
Spark性能调优分享
PDF
服务器基准测试-叶金荣@CYOU-20121130
PDF
大資料趨勢介紹與相關使用技術
PDF
Java线上应用问题排查方法和工具(空望)
PDF
百度系统部分布式系统介绍 马如悦 Sacc2010
PPTX
My sql 5.6新特性深入剖析——innodb引擎
PDF
Introduction of Spark by Wang Haihua
PDF
MongoDB at Qihoo 360
PDF
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
How do we manage more than one thousand of Pegasus clusters - backend part
Something about Kafka - Why Kafka is so fast
豆瓣网技术架构变迁
基于MySQL开放复制协议的同步扩展
SMACK Dev Experience
D2_node在淘宝的应用实践_pdf版
分布式流数据实时计算平台 Iprocess
Node.js在淘宝的应用实践
Hadoop大数据实践经验
Spark在苏宁云商的实践及经验分享
新时代的分析型云数据库 Greenplum
Spark性能调优分享
服务器基准测试-叶金荣@CYOU-20121130
大資料趨勢介紹與相關使用技術
Java线上应用问题排查方法和工具(空望)
百度系统部分布式系统介绍 马如悦 Sacc2010
My sql 5.6新特性深入剖析——innodb引擎
Introduction of Spark by Wang Haihua
MongoDB at Qihoo 360
探索 ISTIO 新型 DATA PLANE 架構 AMBIENT MESH - GOLANG TAIWAN GATHERING #77 X CNTUG
Ad

More from Tianwei Liu (9)

PPTX
2021 ee大会-旷视ai产品背后的研发效能工具建设
PDF
2020 gops-旷视城市大脑私有云平台实践-刘天伟
PDF
豆瓣Paa s平台 dae - 2017
PDF
douban happyday docker for daeqaci
PDF
DAE 新变化介绍
PDF
Docker在豆瓣的实践 刘天伟-20160709
PPT
Mr&ueh数据库方面
PPT
Hadoop introduction
PPT
2021 ee大会-旷视ai产品背后的研发效能工具建设
2020 gops-旷视城市大脑私有云平台实践-刘天伟
豆瓣Paa s平台 dae - 2017
douban happyday docker for daeqaci
DAE 新变化介绍
Docker在豆瓣的实践 刘天伟-20160709
Mr&ueh数据库方面
Hadoop introduction

Kmeans in-hadoop

  • 1. K-means in Hadoop K-means && Spark && Plan
  • 3. K-means in Hadoop • Programs: • Kmeans.py: k-means core algorithm • Wrapper.py: local control iterations of k-means • Generator.py: generate data in random of range • Graph.py: draw data 2012-12-20 3
  • 5. Kmeans.py • use “in-mapper combining” technology, for implementing combiner functionality within every map task. Notice, not combiner phase. • It makes a discrete Combine step between Map and Reduce unnecessary. Typically, it is not guaranteed that a combiner function will be called on every mapper or that ,if called , it will only be called once. • In-mapper combiner design patten, we will guarantee that combiner-like key aggregation occurs in every mapper, instead of optionally in some mappers. 2012-12-20 5
  • 6. Kmeans.py • The aggregation is done entirely in the memory, without touching disk and it happens before any emission code has been called • But it can not assure “Memory Leak” issue. We should use python to control this condition. • Results (3.6G Test Dataset) • Old: 30+ min • Current: 9+ min, in reduce phase we only use 1~2 second. Saving significant time. 2012-12-20 6
  • 8. Wrapper.py • Main controller for k-means iterations • Function: • Start mapper-reduce • Carry basic data and program with mapper phase • Verify whether it runs end. • Result: • Source: 13 clusters • Target: 10 cluster -> 180 + iterations • Target: 13 cluster -> 7-8 iterations 2012-12-20 8
  • 9. Processing(13-clusters) • 110331.286264 -> 43648.070121 • 43648.070121 -> 22167.351291 • 22167.351291 -> 5853.008014 • 5853.008014 -> 552.292067 • 552.292067 -> 8.202320 • 8.202320 -> 0.000000 • 0.000000 -> 0.000000 2012-12-20 9
  • 10. Spark • In-memory , high performance , use Scala • Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可 以优化迭代工作负载。 • Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合 对象一样轻松地操作分布式数据集。 • 尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际 上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。 • Scala 是一种多范式语言,它以一种流畅的、让人感到舒服的方法支持与命 令式、函数式和面向对象的语言相关的语言特性。 2012-12-20 10
  • 11. Spark • Spark 是为集群计算中的特定类型的工作负载而设计,即那些在并行 操作之间重用工作数据集(比如机器学习算法)的工作负载。 • Spark 引进了内存集群计算的概念,可在内存集群计算中将数据集缓 存在内存中,以缩短访问延迟。 2012-12-20 11
  • 12. 其他的大数据分析框架 • GraphLab :侧重于机器学习算法的并行实现 • Storm: “实时处理的 Hadoop”,它主要侧重于流处理 和持续计算(流处理可以得出计算的结果)。Storm 是 用 Clojure 语言(Lisp 语言的一种方言)编写的,但它 支持用任何语言(比如 Ruby 和 Python)编写的应用程 序。 2012-12-20 12
  • 13. Plan • 27 PCs run properly in Hadoop • Remote management : write some shell scripts, power saving, task submit from everyone etc. • Build Mesos, spark, ZooKeeper, Hbase in our platform. 2012-12-20 13