Kmeans in-hadoop

K-means in Hadoop
K-means && Spark && Plan

Outline

• K-means
• Spark
• Plan

2012-12-20 2

K-means in Hadoop
• Programs：
• Kmeans.py: k-means core algorithm
• Wrapper.py: local control iterations of k-means
• Generator.py: generate data in random of
range
• Graph.py: draw data

2012-12-20 3

Kmeans.py
• use “in-mapper combining” technology, for
implementing combiner functionality within every
map task. Notice, not combiner phase.
• It makes a discrete Combine step between Map and Reduce
unnecessary. Typically, it is not guaranteed that a combiner
function will be called on every mapper or that ,if called , it
will only be called once.
• In-mapper combiner design patten, we will guarantee that
combiner-like key aggregation occurs in every mapper,
instead of optionally in some mappers.

2012-12-20 5

Kmeans.py
• The aggregation is done entirely in the memory, without
touching disk and it happens before any emission code has
been called
• But it can not assure “Memory Leak” issue. We
should use python to control this condition.
• Results (3.6G Test Dataset)
• Old: 30+ min
• Current: 9+ min， in reduce phase we only use
1~2 second. Saving significant time.
2012-12-20 6

Generator.py

2012-12-20 7

Wrapper.py
• Main controller for k-means iterations
• Function：
• Start mapper-reduce
• Carry basic data and program with mapper phase
• Verify whether it runs end.
• Result:
• Source： 13 clusters
• Target： 10 cluster -> 180 + iterations
• Target： 13 cluster -> 7-8 iterations

2012-12-20 8

Processing(13-clusters)

• 110331.286264 -> 43648.070121
• 43648.070121 -> 22167.351291
• 22167.351291 -> 5853.008014
• 5853.008014 -> 552.292067
• 552.292067 -> 8.202320
• 8.202320 -> 0.000000
• 0.000000 -> 0.000000

2012-12-20 9

Spark
• In-memory , high performance , use Scala
• Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可
以优化迭代工作负载。
• Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集合
对象一样轻松地操作分布式数据集。
• 尽管创建 Spark 是为了支持分布式数据集上的迭代作业，但是实际
上它是对 Hadoop 的补充，可以在 Hadoo 文件系统中并行运行。
• Scala 是一种多范式语言，它以一种流畅的、让人感到舒服的方法支持与命
令式、函数式和面向对象的语言相关的语言特性。

2012-12-20 10

Spark
• Spark 是为集群计算中的特定类型的工作负载而设计，即那些在并行
操作之间重用工作数据集（比如机器学习算法）的工作负载。
• Spark 引进了内存集群计算的概念，可在内存集群计算中将数据集缓
存在内存中，以缩短访问延迟。

2012-12-20 11

其他的大数据分析框架
• GraphLab ：侧重于机器学习算法的并行实现
• Storm： “实时处理的 Hadoop”，它主要侧重于流处理
和持续计算（流处理可以得出计算的结果）。Storm 是
用 Clojure 语言（Lisp 语言的一种方言）编写的，但它
支持用任何语言（比如 Ruby 和 Python）编写的应用程
序。

2012-12-20 12

Plan
• 27 PCs run properly in Hadoop
• Remote management : write some shell scripts,
power saving, task submit from everyone etc.
• Build Mesos, spark, ZooKeeper, Hbase in our
platform.

2012-12-20 13

thanks

2012-12-20 14

Kmeans in-hadoop

More Related Content

Viewers also liked (10)

Similar to Kmeans in-hadoop (20)

More from Tianwei Liu (9)

Kmeans in-hadoop