Device status anomaly detection

Device anomaly
detection using Spark
k-means
9/3/2014

Introduction
Detect device anomaly based on the device
information ( feature property vector )
● battery %
● cpu %
● RAM %
● wifi strength
● build number ( in numerical )
● exception
● charging
● gps long
● gps lat
● bundle version

K-means clustering
Clustering is an unsupervised learning problem whereby we aim to group
subsets of entities with one another based on some notion of similarity.
Clustering is often used for exploratory analysis and/or as a component of a
hierarchical supervised learning pipeline (in which distinct classifiers or
regression models are trained for each cluster).
MLlib supports k-means clustering, one of the most commonly used clustering
algorithms that clusters the data points into predefined number of clusters.

Example Data
battery, cpu, RAM, wifi, exception, charging count
70.00 15.00 70.00 89.00 3.00, 3
75.00 16.00 68.00 90.00 4.00, 0
60.00 19.00 67.00 90.00 3.00, 0
65.00 19.00 67.00 90.00 3.00, 0
67.00 17.00 67.00 90.00 3.00, 0
68.00 19.00 69.00 90.00 3.00, 0
68.00 19.00 69.00 90.00 3.00, 0
68.00 19.00 69.00 90.00 3.00, 0
68.00 19.00 89.00 80.00 4.00, 0
33.00 49.00 79.00 90.00 3.00, 0
33.00 49.00 79.00 90.00 3.00, 0
33.00 49.00 79.00 98.00 3.00, 0
43.00 49.00 79.00 90.00 3.00, 0
53.00 49.00 78.00 90.00 3.00, 0
38.00 49.00 79.00 90.00 3.00, 0
38.00 49.00 89.00 90.00 3.00, 0
68.00 19.00 69.00 90.00 3.00, 0

Example Scala code
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("data/device_anomaly.txt").map { line => Vectors.dense(line.split('
').map(_.toDouble))}.cache()
val K = 3
val maxIteration = 20
val runs =20
val clusters= KMeans.train(data, K, maxIteration, runs)
val vectorsAndClusterIdx = data.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.foreach ( k => printf(k.toString()))

Normalize
data.unpersist(true)
val numCols = data.take(1)(0).length
val n = data.count
val sums = data.reduce((a,b) => a.zip(b).map(t => t._1 + t._2))
val sumSquares = data.fold(new Array[Double](numCols)) ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2))
val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n }
val means = sums.map(_ / n)
val normalizedData = data.map(
(_,means,stdevs).zipped.map((value,mean,stdev) =>
if (stdev <= 0) (value-mean) else
(value-mean)/stdev)).cache()
val kScores = (50 to 120 by 10).par.map(k => (k, clusteringScore(normalizedData, k)))

Result
([70.0,15.0,70.0,89.0,3.0],2)
([33.0,49.0,79.0,90.0,3.0],1)
([75.0,16.0,68.0,90.0,4.0],2)
([33.0,49.0,79.0,90.0,3.0],1)
([60.0,19.0,67.0,90.0,3.0],2)
([33.0,49.0,79.0,98.0,3.0],1)
([65.0,19.0,67.0,90.0,3.0],2)
([43.0,49.0,79.0,90.0,3.0],1)
([67.0,17.0,67.0,90.0,3.0],2)
([53.0,49.0,78.0,90.0,3.0],1)
([68.0,19.0,69.0,90.0,3.0],2)
([38.0,49.0,79.0,90.0,3.0],1)
([68.0,19.0,69.0,90.0,3.0],2)
([38.0,49.0,89.0,90.0,3.0],1)
([68.0,19.0,69.0,90.0,3.0],2)
([68.0,19.0,69.0,90.0,3.0],2)
([68.0,19.0,89.0,80.0,4.0],0)

Heatmap ( sample )
Venue MAC Time CPU Battery
Sneakers 0A-94-05-
F7-93
9/1/2014
7:30:20
89% 13%
McCoverys 0A-94-05-
F7-76
9/3/2014
5:30:20
73% 10%
...

References
● https://www.youtube.com/watch?v=TC5cKYBZAeI
● https://www.youtube.com/watch?v=FjhRkfAuU7I
● http://www.ebaytechblog.com/2014/05/28/using-spark-
to-ignite-data-analytics/#.VAc0PWRdXCw
● http://stanford.edu/~rezab/sparkworkshop/slides/xia
ngrui.pdf
● http://stanford.edu/~rezab/sparkworkshop/slides/xia
ngrui.pdf

Device status anomaly detection

More Related Content

What's hot (20)

Similar to Device status anomaly detection (20)

Recently uploaded (20)

Device status anomaly detection