SlideShare a Scribd company logo
Device anomaly 
detection using Spark 
k-means 
9/3/2014
Introduction 
Detect device anomaly based on the device 
information ( feature property vector ) 
● battery % 
● cpu % 
● RAM % 
● wifi strength 
● build number ( in numerical ) 
● exception 
● charging 
● gps long 
● gps lat 
● bundle version
K-means clustering 
Clustering is an unsupervised learning problem whereby we aim to group 
subsets of entities with one another based on some notion of similarity. 
Clustering is often used for exploratory analysis and/or as a component of a 
hierarchical supervised learning pipeline (in which distinct classifiers or 
regression models are trained for each cluster). 
MLlib supports k-means clustering, one of the most commonly used clustering 
algorithms that clusters the data points into predefined number of clusters.
Example Data 
battery, cpu, RAM, wifi, exception, charging count 
70.00 15.00 70.00 89.00 3.00, 3 
75.00 16.00 68.00 90.00 4.00, 0 
60.00 19.00 67.00 90.00 3.00, 0 
65.00 19.00 67.00 90.00 3.00, 0 
67.00 17.00 67.00 90.00 3.00, 0 
68.00 19.00 69.00 90.00 3.00, 0 
68.00 19.00 69.00 90.00 3.00, 0 
68.00 19.00 69.00 90.00 3.00, 0 
68.00 19.00 89.00 80.00 4.00, 0 
33.00 49.00 79.00 90.00 3.00, 0 
33.00 49.00 79.00 90.00 3.00, 0 
33.00 49.00 79.00 98.00 3.00, 0 
43.00 49.00 79.00 90.00 3.00, 0 
53.00 49.00 78.00 90.00 3.00, 0 
38.00 49.00 79.00 90.00 3.00, 0 
38.00 49.00 89.00 90.00 3.00, 0 
68.00 19.00 69.00 90.00 3.00, 0
Example Scala code 
import org.apache.spark.mllib.clustering.KMeans 
import org.apache.spark.mllib.linalg.Vectors 
val data = sc.textFile("data/device_anomaly.txt").map { line => Vectors.dense(line.split(' 
').map(_.toDouble))}.cache() 
val K = 3 
val maxIteration = 20 
val runs =20 
val clusters= KMeans.train(data, K, maxIteration, runs) 
val vectorsAndClusterIdx = data.map{ point => 
val prediction = clusters.predict(point) 
(point.toString, prediction) 
} 
vectorsAndClusterIdx.foreach ( k => printf(k.toString()))
Normalize 
data.unpersist(true) 
val numCols = data.take(1)(0).length 
val n = data.count 
val sums = data.reduce((a,b) => a.zip(b).map(t => t._1 + t._2)) 
val sumSquares = data.fold(new Array[Double](numCols)) ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2)) 
val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n } 
val means = sums.map(_ / n) 
val normalizedData = data.map( 
(_,means,stdevs).zipped.map((value,mean,stdev) => 
if (stdev <= 0) (value-mean) else 
(value-mean)/stdev)).cache() 
val kScores = (50 to 120 by 10).par.map(k => (k, clusteringScore(normalizedData, k)))
Result 
([70.0,15.0,70.0,89.0,3.0],2) 
([33.0,49.0,79.0,90.0,3.0],1) 
([75.0,16.0,68.0,90.0,4.0],2) 
([33.0,49.0,79.0,90.0,3.0],1) 
([60.0,19.0,67.0,90.0,3.0],2) 
([33.0,49.0,79.0,98.0,3.0],1) 
([65.0,19.0,67.0,90.0,3.0],2) 
([43.0,49.0,79.0,90.0,3.0],1) 
([67.0,17.0,67.0,90.0,3.0],2) 
([53.0,49.0,78.0,90.0,3.0],1) 
([68.0,19.0,69.0,90.0,3.0],2) 
([38.0,49.0,79.0,90.0,3.0],1) 
([68.0,19.0,69.0,90.0,3.0],2) 
([38.0,49.0,89.0,90.0,3.0],1) 
([68.0,19.0,69.0,90.0,3.0],2) 
([68.0,19.0,69.0,90.0,3.0],2) 
([68.0,19.0,89.0,80.0,4.0],0)
Heatmap ( sample ) 
Venue MAC Time CPU Battery 
Sneakers 0A-94-05- 
F7-93 
9/1/2014 
7:30:20 
89% 13% 
McCoverys 0A-94-05- 
F7-76 
9/3/2014 
5:30:20 
73% 10% 
...
References 
● https://www.youtube.com/watch?v=TC5cKYBZAeI 
● https://www.youtube.com/watch?v=FjhRkfAuU7I 
● http://www.ebaytechblog.com/2014/05/28/using-spark- 
to-ignite-data-analytics/#.VAc0PWRdXCw 
● http://stanford.edu/~rezab/sparkworkshop/slides/xia 
ngrui.pdf 
● http://stanford.edu/~rezab/sparkworkshop/slides/xia 
ngrui.pdf

More Related Content

PPTX
Spark with kubernates
PDF
K8s cluster autoscaler
PDF
reInvent 2021 Recap and k9s review
PPTX
CNCF Rajkot group- Know the magic of kubernetes with AWS EKS
PDF
CI/CD with Kubernetes, Helm & Wercker (#madScalability)
PDF
Running kubernetes
PPTX
Kubeflow on google kubernetes engine
PDF
VPC by Default時代のアクセス制御
Spark with kubernates
K8s cluster autoscaler
reInvent 2021 Recap and k9s review
CNCF Rajkot group- Know the magic of kubernetes with AWS EKS
CI/CD with Kubernetes, Helm & Wercker (#madScalability)
Running kubernetes
Kubeflow on google kubernetes engine
VPC by Default時代のアクセス制御

What's hot (20)

PPTX
kubernates and micro-services
PPTX
Node Summit 2018 - Optimize your Lambda functions
PPTX
CloudAnts - Kubernetes
PDF
Kubernetes Basics
PPTX
Serverless on Kubernetes
PDF
Using Kubernetes to deploy Django in GCP
PPTX
Cloud brew cloudcamp
PPTX
Enable IPv6 on Route53 AWS ELB, docker and node App
PPTX
Automating aws infrastructure and code deployments using Ansible @WebEngage
PDF
Kubernetes Helm (Boulder Kubernetes Meetup, June 2016)
PDF
From Ceilometer to Telemetry: not so alarming!
PDF
Kubeflow control plane
PPTX
Shelly cloud & heroku & engineyard. Pros & Cons
PDF
Kubernetes and Amazon ECS
PDF
Helm – The package manager for Kubernetes
PDF
Deliver Docker Containers Continuously on AWS - QCon 2017
PDF
KubeCon 2018 - Running VM Workloads Side by Side with Container Workloads
PDF
Multi cloud Serverless platform using Kubernetes
PDF
Elasticsearch on Kubernetes
PDF
Google Cloud Computing compares GCE, GAE and GKE
kubernates and micro-services
Node Summit 2018 - Optimize your Lambda functions
CloudAnts - Kubernetes
Kubernetes Basics
Serverless on Kubernetes
Using Kubernetes to deploy Django in GCP
Cloud brew cloudcamp
Enable IPv6 on Route53 AWS ELB, docker and node App
Automating aws infrastructure and code deployments using Ansible @WebEngage
Kubernetes Helm (Boulder Kubernetes Meetup, June 2016)
From Ceilometer to Telemetry: not so alarming!
Kubeflow control plane
Shelly cloud & heroku & engineyard. Pros & Cons
Kubernetes and Amazon ECS
Helm – The package manager for Kubernetes
Deliver Docker Containers Continuously on AWS - QCon 2017
KubeCon 2018 - Running VM Workloads Side by Side with Container Workloads
Multi cloud Serverless platform using Kubernetes
Elasticsearch on Kubernetes
Google Cloud Computing compares GCE, GAE and GKE
Ad

Similar to Device status anomaly detection (20)

PPTX
Anomaly Detection with Apache Spark
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
PDF
Anomaly detection (Unsupervised Learning) in Machine Learning
PPTX
Spark MLlib - Training Material
PDF
A Hierarchical Feature Set optimization for effective code change based Defec...
PDF
Machine learning for predictive maintenance external
PDF
Fault detection of imbalanced data using incremental clustering
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Data Mining: Cluster Analysis
PPTX
Apache Spark Machine Learning Decision Trees
PPTX
Alerting mechanism and algorithms introduction
 
PDF
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
PDF
Choosing allowability boundaries for describing objects in subject areas
PDF
C04701019027
PDF
Adaptive and online one class support vector machine-based outlier detection
PDF
Unsupervised Learning with Apache Spark
PDF
Kmeans plusplus
PDF
Data pipelines and anomaly detection
PDF
A framework for outlier detection in
PPTX
Large Scale Machine Learning with Apache Spark
Anomaly Detection with Apache Spark
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Anomaly detection (Unsupervised Learning) in Machine Learning
Spark MLlib - Training Material
A Hierarchical Feature Set optimization for effective code change based Defec...
Machine learning for predictive maintenance external
Fault detection of imbalanced data using incremental clustering
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Data Mining: Cluster Analysis
Apache Spark Machine Learning Decision Trees
Alerting mechanism and algorithms introduction
 
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
Choosing allowability boundaries for describing objects in subject areas
C04701019027
Adaptive and online one class support vector machine-based outlier detection
Unsupervised Learning with Apache Spark
Kmeans plusplus
Data pipelines and anomaly detection
A framework for outlier detection in
Large Scale Machine Learning with Apache Spark
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Well-logging-methods_new................
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
composite construction of structures.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Welding lecture in detail for understanding
DOCX
573137875-Attendance-Management-System-original
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Automation-in-Manufacturing-Chapter-Introduction.pdf
Well-logging-methods_new................
Model Code of Practice - Construction Work - 21102022 .pdf
CYBER-CRIMES AND SECURITY A guide to understanding
OOP with Java - Java Introduction (Basics)
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Lecture Notes Electrical Wiring System Components
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
composite construction of structures.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
bas. eng. economics group 4 presentation 1.pptx
Welding lecture in detail for understanding
573137875-Attendance-Management-System-original
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Device status anomaly detection

  • 1. Device anomaly detection using Spark k-means 9/3/2014
  • 2. Introduction Detect device anomaly based on the device information ( feature property vector ) ● battery % ● cpu % ● RAM % ● wifi strength ● build number ( in numerical ) ● exception ● charging ● gps long ● gps lat ● bundle version
  • 3. K-means clustering Clustering is an unsupervised learning problem whereby we aim to group subsets of entities with one another based on some notion of similarity. Clustering is often used for exploratory analysis and/or as a component of a hierarchical supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). MLlib supports k-means clustering, one of the most commonly used clustering algorithms that clusters the data points into predefined number of clusters.
  • 4. Example Data battery, cpu, RAM, wifi, exception, charging count 70.00 15.00 70.00 89.00 3.00, 3 75.00 16.00 68.00 90.00 4.00, 0 60.00 19.00 67.00 90.00 3.00, 0 65.00 19.00 67.00 90.00 3.00, 0 67.00 17.00 67.00 90.00 3.00, 0 68.00 19.00 69.00 90.00 3.00, 0 68.00 19.00 69.00 90.00 3.00, 0 68.00 19.00 69.00 90.00 3.00, 0 68.00 19.00 89.00 80.00 4.00, 0 33.00 49.00 79.00 90.00 3.00, 0 33.00 49.00 79.00 90.00 3.00, 0 33.00 49.00 79.00 98.00 3.00, 0 43.00 49.00 79.00 90.00 3.00, 0 53.00 49.00 78.00 90.00 3.00, 0 38.00 49.00 79.00 90.00 3.00, 0 38.00 49.00 89.00 90.00 3.00, 0 68.00 19.00 69.00 90.00 3.00, 0
  • 5. Example Scala code import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile("data/device_anomaly.txt").map { line => Vectors.dense(line.split(' ').map(_.toDouble))}.cache() val K = 3 val maxIteration = 20 val runs =20 val clusters= KMeans.train(data, K, maxIteration, runs) val vectorsAndClusterIdx = data.map{ point => val prediction = clusters.predict(point) (point.toString, prediction) } vectorsAndClusterIdx.foreach ( k => printf(k.toString()))
  • 6. Normalize data.unpersist(true) val numCols = data.take(1)(0).length val n = data.count val sums = data.reduce((a,b) => a.zip(b).map(t => t._1 + t._2)) val sumSquares = data.fold(new Array[Double](numCols)) ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2)) val stdevs = sumSquares.zip(sums).map { case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n } val means = sums.map(_ / n) val normalizedData = data.map( (_,means,stdevs).zipped.map((value,mean,stdev) => if (stdev <= 0) (value-mean) else (value-mean)/stdev)).cache() val kScores = (50 to 120 by 10).par.map(k => (k, clusteringScore(normalizedData, k)))
  • 7. Result ([70.0,15.0,70.0,89.0,3.0],2) ([33.0,49.0,79.0,90.0,3.0],1) ([75.0,16.0,68.0,90.0,4.0],2) ([33.0,49.0,79.0,90.0,3.0],1) ([60.0,19.0,67.0,90.0,3.0],2) ([33.0,49.0,79.0,98.0,3.0],1) ([65.0,19.0,67.0,90.0,3.0],2) ([43.0,49.0,79.0,90.0,3.0],1) ([67.0,17.0,67.0,90.0,3.0],2) ([53.0,49.0,78.0,90.0,3.0],1) ([68.0,19.0,69.0,90.0,3.0],2) ([38.0,49.0,79.0,90.0,3.0],1) ([68.0,19.0,69.0,90.0,3.0],2) ([38.0,49.0,89.0,90.0,3.0],1) ([68.0,19.0,69.0,90.0,3.0],2) ([68.0,19.0,69.0,90.0,3.0],2) ([68.0,19.0,89.0,80.0,4.0],0)
  • 8. Heatmap ( sample ) Venue MAC Time CPU Battery Sneakers 0A-94-05- F7-93 9/1/2014 7:30:20 89% 13% McCoverys 0A-94-05- F7-76 9/3/2014 5:30:20 73% 10% ...
  • 9. References ● https://www.youtube.com/watch?v=TC5cKYBZAeI ● https://www.youtube.com/watch?v=FjhRkfAuU7I ● http://www.ebaytechblog.com/2014/05/28/using-spark- to-ignite-data-analytics/#.VAc0PWRdXCw ● http://stanford.edu/~rezab/sparkworkshop/slides/xia ngrui.pdf ● http://stanford.edu/~rezab/sparkworkshop/slides/xia ngrui.pdf