SlideShare a Scribd company logo
A new clustering tool
          of Data Mining
     RAPID MINER
Introduction To Clustering
 Unsupervised learning when old data with class
  labels not available e.g. when introducing a new
  product.
 Group/cluster existing customers based on time
  series of payment history such that similar
  customers in same cluster.
 Key requirement: Need a good measure of similarity
  between instances.
 Identify micro-markets and develop policies for
  each
About The Project
 Aim of this project is to devise a new algorithm of
  clustering for Data Mining
 The main functionalities which would be implemented in
  the system would be preprocessing and clustering.
 In the preprocessing of the data, input file, .xls file can be
  chosen. The null values, if any, present in the input file
  would be removed in order to avoid the occurrence of
  faulty results in the output data sets. The redundancy or
  duplicity in the data sets of the attributes is removed.
 In the clustering, the data is distributed into groups, so that
  the degree of association to be strong between members of
  the same cluster and weak between members of different
  clusters.
Present Tool: Weka
 Weka (Waikato Environment for Knowledge Analysis) is a popular
  suite of machine learning software written in Java, developed at
  the University of Waikato, New Zealand.
 The Explorer interface features several panels providing access to the
  main components of the workbench:
 The Preprocess panel has facilities for importing data from a database,
  a CSV file, etc., and for preprocessing this data using a so-called
  filtering algorithm. These filters can be used to transform the data (e.g.,
  turning numeric attributes into discrete ones) and make it possible to
  delete instances and attributes according to specific criteria.
 The Cluster panel gives access to the clustering techniques in Weka,
  e.g., the simple k-means algorithm. There is also an implementation of
  the expectation maximization algorithm for learning a mixture
  of normal distributions.
Our tool:
 Initially in the data preprocessing phase, the MS-Excel File is taken as
  input. There is no question of CSV of ARFF File(s). This is done since
  Excel file(s) are well known and comfortably handled by non-technical
  people as well. But, CSV and ARFF file(s) are needed to be well versed
  with also. This was done by importing a new library, the ‘jxl.jar’ library
  into the project.
 File(s) for data mining is firstly cleaned, by removing the null data sets
  from the input file(s). Null data sets are the data sets that contained no
  information or some information less than a threshold (minimum
  number of values of required attributes) value. The number of null
  data sets is reported to the user of the system as well. The second thing
  that was done was to remove redundancy/ duplicity of data sets from
  the file(s). Redundant/ Duplicate data sets are the data sets which have
  all the attribute values same in value with some other data set. These
  data sets are eliminated for the further process of data mining. The
  number of these redundant/ duplicate data sets is also reported to the
  user.
KD Trees
 K Dimensional Trees
 Space Partitioning Data Structure
 Splitting planes perpendicular to
  Coordinate Axes
 Reduces the Overall Time Complexity to
  O(log n)
Clustering
 Our Clustering Algorithm uses KD Tree extensively for
  improving its Time Complexity Requirements.
 Our algorithm differs from existing approach in how
  nearest centers are computed.
 Efficiency is achieved because the data points do not
  vary throughout the computation and, hence, this data
  structure does not need to be recomputed at each
  stage.
K-means Clustering
 Complexity is O( n * K * I * d )
 – n = number of points, K = number of clusters,
 I = number of iterations, d = number of attributes
K means
 K-Means methodology is a commonly used clustering technique. In
  this analysis the user starts with a collection of samples and attempts to
  group them into ‘k’ Number of Clusters based on certain specific
  distance measurements. The prominent steps involved in the K-Means
  clustering algorithm are given below.
 1. This algorithm is initiated by creating ‘k’ different clusters. The given
  sample set is first randomly distributed between these ‘k’ different
  clusters.
 2. As a next step, the distance measurement between each of the
  sample, within a given cluster, to their respective cluster centroid is
  calculated.
 3. Samples are then moved to a cluster (k ¢ ) that records the shortest
  distance from a sample to the cluster (k ¢ ) centroid.
 As a first step to the cluster analysis, the user decides
  on the Number of Clusters‘k’. This parameter could
  take definite integer values with the lower bound of 1
  (in practice, 2 is the smallest relevant number of
  clusters) and an upper bound that equals the total
  number of samples.

 The K-Means algorithm is repeated a number of times
  to obtain an optimal clustering solution, every time
  starting with a random set of initial clusters.
COMPARISON OF OUR TOOL WITH WEKA

  A set of data with the following statistics was run on
  WEKA and our tool both :

 Relation = weather
 No. of attributes = 3
 No. of Instances ( including redundant/ duplicate and
  null instances) = 17
Clustering
Clustering
Clustering
Clustering
Clustering
Clustering
Limitations :-
This tool does not provide protection from:
 Shared storage failures.

 Network service failures.

 Operational errors.

 Site disasters (unless a geographically dispersed
 clustering solution has been implemented).
In the near future…
 Market analysis
   Marketing strategies
   Advertisement
 Risk analysis and management
   Finance and finance investments
   Manufacturing and production
 Fraud detection and detection of unusual patterns
 (outliers)
   Telecommunication
   Finanancial transactions
   Anti-terrorism (!!!)
CONCLUSION
   We device a new algorithm for clustering by considering the following variations:-

 MS-Excel File(s) is successfully read, handled and processed by the system with the help
  of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel
  document were known.

 Null data sets were removed comfortably. Along with this, redundant and duplicate data
  sets were also removed.

 This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”)
  for the clustering algorithm.

 A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean
  step.

 The initial centers are chosen in this algorithm. K-MEANS does not specify how they are
  to be selected.

 An inappropriate choice of number of clusters can yield poor results. That is why,
  number of clusters are determined properly in the data set.
References
 An Efficient k-Means Clustering Algorithm: Analysis and
  Implementation - Tapas Kanungo, Nathan
S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.
 Introduction to Clustering Techniques – by Leo Wanner
 A comprehensive overview of Basic Clustering Algorithms –
Glenn Fung
 Introduction to Data Mining –
Tan/Steinbach/Kumar
Clustering
Questions/comments…?

More Related Content

PPTX
Cluster analysis
PPT
Cluster analysis
PPT
Chap8 basic cluster_analysis
PPTX
Types of clustering and different types of clustering algorithms
PPT
Cure, Clustering Algorithm
PPTX
Clustering in Data Mining
PPTX
Cluster analysis
PPT
Clustering
Cluster analysis
Cluster analysis
Chap8 basic cluster_analysis
Types of clustering and different types of clustering algorithms
Cure, Clustering Algorithm
Clustering in Data Mining
Cluster analysis
Clustering

What's hot (20)

PPT
Dataa miining
PPTX
Cluster Analysis
PDF
Spss tutorial-cluster-analysis
PPTX
Large Scale Data Clustering: an overview
PPT
3.3 hierarchical methods
PPTX
Clustering
PDF
Current clustering techniques
PDF
Data clustering
PPTX
05 Clustering in Data Mining
PPT
Lect4
PDF
Unsupervised learning clustering
PPTX
Clustering, k-means clustering
PPTX
Clustering in data Mining (Data Mining)
PPT
PPT
3.2 partitioning methods
PPT
3.5 model based clustering
PDF
Big data Clustering Algorithms And Strategies
PDF
New Approach for K-mean and K-medoids Algorithm
PPTX
Introduction to Clustering algorithm
PDF
K means Clustering
Dataa miining
Cluster Analysis
Spss tutorial-cluster-analysis
Large Scale Data Clustering: an overview
3.3 hierarchical methods
Clustering
Current clustering techniques
Data clustering
05 Clustering in Data Mining
Lect4
Unsupervised learning clustering
Clustering, k-means clustering
Clustering in data Mining (Data Mining)
3.2 partitioning methods
3.5 model based clustering
Big data Clustering Algorithms And Strategies
New Approach for K-mean and K-medoids Algorithm
Introduction to Clustering algorithm
K means Clustering
Ad

Similar to Clustering (20)

PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
PDF
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
PDF
Experimental study of Data clustering using k- Means and modified algorithms
PDF
Chapter 5.pdf
PPTX
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
PDF
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PDF
Density Based Clustering Approach for Solving the Software Component Restruct...
PPTX
Document clustering for forensic analysis an approach for improving compute...
PDF
Review of Existing Methods in K-means Clustering Algorithm
PDF
Vol 16 No 2 - July-December 2016
PDF
Parallel KNN for Big Data using Adaptive Indexing
PDF
2017 nov reflow sbtb
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
PDF
Data clustering using kernel based
PDF
Data clustering using map reduce
PDF
CLUSTERING IN DATA MINING.pdf
PDF
Dynamic approach to k means clustering algorithm-2
DOC
Observations
PDF
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
Experimental study of Data clustering using k- Means and modified algorithms
Chapter 5.pdf
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
Density Based Clustering Approach for Solving the Software Component Restruct...
Document clustering for forensic analysis an approach for improving compute...
Review of Existing Methods in K-means Clustering Algorithm
Vol 16 No 2 - July-December 2016
Parallel KNN for Big Data using Adaptive Indexing
2017 nov reflow sbtb
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
Data clustering using kernel based
Data clustering using map reduce
CLUSTERING IN DATA MINING.pdf
Dynamic approach to k means clustering algorithm-2
Observations
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)

Clustering

  • 1. A new clustering tool of Data Mining RAPID MINER
  • 2. Introduction To Clustering  Unsupervised learning when old data with class labels not available e.g. when introducing a new product.  Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.  Key requirement: Need a good measure of similarity between instances.  Identify micro-markets and develop policies for each
  • 3. About The Project  Aim of this project is to devise a new algorithm of clustering for Data Mining  The main functionalities which would be implemented in the system would be preprocessing and clustering.  In the preprocessing of the data, input file, .xls file can be chosen. The null values, if any, present in the input file would be removed in order to avoid the occurrence of faulty results in the output data sets. The redundancy or duplicity in the data sets of the attributes is removed.  In the clustering, the data is distributed into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters.
  • 4. Present Tool: Weka  Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.  The Explorer interface features several panels providing access to the main components of the workbench:  The Preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria.  The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.
  • 5. Our tool:  Initially in the data preprocessing phase, the MS-Excel File is taken as input. There is no question of CSV of ARFF File(s). This is done since Excel file(s) are well known and comfortably handled by non-technical people as well. But, CSV and ARFF file(s) are needed to be well versed with also. This was done by importing a new library, the ‘jxl.jar’ library into the project.  File(s) for data mining is firstly cleaned, by removing the null data sets from the input file(s). Null data sets are the data sets that contained no information or some information less than a threshold (minimum number of values of required attributes) value. The number of null data sets is reported to the user of the system as well. The second thing that was done was to remove redundancy/ duplicity of data sets from the file(s). Redundant/ Duplicate data sets are the data sets which have all the attribute values same in value with some other data set. These data sets are eliminated for the further process of data mining. The number of these redundant/ duplicate data sets is also reported to the user.
  • 6. KD Trees  K Dimensional Trees  Space Partitioning Data Structure  Splitting planes perpendicular to Coordinate Axes  Reduces the Overall Time Complexity to O(log n)
  • 7. Clustering  Our Clustering Algorithm uses KD Tree extensively for improving its Time Complexity Requirements.  Our algorithm differs from existing approach in how nearest centers are computed.  Efficiency is achieved because the data points do not vary throughout the computation and, hence, this data structure does not need to be recomputed at each stage.
  • 8. K-means Clustering  Complexity is O( n * K * I * d )  – n = number of points, K = number of clusters,  I = number of iterations, d = number of attributes
  • 9. K means  K-Means methodology is a commonly used clustering technique. In this analysis the user starts with a collection of samples and attempts to group them into ‘k’ Number of Clusters based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below.  1. This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters.  2. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated.  3. Samples are then moved to a cluster (k ¢ ) that records the shortest distance from a sample to the cluster (k ¢ ) centroid.
  • 10.  As a first step to the cluster analysis, the user decides on the Number of Clusters‘k’. This parameter could take definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and an upper bound that equals the total number of samples.  The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters.
  • 11. COMPARISON OF OUR TOOL WITH WEKA A set of data with the following statistics was run on WEKA and our tool both :  Relation = weather  No. of attributes = 3  No. of Instances ( including redundant/ duplicate and null instances) = 17
  • 18. Limitations :- This tool does not provide protection from:  Shared storage failures.  Network service failures.  Operational errors.  Site disasters (unless a geographically dispersed clustering solution has been implemented).
  • 19. In the near future…  Market analysis  Marketing strategies  Advertisement  Risk analysis and management  Finance and finance investments  Manufacturing and production  Fraud detection and detection of unusual patterns (outliers)  Telecommunication  Finanancial transactions  Anti-terrorism (!!!)
  • 20. CONCLUSION We device a new algorithm for clustering by considering the following variations:-  MS-Excel File(s) is successfully read, handled and processed by the system with the help of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel document were known.  Null data sets were removed comfortably. Along with this, redundant and duplicate data sets were also removed.  This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”) for the clustering algorithm.  A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean step.  The initial centers are chosen in this algorithm. K-MEANS does not specify how they are to be selected.  An inappropriate choice of number of clusters can yield poor results. That is why, number of clusters are determined properly in the data set.
  • 21. References  An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.  Introduction to Clustering Techniques – by Leo Wanner  A comprehensive overview of Basic Clustering Algorithms – Glenn Fung  Introduction to Data Mining – Tan/Steinbach/Kumar