SlideShare a Scribd company logo
3
Most read
11
Most read
12
Most read
Classification Technique KNN in
           Data Mining
       ---on dataset “Iris”


      Comp722 data mining
        Kaiwen Qi, UNC
          Spring 2012
Outline
   Dataset introduction
   Data processing
   Data analysis
   KNN & Implementation
   Testing
Dataset
   Raw dataset
    Iris(http://archive.ics.uci.edu/ml/datasets/Iris)




                                                                 5 Attributes
                                                (a) Raw
    150 total records                                                     Sepal length in cm
                                                data                     (continious number)
                                                                            Sepal width in cm
                                                                          (continious number)
                  50 records Iris Setosa                                   Petal length in cm
                                                                          (continious number)

                                                                            Petal width in cm
                  50 records Iris Versicolour                             (continious number)
                                                                                   Class
                                                                        (nominal data:
                  50 records Iris Virginica                                  Iris Setosa
                                                                             Iris Versicolour
                                                                             Iris Virginica)

       (b) Data
                                                             (C) Data
       organization
Classification Goal
   Task
Data Processing
   Original data
Data Processing
• Balanced distribution
Data Analysis
   Statistics
Data Analysis
   Histogram
Data Analysis
   Histogram
KNN
   KNN algorithm




    The unknown data, the green circle, is classified to be square when
    K is 5. The distance between two points is calculated with Euclidean
    distance d(p, q)=         . .In this example, square is the majority
    in 5 nearest neighbors.
KNN
   Advantage
       the skimpiness of implementation. It is good
        at dealing with numeric attributes.
       Does not set up the model and just imports
        the dataset with very low computer overhead.
       Does not need to calculate the useful attribute
        subset. Compared with naïve Bayesian, we
        do not need to worry about lack of available
        probability data
Implementation of KNN
   Algorithm
        Algorithm: KNN. Asses a classification label from training data for an
         unlabeled data
         Input: K, the number of neighbors.
         Dataset that include training data
        Output: A string that indicates unknown tuple’s classification

    Method:
     Create a distance array whose size is K
     Initialize the array with the distances between the unlabeled tuple with
      first K records in dataset
     Let i=k+1
     calculate the distance between the unlabeled tuple with the (k+1)th
      record in dataset, if the distance is greater than the biggest distance in
      the array, replace the old max distance with the new distance; i=i+1
     repeat step (4) until i is greater than dataset size(150)
     Count the class number in the array, the class of biggest number is
      mining result
Implementation of KNN
   UML
Testing
   Testing (K=7, total 150 tuples)
Testing
   Testing (K=7, 60% data as training data)
Testing
   Input random distribution dataset



               Random dataset




      Accuracy test:
Performance
   Comparison
     Decision tree
    Advantage                                    Naïve Bayesian
    • comprehensibility
    • construct a decision tree without any     Advantage
      domain knowledge                          • relatively simply.
    • handle high dimensional                   • By simply calculating
    • By eliminating unrelated attributes         attributes frequency from
      and tree pruning, it simplifies             training datanand without
      classification calculation                  any other operations (e.g.
    Disadvantage                                  sort, search),
    • requires good quality of training data.   Disadvantage
    • usually runs in memory                    • The assumption of
    • Not good at handling continuous             independence is not right
      number features.                          • No available probability data
                                                  to calculate probability
Conclusion
   KNN is a simple algorithm with high
    classification accuracy for dataset with
    continuous attributes.
   It shows high performance with balanced
    distribution training data as input.
Thanks
Question?

More Related Content

PDF
Winning Data Science Competitions
PDF
Feature Engineering in Machine Learning
PDF
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
PDF
Introduction to Data Visualization
ODP
Machine Learning with Decision trees
PPT
PPT
Decision tree
PPT
2. visualization in data mining
Winning Data Science Competitions
Feature Engineering in Machine Learning
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
Introduction to Data Visualization
Machine Learning with Decision trees
Decision tree
2. visualization in data mining

What's hot (20)

PDF
Credit card fraud detection through machine learning
PPTX
Decision Tree Learning
PDF
Iris data analysis example in R
PPTX
Iris - Most loved dataset
PPTX
Data Analytics
PPT
K mean-clustering algorithm
PPTX
Statistics for data science
PDF
Data Science - Part III - EDA & Model Selection
PPT
3.4 density and grid methods
PPTX
Support vector machine-SVM's
PPTX
The Basics of Statistics for Data Science By Statisticians
PPTX
Introduction to predictive modeling v1
PDF
The Data Science Process
ODP
Machine Learning With Logistic Regression
PPTX
Presentation on data preparation with pandas
PPTX
Exploratory data analysis
PPTX
Classification and prediction in data mining
PDF
Exploratory data analysis data visualization
Credit card fraud detection through machine learning
Decision Tree Learning
Iris data analysis example in R
Iris - Most loved dataset
Data Analytics
K mean-clustering algorithm
Statistics for data science
Data Science - Part III - EDA & Model Selection
3.4 density and grid methods
Support vector machine-SVM's
The Basics of Statistics for Data Science By Statisticians
Introduction to predictive modeling v1
The Data Science Process
Machine Learning With Logistic Regression
Presentation on data preparation with pandas
Exploratory data analysis
Classification and prediction in data mining
Exploratory data analysis data visualization
Ad

Viewers also liked (20)

PPT
Data mining slides
 
PDF
Dwdm naive bayes_ankit_gadgil_027
PPTX
Naive bayes
PPTX
ML KNN-ALGORITHM
PPT
Multidimensional Database Design & Architecture
PDF
Data modelling 101
PPTX
Multidimensional data models
PPT
Data Mining In Market Research
PPTX
Marekting research applications ppt
PPT
Data mining
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PPTX
Data Modeling Basics
PPTX
Data Modeling PPT
PDF
Multidimentional data model
PPTX
Copy Testing
PPTX
Multi dimensional model vs (1)
PPT
Promotion
PDF
Data warehouse architecture
PDF
Data mining (lecture 1 & 2) conecpts and techniques
PPT
Data Warehouse Modeling
Data mining slides
 
Dwdm naive bayes_ankit_gadgil_027
Naive bayes
ML KNN-ALGORITHM
Multidimensional Database Design & Architecture
Data modelling 101
Multidimensional data models
Data Mining In Market Research
Marekting research applications ppt
Data mining
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Data Modeling Basics
Data Modeling PPT
Multidimentional data model
Copy Testing
Multi dimensional model vs (1)
Promotion
Data warehouse architecture
Data mining (lecture 1 & 2) conecpts and techniques
Data Warehouse Modeling
Ad

Similar to Data mining project presentation (15)

PPTX
Iris_KNN_Colab_Presentation that enables the
PPTX
Statistical classification: A review on some techniques
PPTX
Instance based learning
PPTX
ICCV2009: Max-Margin Ađitive Classifiers for Detection
PPTX
Principal Component Analysis For Novelty Detection
PDF
lecture_RNN Autoencoder.pdf
PPTX
Machine Learning Algorithms (Part 1)
PDF
London useR Meeting 21-Jul-09
PDF
Introduction to deep learning
PDF
Intel Nervana Artificial Intelligence Meetup 1/31/17
PDF
Bayesian Counters
PPTX
FUNCTION APPROXIMATION
PPTX
Convolutional Patch Representations for Image Retrieval An unsupervised approach
PPTX
Machine Learning with R
PDF
Learning Classifier Systems for Class Imbalance Problems
Iris_KNN_Colab_Presentation that enables the
Statistical classification: A review on some techniques
Instance based learning
ICCV2009: Max-Margin Ađitive Classifiers for Detection
Principal Component Analysis For Novelty Detection
lecture_RNN Autoencoder.pdf
Machine Learning Algorithms (Part 1)
London useR Meeting 21-Jul-09
Introduction to deep learning
Intel Nervana Artificial Intelligence Meetup 1/31/17
Bayesian Counters
FUNCTION APPROXIMATION
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Machine Learning with R
Learning Classifier Systems for Class Imbalance Problems

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Monthly Chronicles - July 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Understanding_Digital_Forensics_Presentation.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Monthly Chronicles - July 2025

Data mining project presentation

  • 1. Classification Technique KNN in Data Mining ---on dataset “Iris” Comp722 data mining Kaiwen Qi, UNC Spring 2012
  • 2. Outline  Dataset introduction  Data processing  Data analysis  KNN & Implementation  Testing
  • 3. Dataset  Raw dataset Iris(http://archive.ics.uci.edu/ml/datasets/Iris) 5 Attributes (a) Raw 150 total records Sepal length in cm data (continious number) Sepal width in cm (continious number) 50 records Iris Setosa Petal length in cm (continious number) Petal width in cm 50 records Iris Versicolour (continious number) Class (nominal data: 50 records Iris Virginica Iris Setosa Iris Versicolour Iris Virginica) (b) Data (C) Data organization
  • 5. Data Processing  Original data
  • 7. Data Analysis  Statistics
  • 8. Data Analysis  Histogram
  • 9. Data Analysis  Histogram
  • 10. KNN  KNN algorithm The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.
  • 11. KNN  Advantage  the skimpiness of implementation. It is good at dealing with numeric attributes.  Does not set up the model and just imports the dataset with very low computer overhead.  Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data
  • 12. Implementation of KNN  Algorithm  Algorithm: KNN. Asses a classification label from training data for an unlabeled data Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification Method:  Create a distance array whose size is K  Initialize the array with the distances between the unlabeled tuple with first K records in dataset  Let i=k+1  calculate the distance between the unlabeled tuple with the (k+1)th record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1  repeat step (4) until i is greater than dataset size(150)  Count the class number in the array, the class of biggest number is mining result
  • 14. Testing  Testing (K=7, total 150 tuples)
  • 15. Testing  Testing (K=7, 60% data as training data)
  • 16. Testing  Input random distribution dataset Random dataset Accuracy test:
  • 17. Performance  Comparison Decision tree Advantage Naïve Bayesian • comprehensibility • construct a decision tree without any Advantage domain knowledge • relatively simply. • handle high dimensional • By simply calculating • By eliminating unrelated attributes attributes frequency from and tree pruning, it simplifies training datanand without classification calculation any other operations (e.g. Disadvantage sort, search), • requires good quality of training data. Disadvantage • usually runs in memory • The assumption of • Not good at handling continuous independence is not right number features. • No available probability data to calculate probability
  • 18. Conclusion  KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.  It shows high performance with balanced distribution training data as input.