SlideShare a Scribd company logo
K-nearest neighbor algorithm
                                                              Glenn M. Bernstein
                                                     Computer Science and Engineering
                                                                 Department
                                                      University of California, Riverside
                                                           Riverside, CA. 92521
                                                               (951) 892-8682
                                                              gbern@cs.ucr.edu

ABSTRACT                                                                     There are strong engineering reasons based on the composition
This paper examines the K-nearest neighbor algorithm.                        of O-rings to support the judgment that failure probability may
Different values of K produce different accuracy, and this paper             rise monotonically as temperature drops. No previous liftoff
determines the optimal value of K for two different data sets.               temperature was under 53 degrees F. The attribute information:
The first data set is the Challenger USA Space Shuttle O-Ring                     1.   Number of O-rings at risk on a given flight.
from the UCI machine learning repository. The second is the El
Nino data set from the UCI KDD archive.                                           2.   Number experiencing thermal distress
                                                                                  3.   Launch temperature (degree F)
Categories and Subject Descriptors                                                4.   Leak-check pressure (psi)
I.2.8 [Artificial Intelligence]: Problem Solving, Control
Methods, and Search – heuristic methods.                                          5    Temporal order of flight [2]
                                                                             The second data set is significantly larger. This data set is from
General Terms                                                                the UCI Knowledge Discovery Database.                  It contains
Algorithms, Management, Experimentation.                                     oceanographic and surface meteorological readings taken from a
                                                                             series of buoys positioned throughout the equatorial Pacific. The
Keywords                                                                     data is expected to aid in the understanding and prediction of El
K-nearest neighbor, cross validation.                                        Nino/Southern Oscillation (ENSO) cycles. It consists of 782
                                                                             instances measured from May 23 rd, 1998 to June 5th, 1998. The
                                                                             data consists of the following variables: date, latitude, longitude,
1.INTRODUCTION                                                               zonal winds (west<0, east>0), meridional winds (south <0, north
K-nearest neighbor is a useful supervised learning algorithm.                >0), relative humidity, air temperature, sea surface temperature
Used for classification the result of a new instance query is                down to a depth of 500 meters. Missing values do exist in the
classified based on the majority of a K-nearest neighbor’s                   data because not all buoys are able to measure currents, rainfall,
category. The purpose of is to classify a new object based on                and solar radiation so these values are missing dependent on the
attributes and training samples. Given a query point, we find K-             individual buoy. The amount of data is available is also
number of objects or (training points) closest to the query point.           dependent on the buoy, as certain buoys were commissioned
The classification uses a majority vote among the classification             earlier than others[3]. Any missing values were replaced with
of K objects.        The K-nearest neighbor algorithm uses                   an average of that attribute. The median can also be used
neighborhood classification for the prediction value of the new              because the distribution is approximately symmetric so these
query instance[1].                                                           two measures of central tendency are approximately the same.
                                                                             The attributes that contain missing values are: zonal and
2.EXPERIMENTAL DESIGN                                                        meridional winds, humidity, and air and sea surface
The first data set is from the UCI Machine Learning Repository.              temperatures. The values were replaced with -3.90, -0.60,
It consists of 23 instances of 4 attributes each. No missing data            84.46, 27.57, and 28.29, respectively. To clean the data requires
values are present. The task is to predict the number of O-rings             conversion of the file to a .csv file (comma separated value file).
that experience thermal distress on a flight at 31 degrees F given           Then remove the periods that denote missing values and replace,
data on the previous 23 shuttle flights.                                     after calculation of the average value, with the average value for
                                                                             that attribute. The task is to predict the buoy number given
                                                                             various attributes.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are    3.EXPERIMENTS
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To otherwise, or   3.1Challenger USA Space Shuttle O-Ring
republish, to post on servers or to redistribute to lists, requires prior    Data Set
specific permission and/or a fee.
Conference’0X, Month X–X, 200X, City, State, Country.                        Sixty percent is partitioned as the training data set, and forty
Copyright 200X ACM 1-58113-000-0/00/0004…$5.00.                              percent is partitioned as the validation set randomly chosen.
                                                                             The data is normalized so that all data is expressed in terms of
                                                                             standard deviations so that the distance measure is not
                                                                             dominated by variables with a large scale[4]. When the K-
                                                                             nearest neighbor algorithm is run with all the features chosen,
the best value of K is 1 with a 22.22 percent validation error.       Table 2
Two misclassifications out of nine validations set queries.           So, by the elimination of one attribute we achieve increased
                                                                      accuracy, and reduced computation time because all K values
                                                                      achieve the same accuracy above. The variable number of O-
                                                                      rings at risk on a given flight is not relevant because they are all
                                                                      the same value.




Table 1                                                               Figure 1
The K-nearest neighbor algorithm is run without the launch
temperature attribute. Sixty percent is partitioned as the training   3.2El Nino Data Set
data set, and forty percent is partitioned as the validation data     Fifty two percent is partitioned as the training data set, and forty
set. The best K again is 1, with a 22.22 percent error. Two           eight is partitioned as the validation set randomly chosen. The
misclassifications out of nine validations set queries.               data is normalized so that all data is expressed in terms of
                                                                      standard deviations so that the distance measure is not
The K-nearest neighbor algorithm is run without the Psi               dominated by variables with a large scale[4]. When the K-
attribute. Sixty percent is partitioned as the training data set,     nearest neighbor algorithm is run with all the features chosen,
and forty percent is partitioned as the validation data set. The      the best value of K is 1 with a 31.55 percent validation error.
best K again is 1, with a 22.22 percent error.               Two      Fifty seven misclassifications out of one hundred eighty three
misclassifications out of nine validations set queries.               validations set queries.
                                                                      The K-nearest neighbor algorithm is run without the latitude and
                                                                      longitude attributes. Fifty-two point twenty one percent is
                                                                      partitioned as the training data set, and forty-seven point seventy
                                                                      nine percent is partitioned as the validation data set. The best K
                                                                      again is 1, with a 51.37 percent error.                Ninety four
                                                                      misclassifications out of one hundred eighty three validations
                                                                      set queries.
                                                                      The K-nearest neighbor algorithm is run without the zonal winds
                                                                      and meridional winds attributes. Fifty-two point twenty one
                                                                      percent is partitioned as the training data set, and forty-seven
                                                                      point seventy nine percent is partitioned as the validation data
                                                                      set. The best K again is 1, with a 27.87 percent error. Fifty one
                                                                      misclassifications out of one hundred eighty three validations
                                                                      set queries. Elimination of the longitudinal and latitude
                                                                      attributes improved the accuracy.
                                                                      The K-nearest neighbor algorithm is run without the humidity
                                                                      attribute. Fifty-two point twenty one percent is partitioned as
                                                                      the training data set, and forty-seven point seventy nine percent
                                                                      is partitioned as the validation data set. The best K again is 1,
                                                                      with a 28.96 percent error. Fifty three misclassifications out of
                                                                      one hundred eighty three validations set queries.
                                                                      The K-nearest neighbor algorithm is run without the air and sea
                                                                      surface temperature attributes. Fifty-two point twenty one
percent is partitioned as the training data set, and forty-seven
point seventy nine percent is partitioned as the validation data
set. The best K again is 1, with a 40.98 percent error. Seventy       Table 3
five misclassifications out of one hundred eighty three
validations set queries. The elimination of the temperature           4.ACKNOWLEDGMENTS
attributes decreases the accuracy.                                    Thank to professor Eamonn Keogh for his advice a guidance.
                                                                      Thanks to Paul Lammertsma for his open source code that was
The K-nearest neighbor algorithm is run without the zonal
                                                                      adapted for the Challenger USA Space Shuttle O-ring data set.
winds, meridional winds attributes, and humidity. Fifty-two
point twenty one percent is partitioned as the training data set,     5. CONCLUSIONS
and forty-seven point seventy nine percent is partitioned as the
validation data set. The best K again is 1, with a 20.77 percent      Search provides us a tool to reduce the number of attributes. In
error. Thirty eight misclassifications out of one hundred eighty      the Shuttle data set we eliminated one attribute. In the El Nino
three validations set queries. So, by the use of five attributes we   data set we eliminated more than one. Adjustment of the K-
achieve better accuracy.                                              value can improve accuracy.

                                                                      6. REFERENCES
                                                                      [1] http://people.revoledu.com/kardi/tutorial/KNN/What-is-K-
                                                                          Nearest-Neighbor-Algorithm.html
                                                                      [2] http://archive.ics.uci.edu/ml/datasets/Challenger+USA+Spa
                                                                          ce+Shuttle+O-Ring
                                                                      [3] http://kdd.ics.uci.edu/databases/el_nino/el_nino.html
                                                                      [4] http://www.resample.com/xlminer/help/k-NN/knn_ex.htm

More Related Content

PDF
Investigating the Performance of Distanced-Based Weighted-Voting approaches i...
PDF
Agu chen a31_g-2917_retrieving temperature and relative humidity profiles fro...
PDF
Estimation of global solar radiation by using machine learning methods
PPT
WE2.TO9.4.ppt
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PDF
International journal of engineering issues vol 2015 - no 2 - paper3
PDF
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
PDF
IMAGE QUALITY OPTIMIZATION USING RSATV
Investigating the Performance of Distanced-Based Weighted-Voting approaches i...
Agu chen a31_g-2917_retrieving temperature and relative humidity profiles fro...
Estimation of global solar radiation by using machine learning methods
WE2.TO9.4.ppt
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
International journal of engineering issues vol 2015 - no 2 - paper3
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
IMAGE QUALITY OPTIMIZATION USING RSATV

Viewers also liked (6)

DOCX
Reverse nearest neighbors in unsupervised distance based outlier
PDF
Improving the performance of k nearest neighbor algorithm for the classificat...
PPTX
Data Stream Outlier Detection Algorithm
PPTX
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
PPTX
Nearest Neighbor Algorithm Zaffar Ahmed
Reverse nearest neighbors in unsupervised distance based outlier
Improving the performance of k nearest neighbor algorithm for the classificat...
Data Stream Outlier Detection Algorithm
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classifi...
Nearest Neighbor Algorithm Zaffar Ahmed
Ad

Similar to Final report (20)

PDF
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
DOC
.doc
DOC
.doc
DOC
.doc
PDF
IEEE Datamining 2016 Title and Abstract
PPT
Global Modeling of Biodiversity and Climate Change
PDF
Neighborhood Component Analysis 20071108
PDF
Searching in metric spaces
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PDF
Lecture7 - IBk
PDF
Dh31504508
PPT
Paradigm shifts in wildlife and biodiversity management through machine learning
PDF
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
PDF
K Nearest neighbour
PDF
Parallelizing itinerary based knn query
PDF
A Hierarchical Feature Set optimization for effective code change based Defec...
DOC
FOCUS.doc
PDF
KurtPortelliMastersDissertation
PDF
Data repository for sensor network a data mining approach
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
.doc
.doc
.doc
IEEE Datamining 2016 Title and Abstract
Global Modeling of Biodiversity and Climate Change
Neighborhood Component Analysis 20071108
Searching in metric spaces
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
Lecture7 - IBk
Dh31504508
Paradigm shifts in wildlife and biodiversity management through machine learning
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
K Nearest neighbour
Parallelizing itinerary based knn query
A Hierarchical Feature Set optimization for effective code change based Defec...
FOCUS.doc
KurtPortelliMastersDissertation
Data repository for sensor network a data mining approach
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Final report

  • 1. K-nearest neighbor algorithm Glenn M. Bernstein Computer Science and Engineering Department University of California, Riverside Riverside, CA. 92521 (951) 892-8682 gbern@cs.ucr.edu ABSTRACT There are strong engineering reasons based on the composition This paper examines the K-nearest neighbor algorithm. of O-rings to support the judgment that failure probability may Different values of K produce different accuracy, and this paper rise monotonically as temperature drops. No previous liftoff determines the optimal value of K for two different data sets. temperature was under 53 degrees F. The attribute information: The first data set is the Challenger USA Space Shuttle O-Ring 1. Number of O-rings at risk on a given flight. from the UCI machine learning repository. The second is the El Nino data set from the UCI KDD archive. 2. Number experiencing thermal distress 3. Launch temperature (degree F) Categories and Subject Descriptors 4. Leak-check pressure (psi) I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search – heuristic methods. 5 Temporal order of flight [2] The second data set is significantly larger. This data set is from General Terms the UCI Knowledge Discovery Database. It contains Algorithms, Management, Experimentation. oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The Keywords data is expected to aid in the understanding and prediction of El K-nearest neighbor, cross validation. Nino/Southern Oscillation (ENSO) cycles. It consists of 782 instances measured from May 23 rd, 1998 to June 5th, 1998. The data consists of the following variables: date, latitude, longitude, 1.INTRODUCTION zonal winds (west<0, east>0), meridional winds (south <0, north K-nearest neighbor is a useful supervised learning algorithm. >0), relative humidity, air temperature, sea surface temperature Used for classification the result of a new instance query is down to a depth of 500 meters. Missing values do exist in the classified based on the majority of a K-nearest neighbor’s data because not all buoys are able to measure currents, rainfall, category. The purpose of is to classify a new object based on and solar radiation so these values are missing dependent on the attributes and training samples. Given a query point, we find K- individual buoy. The amount of data is available is also number of objects or (training points) closest to the query point. dependent on the buoy, as certain buoys were commissioned The classification uses a majority vote among the classification earlier than others[3]. Any missing values were replaced with of K objects. The K-nearest neighbor algorithm uses an average of that attribute. The median can also be used neighborhood classification for the prediction value of the new because the distribution is approximately symmetric so these query instance[1]. two measures of central tendency are approximately the same. The attributes that contain missing values are: zonal and 2.EXPERIMENTAL DESIGN meridional winds, humidity, and air and sea surface The first data set is from the UCI Machine Learning Repository. temperatures. The values were replaced with -3.90, -0.60, It consists of 23 instances of 4 attributes each. No missing data 84.46, 27.57, and 28.29, respectively. To clean the data requires values are present. The task is to predict the number of O-rings conversion of the file to a .csv file (comma separated value file). that experience thermal distress on a flight at 31 degrees F given Then remove the periods that denote missing values and replace, data on the previous 23 shuttle flights. after calculation of the average value, with the average value for that attribute. The task is to predict the buoy number given various attributes. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 3.EXPERIMENTS not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To otherwise, or 3.1Challenger USA Space Shuttle O-Ring republish, to post on servers or to redistribute to lists, requires prior Data Set specific permission and/or a fee. Conference’0X, Month X–X, 200X, City, State, Country. Sixty percent is partitioned as the training data set, and forty Copyright 200X ACM 1-58113-000-0/00/0004…$5.00. percent is partitioned as the validation set randomly chosen. The data is normalized so that all data is expressed in terms of standard deviations so that the distance measure is not dominated by variables with a large scale[4]. When the K- nearest neighbor algorithm is run with all the features chosen,
  • 2. the best value of K is 1 with a 22.22 percent validation error. Table 2 Two misclassifications out of nine validations set queries. So, by the elimination of one attribute we achieve increased accuracy, and reduced computation time because all K values achieve the same accuracy above. The variable number of O- rings at risk on a given flight is not relevant because they are all the same value. Table 1 Figure 1 The K-nearest neighbor algorithm is run without the launch temperature attribute. Sixty percent is partitioned as the training 3.2El Nino Data Set data set, and forty percent is partitioned as the validation data Fifty two percent is partitioned as the training data set, and forty set. The best K again is 1, with a 22.22 percent error. Two eight is partitioned as the validation set randomly chosen. The misclassifications out of nine validations set queries. data is normalized so that all data is expressed in terms of standard deviations so that the distance measure is not The K-nearest neighbor algorithm is run without the Psi dominated by variables with a large scale[4]. When the K- attribute. Sixty percent is partitioned as the training data set, nearest neighbor algorithm is run with all the features chosen, and forty percent is partitioned as the validation data set. The the best value of K is 1 with a 31.55 percent validation error. best K again is 1, with a 22.22 percent error. Two Fifty seven misclassifications out of one hundred eighty three misclassifications out of nine validations set queries. validations set queries. The K-nearest neighbor algorithm is run without the latitude and longitude attributes. Fifty-two point twenty one percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 51.37 percent error. Ninety four misclassifications out of one hundred eighty three validations set queries. The K-nearest neighbor algorithm is run without the zonal winds and meridional winds attributes. Fifty-two point twenty one percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 27.87 percent error. Fifty one misclassifications out of one hundred eighty three validations set queries. Elimination of the longitudinal and latitude attributes improved the accuracy. The K-nearest neighbor algorithm is run without the humidity attribute. Fifty-two point twenty one percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 28.96 percent error. Fifty three misclassifications out of one hundred eighty three validations set queries. The K-nearest neighbor algorithm is run without the air and sea surface temperature attributes. Fifty-two point twenty one
  • 3. percent is partitioned as the training data set, and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 40.98 percent error. Seventy Table 3 five misclassifications out of one hundred eighty three validations set queries. The elimination of the temperature 4.ACKNOWLEDGMENTS attributes decreases the accuracy. Thank to professor Eamonn Keogh for his advice a guidance. Thanks to Paul Lammertsma for his open source code that was The K-nearest neighbor algorithm is run without the zonal adapted for the Challenger USA Space Shuttle O-ring data set. winds, meridional winds attributes, and humidity. Fifty-two point twenty one percent is partitioned as the training data set, 5. CONCLUSIONS and forty-seven point seventy nine percent is partitioned as the validation data set. The best K again is 1, with a 20.77 percent Search provides us a tool to reduce the number of attributes. In error. Thirty eight misclassifications out of one hundred eighty the Shuttle data set we eliminated one attribute. In the El Nino three validations set queries. So, by the use of five attributes we data set we eliminated more than one. Adjustment of the K- achieve better accuracy. value can improve accuracy. 6. REFERENCES [1] http://people.revoledu.com/kardi/tutorial/KNN/What-is-K- Nearest-Neighbor-Algorithm.html [2] http://archive.ics.uci.edu/ml/datasets/Challenger+USA+Spa ce+Shuttle+O-Ring [3] http://kdd.ics.uci.edu/databases/el_nino/el_nino.html [4] http://www.resample.com/xlminer/help/k-NN/knn_ex.htm