K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN) is one of the most popular and intuitive supervised machine learning algorithms. It is available in Excel using the XLSTAT software.

What is K Nearest Neighbors (KNN) machine learning?

The K Nearest Neighbors (KNN) algorithm is a non-parametric method used in both classification and regression that assumes that similar objects are in close proximity. Objects that are close (in terms of a certain distance metrics) are thus supposed to belong to the same class, or share similar properties.

The K Nearest Neighbors method (KNN) aims to categorize query points whose class is unknown given their respective distances to points in a learning set (i.e. whose class is known a priori). It is one of the most popular supervised machine learning tools that is both used as a regression method and a classification method.

What is K Nearest Neighbors (KNN) Classification?

A simple version of KNN classification algorithm can be regarded as an extension of the nearest neighbor method (NN method is a special case of KNN, k = 1). The nearest neighbor method consists in assigning to an object the class of its nearest neighbor.

The KNN classification approach assumes that each example in the learning set is a random vector in Rn. Each point is described as x =< a1(x), a2(x), a3(x),.., an(x) > where ar(x) denotes the value I of the rth attribute. ar(x) can be either a quantitative or a qualitative variable.

To determine the class of the query point xq, each of the k nearest points x1,…,xk to xq proceed to voting. The class of xq corresponds to the majority class.

What is K Nearest Neighbors (KNN) Regression?

The goal of the K Nearest neighbors (KNN) regressionalgorithm, on the other hand, is to predict a numerical dependant variable for a query point xq, based on the mean or the median of its value for the k nearest points x1,...xk.

K Nearest Neighbors in XLSTAT: options

Distances: Several distance metrics can be used in XLSTAT to compute similarities in the K Nearest Neighbors algorithm. Options vary according to the type of variables characterizing the observations (qualitative or quantitative).

Distances available for quantitative data (metrics): Euclidian, Minkowski, Manhatan, Tchebychev, Canberra
Distances available for quantitative data (kernels): linear, sigmoid, logarithmic, power, Gaussian, Laplacian
Distances available for qualitative data: Overlap Metric (OM), Value Difference Metric (VDM)

Validation: XLSTAT proposes a K-fold cross validation technique to quantify the quality of the classifier. Data is partitioned into k equally sub samples of equal size. Among the k subsamples, a single subsample is retained as the validation data to test the model, and the remaining k − 1 subsamples are used as training data. This technique can be used to find the best number of neighbors to use so that the algorithm performs the best on the considered dataset.

Other options available in the XLSTAT K Nearest Neighbors feature include observation tracking as well as vote weighing.

K Nearest Neighbors in XLSTAT: results

The K Nearest Neighbors (KNN) feature in XLSTAT includes displaying results by class or by object (observation).

Although the K Nearest Neighbors (KNN) algorithm is one of the most popular machine learning techniques in classification, other algorithms such as classification and regression random forests might be considered instead when it comes to working with a bigger dataset.

View all tutorials