Giovanni's Diary > Programming > Notes > Intro to Machine Learning >

The K-nearest neighbor algorithm

Given an n-dimensional partitioned space, and a new train value, we want to associate the new value with a partition. A simple solution could be to look at the nearest known data and their partition. To get a more representative result, we can check k nearest neighbors.

More precisely, to classify an example \(d\):

  • find k nearest neighbors of \(d\)
  • choose as the label the majority label within the \(k\) nearest neighbors

How do we measure "nearest"?

Measuring distance / similarity is a domain-specific problem and there are many different variations.

  • similarity: numerical measure of how alike two data objects are
  • difference: numerical measure of how different are two data objects

Euclidean distance

\[D(a, b) = \sqrt{(a_i - b_i)^2 + (a_2 - b_2)^2}\]

However, here we are assuming that all the labels are comparable aka. measured in the same range. Therefore, data should be standardized.

Standardization or Z-score normalization

Rescale the data so that the mean is 0 and the standard deviation from the mean is 1

\[x_{norm}=\frac{x-\mu}{\sigma}\] Min-Max scaling: scale the data to a fixed range - between 0 and 1

\[x_{norm}=\frac{x-x_{min}}{x_{max}-x_{min}}\]

Minkowsky Distance

Generalization of the Euclidean distance:

\[D(a, b) = (\sum_{k=1}^{p} |a_k - b_k|^r)^{\frac{1}{r}}\]

Cosine Similarity

Given two different vectors:

\[COS(d_1, d_2) = (d_1 * d_2) / ||d_1|| ||d_2||\]

  • \(*\) is the dot product and \(||d||\) is the length of vector d

Decision Boundaries

The decision boundaries are places in the features space where the classification of a point / example changes.

Voronoi Diagram / Partitioning

Describes the areas that are nearest to any given point, given a set of data

  • each line segment is equidistant between two points

However, k-NN does not explicitly compute decision boundaries, but form a subset of the Voronoi diagram for the training data.

How to pick K

Some techniques to pick \(k\) include:

  • common heuristics
  • use validation set
  • use cross validation
  • rule of thumb is \(k < \sqrt{n}\) where n is the size of training examples

k-NN variations: weighted k-NN.

Lazy Learner vs Eager Learner

k-NN belongs to the class of lazy learning algorithms

  • lazy learning: simply stores training data and operates when it is given a test example
  • eager learning: given a training set, constructs a classification model before receiving new test data to classify

This means that k-NN is not really fast during inference, but no training is required.

Curse of Dimensionality

In high dimensions almost all points are far away from each other

  • the size of the data space grows exponentially with the number of dimensions
  • this means that the size of the data set must also grow exponentially in order to keep the same density
    • the success of k-NN is very dependent on having a dense data set
    • k-NN requires a point to be close in every single dimension

Computational Cost

  • Linear algorithm (no preprocessing) is O(kN) to compute the distance for all N datapoints
  • tree-based data structures: pre-processing often using K-D tree
    • divide the space in regions. To check which region a point belongs to, simply traverse a binary tree.

Parametric vs Non-parametric Models

  • Parametric models have a finite number of parameters
    • linear regression, logistic regression, and linear support vector machines
  • Nonparametric model: the number of parameters is (potentially) infinite
    • k-nearest neighbor, decision trees, RBF kernel SVMs

Travel: Intro to Machine Learning, Index