Giovanni's Diary > Programming > Notes > Intro to Machine Learning >

Unsupervised Learning

In unsupervised learning we observe data from an unknown distribution without class labels. It is quite useful for the following tasks:

  • dimensionality reduction: find \(f: x\rightarrow y\) such that \(dim(x)\lll dim(y)\)
  • clustering: find \(f: x\rightarrow \mathbb{N}\) that maps an input with a cluster index.
  • density estimation: find the probability distribution that best fits the data.

A technique that is widely used for dimensionality reduction, lossy data compression, feature extraction and data visualization is called Principal Component Analysis, which we will now discuss.

Principal Component Analysis

In Principal Component Analysis, or PCA for short, the data is linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. To reduce dimensionality, we consider \(k\) directions with largest variation and project the points onto those.

The idea of the algorithm is the following:

  1. Find orthogonal directions of maximum variance (eigenvectors)
  2. change coordinate system
  3. drop dimensions of least variance

To understand this, first we need to revise some linear algebra theory.

Recap: Linear algebra

The variance is the measure of the deviation from the mean of a point, and is calculated as:

\[Var(X)= \mathbb{E}[(X-\mu)^2] = \frac{1}{n}\sum_{i=1}^n (x_i - \mathbb{E}[t])^2\] The covariance is the measure of the linear relationship between two variables with respect to each other and is calculated as:

\[Cov(X, Y) = \mathbb{E}[(X- \mathbb{E}[X])(Y - \mathbb{E}[Y])]= \frac{1}{n}\sum_{i=1}^n (x_i - \bar x)(y_i -\bar y)^T\]

You could think of variance as the covariance of a random variable with itself:

\[Var(X)=Cov(X, X)\]

An eigenvector or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \(v\) of a linear transformation \(T\) is scaled by a constant factor \(\lambda\) when the linear transformation is applied to it: \(Tv = \lambda v\). The corresponding eigenvalue or characteristic value is the multiplying factor \(\lambda\).

Eigenvalues can be found by solving:

\[det(A-\lambda I) = 0\]

and eigenvectors by solving:

\[(A-\lambda I)v =0\]

Now suppose you have a matrix \(A\) with \(n\) linearly independent eigenvectors \(v_1, v_2, ..., v_n\) which associated eigenvalues \(\lambda _1, \lambda _2, ..., \lambda _n\). Define a square matrix \(Q\) whose columns are \(n\) linearly independent eigenvectors of \(A\), \(Q=[v_1 v_2 ... v_n]\). Since each column of \(Q\) is an eigenvector of \(A\), right multiplying \(A\) by \(Q\) scaled each column of \(Q\) by its associated eigenvalue: \(AQ = [\lambda_1 v_1\ \lambda_2 v_2\ ... \ \lambda _n v_n]\). We define a diagonal matrix \(\Lambda\) where each diagonal element \(\Lambda _{ii}\) is the eigenvalue associated with the i-th column of \(Q\). Then:

\[AQ=Q\Lambda\]

therefore: \[A = Q\Lambda Q^{-1}\]

moreover, if we assume that Q is a unit matrix, meaning \(QQ^T = Q^T Q = I\), then \(A = Q\Lambda Q^T\). This is referred to as eigenvalue decomposition.

Back to PCA

Armed with this knowledge, we can finally proceed to discuss principal component analysis. The i-th principal component can be taken as a direction orthogonal to the first \(i-1\) principal components that maximizes the variance of the projected data. They can also be thought as a sequence of \(p\) unit vectors, where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first \(i-1\) vectors. A best-fitting line is defined as the one that minimized the average squared perpendicular distance from the points to the line. We will use the first definition now.

Let's take the projection \(t_i\) of a point \(x_i\) into the unit vector \(w\) centered in \(c\):

\[t_i = (x_i-c)^Tw \]

The variance of such points is then:

\[Var(t)=\frac{1}{n}\sum_{i=1}^n (t_i - \mathbb{E}[t])^2 = \frac{1}{n}\sum_{i=1}^n [(x_i - \bar x )^T w]^2 = \] \[ =^{\bar x_i =^{def}x_i - \bar x} \frac{1}{n}\sum_{i=1}^n (\bar x_i^T w)^2 = w^T[\frac{1}{n}\bar X \bar X^T]w = w^TCw \]

We are trying to minimize the variance:

\[w_i \in argmax\{ w^T C w: w^Tw=1 \}\]

It can be shown that the principal components are eigenvectors of the data's covariance matrix.

  • the first principal component \(w_1\) is the corresponding eigenvector

\[w_2 \in argmax \{ w^T C w: w^T w =1, w \bot w_1 \}\]

  • this generalizes to i dimensions

\[w_i \in argmax \{ w^T C w: w^T w =1, w \bot w_i\ for 1 \le j < i \}\]

  • the i-th largest eigenvalue of C is the variance along the i-th principal component
  • the i-th principal component is the corresponding eigenvector

Principal Component Analysis using eigenvalue decomposition

Algorithm:

  • Input: Data points \(X=[x_1, ...., x_n]\)
  • Centering: \(\bar X = X - \frac{1}{n}X1_n1_n^T\)
  • Compute covariance matrix: \(C=\frac{1}{n}\bar X \bar X^T\)
  • Eigenvalue decomposition: \(Q, \lambda = eig(C)\)
  • output: Principal components \(W = Q = [q_1, ..., q_m]\) and variances \(\lambda = (\lambda _1, ..., \lambda _m)\)

Now that we can calculate the principal components of some data points, we can take the first \(k\) and project the data onto those, with \(k\) smaller than the original number of dimensions. We can treat \(k\) as a parameter in our training process to find the best value. We could also compute the cumulative proportion of explained variance, which is given by \(\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^m C_{jj}}\), to estimate the amount of information loss.

PCA using Singular Value Decomposition

Here is presented an alternative solution to PCA. Singular Value Decomposition is a factorization o a real or complex matrix into a rotation, followed by a resealing followed by another rotation. It generalizes eigenvalue decomposition.

\[A = USV^T\] Algorithm:

  • Input: Data points \(X=[x_1, ..., x_n]\)
  • Centering: \(\bar X = X - \frac{1}{n}X1_n1_n^T\)
  • SVD decomposition: \(U, S, V = SVD(\bar X)\)
  • Output: Principal components \(U = [u_1, ..., u_k]\) and variances \((\frac{s_i^2}{n}, ..., \frac{s_k^2}{n})\) since \(C = \frac{1}{n} \bar X \bar X^T = \frac{1}{n}USV^TUSV^T= U\frac{S^2}{n}U^T\)

Kernel PCA (KPCA)

By using the kernel trick one can apply PCA in a higher dimensional space, yielding a non-linear transformation in the original space

Other dimensionality reduction techniques

  • PCA: find projection that maximize the variance
  • Multidimensional Scaling: find projection that best preserves inter-point distances
  • LDA (Linear Discriminan Analysis): Maximizing the component axes for class-separation.

Limitations

  • the data must be linearly separable (which is a strong assumption)

K-means clustering

Given the data and not the labels, the task is to to divide It in clusters. Here is presented a technique called k-means clustering.

We are assuming that we know the number of clusters. NOTE: This is a strong assumption to make!

Find a partition of data points into K sets minimizing the variation within each set:

\[min_{C_1, ..., C_k}\sum_{j=1}^kV(C_j)\]

  • the variation is typically given by \(V(C_j) = \sum_{i\in C_j} ||x_i-u_j||^2\)
  • the centroid is computed with \(u_j = \frac{1}{|C_j|}\sum_{i\in C_j} x_i\)

The algorithm

Initialization: Use some initialization strategy to get
some initial cluster centroids u1,...,uk

while clusters change, do
  Assign each datapoint to the closest centroid forming
  new clusters

    Cj = i in N such taht j = argmin_l(xi - ul)

  Compute cluster centroids u1,...,uk

The algorithm is guaranteed to converge, but not to find the global minimum.

Initial Centroid Selection

  • random selection
  • points least similar to any existing center
  • try multiple starting points

Running time

  • assignment: \(O(kn)\) time
  • centroid computation: \(O(n)\) time

In this case, euclidean distance is not the best so we use cosine similarity

\[sim(x, y)=\frac{x*y}{|x||y|}\]


Travel: Intro to Machine Learning, Index