Giovanni's Diary > Subjects > Programming > Notes > Intro to Machine Learning >
Unsupervised Learning
In unsupervised learning we observe data from an unknown distribution without class labels. It is quite useful for the following tasks:
- dimensionality reduction: find \(f: x\rightarrow y\) such that \(dim(x)<
- clustering: find \(f: x\rightarrow \mathbb{N}\) that maps an input with a cluster index.
- density estimation: find the probability distribution that best fits the data.
A technique that is widely used for dimensionality reduction, lossy data compression, feature extraction and data visualization is called Principal Component Analysis, which we will now discuss.
Principal Component Analysis
In Principal Component Analysis, or PCA for short, the data is linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. To reduce dimensionality, we consider \(k\) directions with largest variation and project the points onto those.
The idea of the algorithm is the following:
- Find orthogonal directions of maximum variance (eigenvectors)
- change coordinate system
- drop dimensions of least variance
To understand this, first we need to revise some linear algebra theory.
Recap: Linear algebra
The variance is the measure of the deviation from the mean of a point, and is calculated as: \[Var(X)= \mathbb{E}[(X-\mu)^2] = \frac{1}{n}\sum_{i=1}^n (x_i - \mathbb{E}[t])^2\] The covariance is the measure of the linear relationship between two variables with respect to each other and is calculated as: \[Cov(X, Y) = \mathbb{E}[(X- \mathbb{E}[X])(Y - \mathbb{E}[Y])]= \frac{1}{n}\sum_{i=1}^n (x_i - \bar x)(y_i -\bar y)^T\]
You could think of variance as the covariance of a random variable with itself: \[Var(X)=Cov(X, X)\] An eigenvector or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \(v\) of a linear transformation \(T\) is scaled by a constant factor \(\lambda\) when the linear transformation is applied to it: \(Tv = \lambda v\). The corresponding eigenvalue or characteristic value is the multiplying factor \(\lambda\). Eigenvalues can be found by solving: \[det(A-\lambda I) = 0\] and eigenvectors by solving: \[(A-\lambda I)v =0\] Now suppose you have a matrix \(A\) with \(n\) linearly independent eigenvectors \(v_1, v_2, ..., v_n\) which associated eigenvalues \(\lambda _1, \lambda _2, ..., \lambda _n\). Define a square matrix \(Q\) whose columns are \(n\) linearly independent eigenvectors of \(A\), \(Q=[v_1 v_2 ... v_n]\). Since each column of \(Q\) is an eigenvector of \(A\), right multiplying \(A\) by \(Q\) scaled each column of \(Q\) by its associated eigenvalue: \(AQ = [\lambda_1 v_1\ \lambda_2 v_2\ ... \ \lambda _n v_n]\). We define a diagonal matrix \(\Lambda\) where each diagonal element \(\Lambda _{ii}\) is the eigenvalue associated with the i-th column of \(Q\). Then: \[AQ=Q\Lambda\] therefore: \[A = Q\Lambda Q^{-1}\] moreover, if we assume that Q is a unit matrix, meaning \(QQ^T = Q^T Q = I\), then \(A = Q\Lambda Q^T\). This is referred to as eigenvalue decomposition.
Back to PCA
Armed with this knowledge, we can finally proceed to discuss principal component analysis. The i-th principal component can be taken as a direction orthogonal to the first \(i-1\) principal components that maximizes the variance of the projected data. They can also be thought as a sequence of \(p\) unit vectors, where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first \(i-1\) vectors. A best-fitting line is defined as the one that minimized the average squared perpendicular distance from the points to the line. We will use the first definition now.
Let's take the projection \(t_i\) of a point \(x_i\) into the unit vector \(w\) centered in \(c\): \[t_i = (x_i-c)^Tw \] Noting that:
\[\mathbb{E}(t)= \sum_{i=1}^nt_i = \frac{1}{2}\sum_{i=1}^n (x_i -c)^Tw = \bar{x}^Tw - c^Tw\ (1)\]
The variance of such points is then: \[Var(t)=\frac{1}{n}\sum_{i=1}^n (t_i - \mathbb{E}[t])^2 =^{(1)} \frac{1}{n}\sum_{i=1}^n [(x_i - \bar x )^T w]^2 = \] \[ =^{\bar x_i =^{def}x_i - \bar x} \frac{1}{n}\sum_{i=1}^n (\bar x_i^T w)^2 = \frac{1}{n} \sum_{i=1}^n (w^T\bar{x}_i)^2 = \] \[ \frac{1}{n} \sum_{i=1}^n (w^T\bar{x}_i)(\bar{x}_i^Tw) = \frac{1}{n} \sum{w^T}{\bar{x}_i\bar{x}_i^T w} = w^T[\frac{1}{n}\bar X \bar X^T]w = w^TCw\]
Where \(C\) is the covariance. We are trying to find the direction \(w\) that maximizes the variance along said direction: \[argmax\ w^T C w\] Such that: \[\ w^Tw=1\]
Let's try to solve this maximization problem using the Lagrangian function:
\[L(w, x)=w^TCw+\lambda (1-w^Tw)\]
Lets derivate with respect to \(w\) and set the derivative equal to \(0\), noting that \(\frac{d}{dw}w^Tw=\frac{d}{dw}||w^2||=\frac{d}{dw}\sum_{i}w_i^2=\sum_{i}2w_i=2w\):
\[\frac{d}{dw}L(w, x)=2Cw-2\lambda w=0\]
Then:
\[Cw=\lambda w\]
We notice that the Lagrangian coefficients are eigenvalues, and \(w\) are eigenvectors of the covariance matrix (!!!) It is also true that:
\[w^TCw=\lambda\]
This means that the variance will be maximum then we set \(w\) as the eigenvector having the largest eigenvalue \(\lambda\). This eigenvector is known as the first principal component.
- this generalizes to \(i\) dimensions
\[w_i \in argmax \{ w^T C w: w^T w =1, w \bot w_i\ for 1 \le j < i \}\]
Principal Component Analysis using eigenvalue decomposition
Algorithm:
- Input: Data points \(X=[x_1, ...., x_n]\)
- Centering: \(\bar X = X - \frac{1}{n}X1_n1_n^T\)
- Compute covariance matrix: \(C=\frac{1}{n}\bar X \bar X^T\)
- Eigenvalue decomposition: \(Q, \lambda = eig(C)\)
- output: Principal components \(W = Q = [q_1, ..., q_m]\) and variances \(\lambda = (\lambda _1, ..., \lambda _m)\)
Now that we can calculate the principal components of some data points, we can take the first \(k\) and project the data onto those, with \(k\) smaller than the original number of dimensions. We can treat \(k\) as a parameter in our training process to find the best value. We could also compute the cumulative proportion of explained variance, which is given by \(\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^m C_{jj}}\), to estimate the amount of information loss.
PCA using Singular Value Decomposition
Here is presented an alternative solution to PCA. Singular Value Decomposition is a factorization o a real or complex matrix into a rotation, followed by a rescaling followed by another rotation.
\[A = USV^T\]
It generalizes eigenvalue decomposition since SVD can be done on every \(m\times n\) matrix, whereas eigenvalue decomposition can be applied to square diagonizable matrices. In fact:
\[A^TA=(VS^TU^T)(USV^T)=V(S^TS)V^T=V\frac{C}{n}V^T\]
We have a similar function as before, the variance again, and \(V\) are the eigenvectors.
Algorithm:
- Input: Data points \(X=[x_1, ..., x_n]\)
- Centering: \(\bar X = X - \frac{1}{n}X1_n1_n^T\)
- SVD decomposition: \(U, S, V = SVD(\bar X)\)
- Output: Principal components \(U = [u_1, ..., u_k]\) and variances \((\frac{s_i^2}{n}, ..., \frac{s_k^2}{n})\) since \(C = \frac{1}{n} \bar X \bar X^T = \frac{1}{n}USV^TUSV^T= U\frac{S^2}{n}U^T\)
Kernel PCA (KPCA)
By using the kernel trick one can apply PCA in a higher dimensional space, yielding a non-linear transormation in the original space
Other dimensionality reduction techniques
- PCA: find projection that maximize the variance
- Multidimensional Scaling: find projection that best preserves inter-point distances
- LDA (Linear Discriminan Analysis): Maximizing the component axes for class-separation.
Limitations
- the data must be linearly separable (which is a strong assumption)
Travel: Intro to Machine Learning, Index