Giovanni's Diary > Programming > Notes > Intro to Machine Learning >
Machine Learning Basics
In machine learning we want the computer to automatically detect patterns and predict future data from other data. To do so, we perform a series of steps which we call a pipeline: data acquisition, preprocessing, dimensionality reduction, learning and testing.
In particular:
- Data acquisition: We want to collect the relevant data for the problems at hand.
- Preprocessing: cleaning and preparing for analysis (missing values, formatting…). Techniques include normalization, feature scaling, handling categorical variables.
- Dimensionality reduction: selection methods can be applied to reduce the number of features while preserving the most important information.
- Model learning: a model is trained on the preprocessed data.
- Model testing: the model is evaluated using a test set.
Types of learning
Below are described the types of learning which will be discussed in this course:
Supervised Learning
Given labeled examples, create the model to predict the label, and test if the predicted label is correct.
Classification: there is a finite set of labels to predict. Given a training set \(T = \{(x1,y1),...,(xm,ym)\}\) of size m, learn a function to predict \(y\) given \(x\): \(f: R^d -> \{1, 2, ..., k\}\). \(x\) is generally multidimensional (multiple features), Applications include:
- Face recognition
- Character recognition
- Spam detection (classical problem)
- Medical diagnosis
- Biometrics
- Regression: the label is real value:
\[f: R^d -> R\]
- Ranking: label indicates an order.
Unsupervised Learning
The input is given with no labels. The main problems include:
- Clustering: Given an input \(T = \{x1, ..., xm\}\), output the
hidden structure behind the \(x\)'s, which represents the
clusters. Possible applications are:
- social network analysis
- genomics
- image segmentation
- anomaly detection
- Dimensionality Reduction: reduce the number of features under consideration by mapping data into another low dimensional space.
\[ f: R^d -> R^m, m << d \]
- Density estimation: find a probability distribution that fits the data.
- Reinforcement learning: idea: an agent learns from the environment by interacting with it and receiving rewards for performing actions.
Other learning variations
- semi-supervised: some data on which you have labels, and some data on which you don't.
- active learning: the model learns a labeled dataset, and It interactively queries a human user to label new data points.
- batch (offline) learning: the model learns from the entire
dataset in one go and is updated only after processing all the data
- Once the model is trained, it does not change.
- Typically used when the dataset fits into memory and can be processed efficiently as a whole.
- online learning: the model learns incrementally from each new
data point or small batches of data.
- Allows the model to adapt to changes in the data distribution over time.
- Suitable for scenarios where data arrives sequentially and needs to be processed in real-time or where computational resources are limited.
Features
Features are the questions we can ask about the examples. They are generally represented as vectors.
Generalization
"Machine learning is about generalization of data".
Generalization in machine learning refers to the ability of a trained model to perform well on new, unseen data that was not used during the training process.
This can be done only if there is a correlation between inputs and outputs. More technically, we are going to use the probabilistic model of learning.
- there is some probability distribution over example / label pairs called the data generating distribution
- both the training data and the test set are generated based on the distribution
A data generating distribution refers to the underlying probability distribution that generates the observed data points in a dataset. Understanding this distribution is crucial for building accurate and generalizable machine learning models because It enables us to make informed assumptions about the data and to make predictions or decisions based on probabilistic reasoning. This is valid for every kind of machine learning.
Learning process
The steps to take in order to learn a model are:
- Collect (annotated) data
- Define a family of models for the classification task
- Define an error function to measure how well a model fits the data
- Find the model that minimized the error, aka TRAIN or LEARN a model
"Ingredients":
- task: a task represents the type of prediction being made to
solve a problem on some data. f: x -> y
- For example, in the classification case, f: x -> {c1, …, ck}.
- Similarly is clustering, where the output is a cluster index.
- Regression: f: R -> R
- Dimensionality reduction: f: x -> y, dim(y) << dim(x)
- Density estimation: f: x -> Delta(x)
data: information about the problem to solve in the form of a distribution p which is typically unknown.
- training set: the failure of a machine learning algorithm is
often caused by a bad selection of training samples.
- validation set
- test set
- model hypotheses: a model \(Ftask\) is an implementation of a function \(f\):
\[f \in Ftask\]
A set of models forms an hypothesis space:
\[Hip \subseteq Ftask\]
We use an hypothesis space to reduce the number of possible models in order to make our life easier.
- learning algorithm
- objective: we want to minimize a (generalization) error function \(E(f, p)\).
\[f* \in arg\ min(f,p), f \in Ftask\]
\(Ftask\) is too big of a function space, we need an implementation (model hypotheses). So we define a model hypothesis space \(Hip \in Ftask\) and seek a solution within that space.
\[f_{Hip}*(D) \in arg\ min_{f \in Hip_M} E(f, D)\]
With \(D=\{z_1, ..., z_n\}\) being the training data.
Error function
Let \(l(f, z)\) be a pointwise loss. The error is computed from a function (in an hypothesis space) and a training set.
\[E(f, p) = \mathbb{E}_{z\sim pdata} [l(f, z)]\] \[E(f, D) = \frac{1}{n}\sum_{i=1}^{n}l(f, z_i)\] We want to minimize such error.
Underfitting and Overfitting
- Underfitting: the error is very big, aka I am very far from what I want
- Overfitting: there is a large gap between the generalization (validation) and the training phase.
How to improve generalization
Common techniques to improve generalization include:
- avoid attaining the minimum on training error.
- reduce model capacity.
- change the objective with a regularization term:
\[E_{reg}(f, D_n) = E(f, D_n)+\lambda \Omega (f)\]
- \(\lambda\) is the trade-off parameter
- For example:
\[E_{reg}(f, D_n) = \frac{1}{n} \sum_{i=1}^{n} [f(x_i)-y_i]^2 + \frac{\lambda}{n} |w|^2\]
- inject noise in the learning algorithm.
- stop the learning algorithm before convergence.
- increase the amount of data:
\[E(f, D_N) \rightarrow E(f, p_{data}),\ n \rightarrow \inf\]
- augmenting the training set with transformations (rotate the image, change brightness…).
- combine predictions from multiple, decorrelated models (ensembling).
- train different models on different subsets of data, and we average the final solution between all of them
Travel: Intro to Machine Learning, Index