Giovanni's Diary > Subjects > Programming > Notes > Intro to Machine Learning >
Regularization
A recap on model based machine learning:
- Pick a model
\[0=b+\sum_{j=1}^mw_if_i\]
- Pick a criteria to optimize
\[\sum_{i=1}^n\mathbb{1}[y_i(wx_i+b)\le 0]\]
- Develop a learning algorithm
\[argmin_{w, b}\sum_{i=1}^n\mathbb{1}[y_i(wx_i+b)\le 0]\]
- repeat: pick a dimension and move a small amount towards the opposite of the derivative
Regularizer
A regularizer is an additional criterion to the loss function to make sure that we do not overfit \[argmin_{w, b}\sum_{i=1}^nloss(yy')+\lambda regularizer(w, b)\]
- we want to bias the model so that it prefers certain types of weights over others
- note that is I sum two convex functions, the result is also convex. We want a convex regularizer. If the function is convex, there is usually an easy solution.
Generally, we do not want huge weights: if weights are large, a small change in a feature can result in a large change in the prediction
So, how do we encourage small weights or penalize large weights?
Common regularizers
Sum of the weights \[r(w, b)=\sum |w_j|\] Sum of the squared weights \[r(w, b)=\sqrt{\sum |w_j|^2}\]
- this penalizes large values more compared to sum of weights
In general, we can call this formula p-norm or \(L_p (L_1, L_2, ....)\): \[r(w, b)=\sqrt[p]{\sum |w_j|^p}=||w||^p\]
Using Gradient Descent
Example using 2-norm: \[argmin_{w, b}\sum_{i=1}^n exp(-y_i(wx_i+b))+\frac{\lambda}{2}||w||^2\] \[\frac{d}{dw_j}equation = -\sum_{i=1}^n y_ix_{ij}exp(-y_i(wx_i+b))+\lambda w_i\] We can multiply all of this by a constant to control the learning rate.
Note that gradient descent is not the only minimization method.
Logistic Regression
Log loss or binary cross-entropy loss \[L(w) = -\sum_{i=1}^n [y_i\log(\hat{y_i})+(1-y_i)\log(1-\hat{y_i})]\] where \[\hat{y_i} = \sigma (x_i^Tw)\] and \[\sigma (z)=\frac{1}{1+e^{-z}}\]
Travel: Intro to Machine Learning, Index