Lecture3: Linear Classifiers
1. Linear classifier
There are three ways to think about linear classifiers:
- Algebraic viewpoint
- Visual viewpoint
- Geometric viewpoint
2. Loss function
A loss function tells how good our classifier is
- Low loss \(\rightarrow\) A good classifier
- High loss \(\rightarrow\) A bad classifier
Given a dataset of examples \(\{(x_i, y_i)\}^N_{i=1}\), where \(x_i\) is image and \(y_i\) is integer label.
Loss for a single example is
$$
L_i(f(x_i, W), y_i)
$$
Loss for the dataset is average of per-example losses:
$$
L = \frac{1}{N} \sum_{i}{L_i(f(x_i, W), y_i)}
$$
Next we will introduce some commonly used loss function.
2.1 Cross-Entropy Loss
The Cross Entropy Loss is also called Multinomial Logistic Regression. Its main idea is to interpret the raw classifier scores as probabilities. The formula is as follows:
$$
L_i = -\log P(Y=y_i|X=x_i)
$$
where the \(P(Y = y_i|X=x_i)\) is a Softmax function, which formula is as follows:
$$
P(Y = k| X= x_i) = \frac{\text{exp}(s_k)}{\sum_j{\text{exp}(s_j)}}
$$
Q: What is the min / max possible loss \(L_i\) ?
The minimal \(L_i\) is \(0\) and the maximum \(L_i\) is \(+\infty\).
2.2 Multi-class SVM Loss
The core idea of multi-class SVM loss is the score of the correct class should be higher than all the other scores.
Given an example \((x_i, y_i)\), where \(x_i\) is image and \(y_i\) is label. Let \(s = f(x_i, W)\) be the score vector. Then, the SVM loss has the following form:
$$
L_i = \sum_{j \ne y_i} \max({0, s_j - s_{y_i} + 1})
$$
Lecture 4: Regularization & Optimization
1. Regularization
What is Overfitting ?
A model is overfitting when it performs too well on the training data, and has poor performance for unseen data.
To alleviate the effort of overfitting, we use the regularization term to prevent the model form performing too well on training data. The new loss function is as follows:
$$
L(W) = \frac{1}{N} \sum_{i=1}^{N}L_i(f(x_i, W), y_i) + \lambda R(W)
$$
where \(\lambda\) is a hyper-parameter giving regularization strength.
Regularization term causes loss to increase for model with sharp cliff. There are some commonly used regularization term:
- \(L2\) regularization: \(R(W) = \sum_{k,l}W^2_{k,l}\)
- \(L1\) regularization: \(R(W) = \sum_{k,l}|W_{k,l}|\)
Other regularization methods
- Dropout
- Batch normalization
- Cutout, Mixup, Stochastic depth, etc…
2. Optimization
The core idea of optimization is to minimize the loss function by optimizing the weight matrix, and the formula is as follows:
$$
W^{*} = \arg \min_{w} L(W)
$$
In order to optimize the weight matrix, we should choose an effective strategy. A simple yet effective approach is following the slope. In multiple dimensions, the gradient is the vector of along each dimension.
How to compute gradient?
- Numeric gradient: approximate, slow, easy to write
- Analytic gradient: exact, fast, error-prone
In practice, we always use analytic gradient, but check implementation with numerical gradient. This method is called a gradient check.
3. Gradient Descent Strategies
3.1 Gradient Descent
Iteratively step in the direction of the negative gradient (direction of local steepest descent).
w = initialize_weights()
for t in range(num_steps):
dw = compute_gradient(loss_fn, data, w)
w -= learning_rate * dw
In this code block, there have some hyper-parameters:
- Weight initialization method
- Number of steps
- Learning rate
3.2 Stochastic Gradient Descent
However, calculating all example sum is too expensive when \(N\) is large.