嘘~ 正在从服务器偷取页面 . . .

Deep Learning for computer vision


Lecture3: Linear Classifiers

1. Linear classifier

There are three ways to think about linear classifiers:

  • Algebraic viewpoint
  • Visual viewpoint
  • Geometric viewpoint

2. Loss function

A loss function tells how good our classifier is

  • Low loss \(\rightarrow\) A good classifier
  • High loss \(\rightarrow\) A bad classifier

Given a dataset of examples \(\{(x_i, y_i)\}^N_{i=1}\), where \(x_i\) is image and \(y_i\) is integer label.

Loss for a single example is

$$
L_i(f(x_i, W), y_i)
$$
Loss for the dataset is average of per-example losses:

$$
L = \frac{1}{N} \sum_{i}{L_i(f(x_i, W), y_i)}
$$

Next we will introduce some commonly used loss function.

2.1 Cross-Entropy Loss

The Cross Entropy Loss is also called Multinomial Logistic Regression. Its main idea is to interpret the raw classifier scores as probabilities. The formula is as follows:

$$
L_i = -\log P(Y=y_i|X=x_i)
$$

where the \(P(Y = y_i|X=x_i)\) is a Softmax function, which formula is as follows:

$$
P(Y = k| X= x_i) = \frac{\text{exp}(s_k)}{\sum_j{\text{exp}(s_j)}}
$$

Q: What is the min / max possible loss \(L_i\) ?

The minimal \(L_i\) is \(0\) and the maximum \(L_i\) is \(+\infty\).

2.2 Multi-class SVM Loss

The core idea of multi-class SVM loss is the score of the correct class should be higher than all the other scores.

Given an example \((x_i, y_i)\), where \(x_i\) is image and \(y_i\) is label. Let \(s = f(x_i, W)\) be the score vector. Then, the SVM loss has the following form:

$$
L_i = \sum_{j \ne y_i} \max({0, s_j - s_{y_i} + 1})
$$

Lecture 4: Regularization & Optimization

1. Regularization

What is Overfitting ?

A model is overfitting when it performs too well on the training data, and has poor performance for unseen data.

To alleviate the effort of overfitting, we use the regularization term to prevent the model form performing too well on training data. The new loss function is as follows:

$$
L(W) = \frac{1}{N} \sum_{i=1}^{N}L_i(f(x_i, W), y_i) + \lambda R(W)
$$

where \(\lambda\) is a hyper-parameter giving regularization strength.

Regularization term causes loss to increase for model with sharp cliff. There are some commonly used regularization term:

  • \(L2\) regularization: \(R(W) = \sum_{k,l}W^2_{k,l}\)
  • \(L1\) regularization: \(R(W) = \sum_{k,l}|W_{k,l}|\)

Other regularization methods

  • Dropout
  • Batch normalization
  • Cutout, Mixup, Stochastic depth, etc…

2. Optimization

The core idea of optimization is to minimize the loss function by optimizing the weight matrix, and the formula is as follows:

$$
W^{*} = \arg \min_{w} L(W)
$$

In order to optimize the weight matrix, we should choose an effective strategy. A simple yet effective approach is following the slope. In multiple dimensions, the gradient is the vector of along each dimension.

How to compute gradient?

  • Numeric gradient: approximate, slow, easy to write
  • Analytic gradient: exact, fast, error-prone

In practice, we always use analytic gradient, but check implementation with numerical gradient. This method is called a gradient check.

3. Gradient Descent Strategies

3.1 Gradient Descent

Iteratively step in the direction of the negative gradient (direction of local steepest descent).

w = initialize_weights()
for t in range(num_steps):
	dw = compute_gradient(loss_fn, data, w)
	w -= learning_rate * dw

In this code block, there have some hyper-parameters:

  • Weight initialization method
  • Number of steps
  • Learning rate

3.2 Stochastic Gradient Descent

However, calculating all example sum is too expensive when \(N\) is large.

Inference


文章作者: Chen
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Chen !
评论
  目录