Deep Learning Basics Lecture3: Optimization

Introduction

Language is the source of misunderstandings" Antoine de Saint-Exupery(1900-1944)
Gradient Descent
- First-order iterative optimaization algorithm for finding a local minimum of a differentiable function.

Important Concepts in Oprimization

Generalization
Under-fitting vs. over-fitting
Cross validation
Bias-variance tradeoff
Bootstrapping
Bagging and boosting

Genenralization

How well the learned model will behave on unseen data.

Underfitting vs. Overfitting

Cross-validation

Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set.
- Validation data: use to tune hyperparameters
- Cross-validation: cycle through the choice of which fold is the validation fold, average results.

테스트 데이터를 활용하는 것 자체가 치팅이기에 밸리데이션과 크로스 밸리데이션 데이터로만 이용해야 한다.

Bias and Variance

We can derive that what we are minimizing(cost) can be decomposed into three different parts: bias2, variance, and noise.

Boostrapping

Bootstrapping is any test or metric that uses random sampling with replacement.

여러 개 만든 모델이 각각 다른 값을 나올 수 있기에 이 값들이 얼마나 일치하는지에 대해 사용함. 학습 데이터가 고정되어 있을 때, 여러 개를 만들고, 그것으로 여러 모델을 만들어서 무엇인가를 하겠다는 것.

Bagging vs. Boosting

Bagging (Bootstrapping aggregating)
- Multiple models are being trained with bootstrapping
- ex) Base classifiers are fitted on random subset where individual pre dictions are aggregated (voting or averaging).
Boosting
- it focuses on thse specific training samples that are hard to classify.
- A strong model is built by combining weak learners in sequence where each learner learns from the mistakes of the previous weak learner.

Gradient Descent Methods

Stochastic gradient descent
- Update with the gradient computed from a single sample.
Mini-batch gradient descent
- Update with the gradient computed from a subset of data. (대부분 사용)
Batch gradient descent
- Update with the gradient computed from the whole data.

Batch-size Matters

"It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize."
"We... present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contract, small-batch methods consistently converge to flat minimizers... this is due to the inherent noise in the gradient estimation."

Gradient Descent Methods

Stochastic gradient descent
Momentum
Nesterov accelerated gradient
Adagrad
Adadelta
RMSprop
Adam

Adagrad

Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters.

Adadelta

Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window.

There is no learning rate in Adadelta

RMSprop

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture.

Adam

Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients.

Adam effectively combines momentum with adaptive learning rate approach.

Early stopping
Parameter norm penalty
Data augmentation
Noise robustness
Label smoothing
Dropout
Batch normalization

Early stopping

Note that we need additional validation data to do early stopping

Parameter Norm Penalty

It adds smoothness to the function space. 함수의 공간 속에서 부드러운 함수로 본다.

Data Augmentation

More data are always welcomed.
However, in most cases, traing data are given in advance.
In such cases, we need data augmentation.

Noise Robustness

Add random noises inputs or weights.

Label Smoothing

Mix-up constructs augmented traing examples by mixing both input and output of two randomly selected training data.
CutMix constructs augmented traing examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data.

Dropout

in each forward pass, randomly set some neurons to zero.

Batch Normalization

Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize.
There are different variances of normalizations.

Batch Norm, Layer Norm, Instance Norm, Group Norm

저작자표시

'BOOTCAMP > boostcamp AI Tech Pre-Course' 카테고리의 다른 글

Mathematics for Artificial Intelligence 10강: RNN 첫걸음 (0)	2023.01.06
Mathematics for Artificial Intelligence 9강: CNN 첫걸음 (0)	2023.01.05
Mathematics for Artificial Intelligence 8강: 베이즈 통계학 맛보기 (0)	2023.01.04
Deep Learning Basis Lecture 4: Convolutional Neural Networks (0)	2023.01.04
Linear Transformation (0)	2023.01.03