Deep Learning Basics Lecture3: Optimization

2023. 1. 5. 23:09BOOTCAMP/boostcamp AI Tech Pre-Course

Introduction

  • Language is the source of misunderstandings" Antoine de Saint-Exupery(1900-1944)
  • Gradient Descent
    • First-order iterative optimaization algorithm for finding a local minimum of a differentiable function.

Important Concepts in Oprimization

  • Generalization
  • Under-fitting vs. over-fitting
  • Cross validation
  • Bias-variance tradeoff
  • Bootstrapping
  • Bagging and boosting

Genenralization

  • How well the learned model will behave on unseen data.

Underfitting vs. Overfitting

 

Cross-validation

  • Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set.
    • Validation data: use to tune hyperparameters
    • Cross-validation: cycle through the choice of which fold is the validation fold, average results.

테스트 데이터를 활용하는 것 자체가 치팅이기에 밸리데이션과 크로스 밸리데이션 데이터로만 이용해야 한다.

 

Bias and Variance 

 

We can derive that what we are minimizing(cost) can be decomposed into three different parts: bias2, variance, and noise.

 

Boostrapping

  • Bootstrapping is any test or metric that uses random sampling with replacement.

여러 개 만든 모델이 각각 다른 값을 나올 수 있기에 이 값들이 얼마나 일치하는지에 대해 사용함. 학습 데이터가 고정되어 있을 때, 여러 개를 만들고, 그것으로 여러 모델을 만들어서 무엇인가를 하겠다는 것.

Bagging vs. Boosting

  • Bagging (Bootstrapping aggregating)
    • Multiple models are being trained with bootstrapping
    • ex) Base classifiers are fitted on random subset where individual pre dictions are aggregated (voting or averaging).
  • Boosting
    • it focuses on thse specific training samples that are hard to classify.
    • A strong model is built by combining weak learners in sequence where each learner learns from the mistakes of the previous weak learner.

Gradient Descent Methods

  • Stochastic gradient descent
    • Update with the gradient computed from a single sample.
  • Mini-batch gradient descent
    • Update with the gradient computed from a subset of data. (대부분 사용)
  • Batch gradient descent
    • Update with the gradient computed from the whole data.

Batch-size Matters

  • "It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize."
  • "We... present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contract, small-batch methods consistently converge to flat minimizers... this is due to the inherent noise in the gradient estimation."

Gradient Descent Methods

  • Stochastic gradient descent
  • Momentum
  • Nesterov accelerated gradient
  • Adagrad
  • Adadelta
  • RMSprop
  • Adam

Adagrad

  • Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters.

Adadelta

  • Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window.

There is no learning rate in Adadelta

 

RMSprop

  • RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture.

Adam

  • Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients.

Adam effectively combines momentum with adaptive learning rate approach.

 

  • Early stopping
  • Parameter norm penalty
  • Data augmentation
  • Noise robustness
  • Label smoothing
  • Dropout
  • Batch normalization

Early stopping

  • Note that we need additional validation data to do early stopping

Parameter Norm Penalty

  • It adds smoothness to the function space. 함수의 공간 속에서 부드러운 함수로 본다.

Data Augmentation

  • More data are always welcomed.
  • However, in most cases, traing data are given in advance.
  • In such cases, we need data augmentation.

Noise Robustness

  • Add random noises inputs or weights.

Label Smoothing

  • Mix-up constructs augmented traing examples by mixing both input and output of two randomly selected training data.
  • CutMix constructs augmented traing examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data.

Dropout

  • in each forward pass, randomly set some neurons to zero.

Batch Normalization

  • Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize.
  • There are different variances of normalizations.

Batch Norm, Layer Norm, Instance Norm, Group Norm