Introduction
- Language is the source of misunderstandings" Antoine de Saint-Exupery(1900-1944)
- Gradient Descent
- First-order iterative optimaization algorithm for finding a local minimum of a differentiable function.
Important Concepts in Oprimization
- Generalization
- Under-fitting vs. over-fitting
- Cross validation
- Bias-variance tradeoff
- Bootstrapping
- Bagging and boosting
Genenralization
- How well the learned model will behave on unseen data.
Underfitting vs. Overfitting
Cross-validation
- Cross-validation is a model validation technique for assessing how the model will generalize to an independent (test) data set.
- Validation data: use to tune hyperparameters
- Cross-validation: cycle through the choice of which fold is the validation fold, average results.
테스트 데이터를 활용하는 것 자체가 치팅이기에 밸리데이션과 크로스 밸리데이션 데이터로만 이용해야 한다.
Bias and Variance
We can derive that what we are minimizing(cost) can be decomposed into three different parts: bias2, variance, and noise.
Boostrapping
- Bootstrapping is any test or metric that uses random sampling with replacement.
여러 개 만든 모델이 각각 다른 값을 나올 수 있기에 이 값들이 얼마나 일치하는지에 대해 사용함. 학습 데이터가 고정되어 있을 때, 여러 개를 만들고, 그것으로 여러 모델을 만들어서 무엇인가를 하겠다는 것.
Bagging vs. Boosting
- Bagging (Bootstrapping aggregating)
- Multiple models are being trained with bootstrapping
- ex) Base classifiers are fitted on random subset where individual pre dictions are aggregated (voting or averaging).
- Boosting
- it focuses on thse specific training samples that are hard to classify.
- A strong model is built by combining weak learners in sequence where each learner learns from the mistakes of the previous weak learner.
Gradient Descent Methods
- Stochastic gradient descent
- Update with the gradient computed from a single sample.
- Mini-batch gradient descent
- Update with the gradient computed from a subset of data. (대부분 사용)
- Batch gradient descent
- Update with the gradient computed from the whole data.
Batch-size Matters
- "It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize."
- "We... present numerical evidence that supports the view that large batch methods tend to converge to sharp minimizers of the training and testing functions. In contract, small-batch methods consistently converge to flat minimizers... this is due to the inherent noise in the gradient estimation."
Gradient Descent Methods
- Stochastic gradient descent
- Momentum
- Nesterov accelerated gradient
- Adagrad
- Adadelta
- RMSprop
- Adam
Adagrad
- Adagrad adapts the learning rate, performing larger updates for infrequent and smaller updates for frequent parameters.
Adadelta
- Adadelta extends Adagrad to reduce its monotonically decreasing the learning rate by restricting the accumulation window.
There is no learning rate in Adadelta
RMSprop
- RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture.
Adam
- Adaptive Moment Estimation (Adam) leverages both past gradients and squared gradients.
Adam effectively combines momentum with adaptive learning rate approach.
- Early stopping
- Parameter norm penalty
- Data augmentation
- Noise robustness
- Label smoothing
- Dropout
- Batch normalization
Early stopping
- Note that we need additional validation data to do early stopping
Parameter Norm Penalty
- It adds smoothness to the function space. 함수의 공간 속에서 부드러운 함수로 본다.
Data Augmentation
- More data are always welcomed.
- However, in most cases, traing data are given in advance.
- In such cases, we need data augmentation.
Noise Robustness
- Add random noises inputs or weights.
Label Smoothing
- Mix-up constructs augmented traing examples by mixing both input and output of two randomly selected training data.
- CutMix constructs augmented traing examples by mixing inputs with cut and paste and outputs with soft labels of two randomly selected training data.
Dropout
- in each forward pass, randomly set some neurons to zero.
Batch Normalization
- Batch normalization compute the empirical mean and variance independently for each dimension (layers) and normalize.
- There are different variances of normalizations.
Batch Norm, Layer Norm, Instance Norm, Group Norm
'BOOTCAMP > boostcamp AI Tech Pre-Course' 카테고리의 다른 글
Mathematics for Artificial Intelligence 10강: RNN 첫걸음 (0) | 2023.01.06 |
---|---|
Mathematics for Artificial Intelligence 9강: CNN 첫걸음 (0) | 2023.01.05 |
Mathematics for Artificial Intelligence 8강: 베이즈 통계학 맛보기 (0) | 2023.01.04 |
Deep Learning Basis Lecture 4: Convolutional Neural Networks (0) | 2023.01.04 |
Linear Transformation (0) | 2023.01.03 |