|
| Name | Advantages | Disadvantages |
|
| BGD | The principle of gradient descent is simple | (1) Calculation is very slow |
| (2) Difficult to handle a large dataset |
| (3) Cannot add new data to update the model |
|
| SGD | (1) Compared with BGD, SGD training speed is faster | (1) Frequent updates may cause severe oscillations in the loss function |
| (2) New data can be added to update the model |
|
| Momentum | (1) Consider the speed of the previous step and the new gradient | |
| (2) Can speed up the convergence and suppress the shock |
|
| AdaGrad | (1) Compared with SGD, it adds a denominator | (1) If gradient update is frequent, it may cause the subsequent gradient updates be slow or disappear |
| (2) Handle the case where the number of gradient updates is small |
|
| RMSprop | (1) Similar to momentum, it can reduce fluctuations | |
| (2) Overcome the problem of the sharp decrease or disappearance of gradient in AdaGrad |
| (3) It performs better than SGD, momentum, and AdaGrad, based on the nonstationary objective function |
|
| Adam | (1) Combine momentum and RMSProp | |
| (2) Integrate the contents of gradient descent, momentum, Adagrad, and RMSprop with certain improvement |
| (3) Easy to use, insensitive to gradient scaling, can be used for large data, processing sparse data, easy to adjust hyperparameters, etc. |
|