Learning rate scheduler

When we use a method such as the gradient descent method to train the model, the training speed and loss are generally taken into consideration to select a relatively appropriate learning rate. However, if a fixed learning rate is used throughout the training process, the loss of the training set will not continue to decline after falling to a certain extent, but will ‘jump’ within a certain range. The jumping principle is shown in the figure below. When the loss function converges to the local minimum value, the update step will be too large due to the excessive learning rate. The parameter update will repeatedly jump over the local minimum value and an oscillation-like phenomenon will occur.

../../../_images/learning_rate_scheduler.png

The learning rate scheduler defines a commonly used learning rate decay strategy to dynamically generate the learning rate. The learning rate decay function takes epoch or step as the parameter and returns a learning rate that gradually decreases with training. Thereby it reduces the training time and finds the local minimum value at the same time.

The following content describes the APIs related to the learning rate scheduler:


  • noam_decay: Noam decay. Please refer to Attention Is All You Need for related algorithms. For related API Reference please refer to noam_decay
  • exponential_decay: Exponential decay. That is, each time the current learning rate is multiplied by the given decay rate to get the next learning rate. For related API Reference please refer to exponential_decay
  • natural_exp_decay: Natural exponential decay. That is, each time the current learning rate is multiplied by the natural exponent of the given decay rate to get the next learning rate. For related API Reference please refer to natural_exp_decay
  • inverse_time_decay: Inverse time decay. The decayed learning rate is inversely proportional to the current number of decays. For related API Reference please refer to inverse_time_decay
  • polynomial_decay: Polynomial decay, i.e. the decayed learning rate is calculated in a polynomial format with the initial learning rate and the end learning rate. For related API Reference please refer to polynomial_decay
  • piecewise_decay: Piecewise decay. That is, the stair-like decay for a given number of steps, the learning rate stays the same within each step. For related API Reference please refer to piecewise_decay
  • append_LARS: The learning rate is obtained by the Layer-wise Adaptive Rate Scaling algorithm. For related algorithms, please refer to Train Feed forward Neural Network with Layerwise Adaptive Rate via Approximating Back-matching Propagation . For related API Reference please refer to api_fluid_layers_append_LARS
  • cosine_decay: Cosine attenuation. It means the learning rate changes with the number of steps in the form of a cosine function. For related API Reference please refer to cosine_decay
  • linear_lr_warmup: The learning rate increases linearly to an appointed rate with the number of steps. For related API Reference please refer to linear_lr_warmup