fluid.optimizer

Adadelta

paddle.fluid.optimizer.Adadelta

alias of AdadeltaOptimizer

Adagrad

paddle.fluid.optimizer.Adagrad

alias of AdagradOptimizer

AdagradOptimizer

class paddle.fluid.optimizer.AdagradOptimizer(learning_rate, epsilon=1e-06, regularization=None, name=None, initial_accumulator_value=0.0)[source]

Adaptive Gradient Algorithm (Adagrad)

The update is done as follows:

\[ \begin{align}\begin{aligned}moment\_out &= moment + grad * grad\\param\_out &= param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have the epsilon attribute. It is added here in our implementation as also proposed here: http://cs231n.github.io/neural-networks-3/#ada for numerical stability to avoid the division by zero error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
  • initial_accumulator_value (float) – Initial value for moment accumulator.

Examples

import paddle.fluid as fluid
import numpy as np

np_inp = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
inp = fluid.layers.data(
    name="inp", shape=[2, 2], append_batch_size=False)
out = fluid.layers.fc(inp, size=3)
out = fluid.layers.reduce_sum(out)
optimizer = fluid.optimizer.Adagrad(learning_rate=0.2)
optimizer.minimize(out)

exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
exe.run(
    feed={"inp": np_inp},
    fetch_list=[out.name])
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

Adam

paddle.fluid.optimizer.Adam

alias of AdamOptimizer

Adamax

paddle.fluid.optimizer.Adamax

alias of AdamaxOptimizer

AdamaxOptimizer

class paddle.fluid.optimizer.AdamaxOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None)[source]

We implement the Adamax optimizer from Section 7 of the Adam paper: https://arxiv.org/abs/1412.6980. Adamax is a variant of the Adam algorithm based on the infinity norm.

Adamax updates:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_out & = {\beta}_1 * moment + (1 - {\beta}_1) * grad\\inf\_norm\_out & = max({\beta}_2 * inf\_norm + \epsilon, |grad|)\\learning\_rate & = \frac{learning\_rate}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}\end{aligned}\end{align} \]

The original paper does not have an epsilon attribute. However, it is added here for numerical stability to prevent the division by 0 error.

Examples

import paddle.fluid as fluid
import numpy

# First create the Executor.
place = fluid.CPUPlace() # fluid.CUDAPlace(0)
exe = fluid.Executor(place)

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    data = fluid.layers.data(name='X', shape=[1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    adam = fluid.optimizer.Adamax(learning_rate=0.2)
    adam.minimize(loss)

# Run the startup program once and only once.
exe.run(startup_program)

x = numpy.random.random(size=(10, 1)).astype('float32')
outs = exe.run(program=train_program,
              feed={'X': x},
               fetch_list=[loss.name])
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Notes

Currently, AdamaxOptimizer doesn’t support sparse parameter optimization.

apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

AdamOptimizer

class paddle.fluid.optimizer.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, regularization=None, name=None, lazy_mode=False)[source]

This implements the Adam optimizer from Section 2 of the Adam paper : https://arxiv.org/abs/1412.6980. Adam is a first-order gradient-based optimization method based on adaptive estimates of lower-order moments.

Adam updates:

\[ \begin{align}\begin{aligned}t & = t + 1\\moment\_1\_out & = {\beta}_1 * moment\_1 + (1 - {\beta}_1) * grad\\moment\_2\_out & = {\beta}_2 * moment\_2 + (1 - {\beta}_2) * grad * grad\\learning\_rate & = learning\_rate * \ \frac{\sqrt{1 - {\beta}_2^t}}{1 - {\beta}_1^t}\\param\_out & = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
  • lazy_mode(bool – false): The official Adam algorithm has two moving-average accumulators
  • accumulators are updated at every step. Every element of the two moving-average is updated (the) –
  • both dense mode and sparse mode. If the size of parameter is very large, then the update (in) –
  • be very slow. The lazy mode only update the element that has gradient is the current (may) –
  • so it will be much more faster. But this mode has different semantics with the (mini-batch,) –
  • Adam algorithm and may lead to different result. (original) –

Examples

import paddle
import paddle.fluid as fluid

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    adam_optimizer = fluid.optimizer.AdamOptimizer(0.01)
    adam_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

DecayedAdagrad

paddle.fluid.optimizer.DecayedAdagrad

alias of DecayedAdagradOptimizer

DecayedAdagradOptimizer

class paddle.fluid.optimizer.DecayedAdagradOptimizer(learning_rate, decay=0.95, epsilon=1e-06, regularization=None, name=None)[source]

Decayed Adagrad Optimizer

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)

The update is done as follows:

\[ \begin{align}\begin{aligned}moment\_out & = decay * moment + (1 - decay) * grad * grad\\param\_out & = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}\end{aligned}\end{align} \]

The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) does not have an epsilon attribute. It is added here for numerical stability to avoid the division by zero error.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • decay (float) – decay rate.
  • epsilon (float) – a small float value for numerical stability.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

import paddle.fluid as fluid
import paddle.fluid.layers as layers
from paddle.fluid.optimizer import DecayedAdagrad

x = layers.data( name='x', shape=[-1, 10], dtype='float32' )
trans = layers.fc( x, 100 )
cost = layers.reduce_mean( trans )
optimizer = fluid.optimizer.DecayedAdagrad(learning_rate=0.2)
optimizer.minimize(cost)

Notes

Currently, DecayedAdagradOptimizer doesn’t support sparse parameter optimization.

apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

DGCMomentumOptimizer

class paddle.fluid.optimizer.DGCMomentumOptimizer(learning_rate, momentum, rampup_begin_step, rampup_step=1, sparsity=[0.999], use_nesterov=False, local_grad_clip_norm=None, num_trainers=None, regularization=None, name=None)[source]

Original paper is https://arxiv.org/abs/1712.01887

DGC reduces the communication bandwidth by sending only the important gradients (sparse update): only gradients larger than a threshold are transmitted.

To avoid losing information, DGC accumulates the rest of the gradients locally.

Eventually, these gradients become large enough to be transmitted.

Thus, DGC sends the large gradients immediately but eventually send all of the gradients over time.

To ensure no loss of accuracy, DGC employs momentum correction and local gradient clipping on top of the gradient sparsification to maintain model performance.

DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication.

This optimizer will do two things:

  1. Compress the gradient by get TopK import value from tensor and use it for allreduce to reduce network bandwidth.
  2. Call momentum to optimize on the cost.
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • momentum (float) – Momentum factor.
  • rampup_begin_step (int) – The beginning step from which gradient compression is implemented.
  • rampup_step (int) – How long it use the sparsity periods. Default is 1. for example: If the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 5, it will use 0.75 at 0 step, and 0.9375 at 1 step, and so on. And when reach sparsity array ends, it will use 0.999 then and after.
  • sparsity (list[float]) – Get top important element from gradient tensor, the ratio is (1 - current sparsity).
  • use_nesterov (bool) – Enables Nesterov momentum. True means use nesterov.
  • local_grad_clip_norm (float) – Clip norm value if needed.
  • num_trainers – The number of training nodes.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – An optional name prefix.

Examples

import paddle.fluid as fluid
optimizer = fluid.optimizer.DGCMomentumOptimizer(
            learning_rate=0.0001,
            momentum=0.9,
            rampup_step=1000,
            rampup_begin_step=1252,
            sparsity=[0.999, 0.999])
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

ExponentialMovingAverage

class paddle.fluid.optimizer.ExponentialMovingAverage(decay=0.999, thres_steps=None, name=None)[source]

Compute the moving average of parameters with exponential decay. Given a parameter \(\theta\), its exponential moving average (EMA) will be

\[ \begin{align}\begin{aligned}\text{EMA}_0 & = 0\\\text{EMA}_t & = \text{decay} * \text{EMA}_{t-1} + (1 - \text{decay}) * \theta_t\end{aligned}\end{align} \]

The average results calculated by update() method will be saved in temporary variables which are created and maintained by the object, and can be applied to parameters of current model by calling apply() method. And the restore() method is used to restore the parameters.

Bias correction. All EMAs are initialized to \(0\) and hence they will be zero biased, which can be corrected by divided by a factor \((1 - \text{decay}^t)\) , i.e., the actual EMAs applied to parameters when calling apply() method would be

\[\widehat{\text{EMA}}_t = \frac{\text{EMA}_t}{1 - \text{decay}^t}\]

Decay rate scheduling. A large decay rate very close to 1 would result in that the averages move very slowly. And a better strategy is to set a relative smaller decay rate in the very beginning. The argument thres_steps allows users to pass a Variable to schedule the decay rate, in this case, the actual decay rate becomes

\[\min(\text{decay}, \frac{1 + \text{thres_steps}}{10 + \text{thres_steps}})\]

Usually thres_steps can be the global training steps.

Parameters:
  • decay (float) – The exponential decay rate, usually close to 1, such as 0.999, 0.9999, ... .
  • thres_steps (Variable|None) – If not None, schedule the decay rate.
  • name (str|None) – An optional name prefix.

Examples

import numpy
import paddle
import paddle.fluid as fluid

data = fluid.layers.data(name='x', shape=[5], dtype='float32')
hidden = fluid.layers.fc(input=data, size=10)
cost = fluid.layers.mean(hidden)

test_program = fluid.default_main_program().clone(for_test=True)

optimizer = fluid.optimizer.Adam(learning_rate=0.001)
optimizer.minimize(cost)

global_steps = fluid.layers.learning_rate_scheduler._decay_step_counter()
ema = fluid.optimizer.ExponentialMovingAverage(0.999, thres_steps=global_steps)
ema.update()

place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())

for pass_id in range(3):
    for batch_id in range(6):
        data = numpy.random.random(size=(10, 5)).astype('float32')
        exe.run(program=fluid.default_main_program(),
            feed={'x': data},
            fetch_list=[cost.name])

    # usage 1
    with ema.apply(exe):
        data = numpy.random.random(size=(10, 5)).astype('float32')
        exe.run(program=test_program,
                feed={'x': data},
                fetch_list=[hidden.name])


     # usage 2
    with ema.apply(exe, need_restore=False):
        data = numpy.random.random(size=(10, 5)).astype('float32')
        exe.run(program=test_program,
                feed={'x': data},
                fetch_list=[hidden.name])
    ema.restore(exe)
update()[source]

Update Exponential Moving Average. Should only call this method in train program.

apply(executor, need_restore=True)[source]

Apply moving average to parameters for evaluation.

Parameters:
  • executor (Executor) – The Executor to execute applying.
  • need_restore (bool) – Whether to restore parameters after applying.
restore(executor)[source]

Restore parameters.

Parameters:executor (Executor) – The Executor to execute restoring.

Ftrl

paddle.fluid.optimizer.Ftrl

alias of FtrlOptimizer

FtrlOptimizer

class paddle.fluid.optimizer.FtrlOptimizer(learning_rate, l1=0.0, l2=0.0, lr_power=-0.5, regularization=None, name=None)[source]

FTRL (Follow The Regularized Leader) Optimizer.

The paper that proposed Follow The Regularized Leader (FTRL): (https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf)

\[ \begin{align}\begin{aligned}&new\_accum = squared\_accum + grad^2\\&if (lr\_power == -0.5):\\&\quad linear\_accum += grad - \frac{\sqrt{new\_accum} - \sqrt{squared\_accum}}{learning\_rate * param}\\&else:\\&\quad linear\_accum += grad - \frac{new\_accum^{-lr\_power} - accum^{-lr\_power}}{learning\_rate * param}\\ &x = l1 * sign(linear\_accum) - linear\_accum\\&if (lr\_power == -0.5):\\&\quad y = \frac{\sqrt{new\_accum}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&else:\\&\quad y = \frac{new\_accum^{-lr\_power}}{learning\_rate} + (2 * l2)\\&\quad pre\_shrink = \frac{x}{y}\\&\quad param = (abs(linear\_accum) > l1).select(pre\_shrink, 0.0)\\&squared\_accum += grad^2\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – global learning rate.
  • l1 (float) – L1 regularization strength.
  • l2 (float) – L2 regularization strength.
  • lr_power (float) – Learning Rate Power.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    ftrl_optimizer = fluid.optimizer.Ftrl(learning_rate=0.1)
    ftrl_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)

Notes

Currently, FtrlOptimizer doesn’t support sparse parameter optimization.

apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

LambOptimizer

class paddle.fluid.optimizer.LambOptimizer(learning_rate=0.001, lamb_weight_decay=0.01, beta1=0.9, beta2=0.999, epsilon=1e-06, regularization=None, exclude_from_weight_decay_fn=None, name=None)[source]

LAMB (Layer-wise Adaptive Moments optimizer for Batching training) Optimizer.

LAMB Optimizer is designed to scale up the batch size of training without losing accuracy, which supports adaptive element-wise updating and accurate layer-wise correction. For more information, please refer to Large Batch Optimization for Deep Learning: Training BERT in 76 minutes .

The updating of parameters follows:

\[ \begin{align}\begin{aligned}m_t &= \beta_1 m_{t - 1}+ (1 - \beta_1)g_t \\\v_t &= \beta_2 v_{t - 1} + (1 - \beta_2)g_t^2 \\\r_t &= \frac{m_t}{\sqrt{v_t}+\epsilon} \\\w_t &= w_{t-1} -\eta_t \frac{\left \| w_{t-1}\right \|}{\left \| r_t + \lambda w_{t-1}\right \|} (r_t + \lambda w_{t-1})\end{aligned}\end{align} \]

where \(m\) is the 1st moment, and \(v\) the 2nd moment, \(\eta\) the learning rate, \(\lambda\) the LAMB weight decay rate.

Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • lamb_weight_decay (float) – The LAMB weight decay rate.
  • beta1 (float) – The exponential decay rate for the 1st moment estimates.
  • beta2 (float) – The exponential decay rate for the 2nd moment estimates.
  • epsilon (float) – A small float value for numerical stability.
  • regularization (Regularizer) – A Regularizer, such as fluid.regularizer.L1DecayRegularizer.
  • exclude_from_weight_decay_fn (function) – Exclude a parameter from weight decay when exclude_from_weight_decay_fn(parameter) returns true.
  • name (str|None) – An optional name prefix.

Examples

import paddle.fluid as fluid

data = fluid.layers.data(name='x', shape=[5], dtype='float32')
hidden = fluid.layers.fc(input=data, size=10)
cost = fluid.layers.mean(hidden)

def exclude_fn(param):
    return param.name.endswith('.b_0')

optimizer = fluid.optimizer.Lamb(learning_rate=0.002,
                                 exclude_from_weight_decay_fn=exclude_fn)
optimizer.minimize(cost)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

LarsMomentum

paddle.fluid.optimizer.LarsMomentum

alias of LarsMomentumOptimizer

LarsMomentumOptimizer

class paddle.fluid.optimizer.LarsMomentumOptimizer(learning_rate, momentum, lars_coeff=0.001, lars_weight_decay=0.0005, regularization=None, name=None)[source]

Momentum optimizer with LARS support

The update equations are as follows:

\[ \begin{align}\begin{aligned}& local\_learning\_rate = learning\_rate * lars\_coeff * \ \frac{||param||}{||gradient|| + lars\_weight\_decay * ||param||}\\& velocity = mu * velocity + local\_learning\_rate * (gradient + lars\_weight\_decay * param)\\& param = param - velocity\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • momentum (float) – momentum factor
  • lars_coeff (float) – defines how much we trust the layer to change its weights.
  • lars_weight_decay (float) – weight decay coefficient for decaying using LARS.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

optimizer = fluid.optimizer.LarsMomentum(learning_rate=0.2, momentum=0.1, lars_weight_decay=0.001)
optimizer.minimize(cost)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

ModelAverage

class paddle.fluid.optimizer.ModelAverage(average_window_rate, min_average_window=10000, max_average_window=10000, regularization=None, name=None)[source]

Accumulate the average of parameters within sliding window. The average result will be saved in temporary variables which can be applied to parameter variables of current model by calling ‘apply()’ method. And the ‘restore()’ method is used to restore the parameter values of current model.

The size of average window is determined by average_window_rate, min_average_window, max_average_window and current update times.

Parameters:
  • average_window_rate – The rate of average window.
  • min_average_window – The minimum size of average window.
  • max_average_window – The maximum size of average window.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

import paddle.fluid as fluid
import numpy

# First create the Executor.
place = fluid.CPUPlace()  # fluid.CUDAPlace(0)
exe = fluid.Executor(place)

train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
    # build net
    data = fluid.layers.data(name='X', shape=[1], dtype='float32')
    hidden = fluid.layers.fc(input=data, size=10)
    loss = fluid.layers.mean(hidden)
    optimizer = fluid.optimizer.Momentum(learning_rate=0.2, momentum=0.1)
    optimizer.minimize(loss)

    # build ModelAverage optimizer
    model_average = fluid.optimizer.ModelAverage(0.15,
                                                 min_average_window=10000,
                                                 max_average_window=20000)

    exe.run(startup_program)
    x = numpy.random.random(size=(10, 1)).astype('float32')
    outs = exe.run(program=train_program,
                   feed={'X': x},
                   fetch_list=[loss.name])

    # apply ModelAverage
    with model_average.apply(exe):
        x = numpy.random.random(size=(10, 1)).astype('float32')
        exe.run(program=train_program,
                feed={'X': x},
                fetch_list=[loss.name])
apply(executor, need_restore=True)[source]

Apply average values to parameters of current model.

Parameters:
  • executor (fluid.Executor) – current executor.
  • need_restore (bool) – If you finally need to do restore, set it to True. Default is True.
restore(executor)[source]

Restore parameter values of current model.

Parameters:executor (fluid.Executor) – current executor.
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

Momentum

paddle.fluid.optimizer.Momentum

alias of MomentumOptimizer

MomentumOptimizer

class paddle.fluid.optimizer.MomentumOptimizer(learning_rate, momentum, use_nesterov=False, regularization=None, name=None)[source]

Simple Momentum optimizer with velocity state

This optimizer has a flag for Nestrov Momentum.

The update equations are as follows:

\[ \begin{align}\begin{aligned}& velocity = mu * velocity + gradient\\& if (use\_nesterov):\\&\quad param = param - (gradient + mu * velocity) * learning\_rate\\& else:\\&\quad param = param - learning\_rate * velocity\end{aligned}\end{align} \]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • momentum (float) – momentum factor
  • use_nesterov (bool) – enables Nesterov momentum
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    moment_optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=0.001, momentum=0.9)
    moment_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

PipelineOptimizer

class paddle.fluid.optimizer.PipelineOptimizer(optimizer, cut_list=None, place_list=None, concurrency_list=None, queue_size=30, sync_steps=1, start_cpu_core_id=0)[source]

Pipeline Optimizer

Train with pipeline mode. The program will be splited by cut_list.

If the len of cut_list is k, then the whole program (including backward part) will be splited to 2*k-1 sections.

So the length of place_list and concurrency_list must be also 2*k-1.

Note: Though the asynchronous mode is applied in pipeline training to speed up, the final performance depends on the training progress of each pipeline heavily.

And we will try the synchronous mode in the future.

Parameters:
  • optimizer (Optimizer) – The based optimizer, such as SGD.
  • cut_list (list of Variable list) – The cut variable of the main_program.
  • place_list (list of Place) – The place where the section will run on.
  • concurrency_list (list of int) – The concurrency degree.
  • queue_size (int) – Each section will consume scopes from its in-scope queue and produce scopes to out-scope queue. And this parameter specify the scope queue size. [Optional. Default: 30].
  • sync_steps (int) – The synchronization steps between different cards. [Optional. Default: 1].
  • start_cpu_core_id (int) – specify the first cpu core id. [Optional. Default:0].

Examples

import paddle.fluid as fluid
import paddle.fluid.layers as layers

x = fluid.layers.data(name='x', shape=[1], dtype='int64', lod_level=0)
y = fluid.layers.data(name='y', shape=[1], dtype='int64', lod_level=0)
emb_x = layers.embedding(input=x, param_attr=fluid.ParamAttr(name="embx"), size=[10,2], is_sparse=False)
emb_y = layers.embedding(input=y, param_attr=fluid.ParamAttr(name="emby",learning_rate=0.9), size=[10,2], is_sparse=False)
concat = layers.concat([emb_x, emb_y], axis=1)
fc = layers.fc(input=concat, name="fc", size=1, num_flatten_dims=1, bias_attr=False)
loss = layers.reduce_mean(fc)
optimizer = fluid.optimizer.SGD(learning_rate=0.5)
optimizer = fluid.optimizer.PipelineOptimizer(optimizer,
        cut_list=[[emb_x, emb_y], [loss]],
        place_list=[fluid.CPUPlace(), fluid.CUDAPlace(0), fluid.CPUPlace()],
        concurrency_list=[1, 1, 4],
        queue_size=2,
        sync_steps=1,
        )
optimizer.minimize(loss)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
filelist = [] # you should set your own filelist, e.g. filelist = ["dataA.txt"]
dataset = fluid.DatasetFactory().create_dataset("FileInstantDataset")
dataset.set_use_var([x,y])
dataset.set_batch_size(batch_size)
dataset.set_filelist(filelist)
exe.train_from_dataset(
            fluid.default_main_program(),
            dataset,
            thread=2,
            debug=False,
            fetch_list=[],
            fetch_info=[],
            print_period=1)

RMSPropOptimizer

class paddle.fluid.optimizer.RMSPropOptimizer(learning_rate, rho=0.95, epsilon=1e-06, momentum=0.0, centered=False, regularization=None, name=None)[source]

Root Mean Squared Propagation (RMSProp) is an unpublished, adaptive learning rate method. The original slides proposed RMSProp: Slide 29 of http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf .

The original equation is as follows:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\w & = w - \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\end{aligned}\end{align} \]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(sqrt{v(w,t)}\).

In some cases, adding a momentum term :math: beta is beneficial. In our implementation, Nesterov momentum is used:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

if centered is True:

\[ \begin{align}\begin{aligned}r(w, t) & = \rho r(w, t-1) + (1 - \rho)(\nabla Q_{i}(w))^2\\g(w, t) & = \rho g(w, t-1) + (1 - \rho)\nabla Q_{i}(w)\\v(w, t) & = \beta v(w, t-1) + \frac{\eta} {\sqrt{r(w,t) - (g(w, t))^2 + \epsilon}} \nabla Q_{i}(w)\\w & = w - v(w, t)\end{aligned}\end{align} \]

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95 and so on. :math: beta is the momentum term. :math: epsilon is a smoothing term to avoid division by zero, usually set somewhere in range from 1e-4 to 1e-8.

Parameters:
  • learning_rate (float) – global learning rate.
  • rho (float) – rho is :math: rho in equation, set 0.95 by default.
  • epsilon (float) –
    math:epsilon in equation is smoothing term to

    avoid division by zero, set 1e-6 by default.

  • momentum (float) – \(\beta\) in equation is the momentum term, set 0.0 by default.
  • centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.
Raises:

ValueError – If learning_rate, rho, epsilon, momentum are None.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    rms_optimizer = fluid.optimizer.RMSProp(learning_rate=0.1)
    rms_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple

SGD

paddle.fluid.optimizer.SGD

alias of SGDOptimizer

SGDOptimizer

class paddle.fluid.optimizer.SGDOptimizer(learning_rate, regularization=None, name=None)[source]

Optimizer of the stochastic gradient descent algorithm.

\[param\_out = param - learning\_rate * grad\]
Parameters:
  • learning_rate (float|Variable) – the learning rate used to update parameters. Can be a float value or a Variable with one float value as data element.
  • regularization – A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
  • name – A optional name prefix.

Examples

import paddle
import paddle.fluid as fluid
import numpy as np

place = fluid.CPUPlace()
main = fluid.Program()
with fluid.program_guard(main):
    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
    y_predict = fluid.layers.fc(input=x, size=1, act=None)
    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
    avg_cost = fluid.layers.mean(cost)

    sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
    sgd_optimizer.minimize(avg_cost)

    fetch_list = [avg_cost]
    train_reader = paddle.batch(
        paddle.dataset.uci_housing.train(), batch_size=1)
    feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
    exe = fluid.Executor(place)
    exe.run(fluid.default_startup_program())
    for data in train_reader():
        exe.run(main, feed=feeder.feed(data), fetch_list=fetch_list)
apply_gradients(params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:params_grads (list) – list of (param, grad) pair to do optimization.
Returns:A list of operators appended to the current program.
Return type:list

Examples

import paddle.fluid as fluid
loss = network()
optimizer = fluid.optimizer.SGD(learning_rate=0.1)
params_grads = optimizer.backward(loss)
# you may append operations for params_grads here
# ...
optimizer.apply_gradients(params_grads)
apply_optimize(loss, startup_program, params_grads)

Second part of minimize, appending optimization operators for given params_grads pairs.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • params_grads (list) – list of (param, grad) pair to do optimization.
Returns:

A list of operators appended to the current program.

Return type:

list

backward(loss, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None)

First part of minimize, do auto-diff to append backward ops for the current program.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • callbacks (list|None) – list of callables to run when appending backward operator for one parameter.
Returns:

list of (param, grad) pair, grad is the output of backward.

Return type:

list

Examples

See examples in apply_gradients.

load(stat_dict)

load optimizer with learning rate decay in dygraph mode :return: None

Parameters:stat_dict – the dict load by load_persistable method

Examples:

from __future__ import print_function
import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.optimizer import SGDOptimizer
from paddle.fluid.dygraph.nn import FC
from paddle.fluid.dygraph.base import to_variable

class MLP(fluid.Layer):
    def __init__(self, name_scope):
        super(MLP, self).__init__(name_scope)

        self._fc1 = FC(self.full_name(), 10)
        self._fc2 = FC(self.full_name(), 10)

    def forward(self, inputs):
        y = self._fc1(inputs)
        y = self._fc2(y)
        return y

with fluid.dygraph.guard():
    mlp = MLP('mlp')
    optimizer2 = SGDOptimizer(
        learning_rate=fluid.layers.natural_exp_decay(
        learning_rate=0.1,
        decay_steps=10000,
        decay_rate=0.5,
        staircase=True))

    train_reader = paddle.batch(
            paddle.dataset.mnist.train(), batch_size=128, drop_last=True)

    for batch_id, data in enumerate(train_reader()):
        dy_x_data = np.array(
                [x[0].reshape(1, 28, 28) for x in data]).astype('float32')

        y_data = np.array([x[1] for x in data]).astype('int64').reshape(
                128, 1)

        img = to_variable(dy_x_data)
        label = to_variable(y_data)
        label._stop_gradient = True
        cost = mlp(img)
        avg_loss = fluid.layers.reduce_mean(cost)
        avg_loss.backward()
        optimizer.minimize(avg_loss)
        mlp.clear_gradients()
        fluid.dygraph.save_persistables(
                mlp.state_dict(), [optimizer, optimizer2], "save_dir_2")
        if batch_id == 2:
                break

with fluid.dygraph.guard():
    mlp_load = MLP('mlp')
    optimizer_load2 = SGDOptimizer(
            learning_rate=fluid.layers.natural_exp_decay(
            learning_rate=0.1,
            decay_steps=10000,
            decay_rate=0.5,
            staircase=True))
    parameters, optimizers = fluid.dygraph.load_persistables(
        "save_dir_2")
    mlp_load.load_dict(parameters)
    optimizer_load2.load(optimizers)
self.assertTrue(optimizer2._learning_rate.__dict__ == optimizer_load2._learning_rate.__dict__)
minimize(loss, startup_program=None, parameter_list=None, no_grad_set=None, grad_clip=None)

Add operations to minimize loss by updating parameter_list.

This method combines interface backward() and apply_gradients() into one.

Parameters:
  • loss (Variable) – loss variable to run optimizations.
  • startup_program (Program) – startup_program for initializing parameters in parameter_list.
  • parameter_list (list) – list of Variables to update.
  • no_grad_set (set|None) – set of Variables should be ignored.
  • grad_clip (GradClipBase|None) – Gradient clip strategy
Returns:

(optimize_ops, params_grads) which are, list of operators appended; and list of (param, grad) Variables pair for optimization.

Return type:

tuple