AsyncExecutor

class paddle.fluid.AsyncExecutor(place=None, run_mode='')[source]

An asynchronous Executor in Python. Through exploiting the power of multi-core processor and data queueing, AsyncExecutor makes data reading and cosuming decoupled, each run in multiple threads in parallel.

Instead of reading data in python side, AsyncExecutor accepts a training file list, which will be retrieved in C++, then training inputs will be read, parsed and fed to training network within C++ code.

AsyncExecutor is in active development and the API might change in the near future.

Example

>>> data_feed = fluid.DataFeedDesc('data.proto')
>>> startup_program = fluid.default_startup_program()
>>> main_program = fluid.default_main_program()
>>> filelist = ["train_data/part-%d" % i for i in range(100)]
>>> thread_num = len(filelist) / 4
>>>
>>> place = fluid.CPUPlace()
>>> async_executor = fluid.AsyncExecutor(place)
>>>
>>> async_executor.run_startup_program(startup_program)
>>>
>>> epoch = 10
>>> for i in range(epoch):
>>>     async_executor.run(main_program,
>>>                        data_feed,
>>>                        filelist,
>>>                        thread_num,
>>>                        [acc],
>>>                        debug=False)
Parameters:place (fluid.CPUPlace|None) – indicate the executor run on which device. Only CPUPlace supported

Note

For debugging complicated network in parallel-GPUs, you can test it on the executor. They has the exactly same arguments, and expected the same results.

Note: Only running on CPUPlace supported.

run(program, data_feed, filelist, thread_num, fetch, mode='', debug=False)[source]

Run program by this AsyncExecutor. Training dataset will be in filelist. Users can also inspect certain variables by naming them in parameter fetch, like in fluid.Executor. Unlike fluid.Executor, however, AsyncExecutor doesn’t return fetched variables, instead, it will dump the values of each fetched variable to stdandard output.

Running the dataset will be on multiple threads, within each a thread local scope will be created, then all OPs also created in that scope. Parameters are updated by all the OPs simultaneously.

Parameters:
  • program (Program) – the program that need to run, if not provied, then default_main_program will be used.
  • data_feed (DataFeedDesc) – A DataFeedDesc object
  • filelist (str) – a file containing the training dataset file list
  • thread_num (int) – number of concurrent training threads. See Note for how to set this properly
  • fetch (str|list) – the var name or a list of var names to inspect
  • mode (str) – run mode of this interface
  • debug (bool) – When set to True, fetch vars will be printed to standard output after each minibatch

Note

the executor will run all operators in the program but not only the operators dependent by the fetch_list.

Note

Running AsyncExecutor will be on multiple threads, each bound to a CPU core. To achieve best performance, it’s suggested to set thread num to be equal or slightly less than that of CPU cores.

download_data(afs_path, local_path, fs_default_name, ugi, file_cnt, hadoop_home='$HADOOP_HOME', process_num=12)[source]

download_data is a default download method for distributed training a user download data without this method

Example

>>> exe = fluid.AsyncExecutor()
>>> exe.download_data("/xxx/xxx/xx/",
>>>                   "./data", "afs://
>>>  xxx.xxx.xxx.xxx:9901", "xxx,yyy")
Parameters:
  • afs_path (str) – afs_path defined by users
  • local_path (str) – download data path
  • fs_default_name (str) – file system server address
  • ugi (str) – hadoop ugi
  • file_cnt (int) – a user can specify file number for debugging
  • hadoop_home (str) – hadoop home path
  • process_num (int) – download process num
get_instance()[source]

get current node’s instance so that user can do operations in distributed setting

config_distributed_nodes()[source]

if a user needs to run distributed async executor he or she needs to do a global configuration so that information of current process can be obtained

stop()[source]

at the end of process, users should call stop to servers and barrier all workers

init_server(dist_desc)[source]

Initialize server of current node if current process is a server.

Parameters:dist_desc (str) – a protobuf string that describes how to init a worker and a server
init_worker(dist_desc, startup_program)[source]

Initialize worker of current node if current process is a worker.

Parameters:
  • dist_desc (str) – a protobuf string that describes how to init a worker and a server
  • startup_program (fluid.Program) – startup program of current process
init_model()[source]

init_model command that can be invoked from one of the worker model parameters are initialized in servers

save_model(save_path)[source]

save_model command that can be invoked from one of the worker model parameters are saved in servers and upload to save_path of file system.

Parameters:save_path (str) – save path to file system

BuildStrategy

class paddle.fluid.BuildStrategy

BuildStrategy allows the user to more preciously control how to build the SSA Graph in ParallelExecutor by setting the property.

Examples

build_strategy = fluid.BuildStrategy()
build_strategy.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.Reduce

train_exe = fluid.ParallelExecutor(use_cuda=True,
                                   loss_name=loss.name,
                                   build_strategy=build_strategy)

train_loss, = train_exe.run([loss.name], feed=feed_dict)
debug_graphviz_path

The type is STR, debug_graphviz_path indicate the path that writing the SSA Graph to file in the form of graphviz, you. It is useful for debugging. Default “”

enable_sequential_execution

The type is BOOL. If set True, the execution order of ops would be the same as what is in the program. Default False.

fuse_elewise_add_act_ops

The type is BOOL, fuse_elewise_add_act_ops indicate whether to fuse elementwise_add_op and activation_op, it may make the execution faster. Default False

fuse_relu_depthwise_conv

The type is BOOL, fuse_relu_depthwise_conv indicate whether to fuse relu and depthwise_conv2d, it will save GPU memory and may make the execution faster. This options is only available in GPU devices. Default False

gradient_scale_strategy

The type is STR, there are three ways of defining \(loss@grad\) in ParallelExecutor, ‘CoeffNumDevice’, ‘One’ and ‘Customized’. By default, ParallelExecutor sets the \(loss@grad\) according to the number of devices. If you want to customize \(loss@grad\), you can choose ‘Customized’. Default ‘CoeffNumDevice’.

reduce_strategy

The type is STR, there are two reduce strategies in ParallelExecutor, ‘AllReduce’ and ‘Reduce’. If you want that all the parameters’ optimization are done on all devices independently, you should choose ‘AllReduce’; if you choose ‘Reduce’, all the parameters’ optimization will be evenly distributed to different devices, and then broadcast the optimized parameter to other devices. In some models, Reduce is faster. Default ‘AllReduce’.

remove_unnecessary_lock

The type is BOOL. If set True, some locks in GPU ops would be released and ParallelExecutor would run faster. Default True.

sync_batch_norm

The type is BOOL, sync_batch_norm indicates whether to use synchronous batch normalization which synchronizes the mean and variance through multi-devices in training phase.

Current implementation doesn’t support FP16 training and CPU. And only synchronous on one machine, not all machines.

Default False

CompiledProgram

class paddle.fluid.CompiledProgram(program_or_graph)[source]

Compiles to Graph for execution.

  1. Users first create the program with layers.
  2. Optionally, users use CompiledProgram to optimize the program before run.
  3. The original program or CompiledProgram is run by executor.

The CompiledProgram is used to transform a program for various optimizations, for example.

  • Pre-compute some logic once so that each run is faster.
  • Transform the program so that it can run in multiple devices.
  • TODO: transform the program for optimized inference or distributed
    training.

Example

Parameters:program_or_graph (Graph|Program) – If it’s Program, it will be first lowered to a graph for further optimizations. If it’s a graph (potentially optimized before), it will be directly used for further optimizations. Note: graph is only supported when compiled with with_data_parallel option.
with_data_parallel(loss_name=None, build_strategy=None, exec_strategy=None, share_vars_from=None, places=None)[source]

Configs the program to run in data parallel way.

Parameters:
  • loss_name (str) – The loss name must set in training. Default None.
  • build_strategy (BuildStrategy) – build_strategy is used to build the graph so it can run on multiple devices/cores with optimized topology. For more information, please refer to fluid.BuildStrategy. Default None.
  • exec_strategy (ExecutionStrategy) – exec_strategy is used to to select the a way to execute the graph, for example how many threads are used, how many iterations to clean up the temp variables. For more information, please refer to fluid.ExecutionStrategy. Default None.
  • share_vars_from (CompiledProgram) – If provided, this CompiledProgram will share variables from share_vars_from. share_vars_from must be run by the executor before this CompiledProgram so that vars are ready.
  • places (list(CUDAPlace)|list(CPUPlace)|None) – If provided, only compile program in the given places. Otherwise, the places used when compiled is determined by the Executor, and the places used are controlled by environment variables: FLAGS_selected_gpus or CUDA_VISIBLE_DEVICES if using GPU; or CPU_NUM if using CPU. For example, if you want to run on GPU 0 and 1, set places=[fluid.CUDAPlace(0), fluid.CUDAPlace(1)]. If you want to run on 2 CPU cores, set places=[fluid.CPUPlace()]*2.
Returns:

self

with_inference_optimize(config)[source]

Add inference optimize

Parameters:config – instance of NativeConfig or AnalysisConfig to create predictor
Returns:self

cpu_places

paddle.fluid.cpu_places(device_count=None)[source]

Create a list of fluid.CPUPlace objects.

If device_count is None, the device count would be determined by environment variable CPU_NUM. If CPU_NUM is not set, the device count would be determined by multiprocessing.cpu_count().

Parameters:device_count (None|int) – device number.
Returns:cpu place list.
Return type:out (list(fluid.CPUPlace))

CPUPlace

class paddle.fluid.CPUPlace

CPUPlace is a descriptor of a device. It represents a CPU, and the memory CPUPlace can be accessed by CPU.

create_lod_tensor

paddle.fluid.create_lod_tensor(data, recursive_seq_lens, place)[source]

Create a lod tensor from a numpy array, a list, or an existing lod tensor.

Create a lod tensor by doing the following:

  1. Check that the length-based level of detail (LoD) also known as recursive_sequence_lengths of the input is valid.
  2. Convert recursive_sequence_lengths to a offset-based LoD.
  3. Copy the data from a numpy array, a list or a existing lod tensor to CPU or GPU device (based on input place).
  4. Set the level of detail (LoD) using the offset-based LoD.

Examples

Suppose we want LoDTensor to hold data for sequences of word, where each word is represented by an integer. If we want to create a LoDTensor to represent two sentences, one of 2 words, and one of 3 words.

Then data can be a numpy array of integers with shape (5, 1). recursive_seq_lens will be [[2, 3]], indicating the length(# of words) in each sentence. This length-based recursive_seq_lens [[2, 3]] will be converted to offset-based LoD [[0, 2, 5]] inside the function call.

Please reference api_guide_low_level_lod_tensor for more details regarding LoD.

Parameters:
  • data (numpy.ndarray|list|LoDTensor) – a numpy array or a LoDTensor or a list holding the data to be copied.
  • recursive_seq_lens (list) – a list of lists indicating the length-based level of detail info specified by the user.
  • place (Place) – CPU or GPU place indicating where the data in the new LoDTensor will be stored.
Returns:

A fluid LoDTensor object with tensor data and recursive_seq_lens info.

create_random_int_lodtensor

paddle.fluid.create_random_int_lodtensor(recursive_seq_lens, base_shape, place, low, high)[source]

Create a LoDTensor containing random integers.

This function is frequently used in the book examples. So we revised it based on the new create_lod_tensor API and put it here in the lod_tensor module to simplify the code.

The function does the following:

  1. Calculate the overall shape of the LoDTensor based on the length-based recursive_seq_lens input and the shape of the basic element in base_shape.
  2. Create a numpy array of this shape.
  3. Create the LoDTensor using create_lod_tensor API.

Suppose we want LoDTensor to hold data for sequences of word, where each word is represented by an integer. If we want to create a LoDTensor to represent two sentences, one of 2 words, and one of 3 words. Then ‘base_shape’ is [1], input length-based ‘recursive_seq_lens’ is [[2, 3]]. Then the overall shape of the LoDTensor would be [5, 1], holding 5 words for two sentences.

Parameters:
  • recursive_seq_lens (list) – a list of lists indicating the length-based level of detail info specified by the user.
  • base_shape (list) – the shape of the basic element to be held by the LoDTensor.
  • place (Place) – CPU or GPU place indicating where the data in the new LoDTensor will be stored.
  • low (int) – the lower bound of the random integers.
  • high (int) – the upper bound of the random integers.
Returns:

A fluid LoDTensor object with tensor data and recursive_seq_lens info.

cuda_pinned_places

paddle.fluid.cuda_pinned_places(device_count=None)[source]

Create a list of fluid.CUDAPinnedPlace objects.

If device_count is None, the device count would be determined by environment variable CPU_NUM. If CPU_NUM is not set, the device count would be determined by multiprocessing.cpu_count().

Parameters:device_count (None|int) – device number.
Returns:cuda pinned place list.
Return type:out (list(fluid.CUDAPinnedPlace))

cuda_places

paddle.fluid.cuda_places(device_ids=None)[source]

Create a list of fluid.CUDAPlace objects.

If device_ids is None, environment variable of FLAGS_selected_gpus would be checked first. If FLAGS_selected_gpus=0,1,2, the returned list would be [fluid.CUDAPlace(0), fluid.CUDAPlace(1), fluid.CUDAPlace(2)]. If FLAGS_selected_gpus is not set, all visible gpu places would be returned.

If device_ids is not None, it should be the device ids of gpus. For example, if device_ids=[0,1,2], the returned list would be [fluid.CUDAPlace(0), fluid.CUDAPlace(1), fluid.CUDAPlace(2)].

Parameters:device_ids (None|list(int)|tuple(int)) – gpu device id list.
Returns:gpu place list.
Return type:out (list(fluid.CUDAPlace))

CUDAPinnedPlace

class paddle.fluid.CUDAPinnedPlace

CUDAPinnedPlace is a descriptor of a device. The memory of CUDAPinnedPlace can be accessed by GPU and CPU.

CUDAPlace

class paddle.fluid.CUDAPlace

CUDAPlace is a descriptor of a device. It represents a GPU, and each CUDAPlace has a dev_id to indicate the number of cards represented by the current CUDAPlace. The memory of CUDAPlace with different dev_id is not accessible.

DataFeedDesc

class paddle.fluid.DataFeedDesc(proto_file)[source]

Datafeed descriptor, describing input training data format. This class is currently only used for AsyncExecutor (See comments for class AsyncExecutor for a brief introduction)

DataFeedDesc shall be initialized from a valid protobuf message from disk: >>> data_feed = fluid.DataFeedDesc(‘data.proto’)

See paddle/fluid/framework/data_feed.proto for message definition. A typical message might look like:

>>> name: "MultiSlotDataFeed"
>>> batch_size: 2
>>> multi_slot_desc {
>>>     slots {
>>>         name: "words"
>>>         type: "uint64"
>>>         is_dense: false
>>>         is_used: true
>>>     }
>>>     slots {
>>>         name: "label"
>>>         type: "uint64"
>>>         is_dense: false
>>>         is_used: true
>>>     }
>>> }

However, users usually shouldn’t care about the message format; instead, they are encouragd to use Data Generator as a tool to generate a valid data description, in the process of converting their raw log files to training files acceptable to AsyncExecutor.

DataFeedDesc can also be changed during runtime. Once you got familiar with what each field mean, you can modify it to better suit your need. E.g.: >>> data_feed.set_batch_size(128) >>> data_feed.set_dense_slots(‘wd’) # The slot named ‘wd’ will be dense >>> data_feed.set_use_slots(‘wd’) # The slot named ‘wd’ will be used

Finally, the content can be dumped out for debugging purpose: >>> print(data_feed.desc())

Parameters:proto_file (string) – Disk file containing a data feed description.
set_batch_size(batch_size)[source]

Set batch size. Will be effective during training

Example

>>> data_feed = fluid.DataFeedDesc('data.proto')
>>> data_feed.set_batch_size(128)
Parameters:batch_size – batch size
set_dense_slots(dense_slots_name)[source]

Set if a specific slot will be dense. Will be effective during training. features for a dense slot will be fed into a Tensor, while those for a sparse slot will be fed into a LoDTensor

Example

>>> data_feed = fluid.DataFeedDesc('data.proto')
>>> data_feed.set_dense_slots(['words'])
Parameters:dense_slots_name – a list of slot names which will be set dense

Note

Default is sparse for all slots

set_use_slots(use_slots_name)[source]

Set if a specific slot will be used for training. A dataset shall contain a lot of features, through this function one can select which ones will be used for a specific model.

Example

>>> data_feed = fluid.DataFeedDesc('data.proto')
>>> data_feed.set_use_slots(['words'])
Parameters:use_slots_name – a list of slot names which will be used in training

Note

Default is not used for all slots

desc()[source]

Returns a protobuf message for this DataFeedDesc

Example

>>> data_feed = fluid.DataFeedDesc('data.proto')
>>> print(data_feed.desc())
Returns:A string message

DataFeeder

class paddle.fluid.DataFeeder(feed_list, place, program=None)[source]

DataFeeder converts the data that returned by a reader into a data structure that can feed into Executor and ParallelExecutor. The reader usually returns a list of mini-batch data entries. Each data entry in the list is one sample. Each sample is a list or a tuple with one feature or multiple features.

The simple usage shows below:

place = fluid.CPUPlace()
img = fluid.layers.data(name='image', shape=[1, 28, 28])
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
feeder = fluid.DataFeeder([img, label], fluid.CPUPlace())
result = feeder.feed([([0] * 784, [9]), ([1] * 784, [1])])

If you want to feed data into GPU side separately in advance when you use multi-GPU to train a model, you can use decorate_reader function.

place=fluid.CUDAPlace(0)
feeder = fluid.DataFeeder(place=place, feed_list=[data, label])
reader = feeder.decorate_reader(
    paddle.batch(flowers.train(), batch_size=16))
Parameters:
  • feed_list (list) – The Variables or Variables’name that will feed into model.
  • place (Place) – place indicates feed data into CPU or GPU, if you want to feed data into GPU, please using fluid.CUDAPlace(i) (i represents the GPU id), or if you want to feed data into CPU, please using fluid.CPUPlace().
  • program (Program) – The Program that will feed data into, if program is None, it will use default_main_program(). Default None.
Raises:

ValueError – If some Variable is not in this Program.

Examples

# ...
place = fluid.CPUPlace()
feed_list = [
    main_program.global_block().var(var_name) for var_name in feed_vars_name
] # feed_vars_name is a list of variables' name.
feeder = fluid.DataFeeder(feed_list, place)
for data in reader():
    outs = exe.run(program=main_program,
                   feed=feeder.feed(data))
feed(iterable)[source]

According to feed_list and iterable, converters the input into a data structure that can feed into Executor and ParallelExecutor.

Parameters:iterable (list|tuple) – the input data.
Returns:the result of conversion.
Return type:dict
feed_parallel(iterable, num_places=None)[source]

Takes multiple mini-batches. Each mini-batch will be feed on each device in advance.

Parameters:
  • iterable (list|tuple) – the input data.
  • num_places (int) – the number of devices. Default None.
Returns:

the result of conversion.

Return type:

dict

Notes

The number of devices and number of mini-batches must be same.

decorate_reader(reader, multi_devices, num_places=None, drop_last=True)[source]

Converter the input data into a data that returned by reader into multiple mini-batches. Each mini-batch will be feed on each device.

Parameters:
  • reader (function) – the reader is the function which can generate data.
  • multi_devices (bool) – whether to use multiple devices or not.
  • num_places (int) – if multi_devices is True, you can specify the number of GPU to use, if multi_devices is None, the function will use all the GPU of the current machine. Default None.
  • drop_last (bool) – whether to drop the last batch if the size of the last batch is less than batch_size. Default True.
Returns:

the result of conversion.

Return type:

dict

Raises:

ValueError – If drop_last is False and the data batch cannot fit for devices.

default_main_program

paddle.fluid.default_main_program()[source]

Get default/global main program. The main program is used for training or testing.

All layer function in fluid.layers will append operators and variables to the default_main_program.

The default_main_program is the default program in a lot of APIs. For example, the Executor.run() will execute the default_main_program when the program is not specified.

Returns:main program
Return type:Program

default_startup_program

paddle.fluid.default_startup_program()[source]

Get default/global startup program.

The layer function in fluid.layers will create parameters, readers, NCCL handles as global variables. The startup_program will initialize them by the operators in startup program. The layer function will append these initialization operators into startup program.

This method will return the default or the current startup program. Users can use fluid.program_guard to switch program.

Returns:startup program
Return type:Program

DistributeTranspiler

class paddle.fluid.DistributeTranspiler(config=None)[source]

DistributeTranspiler

Convert the fluid program to distributed data-parallelism programs. Supports two modes: pserver mode and nccl2 mode.

In pserver mode, the main_program will be transformed to use a remote parameter server to do parameter optimization. And the optimization graph will be put into a parameter server program.

In nccl2 mode, the transpiler will append a NCCL_ID broadcasting op in startup_program to share the NCCL_ID across the job nodes. After transpile_nccl2 called, you *must* pass trainer_id and num_trainers argument to ParallelExecutor to enable NCCL2 distributed mode.

Examples

# for pserver mode
pserver_endpoints = "192.168.0.1:6174,192.168.0.2:6174"
trainer_endpoints = "192.168.0.1:6174,192.168.0.2:6174"
current_endpoint = "192.168.0.1:6174"
trainer_id = 0
trainers = 4
role = os.getenv("PADDLE_TRAINING_ROLE")
t = fluid.DistributeTranspiler()
t.transpile(
     trainer_id, pservers=pserver_endpoints, trainers=trainers)
if role == "PSERVER":
     pserver_program = t.get_pserver_program(current_endpoint)
     pserver_startup_program = t.get_startup_program(current_endpoint,
                                                    pserver_program)
elif role == "TRAINER":
     trainer_program = t.get_trainer_program()

# for nccl2 mode
config = fluid.DistributeTranspilerConfig()
config.mode = "nccl2"
t = fluid.DistributeTranspiler(config=config)
t.transpile(trainer_id, workers=workers, current_endpoint=curr_ep)
exe = fluid.ParallelExecutor(
    use_cuda,
    loss_name=loss_var.name,
    num_trainers=len(trainers.split(",)),
    trainer_id=trainer_id
)
transpile(trainer_id, program=None, pservers='127.0.0.1:6174', trainers=1, sync_mode=True, startup_program=None, current_endpoint='127.0.0.1:6174')[source]

Run the transpiler.

Parameters:
  • trainer_id (int) – id for current trainer worker, if you have n workers, the id may range from 0 ~ n-1
  • program (Program|None) – program to transpile, default is fluid.default_main_program().
  • startup_program (Program|None) – startup_program to transpile, default is fluid.default_startup_program().
  • pservers (str) – comma separated ip:port string for the pserver list.
  • trainers (int|str) – in pserver mode this is the number of trainers, in nccl2 mode this is a string of trainer endpoints.
  • sync_mode (bool) – Do sync training or not, default is True.
  • startup_program – startup_program to transpile, default is fluid.default_main_program().
  • current_endpoint (str) – need pass current endpoint when transpile as nccl2 distributed mode. In pserver mode this argument is not used.
get_trainer_program(wait_port=True)[source]

Get transpiled trainer side program.

Returns:trainer side program.
Return type:Program
get_pserver_program(endpoint)[source]

Get parameter server side program.

Parameters:endpoint (str) – current parameter server endpoint.
Returns:the program for current parameter server to run.
Return type:Program
get_pserver_programs(endpoint)[source]

Get pserver side main program and startup program for distributed training.

Parameters:endpoint (str) – current pserver endpoint.
Returns:(main_program, startup_program), of type “Program”
Return type:tuple
get_startup_program(endpoint, pserver_program=None, startup_program=None)[source]

Deprecated

Get startup program for current parameter server. Modify operator input variables if there are variables that were split to several blocks.

Parameters:
  • endpoint (str) – current pserver endpoint.
  • pserver_program (Program) – deprecated, call get_pserver_program first.
  • startup_program (Program) – deprecated, should pass startup_program when initalizing
Returns:

parameter server side startup program.

Return type:

Program

DistributeTranspilerConfig

class paddle.fluid.DistributeTranspilerConfig[source]
slice_var_up(bool)

Do Tensor slice for pservers, default is True.

split_method(PSDispatcher)

RoundRobin or HashName can be used. Try to choose the best method to balance loads for pservers.

min_block_size(int)

Minimum number of splitted elements in block.

According to : https://github.com/PaddlePaddle/Paddle/issues/8638#issuecomment-369912156 We can use bandwidth effiently when data size is larger than 2MB.If you want to change it, please be sure you have read the slice_variable function.

ExecutionStrategy

class paddle.fluid.ExecutionStrategy

ExecutionStrategy allows the user to more preciously control how to run the program in ParallelExecutor by setting the property.

Examples

exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 4

train_exe = fluid.ParallelExecutor(use_cuda=True,
                                   loss_name=loss.name,
                                   exec_strategy=exec_strategy)

train_loss, = train_exe.run([loss.name], feed=feed_dict)
allow_op_delay

The type is BOOL, allow_op_delay represents whether to delay the communication operators to run, it may make the execution faster. Note that in some models, allow_op_delay may cause program hang. Default False.

num_iteration_per_drop_scope

The type is INT, num_iteration_per_drop_scope indicates how many iterations to clean up the temp variables which is generated during execution. It may make the execution faster, because the temp variable’s shape maybe the same between two iterations. Default 100.

Notes

  1. If you fetch data when calling the ‘run’, the ParallelExecutor will clean up the temp variables at the end of the current iteration.
  2. In some NLP model, it may cause the GPU memory is insufficient, in this case, you should reduce num_iteration_per_drop_scope.
num_threads

The type is INT, num_threads represents the size of thread pool that used to run the operators of the current program in ParallelExecutor. If \(num\_threads=1\), all the operators will execute one by one, but the order maybe difference between iterations. If it is not set, it will be set in ParallelExecutor according to the device type and device count, for GPU, \(num\_threads=device\_count*4\), for CPU, \(num\_threads=CPU\_NUM*4\), the explanation of:math:CPU_NUM is in ParallelExecutor. if it is not set, ParallelExecutor will get the cpu count by calling multiprocessing.cpu_count(). Default 0.

Executor

class paddle.fluid.Executor(place)[source]

An Executor in Python, supports single/multiple-GPU running, and single/multiple-CPU running. Python executor takes a program, adds feed operators and fetch operators to this program according to feed map and fetch_list. Feed map provides input data for the program. fetch_list provides the variables(or names) that user wants to get after program runs. Note: the executor will run all operators in the program but not only the operators dependent by the fetch_list. It stores the global variables into the global scope, and creates a local scope for the temporary variables. The contents in local scope may be discarded after every minibatch forward/backward finished. But the global scope variables will be persistent through different runs.

Example

# First create the Executor.
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)

# Run the startup program once and only once.
# Not need to optimize/compile the startup program.
exe.run(fluid.default_startup_program())

# Run the main program directly without compile.
loss, = exe.run(fluid.default_main_program(),
                feed=feed_dict,
                fetch_list=[loss.name])
# Or, compiled the program and run. See `CompiledProgram` for more detail.
compiled_prog = compiler.CompiledProgram(
    fluid.default_main_program()).with_data_parallel(
    loss_name=loss.name)
loss, = exe.run(compiled_prog,
                feed=feed_dict,
                fetch_list=[loss.name])
Parameters:place (core.CPUPlace|core.CUDAPlace(n)) – indicate the executor run on which device
close()[source]

Close this executor.

You can no longer use this executor after calling this method. For the distributed training, this method would free the resource on PServers related to the current Trainer.

Example

>>> cpu = core.CPUPlace()
>>> exe = Executor(cpu)
>>> ...
>>> exe.close()
run(program=None, feed=None, fetch_list=None, feed_var_name='feed', fetch_var_name='fetch', scope=None, return_numpy=True, use_program_cache=False)[source]

Run program by this Executor. Feed data by feed map, fetch result by fetch_list. Python executor takes a program, add feed operators and fetch operators to this program according to feed map and fetch_list. Feed map provides input data for the program. fetch_list provides the variables(or names) that user want to get after program run.

Note: the executor will run all operators in the program but not only the operators dependent by the fetch_list

Parameters:
  • program (Program|CompiledProgram) – the program that need to run, if not provided, then default_main_program (not compiled) will be used.
  • feed (dict) – feed variable map, e.g. {“image”: ImageData, “label”: LabelData}
  • fetch_list (list) – a list of variable or variable names that user wants to get, this method will return them according to this list.
  • feed_var_name (str) – the name for the input variable of feed Operator.
  • fetch_var_name (str) – the name for the output variable of fetch Operator.
  • scope (Scope) – the scope used to run this program, you can switch it to different scope. default is global_scope
  • return_numpy (bool) – if convert the fetched tensor to numpy
  • use_program_cache (bool) – whether to use the cached program settings across batches. Setting it be true would be faster only when (1) the program is not compiled with data parallel, and (2) program, feed variable names and fetch_list variable names do not changed compared to the last step.
Returns:

fetch result according to fetch_list.

Return type:

list(numpy.array)

Examples

>>> data = fluid.layers.data(name='X', shape=[1], dtype='float32')
>>> out = fluid.layers.create_tensor(dtype='float32')
>>> hidden = fluid.layers.fc(input=data, size=10)
>>> fluid.layers.assign(hidden,out)
>>> loss = fluid.layers.mean(out)
>>> adam = fluid.optimizer.Adam()
                                    >>> adam.minimize(loss)
>>> cpu = core.CPUPlace()
>>> exe = fluid.Executor(cpu)
>>> exe.run(fluid.default_startup_program())
>>> x = numpy.random.random(size=(10, 1)).astype('float32')
>>> outs = exe.run(
>>>     feed={'X': x},
>>>     fetch_list=[loss.name])
infer_from_dataset(program=None, dataset=None, scope=None, thread=0, debug=False, fetch_list=None, fetch_info=None, print_period=100)[source]

The document of infer_from_dataset is almost the same as train_from_dataset, except that in distributed training, push gradients will be disabled in infer_from_dataset. infer_from_dataset() can be used for evaluation in multi-thread very easily.

Parameters:
  • program (Program|CompiledProgram) – the program that needs to be run, if not provided, then default_main_program (not compiled) will be used.
  • dataset (paddle.fluid.Dataset) – dataset created outside this function, a user should provide a well-defined dataset before calling this function. Please check the document of Dataset if needed. default is None
  • scope (Scope) – the scope used to run this program, you can switch it to different scope for each run. default is global_scope
  • thread (int) – number of thread a user wants to run in this function. The actual number of thread will be min(Dataset.thread_num, thread) if thread > 0, default is 0
  • debug (bool) – whether a user wants to run infer_from_dataset, default is False
  • fetch_list (Variable List) – fetch variable list, each variable will be printed during training, default is None
  • fetch_info (String List) – print information for each variable, default is None
  • print_period (int) – the number of mini-batches for each print, default is 100
Returns:

None

Examples

import paddle.fluid as fluid
place = fluid.CPUPlace()
exe = fluid.Executor(place)
x = fluid.layers.data(name="x", type="int64")
y = fluid.layers.data(name="y", type="int64")
dataset = fluid.DatasetFactory().create_dataset()
dataset.set_use_var([x, y])
filelist = ["dataA.txt", "dataB.txt"]
dataset.set_filelist(filelist)
exe.run(fluid.default_startup_program())
exe.infer_from_dataset(program=fluid.default_main_program(),
                       dataset=dataset)
train_from_dataset(program=None, dataset=None, scope=None, thread=0, debug=False, fetch_list=None, fetch_info=None, print_period=100)[source]

Train from a pre-defined Dataset. Dataset is defined in paddle.fluid.dataset. Given a program, either a program or compiled program, train_from_dataset will consume all data samples in dataset. Input scope can be given by users. By default, scope is global_scope(). The total number of thread run in training is thread. Thread number used in training will be minimum value of threadnum in Dataset and the value of thread in this interface. Debug can be set so that executor will display Run-Time for all operators and the throughputs of current training task.

Note: train_from_dataset will destroy all resources created within executor for each run.

Parameters:
  • program (Program|CompiledProgram) – the program that needs to be run, if not provided, then default_main_program (not compiled) will be used.
  • dataset (paddle.fluid.Dataset) – dataset created outside this function, a user should provide a well-defined dataset before calling this function. Please check the document of Dataset if needed.
  • scope (Scope) – the scope used to run this program, you can switch it to different scope for each run. default is global_scope
  • thread (int) – number of thread a user wants to run in this function. The actual number of thread will be min(Dataset.thread_num, thread)
  • debug (bool) – whether a user wants to run train_from_dataset
  • fetch_list (Variable List) – fetch variable list, each variable will be printed during training
  • fetch_info (String List) – print information for each variable
  • print_period (int) – the number of mini-batches for each print
Returns:

None

Examples

import paddle.fluid as fluid
place = fluid.CPUPlace()
exe = fluid.Executor(place)
x = fluid.layers.data(name="x", type="int64")
y = fluid.layers.data(name="y", type="int64")
dataset = fluid.DatasetFactory().create_dataset()
dataset.set_use_var([x, y])
dataset.set_thread(2)
filelist = ["dataA.txt", "dataB.txt"]
dataset.set_filelist(filelist)
exe.run(fluid.default_startup_program())
exe.train_from_dataset(program=fluid.default_main_program(),
                       dataset=dataset)

global_scope

paddle.fluid.global_scope()[source]

Get the global/default scope instance. There are a lot of APIs use global_scope as its default value, e.g., Executor.run

Returns:The global/default scope instance.
Return type:Scope

in_dygraph_mode

paddle.fluid.in_dygraph_mode()[source]

Returns(bool): True if the program is running in dynamic graph mode

LoDTensor

class paddle.fluid.LoDTensor

LoDTensor is a Tensor with optional LoD information.

np.array(lod_tensor) can convert LoDTensor to numpy array. lod_tensor.lod() can retrieve the LoD information.

LoD is short for Level of Details and is usually used for varied sequence length. You can skip the following comment if you don’t need optional LoD.

For example:

A LoDTensor X can look like the example below. It contains 2 sequences. The first has length 2 and the second has length 3, as described by x.lod.

The first tensor dimension 5=2+3 is calculated from LoD if it’s available. It means the total number of sequence element. In X, each element has 2 columns, hence [5, 2].

x.lod = [[2, 3]] x.data = [[1, 2], [3, 4],

[5, 6], [7, 8], [9, 10]]

x.shape = [5, 2]

LoD can have multiple levels (for example, a paragraph can have multiple sentences and a sentence can have multiple words). In the following LodTensor Y, the lod_level is 2. It means there are 2 sequence, the first sequence length is 2 (has 2 sub-sequences), the second one’s length is 1. The first sequence’s 2 sub-sequences have length 2 and 2, respectively. And the second sequence’s 1 sub-sequence has length 3.

y.lod = [[2 1], [2 2 3]] y.shape = [2+2+3, ...]

Note

In above description, LoD is length-based. In Paddle internal implementation, lod is offset-based. Hence, internally, y.lod is represented as [[0, 2, 3], [0, 2, 4, 7]] (length-based equivlent would be [[2-0, 3-2], [2-0, 4-2, 7-4]]).

Sometimes LoD is called recursive_sequence_length to be more self-explanatory. In this case, it must be length-based. Due to history reasons. when LoD is called lod in public API, it might be offset-based. Users should be careful about it.

has_valid_recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor) → bool

Check whether the lod of the LoDTensor is valid.

Returns:whether the lod is valid.
Return type:out (bool)
lod(self: paddle.fluid.core.LoDTensor) → List[List[int]]

Return the LoD of the LoDTensor.

Returns:the lod of the LoDTensor.
Return type:out (List[List[int]])
recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor) → List[List[int]]

Return the sequence length of the LoDTensor corresponding to LoD.

Returns:the sequence lengths.
Return type:out (List[List[int])
set_lod(self: paddle.fluid.core.LoDTensor, lod: List[List[int]]) → None

Set LoD of the LoDTensor.

Parameters:lod (List[List[int]]) – the lod to be set.
set_recursive_sequence_lengths(self: paddle.fluid.core.LoDTensor, recursive_sequence_lengths: List[List[int]]) → None

Set LoD of the LoDTensor according to recursive sequence length.

For example, if recursive_sequence_lengths=[[2, 3]], meaning that there are two sequences with length 2 and 3 respectively, the corresponding lod would be [[0, 2, 2+3]], i.e, [[0, 2, 5]].

Parameters:recursive_sequence_lengths (List[List[int]]) – sequence lengths.

LoDTensorArray

class paddle.fluid.LoDTensorArray
append(self: paddle.fluid.core.LoDTensorArray, tensor: paddle.fluid.core.LoDTensor) → None

Append a LoDensor to LoDTensorArray.

memory_optimize

paddle.fluid.memory_optimize(input_program, skip_opt_set=None, print_log=False, level=0, skip_grads=False)[source]

Optimize memory by reusing var memory.

Note: it doesn’t not support subblock nested in subblock.
Parameters:
  • input_program (str) – Input Program
  • skip_opt_set (set) – vars wil be skipped in memory optimze
  • print_log (bool) – whether to print debug log.
  • level (int) – If level=0, reuse if the shape is completely equal, o
Returns:

None

name_scope

paddle.fluid.name_scope(prefix=None)[source]

Generate hierarchical name prefix for the operators.

Note: This should only used for debugging and visualization purpose. Don’t use it for serious analysis such as graph/program transformations.

Parameters:prefix (str) – prefix.

Examples

with name_scope("encoder"):
   ...
with name_scope("decoder"):
   ...
with name_scope("attention"):
   ...

ParallelExecutor

class paddle.fluid.ParallelExecutor(use_cuda, loss_name=None, main_program=None, share_vars_from=None, exec_strategy=None, build_strategy=None, num_trainers=1, trainer_id=0, scope=None)[source]

ParallelExecutor is designed for data parallelism, which focuses on distributing the data across different nodes and every node operates on the data in parallel. If you use ParallelExecutor to run the current program on GPU, the node means GPU device, and ParallelExecutor will get the available GPU device automatically on the current machine. If you use ParallelExecutor to run the current program on CPU, the node means the CPU device, and you can specify the CPU device number by adding ‘CPU_NUM’ environment variable, for example ‘CPU_NUM=4’, if the environment variable is not found, ParallelExecutor will call multiprocessing.cpu_count to get the number of CPUs in the system.

Parameters:
  • use_cuda (bool) – Whether to use CUDA or not.
  • loss_name (str) – The loss name must set in training. Default None.
  • main_program (Program) – The program that need to run, if not provided, then default_main_program will be used. Default None.
  • share_vars_from (ParallelExecutor) – If provide, it will share variables from the specified ParallelExecutor. Default None.
  • exec_strategy (ExecutionStrategy) – exec_strategy is used to control how to run the program in ParallelExecutor, for example how many threads are used to execute the program, how many iterations to clean up the temp variables which is generated during execution. For more information, please refer to fluid.ExecutionStrategy. Default None.
  • build_strategy (BuildStrategy) – build_strategy is used to control how to build the SSA Graph in ParallelExecutor by setting the property, for example reduce_strategy, gradient_scale_strategy. For more information, please refer to fluid.BuildStrategy. Default None.
  • num_trainers (int) – If greater than 1, NCCL will be initialized with multiple rank of nodes, each node should have same number of GPUs. Distributed training will be enabled then. Default 1.
  • trainer_id (int) – Must use together with num_trainers. trainer_id is the “rank” of current node starts from 0. Default 0.
  • scope (Scope) – scope to run with, default use fluid.global_scope().
Returns:

The initialized ParallelExecutor object.

Return type:

ParallelExecutor

Raises:

TypeError – If share_vars_from is provided, but not ParallelExecutor object.

Examples

train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=loss.name)
test_exe = fluid.ParallelExecutor(use_cuda=True,
                                  main_program=test_program,
                                  share_vars_from=train_exe)

train_loss, = train_exe.run([loss.name], feed=feed_dict)
test_loss, = test_exe.run([loss.name], feed=feed_dict)
run(fetch_list, feed=None, feed_dict=None, return_numpy=True)[source]

Run a parallel executor with fetch_list.

The feed parameter can be a dict or a list. If feed is a dict, the feed data will be split into multiple devices. If feed is a list, we assume the data has been splitted into multiple devices, the each element in the list will be copied to each device directly.

For example, if the feed is a dict:

>>> exe = ParallelExecutor()
>>> # the image will be splitted into devices. If there is two devices
>>> # each device will process an image with shape (24, 1, 28, 28)
>>> exe.run(feed={'image': numpy.random.random(size=(48, 1, 28, 28))})

For example, if the feed is a list:

>>> exe = ParallelExecutor()
>>> # each device will process each element in the list.
>>> # the 1st device will process an image with shape (48, 1, 28, 28)
>>> # the 2nd device will process an image with shape (32, 1, 28, 28)
>>> #
>>> # you can use exe.device_count to get the device number.
>>> exe.run(feed=[{"image": numpy.random.random(size=(48, 1, 28, 28))},
>>>               {"image": numpy.random.random(size=(32, 1, 28, 28))},
>>>              ])
Parameters:
  • fetch_list (list) – The fetched variable names
  • feed (list|dict|None) – The feed variables. If the feed is a dict, tensors in that dict will be splitted into each devices. If the feed is a list, each element of the list will be copied to each device. Default None.
  • feed_dict – Alias for feed parameter, for backward compatibility. This parameter has been deprecated. Default None.
  • return_numpy (bool) – Whether converts the fetched tensor to numpy. Default: True.
Returns:

The fetched result list.

Return type:

List

Raises:

ValueError – If the feed is a list, but its length is not equal the length of active places, or its element’s is not dict.

Notes

  1. If the feed’s type is dict, the number of data that feeds to ParallelExecutor must be bigger than active places. Otherwise, it will throw exception from C++ side. Special attention should be paid to check whether the last batch of the dataset is bigger than active places.
  2. If active places are more than one, the fetch results for each variable is a list, and each element of this list is the variable of respective active place.

Examples

pe = fluid.ParallelExecutor(use_cuda=use_cuda,
                            loss_name=avg_cost.name,
                            main_program=fluid.default_main_program())
loss = pe.run(feed=feeder.feed(cur_batch),
              fetch_list=[avg_cost.name]))

ParamAttr

class paddle.fluid.ParamAttr(name=None, initializer=None, learning_rate=1.0, regularizer=None, trainable=True, gradient_clip=None, do_model_average=False)[source]

Parameter attributes object. To fine-tuning network training process, user can set parameter’s attributes to control training details. Such as learning rate, regularization, trainable, do_model_average and the method to initialize param.

Parameters:
  • name (str) – The parameter’s name. Default None.
  • initializer (Initializer) – The method to initial this parameter. Default None.
  • learning_rate (float) – The parameter’s learning rate. The learning rate when optimize is \(global\_lr * parameter\_lr * scheduler\_factor\). Default 1.0.
  • regularizer (WeightDecayRegularizer) – Regularization factor. Default None.
  • trainable (bool) – Whether this parameter is trainable. Default True.
  • gradient_clip (BaseGradientClipAttr) – The method to clip this parameter’s gradient. Default None.
  • do_model_average (bool) – Whether this parameter should do model average. Default False.

Examples

w_param_attrs = fluid.ParamAttr(name="fc_weight",
                                learning_rate=0.5,
                                regularizer=fluid.regularizer.L2Decay(1.0),
                                trainable=True)
x = fluid.layers.data(name='X', shape=[1], dtype='float32')
y_predict = fluid.layers.fc(input=x, size=10, param_attr=w_param_attrs)

Program

class paddle.fluid.Program[source]

Python Program. Beneath it is a ProgramDesc, which is used for create c++ Program. A program is a self-contained programing language like container. It has at least one Block, when the control flow op like conditional_block, while_op is included, it will contains nested block. Please reference the framework.proto for details.

Notes: we have default_startup_program and default_main_program by default, a pair of them will shared the parameters. The default_startup_program only run once to initialize parameters, default_main_program run in every mini batch and adjust the weights.

Returns:A empty program.

Examples

>>> main_program = fluid.Program()
>>> startup_program = fluid.Program()
>>> with fluid.program_guard(main_program=main_program, startup_program=startup_program):
>>>     fluid.layers.data(name="x", shape=[-1, 784], dtype='float32')
>>>     fluid.layers.data(name="y", shape=[-1, 1], dtype='int32')
>>>     fluid.layers.fc(name="fc", shape=[10], dtype='float32', act="relu")
op_role

The operator role. In a enum {Forward, Backward, Optimize}.

Notes: this is a low level API. It is used only for ParallelExecutor to duplicate or schedule operator to devices.

For example, the forward operator should be executed on every device. The backward operator should be executed on every device and the parameter gradient of backward (use op_role_var to get this variable) operator should be merged to one device. The optimization operators should be executed on only one device and broadcast the optimization result, i.e., the new parameter, to every other device.

op_role_var

The auxiliary variables for op_role property.

See Also: Program.op_role‘s documentation for details.

Notes: This is a very low-level API. Users should not use it directly.

set_op_role_var

The auxiliary variables for op_role property.

See Also: Program.op_role‘s documentation for details.

Notes: This is a very low-level API. Users should not use it directly.

to_string(throw_on_error, with_details=False)[source]

To debug string.

Parameters:
  • throw_on_error (bool) – raise Value error when any of required fields is not set.
  • with_details (bool) – True if more details about variables and parameters, e.g., trainable, optimize_attr, need to print.
Returns:

The debug string.

Return type:

str

Raises:

ValueError – If any of required fields is not set and throw_on_error is True.

clone(for_test=False)[source]

Create a new, duplicated program.

Some operators, e.g., batch_norm, behave differently between training and testing. They have an attribute, is_test, to control this behaviour. This method will change the is_test attribute of them to True when for_test=True.

  • Set for_test to False when we want to clone the program for training.
  • Set for_test to True when we want to clone the program for testing.

Notes: This API DOES NOT prune any operator. Use clone(for_test=True) before backward and optimization please. e.g.

>>> test_program = fluid.default_main_program().clone(for_test=True)
>>> optimizer = fluid.optimizer.Momentum(learning_rate=0.01, momentum=0.9)
>>> optimizer.minimize()
Parameters:for_test (bool) – True if change the is_test attribute of operators to True.
Returns:The new, duplicated Program object.
Return type:Program

Examples

  1. To clone a test program, the sample code is:
>>> import paddle.fluid as fluid
>>> train_program = fluid.Program()
>>> startup_program = fluid.Program()
>>> with fluid.program_guard(train_program, startup_program):
>>>     img = fluid.layers.data(name='image', shape=[784])
>>>     hidden = fluid.layers.fc(input=img, size=200, act='relu')
>>>     hidden = fluid.layers.dropout(hidden, dropout_prob=0.5)
>>>     loss = fluid.layers.cross_entropy(
>>>                 input=fluid.layers.fc(hidden, size=10, act='softmax'),
>>>                 label=fluid.layers.data(name='label', shape=[1], dtype='int64'))
>>>
>>> test_program = train_program.clone(for_test=True)
>>>
>>> sgd = fluid.optimizer.SGD(learning_rate=1e-3)
>>> with fluid.program_guard(train_program, startup_program):
>>>     sgd.minimize(loss)

2. The clone method can be avoid if you create program for training and program for testing individually.

>>> import paddle.fluid as fluid
>>>
>>> def network(is_test):
>>>     img = fluid.layers.data(name='image', shape=[784])
>>>     hidden = fluid.layers.fc(input=img, size=200, act='relu')
>>>     hidden = fluid.layers.dropout(hidden, dropout_prob=0.5, is_test=is_test)
>>>     loss = fluid.layers.cross_entropy(
>>>                 input=fluid.layers.fc(hidden, size=10, act='softmax'),
>>>                 label=fluid.layers.data(name='label', shape=[1], dtype='int64'))
>>>     return loss
>>>
>>> train_program = fluid.Program()
>>> startup_program = fluid.Program()
>>> test_program = fluid.Program()
>>>
>>> with fluid.program_guard(train_program, startup_program):
>>>     with fluid.unique_name.guard():
>>>         loss = network(is_test=False)
>>>         sgd = fluid.optimizer.SGD(learning_rate=1e-3)
>>>         sgd.minimize(loss)
>>>
>>> # the test startup program is not used.
>>> with fluid.program_guard(test_program, fluid.Program()):
>>>     with fluid.unique_name.guard():
>>>         loss = network(is_test=True)

The two code snippets above will generate same programs.

static parse_from_string(binary_str)[source]

Deserialize a program desc from protobuf binary string.

Notes: All information about parameters will be lost after serialization and deserialization.

Parameters:binary_str_type (str) – The binary prootbuf string.
Returns:A deserialized program desc.
Return type:Program
num_blocks

The number of blocks in this program.

random_seed

The default random seed for random operators in Program. Zero means get the random seed from random device.

Notes: It must be set before the operators have been added.

global_block()[source]

Get the first block of this program.

block(index)[source]

Get the index block of this program :param index: The index of block to get :type index: int

Returns:The index block
Return type:Block
current_block()[source]

Get the current block. The current block is the block to append operators.

list_vars()[source]

Get all variables from this Program. A iterable object is returned.

Returns:The generator will yield every variable in this program.
Return type:iterable

program_guard

paddle.fluid.program_guard(main_program, startup_program=None)[source]

Change the global main program and startup program with with statement. Layer functions in the Python with block will append operators and variables to the new main programs.

Examples

>>> import paddle.fluid as fluid
>>> main_program = fluid.Program()
>>> startup_program = fluid.Program()
>>> with fluid.program_guard(main_program, startup_program):
>>>     data = fluid.layers.data(...)
>>>     hidden = fluid.layers.fc(...)

Notes: The temporary Program can be used if the user does not need to construct either of startup program or main program.

Examples

>>> import paddle.fluid as fluid
>>> main_program = fluid.Program()
>>> # does not care about startup program. Just pass a temporary value.
>>> with fluid.program_guard(main_program, fluid.Program()):
>>>     data = ...
Parameters:
  • main_program (Program) – New main program inside with statement.
  • startup_program (Program) – New startup program inside with statement. None means do not change startup program.

release_memory

paddle.fluid.release_memory(input_program, skip_opt_set=None)[source]

Modify the input program and insert delete_op to early drop not used variables. The modification will be performed inplace.

Notes: This is an experimental API and could be removed in next few releases. Users should not use this API.

Parameters:
  • input_program (Program) – The program will be inserted delete_op.
  • skip_opt_set (set) – vars wil be skipped in memory optimze
Returns:

None

scope_guard

paddle.fluid.scope_guard(scope)[source]

Change the global/default scope instance by Python with statement. All variable in runtime will assigned to the new scope.

Examples

>>> import paddle.fluid as fluid
>>> new_scope = fluid.Scope()
>>> with fluid.scope_guard(new_scope):
>>>     ...
Parameters:scope – The new global/default scope.

Tensor

paddle.fluid.Tensor

alias of LoDTensor

WeightNormParamAttr

class paddle.fluid.WeightNormParamAttr(dim=None, name=None, initializer=None, learning_rate=1.0, regularizer=None, trainable=True, gradient_clip=None, do_model_average=False)[source]

Used for weight Norm. Weight Norm is a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. Weight Norm has been implemented as discussed in this paper: Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.

Parameters:
  • dim (list) – The parameter’s name. Default None.
  • name (str) – The parameter’s name. Default None.
  • initializer (Initializer) – The method to initial this parameter. Default None.
  • learning_rate (float) – The parameter’s learning rate. The learning rate when optimize is \(global\_lr * parameter\_lr * scheduler\_factor\). Default 1.0.
  • regularizer (WeightDecayRegularizer) – Regularization factor. Default None.
  • trainable (bool) – Whether this parameter is trainable. Default True.
  • gradient_clip (BaseGradientClipAttr) – The method to clip this parameter’s gradient. Default None.
  • do_model_average (bool) – Whether this parameter should do model average. Default False.

Examples

data = fluid.layers.data(name="data", shape=[3, 32, 32], dtype="float32")
fc = fluid.layers.fc(input=data,
                     size=1000,
                     param_attr=WeightNormParamAttr(
                          dim=None,
                          name='weight_norm_param'))