Best practices of distributed training on CPU

To improve the training speed of CPU distributed training, we must consider two aspects:

  1. Improve the training speed mainly by improving utilization rate of CPU;
  2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication.

Improve CPU utilization

The CPU utilization mainly depends on ParallelExecutor, which can make full use of the computing power of multiple CPUs to speed up the calculation.

For detailed API usage, please refer to ParallelExecutor . A simple example:

# Configure the execution strategy, mainly to set the number of threads
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 8

# Configure the composition strategy, for CPU training, you should use the Reduce mode for training.
build_strategy = fluid.BuildStrategy()
if int(os.getenv("CPU_NUM")) > 1:

pe = fluid.ParallelExecutor(

Among the parameters above:

  • num_threads : the number of threads used by the model training. It is preferably close to the number of the physical CPU cores of the machine where the training is performed.
  • reduce_strategy : For CPU training, you should choose fluid.BuildStrategy.ReduceStrategy.Reduce

Configuration of general environment variables:

  • CPU_NUM: The number of replicas of the model, preferably the same as num_threads

Improve communication speed

To reduce the amount of communication data and improve communication speed is achieved mainly by using sparse updates, the current support for sparse update is mainly embedding.

data ='ids', shape=[1], dtype='int64')
fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True)

Among the parameters above:

  • is_sparse: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.