1. 单机多卡启动并行训练

飞桨2.0增加paddle.distributed.spawn函数来启动单机多卡训练，同时原有的paddle.distributed.launch的方式依然保留。

paddle.distributed.launch通过指定启动的程序文件，以文件为单位启动多进程来实现多卡同步训练。以前在aistudio脚本任务说明里，就是推荐这种方法启动多卡任务。launch这种方式对进程管理要求较高。
paddle.distributed.spawn是以function函数为单位启动多进程来实现多卡同步的，可以更好地控制进程，在日志打印、训练退出时更友好。这是当前推荐的用法。

下面分别介绍这两种方法。

1.1单机多卡启动方式1、launch启动

1.1.1使用高层API的场景

当调用paddle.Model高层API来实现训练时，想要启动单机多卡训练非常简单，代码不需要做任何修改，只需要在启动时增加一下参数-m paddle.distributed.launch。

  #单机单卡启动，默认使用第0号卡$ python train.py#单机多卡启动，默认使用当前可见的所有卡$ python -m paddle.distributed.launch train.py#单机多卡启动，设置当前使用的第0号和第1号卡$ python -m paddle.distributed.launch --selected_gpus='0,1' train.py#单机多卡启动，设置当前使用第0号和第1号卡$ export CUDA_VISIABLE_DEVICES='0,1'$ python -m paddle.distributed.launch train.py

下面是一个高阶API的例子代码，直接执行cell代码框，就会在根目录生成hapitrain.py文件，然后就可以使用python来启动这个训练了。

%%writefile hapitrain.py 
import paddle 
from paddle.vision.transforms import ToTensortrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()# Mnist继承paddle.nn.Layer属于Net，model包含了训练功能
model = paddle.Model(lenet)# 设置训练模型所需的optimizer, loss, metric
model.prepare(paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),paddle.nn.CrossEntropyLoss(),paddle.metric.Accuracy(topk=(1, 2)))# 启动训练
model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)# 启动评估
model.evaluate(test_dataset, log_freq=100, batch_size=64)

单机单卡启动，默认使用第0号卡

# 单机单卡启动，默认使用第0号卡
!python hapitrain.py

Begin to downloadDownload finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz 
Begin to download
........
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz 
Begin to downloadDownload finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz 
Begin to download
..
Download finished
W0628 15:25:11.488023   114 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:25:11.614305   114 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsif isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0555 - acc_top1: 0.9217 - acc_top2: 0.9649 - 50ms/step
step 800/938 - loss: 0.0300 - acc_top1: 0.9454 - acc_top2: 0.9782 - 39ms/step
step 938/938 - loss: 0.0213 - acc_top1: 0.9498 - acc_top2: 0.9803 - 38ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0057 - acc_top1: 0.9731 - acc_top2: 0.9927 - 28ms/step
step 157/157 - loss: 0.0013 - acc_top1: 0.9785 - acc_top2: 0.9945 - 28ms/step
Eval samples: 10000

单机多卡启动，默认使用当前可见的所有卡

# 单机多卡启动，默认使用当前可见的所有卡
!python -m paddle.distributed.launch hapitrain.py

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:26:17,473 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:26:17,475 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:35079               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:35079               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+INFO 2021-06-28 15:26:17,475 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
W0628 15:26:24.305920   285 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:26:24.311555   285 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsif isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0586 - acc_top1: 0.9130 - acc_top2: 0.9611 - 38ms/step
step 800/938 - loss: 0.0288 - acc_top1: 0.9397 - acc_top2: 0.9759 - 39ms/step
step 938/938 - loss: 0.0545 - acc_top1: 0.9448 - acc_top2: 0.9785 - 40ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0035 - acc_top1: 0.9677 - acc_top2: 0.9911 - 36ms/step
step 157/157 - loss: 0.0057 - acc_top1: 0.9723 - acc_top2: 0.9929 - 36ms/step
Eval samples: 10000
INFO 2021-06-28 15:27:26,569 launch.py:240] Local processes completed.

单机多卡启动，设置当前使用第0号和第1号卡 aistudio单卡也可以运行，可以看到launch的容错率较高

# 单机多卡启动，设置当前使用第0号和第1号卡 aistudio单卡也可以运行，可以看到launch的容错率较高
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch hapitrain.py

-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:28:10,632 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:28:10,637 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:46909               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:46909               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+INFO 2021-06-28 15:28:10,637 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
W0628 15:28:19.819196   448 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:28:19.905493   448 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsif isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0376 - acc_top1: 0.9136 - acc_top2: 0.9610 - 37ms/step
step 800/938 - loss: 0.0159 - acc_top1: 0.9423 - acc_top2: 0.9764 - 35ms/step
step 938/938 - loss: 0.0444 - acc_top1: 0.9479 - acc_top2: 0.9791 - 35ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0039 - acc_top1: 0.9767 - acc_top2: 0.9939 - 36ms/step
step 157/157 - loss: 0.0029 - acc_top1: 0.9815 - acc_top2: 0.9952 - 35ms/step
Eval samples: 10000
INFO 2021-06-28 15:29:19,766 launch.py:240] Local processes completed.

1.1.2使用基础API场景

如果使用基础API的代码程序启动单机多卡训练，需要对单机单卡的代码进行3处修改，具体看下面未改变版本和改变版本的对比：

修改三处：

第1处改动，import库**

import paddle.distributed as dist

第2处改动，初始化并行环境**

dist.init_parallel_env()

第3处改动，增加paddle.DataParallel封装

net = paddle.DataParallel(paddle.vision.models.LeNet())

import paddle #未改动版本
from paddle.vision.transforms import ToTensortrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)def train():epochs = 1adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=lenet.parameters())# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data, y_data = datapredicts = lenet(x_data)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')acc = paddle.metric.accuracy(predicts, y_data, k=1)avg_acc = paddle.mean(acc)loss.backward()if batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy()))adam.step()adam.clear_grad()
# 启动训练
train()

> epoch: 0, batch_id: 0, loss is: [2.7922328], acc is: [0.15625] epoch:
> 0, batch_id: 400, loss is: [0.10373791], acc is: [0.96875] epoch: 0,
> batch_id: 800, loss is: [0.01435608], acc is: [1.]

这是有3处改动的基础API版本
还是先通过%%writefile normaltrain.py 命令将该文件存盘到根目录

%%writefile normaltrain.py 
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist #第1处改动，import库train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)def train():# 第2处改动，初始化并行环境dist.init_parallel_env()# 第3处改动，增加paddle.DataParallel封装net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径epochs = 1adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data = data[0]y_data = data[1]predicts = net(x_data)  acc = paddle.metric.accuracy(predicts, y_data, k=2)avg_acc = paddle.mean(acc)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') loss.backward() #这里手册误写成了avg_lossif batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_lossadam.step()adam.clear_grad()
# 启动训练
train()

# 单机单卡启动，默认使用第0号卡 。这里单机单卡执行改后的代码会报错
# !python normaltrain.py# 单机多卡启动，默认使用当前可见的所有卡
!python -m paddle.distributed.launch normaltrain.py# 单机多卡启动，设置当前使用第0号和第1号卡 自动用当前所有的卡，只有单卡也不会报错
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch normaltrain.py

1.2 单机多卡启动方式2、spawn启动【推荐！！】

就像把物品放进盒子寄快递一样，只要将待并行计算的train函数体放入paddle.distributed.spawn里面就行了。命令为：

import paddle.distributed as dist# 启动train多进程训练，默认使用所有可见的GPU卡
if __name__ == '__main__':dist.spawn(train)# 启动train函数2个进程训练，默认使用当前可见的前2张卡
if __name__ == '__main__':dist.spawn(train, nprocs=2)# 启动train函数2个进程训练，默认使用第4号和第5号卡
if __name__ == '__main__':dist.spawn(train, nprocs=2, selelcted_gpus='4,5')

基础API场景(不管是否像launch里面那样改代码) aistudio
notebook里会报错，在实际多卡环境下正常。在aistudio 命令行下正常
高阶API场景 aistudio notebook里会报错，在aistudio 命令行下正常。

%%writefile normal3spawn.py 
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist #第1处改动，import库train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)def train():# 第2处改动，初始化并行环境dist.init_parallel_env()# 第3处改动，增加paddle.DataParallel封装net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径epochs = 1adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data = data[0]y_data = data[1]predicts = net(x_data)  acc = paddle.metric.accuracy(predicts, y_data, k=2)avg_acc = paddle.mean(acc)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') loss.backward() #这里手册误写成了avg_lossif batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_lossadam.step()adam.clear_grad()# 启动train多进程训练，默认使用所有可见的GPU卡
import paddle.distributed as dist
if __name__ == '__main__':dist.spawn(train)

1.3单机多卡简要总结：

spawn方式下在notebook里报错的情况，猜测应该是notebook进程管理限制导致的。在命令行情况下或者cell里加叹号运行的时候，就没有问题。

spawn方式不需要去修改代码的内部部分，只是加上dist.spawn(train)这句，相当于给训练代码加了一个多进程的壳，简单方便，是推荐使用的单机多卡组网方式！

在不支持spawn的情况，再去考虑用launch方式启动单机多卡。

飞桨完备的并行模式：

数据并行：针对产业界最常用的数据并行模式，飞桨针对实际业务需求重点打磨多项技术，包括；飞桨提供集合通信架构和参数服务器架构两种方式，支持工业实践中常见的同步训练和异步训练的机制，并提供收敛效果有保障的分布式优化算法。
流水线并行：面向异构硬件，流水线并行能够将模型计算部分拆分到不同硬件并充分流水线化，从而大规模提升异构硬件的整体利用率。
**模型并行：**对于超大规模分类问题，飞桨提供计算与存储同时并行的模型并行，解决单GPU无法解决的问题。

1.4使用fleetrun启动分布式任务

1.4.1 使用fleetrun启动分布式任务

Paddle提供命令行启动命令fleetrun，配合Paddle的分布式高级APIpaddle.distributed.fleet 即可轻松启动Paddle集合通信模式或参数服务器模式下的分布式任务。 fleetrun在静态图和动态图场景下均可使用。

注：目前paddle.distributed.fleet启动动态图分布式训练仅支持集合通信（Colletive Communication）模式，不支持参数服务器（Parameter-Server）模式。

GPU单机多卡训练

若启动单机4卡的任务，只需通过–gpus指定空闲的4张卡即可。

    fleetrun --gpus=0,1,2,3 train.py

注：如果指定了export CUDA_VISIBLE_DEVICES=0,1,2,3，则可以直接使用：

    export CUDA_VISIBLE_DEVICES=0,1,2,3fleetrun train.py

GPU多机多卡训练

[示例一] 2机8卡 (每个节点4卡)

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1,2,3 train.py

注：如果每台机器均指定了export CUDA_VISIBLE_DEVICES=0,1,2,3，则可以直接在每台节点上启动：

    export CUDA_VISIBLE_DEVICES=0,1,2,3fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

[示例二] 2机16卡（每个节点8卡，假设每台机器均有8卡可使用）

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

1.4.2 Fleet单机多卡训练

使用Fleet接口进行动态图分布式训练其实非常的简单，基础API程序代码只需修改3个步骤：

导入paddle.distributed.fleet包
```
  from paddle.distributed import fleet
```
初始化fleet环境
```
  fleet.init(is_collective=True)
```

通过fleet获取分布式优化器和分布式模型

  strategy = fleet.DistributedStrategy()adam = fleet.distributed_optimizer(adam, strategy=strategy)dp_layer = fleet.distributed_model(layer)

Fleet手册提供的例子

%%writefile train_fleet.py
# -*- coding: UTF-8 -*-
import paddle
import paddle.nn as nn
#分布式step 1: 导入paddle.distributed.fleet包
from paddle.distributed import fleet# 定义全连接网络，需继承自nn.Layer
class LinearNet(nn.Layer):def __init__(self):super(LinearNet, self).__init__()self._linear1 = nn.Linear(10, 10)self._linear2 = nn.Linear(10, 1)def forward(self, x):return self._linear2(self._linear1(x))# 1.开启动态图模式
paddle.disable_static()# 分布式step 2: 初始化fleet
fleet.init(is_collective=True)# 2. 定义网络对象，损失函数和优化器
layer = LinearNet()
loss_fn = nn.MSELoss()
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=layer.parameters())# 分布式step 3: 通过fleet获取分布式优化器和分布式模型
strategy = fleet.DistributedStrategy()
adam = fleet.distributed_optimizer(adam, strategy=strategy)
dp_layer = fleet.distributed_model(layer)for step in range(20):# 3. 执行前向网络inputs = paddle.randn([10, 10], 'float32')outputs = dp_layer(inputs)labels = paddle.randn([10, 1], 'float32')loss = loss_fn(outputs, labels)print("step:{}\tloss:{}".format(step, loss.numpy()))# 4. 执行反向计算和参数更新loss.backward()adam.step()adam.clear_grad()

!fleetrun --gpus=0 train_fleet.py

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
-----------  Configuration Arguments -----------
gpus: 0
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: train_fleet.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:56:16,986 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:56:16,990 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:47263               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:47263               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+INFO 2021-06-28 15:56:16,991 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
W0628 15:56:18.760403  1539 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:56:18.826562  1539 device_context.cc:372] device: 0, cuDNN Version: 7.6.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:633: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training."It is recommended to use DistributedStrategy "
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py:423: UserWarning: The program will return to single-card operation. Please check 1, whether you use spawn or fleetrun to start the program. 2, Whether it is a multi-card program. 3, Is the current environment multi-card.warnings.warn("The program will return to single-card operation. "
step:0	loss:[2.747072]
step:1	loss:[3.9464068]
step:2	loss:[3.3363562]
step:3	loss:[1.7597802]
step:4	loss:[2.4984336]
step:5	loss:[1.3766874]
step:6	loss:[3.3678422]
step:7	loss:[1.8410085]
step:8	loss:[1.6417965]
step:9	loss:[4.009201]
step:10	loss:[1.7387416]
step:11	loss:[1.6013482]
step:12	loss:[1.6388085]
step:13	loss:[3.7573469]
step:14	loss:[0.9461777]
step:15	loss:[2.4906065]
step:16	loss:[2.613153]
step:17	loss:[2.8367076]
step:18	loss:[2.170548]
step:19	loss:[2.2705061]
INFO 2021-06-28 15:56:35,049 launch.py:240] Local processes completed.

2.手写数字识别API Fleet多版本

2.1手写数字识别基础API Fleet版本

%%writefile normal_fleet.py 
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
#分布式step 1: 导入paddle.distributed.fleet包
from paddle.distributed import fleettrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)# 分布式step 2: 初始化fleet
fleet.init(is_collective=True)def train():epochs = 1net = paddle.vision.models.LeNet()adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())# 分布式step 3: 通过fleet获取分布式优化器和分布式模型strategy = fleet.DistributedStrategy()adam = fleet.distributed_optimizer(adam, strategy=strategy)net = fleet.distributed_model(net)# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data = data[0]y_data = data[1]predicts = net(x_data)  acc = paddle.metric.accuracy(predicts, y_data, k=2)avg_acc = paddle.mean(acc)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') loss.backward() #这里手册误写成了avg_lossif batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_lossadam.step()adam.clear_grad()if __name__ == '__main__':train()

!fleetrun --gpus=0 normal_fleet.py

 +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:42501               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:42501               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+
epoch: 0, batch_id: 0, loss is: [2.5425684], acc is: [0.234375]
epoch: 0, batch_id: 400, loss is: [0.05207598], acc is: [1.]
epoch: 0, batch_id: 800, loss is: [0.04818164], acc is: [1.]

2.2 手写数字识别高层API Fleet版本

%%writefile hapi_fleet.py
import paddle
from paddle.vision.transforms import ToTensor
import paddle.distributed as disttrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()# Mnist继承paddle.nn.Layer属于Net，model包含了训练功能
model = paddle.Model(lenet)# 设置训练模型所需的optimizer, loss, metric
model.prepare(paddle.optimizer.Adam(learning_rate=0.1, parameters=model.parameters()),paddle.nn.CrossEntropyLoss(),paddle.metric.Accuracy(topk=(1, 2)))
def train():# 启动训练# 使用VisualDL 可视化callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)# 未使用VisualDL 可视化# model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)# 启动评估
#     model.evaluate(test_dataset, log_freq=20, batch_size=64)if __name__ == '__main__':train()

!fleetrun hapi_fleet.py

2.3 多机多卡手写数字识别

从单机多卡到多机多卡训练，在代码上并不需要做任何改动，只需修改启动命令，以2机4卡为例：

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1 dygraph_fleet.py

在2台机器上分别运行以上启动命令，fleetrun将在后台分别启动2个多进程任务，执行分布式多机训练。您将在ip为xx.xx.xx.xx的机器上看到命令台输出日志信息。

下面还是以aistudio为例子演示一下多机多卡，直接运行：

!fleetrun --ips="127.0.0.1" --gpus=0 normal_fleet.py

3.飞桨2.0并行计算总结：

飞桨2.0在并行计算方面有着完备的解决方案，且是经过超大规模业务数据检验过的训练框架。并行计算，就是这么简单！

3.1 针对单机多卡的情况，优先推荐使用spawn方式

spawn的优点是：几乎不需要修改代码，只要导入spawn库，并在最后用spawn去调用训练函数即可。同时spawn方式可以更好地控制进程，在日志打印、训练退出时更友好

程序中只需要增加这两句：

    import paddle.distributed as distif __name__ == '__main__':dist.spawn(train)

然后直接用python train.py启动训练即可

3.2 针对多机多卡的情况，使用fleet方式。

普通API程序需要对应修改3个步骤：

导入paddle.distributed.fleet包
```
  from paddle.distributed import fleet
```
初始化fleet环境
```
  fleet.init(is_collective=True)
```

通过fleet获取分布式优化器和分布式模型

  strategy = fleet.DistributedStrategy()adam = fleet.distributed_optimizer(adam, strategy=strategy)dp_layer = fleet.distributed_model(layer)

然后运行命令：
fleetrun --ips=“xx.xx.xx.xx,yy.yy.yy.yy” --gpus=0,1 train.py

3.3 如果使用高层API代码，则程序不用修改，直接运行fleetrun命令即可。

4.利用VisualDL进行并行计算下的可视化

VisualDL是一个面向深度学习任务设计的可视化工具。VisualDL 利用了丰富的图表来展示数据，用户可以更直观、清晰地查看数据的特征与变化趋势，有助于分析数据、及时发现错误，进而改进神经网络模型的设计。喜欢的同学可以去star支持一下哦~

AI Studio Notebook 项目（Paddle1.8.0及以上版本）已经集成VisualDL工具以便于您的使用，可在可视化tab中启动VisualDL服务。

4.1 VisualDL可视化

在高层API程序中，只需要加上这句callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')并在model.fit里面加上callbacks=callback参数即可，也就是这样：model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)

前面的hapi_fleet.py代码中已经加入了VisualDL语句支持，前面cell已经执行!fleetrun hapi_fleet.py现在直接就可以在AIStudio里面打开可视化了：

打开左侧标签栏 可视化->设置logdir->点击添加->选择 visualdl_log/ -> 点击启动VisualDL服务 -> 点击打开VisualDL，在打开的网页中，就能看到训练的loss/acc等统计了；

4.2 利用VisualDL-Service共享可视化结果

此功能是 VisualDL 2.0.4 新添加的功能，需要安装 VisualDL 2.0.4 或者以上的版本，只需要一行代码 visualdl service upload 即可以将自己的log文件上传到远端，
非常推荐这个功能，我们上传文件之后，就不再需要在本地保存这些文件，直接访问生成的链接就可以了，十分方便！
如果没有安装 VisualDL 2.0.4 + ，需要使用命令pip install visualdl==2.0.5安装
执行下面的代码之后，访问生成的链接，所有人都可以对训练过程进行查看分析

!pip install -U visualdl -q # ==2.0.5!visualdl service upload --logdir visualdl_log

【三】分布式训练---单机多卡与多机多卡组网（飞桨paddle2.0+）更加推荐spawn方式！

1. 单机多卡启动并行训练

1.1单机多卡启动方式1、launch启动

1.1.1使用高层API的场景

1.1.2使用基础API场景

1.2 单机多卡启动方式2、spawn启动【推荐！！】

1.3单机多卡简要总结：

1.4使用fleetrun启动分布式任务

1.4.1 使用fleetrun启动分布式任务

1.4.2 Fleet单机多卡训练

2.手写数字识别API Fleet多版本

2.1手写数字识别基础API Fleet版本

2.2 手写数字识别高层API Fleet版本

2.3 多机多卡手写数字识别

3.飞桨2.0并行计算总结：

3.1 针对单机多卡的情况，优先推荐使用spawn方式

3.2 针对多机多卡的情况，使用fleet方式。

3.3 如果使用高层API代码，则程序不用修改，直接运行fleetrun命令即可。

4.利用VisualDL进行并行计算下的可视化

4.1 VisualDL可视化

4.2 利用VisualDL-Service共享可视化结果

相关文章

CCF201609-3 炉石传说（100分）

CDH 的Kerberos认证配置

cobaltstrike使用和巨龙拉冬9.0

一文懂KL散度KL Divergence

机器学习：KL散度详解

汇开优店19 条⽤卡经验

如何在CDH5.12集群中启用Kerberos认证

两种专家经验评分卡的学习