【三】分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!

news/2025/1/13 8:04:39/

1. 单机多卡启动并行训练

飞桨2.0增加paddle.distributed.spawn函数来启动单机多卡训练,同时原有的paddle.distributed.launch的方式依然保留。

  • paddle.distributed.launch通过指定启动的程序文件,以文件为单位启动多进程来实现多卡同步训练。以前在aistudio脚本任务说明里,就是推荐这种方法启动多卡任务。launch这种方式对进程管理要求较高。
  • paddle.distributed.spawn是以function函数为单位启动多进程来实现多卡同步的,可以更好地控制进程,在日志打印、训练退出时更友好。这是当前推荐的用法。

下面分别介绍这两种方法。

1.1单机多卡启动方式1、launch启动

1.1.1使用高层API的场景

  • 当调用paddle.Model高层API来实现训练时,想要启动单机多卡训练非常简单,代码不需要做任何修改,只需要在启动时增加一下参数-m paddle.distributed.launch。

      #单机单卡启动,默认使用第0号卡$ python train.py#单机多卡启动,默认使用当前可见的所有卡$ python -m paddle.distributed.launch train.py#单机多卡启动,设置当前使用的第0号和第1号卡$ python -m paddle.distributed.launch --selected_gpus='0,1' train.py#单机多卡启动,设置当前使用第0号和第1号卡$ export CUDA_VISIABLE_DEVICES='0,1'$ python -m paddle.distributed.launch train.py
    
  • 下面是一个高阶API的例子代码,直接执行cell代码框,就会在根目录生成hapitrain.py文件,然后就可以使用python来启动这个训练了。

%%writefile hapitrain.py 
import paddle 
from paddle.vision.transforms import ToTensortrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()# Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
model = paddle.Model(lenet)# 设置训练模型所需的optimizer, loss, metric
model.prepare(paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),paddle.nn.CrossEntropyLoss(),paddle.metric.Accuracy(topk=(1, 2)))# 启动训练
model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)# 启动评估
model.evaluate(test_dataset, log_freq=100, batch_size=64)

单机单卡启动,默认使用第0号卡

# 单机单卡启动,默认使用第0号卡
!python hapitrain.py
Begin to downloadDownload finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz 
Begin to download
........
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz 
Begin to downloadDownload finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz 
Begin to download
..
Download finished
W0628 15:25:11.488023   114 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:25:11.614305   114 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsif isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0555 - acc_top1: 0.9217 - acc_top2: 0.9649 - 50ms/step
step 800/938 - loss: 0.0300 - acc_top1: 0.9454 - acc_top2: 0.9782 - 39ms/step
step 938/938 - loss: 0.0213 - acc_top1: 0.9498 - acc_top2: 0.9803 - 38ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0057 - acc_top1: 0.9731 - acc_top2: 0.9927 - 28ms/step
step 157/157 - loss: 0.0013 - acc_top1: 0.9785 - acc_top2: 0.9945 - 28ms/step
Eval samples: 10000

单机多卡启动,默认使用当前可见的所有卡

# 单机多卡启动,默认使用当前可见的所有卡
!python -m paddle.distributed.launch hapitrain.py
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:26:17,473 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:26:17,475 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:35079               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:35079               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+INFO 2021-06-28 15:26:17,475 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
W0628 15:26:24.305920   285 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:26:24.311555   285 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsif isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0586 - acc_top1: 0.9130 - acc_top2: 0.9611 - 38ms/step
step 800/938 - loss: 0.0288 - acc_top1: 0.9397 - acc_top2: 0.9759 - 39ms/step
step 938/938 - loss: 0.0545 - acc_top1: 0.9448 - acc_top2: 0.9785 - 40ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0035 - acc_top1: 0.9677 - acc_top2: 0.9911 - 36ms/step
step 157/157 - loss: 0.0057 - acc_top1: 0.9723 - acc_top2: 0.9929 - 36ms/step
Eval samples: 10000
INFO 2021-06-28 15:27:26,569 launch.py:240] Local processes completed.

单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高

# 单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch hapitrain.py
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:28:10,632 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:28:10,637 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:46909               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:46909               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+INFO 2021-06-28 15:28:10,637 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
W0628 15:28:19.819196   448 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:28:19.905493   448 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsif isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0376 - acc_top1: 0.9136 - acc_top2: 0.9610 - 37ms/step
step 800/938 - loss: 0.0159 - acc_top1: 0.9423 - acc_top2: 0.9764 - 35ms/step
step 938/938 - loss: 0.0444 - acc_top1: 0.9479 - acc_top2: 0.9791 - 35ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0039 - acc_top1: 0.9767 - acc_top2: 0.9939 - 36ms/step
step 157/157 - loss: 0.0029 - acc_top1: 0.9815 - acc_top2: 0.9952 - 35ms/step
Eval samples: 10000
INFO 2021-06-28 15:29:19,766 launch.py:240] Local processes completed.

1.1.2使用基础API场景

  • 如果使用基础API的代码程序启动单机多卡训练,需要对单机单卡的代码进行3处修改,具体看下面未改变版本和改变版本的对比:

修改三处:

  • 第1处改动,import库**

import paddle.distributed as dist

  • 第2处改动,初始化并行环境**

dist.init_parallel_env()

  • 第3处改动,增加paddle.DataParallel封装

net = paddle.DataParallel(paddle.vision.models.LeNet())

import paddle #未改动版本
from paddle.vision.transforms import ToTensortrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)def train():epochs = 1adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=lenet.parameters())# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data, y_data = datapredicts = lenet(x_data)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')acc = paddle.metric.accuracy(predicts, y_data, k=1)avg_acc = paddle.mean(acc)loss.backward()if batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy()))adam.step()adam.clear_grad()
# 启动训练
train()
> epoch: 0, batch_id: 0, loss is: [2.7922328], acc is: [0.15625] epoch:
> 0, batch_id: 400, loss is: [0.10373791], acc is: [0.96875] epoch: 0,
> batch_id: 800, loss is: [0.01435608], acc is: [1.]

这是有3处改动的基础API版本
还是先通过%%writefile normaltrain.py 命令将该文件存盘到根目录

%%writefile normaltrain.py 
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist #第1处改动,import库train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)def train():# 第2处改动,初始化并行环境dist.init_parallel_env()# 第3处改动,增加paddle.DataParallel封装net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径epochs = 1adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data = data[0]y_data = data[1]predicts = net(x_data)  acc = paddle.metric.accuracy(predicts, y_data, k=2)avg_acc = paddle.mean(acc)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') loss.backward() #这里手册误写成了avg_lossif batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_lossadam.step()adam.clear_grad()
# 启动训练
train()
# 单机单卡启动,默认使用第0号卡 。这里单机单卡执行改后的代码会报错
# !python normaltrain.py# 单机多卡启动,默认使用当前可见的所有卡
!python -m paddle.distributed.launch normaltrain.py# 单机多卡启动,设置当前使用第0号和第1号卡 自动用当前所有的卡,只有单卡也不会报错
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch normaltrain.py

1.2 单机多卡启动方式2、spawn启动【推荐!!】

就像把物品放进盒子寄快递一样,只要将待并行计算的train函数体放入paddle.distributed.spawn里面就行了。命令为:

import paddle.distributed as dist# 启动train多进程训练,默认使用所有可见的GPU卡
if __name__ == '__main__':dist.spawn(train)# 启动train函数2个进程训练,默认使用当前可见的前2张卡
if __name__ == '__main__':dist.spawn(train, nprocs=2)# 启动train函数2个进程训练,默认使用第4号和第5号卡
if __name__ == '__main__':dist.spawn(train, nprocs=2, selelcted_gpus='4,5')
  • 基础API场景(不管是否像launch里面那样改代码) aistudio
    notebook里会报错,在实际多卡环境下正常。在aistudio 命令行下正常
  • 高阶API场景 aistudio notebook里会报错,在aistudio 命令行下正常。
%%writefile normal3spawn.py 
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
import paddle.distributed as dist #第1处改动,import库train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)def train():# 第2处改动,初始化并行环境dist.init_parallel_env()# 第3处改动,增加paddle.DataParallel封装net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径epochs = 1adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data = data[0]y_data = data[1]predicts = net(x_data)  acc = paddle.metric.accuracy(predicts, y_data, k=2)avg_acc = paddle.mean(acc)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') loss.backward() #这里手册误写成了avg_lossif batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_lossadam.step()adam.clear_grad()# 启动train多进程训练,默认使用所有可见的GPU卡
import paddle.distributed as dist
if __name__ == '__main__':dist.spawn(train)

1.3单机多卡简要总结:

spawn方式下在notebook里报错的情况,猜测应该是notebook进程管理限制导致的。在命令行情况下或者cell里加叹号运行的时候,就没有问题。

spawn方式不需要去修改代码的内部部分,只是加上dist.spawn(train)这句,相当于给训练代码加了一个多进程的壳,简单方便,是推荐使用的单机多卡组网方式!

在不支持spawn的情况,再去考虑用launch方式启动单机多卡。

飞桨完备的并行模式:

  • 数据并行:针对产业界最常用的数据并行模式,飞桨针对实际业务需求重点打磨多项技术,包括;飞桨提供集合通信架构和参数服务器架构两种方式,支持工业实践中常见的同步训练和异步训练的机制,并提供收敛效果有保障的分布式优化算法。
  • 流水线并行:面向异构硬件,流水线并行能够将模型计算部分拆分到不同硬件并充分流水线化,从而大规模提升异构硬件的整体利用率。
  • **模型并行:**对于超大规模分类问题,飞桨提供计算与存储同时并行的模型并行,解决单GPU无法解决的问题。

1.4使用fleetrun启动分布式任务

1.4.1 使用fleetrun启动分布式任务

Paddle提供命令行启动命令fleetrun,配合Paddle的分布式高级APIpaddle.distributed.fleet 即可轻松启动Paddle集合通信模式或参数服务器模式下的分布式任务。 fleetrun在静态图和动态图场景下均可使用。

注:目前paddle.distributed.fleet启动动态图分布式训练仅支持集合通信(Colletive Communication)模式,不支持参数服务器(Parameter-Server)模式。

  • GPU单机多卡训练

若启动单机4卡的任务,只需通过–gpus指定空闲的4张卡即可。

    fleetrun --gpus=0,1,2,3 train.py

注:如果指定了export CUDA_VISIBLE_DEVICES=0,1,2,3,则可以直接使用:

    export CUDA_VISIBLE_DEVICES=0,1,2,3fleetrun train.py
  • GPU多机多卡训练

[示例一] 2机8卡 (每个节点4卡)

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1,2,3 train.py

注:如果每台机器均指定了export CUDA_VISIBLE_DEVICES=0,1,2,3,则可以直接在每台节点上启动:

    export CUDA_VISIBLE_DEVICES=0,1,2,3fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

[示例二] 2机16卡(每个节点8卡,假设每台机器均有8卡可使用)

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" train.py

1.4.2 Fleet单机多卡训练

使用Fleet接口进行动态图分布式训练其实非常的简单,基础API程序代码只需修改3个步骤:

  • 导入paddle.distributed.fleet包

      from paddle.distributed import fleet
    
  • 初始化fleet环境

      fleet.init(is_collective=True)
    
  • 通过fleet获取分布式优化器和分布式模型

      strategy = fleet.DistributedStrategy()adam = fleet.distributed_optimizer(adam, strategy=strategy)dp_layer = fleet.distributed_model(layer)
    

Fleet手册提供的例子

%%writefile train_fleet.py
# -*- coding: UTF-8 -*-
import paddle
import paddle.nn as nn
#分布式step 1: 导入paddle.distributed.fleet包
from paddle.distributed import fleet# 定义全连接网络,需继承自nn.Layer
class LinearNet(nn.Layer):def __init__(self):super(LinearNet, self).__init__()self._linear1 = nn.Linear(10, 10)self._linear2 = nn.Linear(10, 1)def forward(self, x):return self._linear2(self._linear1(x))# 1.开启动态图模式
paddle.disable_static()# 分布式step 2: 初始化fleet
fleet.init(is_collective=True)# 2. 定义网络对象,损失函数和优化器
layer = LinearNet()
loss_fn = nn.MSELoss()
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=layer.parameters())# 分布式step 3: 通过fleet获取分布式优化器和分布式模型
strategy = fleet.DistributedStrategy()
adam = fleet.distributed_optimizer(adam, strategy=strategy)
dp_layer = fleet.distributed_model(layer)for step in range(20):# 3. 执行前向网络inputs = paddle.randn([10, 10], 'float32')outputs = dp_layer(inputs)labels = paddle.randn([10, 1], 'float32')loss = loss_fn(outputs, labels)print("step:{}\tloss:{}".format(step, loss.numpy()))# 4. 执行反向计算和参数更新loss.backward()adam.step()adam.clear_grad()
!fleetrun --gpus=0 train_fleet.py
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
-----------  Configuration Arguments -----------
gpus: 0
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: train_fleet.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:56:16,986 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:56:16,990 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:47263               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:47263               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+INFO 2021-06-28 15:56:16,991 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsdef convert_to_list(value, n, name, dtype=np.int):
W0628 15:56:18.760403  1539 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:56:18.826562  1539 device_context.cc:372] device: 0, cuDNN Version: 7.6.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:633: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training."It is recommended to use DistributedStrategy "
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py:423: UserWarning: The program will return to single-card operation. Please check 1, whether you use spawn or fleetrun to start the program. 2, Whether it is a multi-card program. 3, Is the current environment multi-card.warnings.warn("The program will return to single-card operation. "
step:0	loss:[2.747072]
step:1	loss:[3.9464068]
step:2	loss:[3.3363562]
step:3	loss:[1.7597802]
step:4	loss:[2.4984336]
step:5	loss:[1.3766874]
step:6	loss:[3.3678422]
step:7	loss:[1.8410085]
step:8	loss:[1.6417965]
step:9	loss:[4.009201]
step:10	loss:[1.7387416]
step:11	loss:[1.6013482]
step:12	loss:[1.6388085]
step:13	loss:[3.7573469]
step:14	loss:[0.9461777]
step:15	loss:[2.4906065]
step:16	loss:[2.613153]
step:17	loss:[2.8367076]
step:18	loss:[2.170548]
step:19	loss:[2.2705061]
INFO 2021-06-28 15:56:35,049 launch.py:240] Local processes completed.

2.手写数字识别API Fleet多版本

2.1手写数字识别基础API Fleet版本

%%writefile normal_fleet.py 
import paddle #这是有3处改动的版本
from paddle.vision.transforms import ToTensor
#分布式step 1: 导入paddle.distributed.fleet包
from paddle.distributed import fleettrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)# 分布式step 2: 初始化fleet
fleet.init(is_collective=True)def train():epochs = 1net = paddle.vision.models.LeNet()adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())# 分布式step 3: 通过fleet获取分布式优化器和分布式模型strategy = fleet.DistributedStrategy()adam = fleet.distributed_optimizer(adam, strategy=strategy)net = fleet.distributed_model(net)# 用Adam作为优化函数for epoch in range(epochs):for batch_id, data in enumerate(train_loader()):x_data = data[0]y_data = data[1]predicts = net(x_data)  acc = paddle.metric.accuracy(predicts, y_data, k=2)avg_acc = paddle.mean(acc)loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') loss.backward() #这里手册误写成了avg_lossif batch_id % 400 == 0:print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_lossadam.step()adam.clear_grad()if __name__ == '__main__':train()
!fleetrun --gpus=0 normal_fleet.py
 +=======================================================================================+|                        Distributed Envs                      Value                    |+---------------------------------------------------------------------------------------+|                       PADDLE_TRAINER_ID                        0                      ||                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:42501               ||                     PADDLE_TRAINERS_NUM                        1                      ||                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:42501               ||                     FLAGS_selected_gpus                        0                      |+=======================================================================================+
epoch: 0, batch_id: 0, loss is: [2.5425684], acc is: [0.234375]
epoch: 0, batch_id: 400, loss is: [0.05207598], acc is: [1.]
epoch: 0, batch_id: 800, loss is: [0.04818164], acc is: [1.]

2.2 手写数字识别高层API Fleet版本

%%writefile hapi_fleet.py
import paddle
from paddle.vision.transforms import ToTensor
import paddle.distributed as disttrain_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()# Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
model = paddle.Model(lenet)# 设置训练模型所需的optimizer, loss, metric
model.prepare(paddle.optimizer.Adam(learning_rate=0.1, parameters=model.parameters()),paddle.nn.CrossEntropyLoss(),paddle.metric.Accuracy(topk=(1, 2)))
def train():# 启动训练# 使用VisualDL 可视化callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)# 未使用VisualDL 可视化# model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)# 启动评估
#     model.evaluate(test_dataset, log_freq=20, batch_size=64)if __name__ == '__main__':train()
!fleetrun hapi_fleet.py

2.3 多机多卡手写数字识别

从单机多卡到多机多卡训练,在代码上并不需要做任何改动,只需修改启动命令,以2机4卡为例:

    fleetrun --ips="xx.xx.xx.xx,yy.yy.yy.yy" --gpus=0,1 dygraph_fleet.py

在2台机器上分别运行以上启动命令,fleetrun将在后台分别启动2个多进程任务,执行分布式多机训练。 您将在ip为xx.xx.xx.xx的机器上看到命令台输出日志信息。

下面还是以aistudio为例子演示一下多机多卡,直接运行:

!fleetrun --ips="127.0.0.1" --gpus=0 normal_fleet.py

3.飞桨2.0并行计算总结:

飞桨2.0在并行计算方面有着完备的解决方案,且是经过超大规模业务数据检验过的训练框架。并行计算,就是这么简单!

3.1 针对单机多卡的情况,优先推荐使用spawn方式

spawn的优点是:几乎不需要修改代码,只要导入spawn库,并在最后用spawn去调用训练函数即可。同时spawn方式可以更好地控制进程,在日志打印、训练退出时更友好

程序中只需要增加这两句:

    import paddle.distributed as distif __name__ == '__main__':dist.spawn(train)

然后直接用python train.py启动训练即可

3.2 针对多机多卡的情况,使用fleet方式。

普通API程序需要对应修改3个步骤:

  • 导入paddle.distributed.fleet包

      from paddle.distributed import fleet
    
  • 初始化fleet环境

      fleet.init(is_collective=True)
    
  • 通过fleet获取分布式优化器和分布式模型

      strategy = fleet.DistributedStrategy()adam = fleet.distributed_optimizer(adam, strategy=strategy)dp_layer = fleet.distributed_model(layer)
    
  • 然后运行命令:
    fleetrun --ips=“xx.xx.xx.xx,yy.yy.yy.yy” --gpus=0,1 train.py

3.3 如果使用高层API代码,则程序不用修改,直接运行fleetrun命令即可。

4.利用VisualDL进行并行计算下的可视化

VisualDL是一个面向深度学习任务设计的可视化工具。VisualDL 利用了丰富的图表来展示数据,用户可以更直观、清晰地查看数据的特征与变化趋势,有助于分析数据、及时发现错误,进而改进神经网络模型的设计。喜欢的同学可以去star支持一下哦~

AI Studio Notebook 项目(Paddle1.8.0及以上版本)已经集成VisualDL工具以便于您的使用,可在可视化tab中启动VisualDL服务。

4.1 VisualDL可视化

在高层API程序中,只需要加上这句callback = paddle.callbacks.VisualDL(log_dir='visualdl_log')并在model.fit里面加上callbacks=callback参数即可,也就是这样:model.fit(train_dataset, epochs=1, batch_size=64, callbacks=callback, log_freq=400)

前面的hapi_fleet.py代码中已经加入了VisualDL语句支持,前面cell已经执行!fleetrun hapi_fleet.py现在直接就可以在AIStudio里面打开可视化了:

打开左侧标签栏 可视化->设置logdir->点击添加->选择 visualdl_log/ -> 点击启动VisualDL服务 -> 点击打开VisualDL,在打开的网页中,就能看到训练的loss/acc等统计了;

4.2 利用VisualDL-Service共享可视化结果

  • 此功能是 VisualDL 2.0.4 新添加的功能,需要安装 VisualDL 2.0.4 或者以上的版本,只需要一行代码 visualdl service upload 即可以将自己的log文件上传到远端,

  • 非常推荐这个功能,我们上传文件之后,就不再需要在本地保存这些文件,直接访问生成的链接就可以了,十分方便!

  • 如果没有安装 VisualDL 2.0.4 + ,需要使用命令pip install visualdl==2.0.5安装

  • 执行下面的代码之后,访问生成的链接, 所有人都可以对训练过程进行查看分析

!pip install -U visualdl -q # ==2.0.5!visualdl service upload --logdir visualdl_log

http://www.ppmy.cn/news/163692.html

相关文章

CCF201609-3 炉石传说(100分)

试题编号:201609-3试题名称:炉石传说时间限制:1.0s内存限制:256.0MB问题描述: 问题描述 《炉石传说:魔兽英雄传》(Hearthstone: Heroes of Warcraft,简称炉石传说)是暴雪…

CDH 的Kerberos认证配置

CDH 的Kerberos认证配置 博客分类: Hadoop http://xubo8118.blog.163.com/blog/static/1855523322013918103857226/ 关于: hadoop的安全机制 hadoop kerberos的安全机制 参考Cloudera官方文档: Configuring Hadoop Security in CDH3 一、部…

cobaltstrike使用和巨龙拉冬9.0

最近看到了office钓鱼的文就对cobaltstrike进行一些学习 这里带来最一些讲解希望能对一些学cs的同学带来帮助也算是扫盲吧(大佬勿喷) 安装,运行,介绍 1.准备一台linux服务器,一台Windows主机(windows主机用来查看,linux服务器用来进行攻击 注…

一文懂KL散度KL Divergence

本文翻译自https://naokishibuya.medium.com/demystifying-kl-divergence-7ebe4317ee68 KL散度中的KL全称是Kullback-Leibler,分别表示Solomon Kullback和Richard A.Leibler这两个人。 一、KL散度的定义 KL散度表明概率分布Q和概率分布P之间的相似性,由…

机器学习:KL散度详解

KL 散度,是一个用来衡量两个概率分布的相似性的一个度量指标。 我们知道,现实世界里的任何观察都可以看成表示成信息和数据,一般来说,我们无法获取数据的总体,我们只能拿到数据的部分样本,根据数据的部分样…

汇开优店19 条⽤卡经验

汇开优店19 条⽤卡经验 1、单卡刷卡次数每周⼤于 5 次,每⽉⼤于 20 次。每⽉合理递增,每⽉ 25-30 笔最佳。 2、刷卡⾦额要⼩额结合⼤额消费,如果都是鸡⽑蒜⽪的⼩消费,银⾏是不会把你优 质客户的,经常有⼤额消费才能证…

如何在CDH5.12集群中启用Kerberos认证

参考链接: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/security.html 一、前置准备 1、基础环境说明 操作系统:CentOS 6.8 minimal CDM版本: 5.12.1 CDH版本:5.12.1 MySQL版本: 5.1.73 JDK: 1.8.0_131 浏览器版本: ChromeStand…

两种专家经验评分卡的学习

专家经验评分,无论是风控初期冷启动情况,还是对于数据量较少的信贷场景,典型如小微信贷场景等,都是较为不错的实现方法,本次文章我们将介绍两种最常用到的专家经验评分卡。分别是: ①基于ODDS的专家经验的评…