原文：Mastering TensorFlow 1.x

协议：CC BY-NC-SA 4.0

译者：飞龙

本文来自【ApacheCN 深度学习译文集】，采用译后编辑（MTPE）流程来尽可能提升效率。

不要担心自己的形象，只关心如何实现目标。——《原则》，生活原则 2.3.c

十一、TF 服务：生产中的 TensorFlow 模型

TensorFlow 模型在开发环境中经过训练和验证。一旦发布，它们需要托管在某个地方，提供用工程师和软件工程师使用，以集成到各种应用中。 TensorFlow 为此提供了一个高表现服务器，称为 TensorFlow 服务。

要在生产中提供 TensorFlow 模型，需要在离线训练后保存它们，然后在生产环境中恢复经过训练的模型。 TensorFlow 模型在保存时包含以下文件：

元图：元图表示图的协议缓冲区定义。元图保存在具有.meta扩展名的文件中。
检查点：检查点代表各种变量的值。检查点保存在两个文件中：一个带有.index扩展名，另一个带有.data-00000-of-00001扩展名。

在本章中，我们将学习各种保存和恢复模型的方法以及如何使用 TF 服务来提供模型。我们将使用 MNIST 示例来简化操作并涵盖以下主题：

使用Saver类在 TensorFlow 中保存和恢复模型
保存和恢复 Keras 模型
TensorFlow 服务
安装 TF 服务
为 TF 服务保存模型
用 TF 服务来提供模型
在 Docker 容器中提供 TF 服务
Kubernetes 上的 TF 服务

在 TensorFlow 中保存和恢复模型

您可以通过以下两种方法之一在 TensorFlow 中保存和恢复模型和变量：

从tf.train.Saver类创建的保存器对象
从tf.saved_model_builder.SavedModelBuilder类创建的基于SavedModel格式的对象

让我们看看两种方法的实际应用。

您可以按照 Jupyter 笔记本中的代码ch-11a_Saving_and_Restoring_TF_Models。

使用保存器类保存和恢复所有图变量

我们进行如下：

要使用saver类，首先要创建此类的对象：

saver = tf.train.Saver()

保存图中所有变量的最简单方法是使用以下两个参数调用save()方法：会话对象和磁盘上保存变量的文件的路径：

with tf.Session() as tfs:...saver.save(tfs,"saved-models/model.ckpt")

要恢复变量，调用restore()方法：

with tf.Session() as tfs:saver.restore(tfs,"saved-models/model.ckpt")...

让我们重温一下第 1 章，TensorFlow 101 的例子，在简单的例子中保存变量的代码如下：

# Assume Linear Model y = w * x + b
# Define model parameters
w = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
# Define model input and output
x = tf.placeholder(tf.float32)
y = w * x + b
output = 0# create saver object
saver = tf.train.Saver()with tf.Session() as tfs:# initialize and print the variable ytfs.run(tf.global_variables_initializer())output = tfs.run(y,{x:[1,2,3,4]})saved_model_file = saver.save(tfs,'saved-models/full-graph-save-example.ckpt')print('Model saved in {}'.format(saved_model_file))print('Values of variables w,b: {}{}'.format(w.eval(),b.eval()))print('output={}'.format(output))

我们得到以下输出：

Model saved in saved-models/full-graph-save-example.ckpt
Values of variables w,b: [ 0.30000001][-0.30000001]
output=[ 0\.          0.30000001  0.60000002  0.90000004]

现在让我们从刚刚创建的检查点文件中恢复变量：

# Assume Linear Model y = w * x + b
# Define model parameters
w = tf.Variable([0], dtype=tf.float32)
b = tf.Variable([0], dtype=tf.float32)
# Define model input and output
x = tf.placeholder(dtype=tf.float32)
y = w * x + b
output = 0# create saver object
saver = tf.train.Saver()with tf.Session() as tfs:saved_model_file = saver.restore(tfs,'saved-models/full-graph-save-example.ckpt')print('Values of variables w,b: {}{}'.format(w.eval(),b.eval()))output = tfs.run(y,{x:[1,2,3,4]})print('output={}'.format(output))

您会注意到在恢复代码中我们没有调用tf.global_variables_initializer()，因为不需要初始化变量，因为它们将从文件中恢复。我们得到以下输出，它是根据恢复的变量计算得出的：

INFO:tensorflow:Restoring parameters from saved-models/full-graph-save-example.ckpt
Values of variables w,b: [ 0.30000001][-0.30000001]
output=[ 0\.          0.30000001  0.60000002  0.90000004]

使用保存器类保存和恢复所选变量

默认情况下，Saver()类将所有变量保存在图中，但您可以通过将变量列表传递给Saver()类的构造器来选择要保存的变量：

# create saver object
saver = tf.train.Saver({'weights': w})

变量名称可以作为列表或字典传递。如果变量名称作为列表传递，则列表中的每个变量将以其自己的名称保存。变量也可以作为由键值对组成的字典传递，其中键是用于保存的名称，值是要保存的变量的名称。

以下是我们刚看到的示例的代码，但这次我们只保存w变量的权重；保存时将其命名为weights：

# Saving selected variables in a graph in TensorFlow# Assume Linear Model y = w * x + b
# Define model parameters
w = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
# Define model input and output
x = tf.placeholder(tf.float32)
y = w * x + b
output = 0# create saver object
saver = tf.train.Saver({'weights': w})with tf.Session() as tfs:# initialize and print the variable ytfs.run(tf.global_variables_initializer())output = tfs.run(y,{x:[1,2,3,4]})saved_model_file = saver.save(tfs,'saved-models/weights-save-example.ckpt')print('Model saved in {}'.format(saved_model_file))print('Values of variables w,b: {}{}'.format(w.eval(),b.eval()))print('output={}'.format(output))

我们得到以下输出：

Model saved in saved-models/weights-save-example.ckpt
Values of variables w,b: [ 0.30000001][-0.30000001]
output=[ 0\.          0.30000001  0.60000002  0.90000004]

检查点文件仅保存权重而不是偏差。现在让我们将偏差和权重初始化为零，并恢复权重。此示例的代码在此处给出：

# Restoring selected variables in a graph in TensorFlow
tf.reset_default_graph()
# Assume Linear Model y = w * x + b
# Define model parameters
w = tf.Variable([0], dtype=tf.float32)
b = tf.Variable([0], dtype=tf.float32)
# Define model input and output
x = tf.placeholder(dtype=tf.float32)
y = w * x + b
output = 0# create saver object
saver = tf.train.Saver({'weights': w})with tf.Session() as tfs:b.initializer.run()saved_model_file = saver.restore(tfs,'saved-models/weights-save-example.ckpt')print('Values of variables w,b: {}{}'.format(w.eval(),b.eval()))output = tfs.run(y,{x:[1,2,3,4]})print('output={}'.format(output))

如您所见，这次我们必须使用b.initializer.run()初始化偏差。我们不使用tfs.run(tf.global_variables_initializer())因为它会初始化所有变量，并且不需要初始化权重，因为它们将从检查点文件中恢复。

我们得到以下输出，因为计算仅使用恢复的权重，而偏差设置为零：

INFO:tensorflow:Restoring parameters from saved-models/weights-save-example.ckpt
Values of variables w,b: [ 0.30000001][ 0.]
output=[ 0.30000001  0.60000002  0.90000004  1.20000005]

保存和恢复 Keras 模型

在 Keras 中，保存和恢复模型非常简单。 Keras 提供三种选择：

使用其网络架构，权重（参数），训练配置和优化器状态保存完整模型。
仅保存架构。
仅保存权重。

要保存完整模型，请使用model.save(filepath)函数。这将把完整的模型保存在 HDF5 文件中。可以使用keras.models.load_model(filepath)函数加载保存的模型。此函数将所有内容加载回来，然后还编译模型。

要保存模型的架构，请使用model.to_json()或model.to_yaml()函数。这些函数返回一个可以写入磁盘文件的字符串。在恢复架构时，可以回读字符串，并使用keras.models.model_from_json(json_string)或keras.models.model_from_yaml(yaml_string)函数恢复模型架构。这两个函数都返回一个模型实例。

要保存模型的权重，请使用model.save_weights(path_to_h5_file)函数。可以使用model.load_weights(path_to_h5_file)函数恢复权重。

TensorFlow 服务

TensorFlow 服务（TFS）是一种高表现服务器架构，用于为生产中的机器学习模型提供服务。它提供与使用 TensorFlow 构建的模型的开箱即用集成。

在 TFS 中，模型由一个或多个可服务对象组成。可服务对象用于执行计算，例如：

用于嵌入查找的查找表
返回预测的单个模型
返回一组预测的一组模型
查找表或模型的分片

管理器组件管理可服务对象的整个生命周期，包括加载/卸载可服务对象并提供可服务对象。

TensorFlow 服务的内部架构和工作流程在此链接中描述。

安装 TF 服务

按照本节中的说明使用aptitude在 Ubuntu 上安装 TensorFlow ModelServer。

首先，在 shell 提示符下使用以下命令添加 TensorFlow 服务分发 URI 作为包源（一次性设置）：

$ echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list$ curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -

在 shell 提示符下使用以下命令安装和更新 TensorFlow ModelServer：

$ sudo apt-get update && sudo apt-get install tensorflow-model-server

这将安装使用特定于平台的编译器优化的 ModelServer 版本，例如使用 SSE4 和 AVX 指令。但是，如果优化版本安装在旧计算机上不起作用，则可以安装通用版本：

$ sudo apt-get remove tensorflow-model-server
$ sudo apt-get update && sudo apt-get install tensorflow-model-server-universal

对于其他操作系统以及从源安装，请参阅此链接。

发布新版本的 ModelServer 时，可以使用以下命令升级到较新版本：

$ sudo apt-get update && sudo apt-get upgrade tensorflow-model-server

现在已安装 ModelServer，使用以下命令运行服务器：

$ tensorflow-model-server

要连接到tensorflow-model-server，请使用 PIP 安装 python 客户端包：

$ sudo pip2 install tensorflow-serving-api

TF 服务 API 仅适用于 Python 2，但尚不适用于 Python 3。

为 TF 服务保存模型

为了服务模型，需要先保存它们。在本节中，我们将从官方 TensorFlow 文档中演示 MNIST 示例的略微修改版本，可从此链接获得。

TensorFlow 团队建议使用 SavedModel 来保存和恢复在 TensorFlow 中构建和训练的模型。根据 TensorFlow 文档：

SavedModel 是一种语言中立的，可恢复的，密集的序列化格式。SavedModel 支持更高级别的系统和工具来生成，使用和转换 TensorFlow 模型。您可以按照 Jupyter 笔记本中的代码ch-11b_Saving_TF_Models_with_SavedModel_for_TF_Serving。我们按照以下方式继续保存模型：

定义模型变量：

model_name = 'mnist'
model_version = '1'
model_dir = os.path.join(models_root,model_name,model_version)

像我们在第 4 章中所做的那样获取 MNIST 数据 - MLP 模型：

from tensorflow.examples.tutorials.mnist import input_data
dataset_home = os.path.join('.','mnist')
mnist = input_data.read_data_sets(dataset_home, one_hot=True)
x_train = mnist.train.images
x_test = mnist.test.images
y_train = mnist.train.labels
y_test = mnist.test.labels
pixel_size = 28 
num_outputs = 10 # 0-9 digits
num_inputs = 784 # total pixels

定义将构建并返回模型的 MLP 函数：

def mlp(x, num_inputs, num_outputs,num_layers,num_neurons):w=[]b=[]for i in range(num_layers):w.append(tf.Variable(tf.random_normal( [num_inputs if i==0 else num_neurons[i-1], num_neurons[i]]),name="w_{0:04d}".format(i) ) ) b.append(tf.Variable(tf.random_normal( [num_neurons[i]]), name="b_{0:04d}".format(i) ) ) w.append(tf.Variable(tf.random_normal([num_neurons[num_layers-1] if num_layers > 0 \else num_inputs, num_outputs]),name="w_out"))b.append(tf.Variable(tf.random_normal([num_outputs]),name="b_out"))# x is input layerlayer = x# add hidden layersfor i in range(num_layers):layer = tf.nn.relu(tf.matmul(layer, w[i]) + b[i])# add output layerlayer = tf.matmul(layer, w[num_layers]) + b[num_layers]model = layerprobs = tf.nn.softmax(model)return model,probs

上述mlp()函数返回模型和概率。概率是应用于模型的 softmax 激活。

为图像输入和目标输出定义x_p和y_p占位符：

# input images
serialized_tf_example = tf.placeholder(tf.string, name='tf_example')
feature_configs = {'x': tf.FixedLenFeature(shape=[784], dtype=tf.float32),}
tf_example = tf.parse_example(serialized_tf_example, feature_configs)
# use tf.identity() to assign name
x_p = tf.identity(tf_example['x'], name='x_p') 
# target output
y_p = tf.placeholder(dtype=tf.float32, name="y_p", shape=[None, num_outputs])

创建模型，以及损失，优化器，准确率和训练函数：

num_layers = 2
num_neurons = []
for i in range(num_layers):num_neurons.append(256)learning_rate = 0.01
n_epochs = 50
batch_size = 100
n_batches = mnist.train.num_examples//batch_sizemodel,probs = mlp(x=x_p, num_inputs=num_inputs, num_outputs=num_outputs, num_layers=num_layers, num_neurons=num_neurons)loss_op = tf.nn.softmax_cross_entropy_with_logits
loss = tf.reduce_mean(loss_op(logits=model, labels=y_p))
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss)pred_check = tf.equal(tf.argmax(probs,1), tf.argmax(y_p,1))
accuracy_op = tf.reduce_mean(tf.cast(pred_check, tf.float32))values, indices = tf.nn.top_k(probs, 10)
table = tf.contrib.lookup.index_to_string_table_from_tensor(tf.constant([str(i) for i in range(10)]))
prediction_classes = table.lookup(tf.to_int64(indices))

在 TensorFlow 会话中，像以前一样训练模型，但使用构建器对象来保存模型：

from tf.saved_model.signature_constants import \CLASSIFY_INPUTS
from tf.saved_model.signature_constants import \CLASSIFY_OUTPUT_CLASSES
from tf.saved_model.signature_constants import \CLASSIFY_OUTPUT_SCORES
from tf.saved_model.signature_constants import \CLASSIFY_METHOD_NAME
from tf.saved_model.signature_constants import \PREDICT_METHOD_NAME
from tf.saved_model.signature_constants import \DEFAULT_SERVING_SIGNATURE_DEF_KEY

with tf.Session() as tfs:tfs.run(tf.global_variables_initializer())for epoch in range(n_epochs):epoch_loss = 0.0for batch in range(n_batches):x_batch, y_batch = mnist.train.next_batch(batch_size)feed_dict = {x_p: x_batch, y_p: y_batch}_,batch_loss = tfs.run([train_op,loss], feed_dict=feed_dict)epoch_loss += batch_lossaverage_loss = epoch_loss / n_batchesprint("epoch: {0:04d}   loss = {1:0.6f}" .format(epoch,average_loss))feed_dict={x_p: x_test, y_p: y_test}accuracy_score = tfs.run(accuracy_op, feed_dict=feed_dict)print("accuracy={0:.8f}".format(accuracy_score))# save the model# definitions for saving the models  builder = tf.saved_model.builder.SavedModelBuilder(model_dir)# build signature_def_mapbti_op = tf.saved_model.utils.build_tensor_infobsd_op = tf.saved_model.utils.build_signature_defclassification_inputs = bti_op(serialized_tf_example)classification_outputs_classes = bti_op(prediction_classes)classification_outputs_scores = bti_op(values)classification_signature = (bsd_op(inputs={CLASSIFY_INPUTS: classification_inputs},outputs={CLASSIFY_OUTPUT_CLASSES:classification_outputs_classes,CLASSIFY_OUTPUT_SCORES:classification_outputs_scores},method_name=CLASSIFY_METHOD_NAME))tensor_info_x = bti_op(x_p)tensor_info_y = bti_op(probs)prediction_signature = (bsd_op(inputs={'inputs': tensor_info_x},outputs={'outputs': tensor_info_y},method_name=PREDICT_METHOD_NAME))legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')builder.add_meta_graph_and_variables(tfs, [tf.saved_model.tag_constants.SERVING],signature_def_map={'predict_images':prediction_signature,DEFAULT_SERVING_SIGNATURE_DEF_KEY:classification_signature,},legacy_init_op=legacy_init_op)builder.save()

一旦看到以下输出，就会保存模型：

accuracy=0.92979997
INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: b'/home/armando/models/mnist/1/saved_model.pb'

接下来，我们运行 ModelServer 并提供刚刚保存的模型。

使用 TF 服务提供模型

要运行 ModelServer，请执行以下命令：

$ tensorflow_model_server --model_name=mnist --model_base_path=/home/armando/models/mnist

服务器开始在端口 8500 上提供模型：

I tensorflow_serving/model_servers/main.cc:147] Building single TensorFlow model file config: model_name: mnist model_base_path: /home/armando/models/mnist
I tensorflow_serving/model_servers/server_core.cc:441] Adding/updating models.
I tensorflow_serving/model_servers/server_core.cc:492] (Re-)adding model: mnist
I tensorflow_serving/core/basic_manager.cc:705] Successfully reserved resources to load servable {name: mnist version: 1}
I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist version: 1}
I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist version: 1}
I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /home/armando/models/mnist/1
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /home/armando/models/mnist/1
I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:155] Restoring SavedModel bundle.
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running LegacyInitOp on SavedModel bundle.
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:284] Loading SavedModel: success. Took 29853 microseconds.
I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mnist version: 1}
E1121 ev_epoll1_linux.c:1051] grpc epoll fd: 3
I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:8500 ...

要通过调用模型对图像进行分类来测试服务器，请按照笔记本ch-11c_TF_Serving_MNIST进行操作。

笔记本电脑的前两个单元提供了服务仓库中 TensorFlow 官方示例的测试客户端功能。我们修改了示例以发送'input'并在函数签名中接收'output'以调用 ModelServer。

使用以下代码调用笔记本的第三个单元中的测试客户端函数：

error_rate = do_inference(hostport='0.0.0.0:8500', work_dir='/home/armando/datasets/mnist',concurrency=1, num_tests=100)
print('\nInference error rate: %s%%' % (error_rate * 100))

我们得到差不多 7% 的错误率！（您可能会得到不同的值）：

Extracting /home/armando/datasets/mnist/train-images-idx3-ubyte.gz
Extracting /home/armando/datasets/mnist/train-labels-idx1-ubyte.gz
Extracting /home/armando/datasets/mnist/t10k-images-idx3-ubyte.gz
Extracting /home/armando/datasets/mnist/t10k-labels-idx1-ubyte.gz..................................................
..................................................
Inference error rate: 7.0%

在 Docker 容器中提供 TF 服务

Docker 是一个用于在容器中打包和部署应用的平台。如果您还不知道 Docker 容器，请访问此链接中的教程和信息。

我们还可以在 Docker 容器中安装和运行 TensorFlow 服务。本节中提供的 Ubuntu 16.04 的说明源自 TensorFlow 官方网站上的链接：

https://www.tensorflow.org/serving/serving_inception
https://www.tensorflow.org/serving/serving_basic

让我们一起潜入！

安装 Docker

我们按如下方式安装 Docker：

首先，删除之前安装的 Docker：

$ sudo apt-get remove docker docker-engine docker.io

安装必备软件：

$ sudo apt-get install \apt-transport-https \ca-certificates \curl \software-properties-common

添加 Docker 仓库的 GPG 密钥：

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

添加 Docker 仓库：

$ sudo add-apt-repository \"deb [arch=amd64] https://download.docker.com/linux/ubuntu \  $(lsb_release -cs)  \ stable"

安装 Docker 社区版：

$ sudo apt-get update && sudo apt-get install docker-ce

安装成功完成后，将 Docker 添加为系统服务：

$ sudo systemctl enable docker

要以非root用户身份运行 Docker 或不使用sudo，请添加docker组：

$ sudo groupadd docker

将您的用户添加到docker组：

$ sudo usermod -aG docker $USER

现在注销并再次登录，以便组成员身份生效。登录后，运行以下命令来测试 Docker 安装：

$ docker run --name hello-world-container hello-world

您应该看到类似于以下内容的输出：

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
ca4f61b1923c: Already exists 
Digest: sha256:be0cd392e45be79ffeffa6b05338b98ebb16c87b255f48e297ec7f98e123905c
Status: Downloaded newer image for hello-world:latestHello from Docker!
This message shows that your installation appears to be working correctly.To generate this message, Docker took the following steps:1\. The Docker client contacted the Docker daemon.2\. The Docker daemon pulled the "hello-world" image from the Docker Hub.(amd64)3\. The Docker daemon created a new container from that image which runs theexecutable that produces the output you are currently reading.4\. The Docker daemon streamed that output to the Docker client, which sent itto your terminal.To try something more ambitious, you can run an Ubuntu container with:$ docker run -it ubuntu bashShare images, automate workflows, and more with a free Docker ID:https://cloud.docker.com/For more examples and ideas, visit:https://docs.docker.com/engine/userguide/

Docker 已成功安装。现在让我们为 TensorFlow 服务构建一个 Docker 镜像。

为 TF 服务构建 Docker 镜像

我们继续使用 Docker 镜像进行如下操作：

使用以下内容创建名为dockerfile的文件：

FROM ubuntu:16.04
MAINTAINER Armando Fandango <armando@geekysalsero.com>RUN apt-get update && apt-get install -y \build-essential \curl \git \libfreetype6-dev \libpng12-dev \libzmq3-dev \mlocate \pkg-config \python-dev \python-numpy \python-pip \software-properties-common \swig \zip \zlib1g-dev \libcurl3-dev \openjdk-8-jdk\openjdk-8-jre-headless \wget \&& \apt-get clean && \rm -rf /var/lib/apt/lists/*RUN echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" \| tee /etc/apt/sources.list.d/tensorflow-serving.listRUN curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg \| apt-key add -RUN apt-get update && apt-get install -y \tensorflow-model-serverRUN pip install --upgrade pip
RUN pip install mock grpcio tensorflow tensorflow-serving-apiCMD ["/bin/bash"]

运行以下命令从此dockerfile构建 Docker 镜像：

$ docker build --pull -t $USER/tensorflow_serving -f dockerfile .

创建图像需要一段时间。当您看到类似于以下内容的内容时，您就会知道图像已构建：

Removing intermediate container 1d8e757d96e0
Successfully built 0f95ddba4362
Successfully tagged armando/tensorflow_serving:latest

运行以下命令以启动容器：

$ docker run --name=mnist_container -it $USER/tensorflow_serving

当您看到以下提示时，您将登录到容器：

root@244ea14efb8f:/#

使用cd命令转到主文件夹。
在主文件夹中，提供以下命令以检查 TensorFlow 是否正在提供代码。我们将使用此代码中的示例来演示，但您可以查看自己的 Git 仓库来运行您自己的模型：

$ git clone --recurse-submodules https://github.com/tensorflow/serving

克隆仓库后，我们就可以构建，训练和保存 MNIST 模型了。

使用以下命令删除临时文件夹（如果尚未删除）：

$ rm -rf /tmp/mnist_model

运行以下命令以构建，训练和保存 MNIST 模型。

$ python serving/tensorflow_serving/example/mnist_saved_model.py /tmp/mnist_model

您将看到类似于以下内容的内容：

Training model...
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/t10k-labels-idx1-ubyte.gz
2017-11-22 01:09:38.165391: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
training accuracy 0.9092
Done training!
Exporting trained model to /tmp/mnist_model/1
Done exporting!

按Ctrl + P和Ctrl + Q从 Docker 镜像中分离。
提交对新映像的更改并使用以下命令停止容器：

$ docker commit mnist_container $USER/mnist_serving
$ docker stop mnist_container

现在，您可以通过提供以下命令随时运行此容器：

$ docker run --name=mnist_container -it $USER/mnist_serving

删除我们为保存图像而构建的临时 MNIST 容器：

$ docker rm mnist_container

在 Docker 容器中提供模型

要在容器中提供模型，说明如下：

$ docker run --name=mnist_container -it $USER/mnist_serving

使用cd命令转到主文件夹。
使用以下命令运行 ModelServer：

$ tensorflow_model_server  --model_name=mnist --model_base_path=/tmp/mnist_model/ &> mnist_log &

使用示例客户端检查模型中的预测：

$ python serving/tensorflow_serving/example/mnist_client.py --num_tests=100 --server=localhost:8500

我们看到错误率为 7%，就像我们之前的笔记本示例执行一样：

Extracting /tmp/train-images-idx3-ubyte.gz
Extracting /tmp/train-labels-idx1-ubyte.gz
Extracting /tmp/t10k-images-idx3-ubyte.gz
Extracting /tmp/t10k-labels-idx1-ubyte.gz
....................................................................................................
Inference error rate: 7.0%

而已！您已经构建了 Docker 镜像，并在 Docker 镜像中为您的模型提供服务。发出exit命令退出容器。

Kubernetes 中的 TensorFlow 服务

根据 {Kubernetes](https://kubernets.io)：

Kubernetes 是一个开源系统，用于自动化容器化应用的部署，扩展和管理。

TensorFlow 模型可以扩展为使用生产环境中的 Kubernetes 集群从数百或数千个TF Serving服务中提供服务。 Kubernetes 集群可以在所有流行的公共云上运行，例如 GCP，AWS，Azure，以及您的本地私有云。因此，让我们直接学习安装 Kubernetes，然后在 Kubernetes 集群上部署 MNIST 模型。

安装 Kubernetes

我们按照以下步骤在单节点本地群集模式下在 Ubuntu 16.04 上安装了 Kubernetes：

安装 LXD 和 Docker，这是在本地安装 Kubernetes 的先决条件。 LXD 是与 Linux 容器一起使用的容器管理器。我们已经在上一节中学习了如何安装 Docker。要安装 LXD，请运行以下命令：

$ sudo snap install lxd
lxd 2.19 from 'canonical'  installed

初始化lxd并创建虚拟网络：

$ sudo /snap/bin/lxd init --auto
LXD has been successfully configured.$ sudo /snap/bin/lxc network create lxdbr0 ipv4.address=auto ipv4.nat=true ipv6.address=none ipv6.nat=false
If this is your first time using LXD, you should also run: lxd initTo start your first container, try: lxc launch ubuntu:16.04Network lxdbr0 created

将您的用户添加到lxd组：

$ sudo usermod -a -G lxd $(whoami)

安装conjure-up并重启机器：

$ sudo snap install conjure-up --classic
conjure-up 2.3.1 from 'canonical'  installed

启动conjure-up以安装 Kubernetes：

$ conjure-up kubernetes

从法术列表中选择Kubernetes Core。
从可用云列表中选择localhost。
从网络列表中选择lxbr0网桥。
提供选项的sudo密码：将 kubectl 和 kubefed 客户端程序下载到本地主机。
在下一个屏幕中，它会要求选择要安装的应用。安装其余五个应用。

您知道当安装期间的最终屏幕如下所示时，Kubernetes 集群已准备好进行酿造：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PSrzoygd-1681566540571)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/3ca75a9c-78c2-4ffe-95bd-e9b632173e37.png)]

如果您在安装时遇到问题，请在互联网上搜索帮助，从此链接的文档开始：

https://kubernetes.io/docs/getting-started-guides/ubuntu/local/

https://kubernetes.io/docs/getting-started-guides/ubuntu/

https://tutorials.ubuntu.com/tutorial/install-kubernetes-with-conjure-up

将 Docker 镜像上传到 dockerhub

将 Docker 镜像上传到 dockerhub 的步骤如下：

如果您还没有，请在 dockerhub 上创建一个帐户。
使用以下命令登录 dockerhub 帐户：

$ docker login --username=<username>

使用您在 dockerhub 上创建的仓库标记 MNIST 图像。例如，我们创建了neurasights/mnist-serving：

$ docker tag $USER/mnist_serving neurasights/mnist-serving

将标记的图像推送到 dockerhub 帐户。

$ docker push neurasights/mnist-serving

在 Kubernetes 部署

我们继续在 Kubernotes 中进行部署，如下所示：

使用以下内容创建mnist.yaml文件：

apiVersion: extensions/v1beta1
kind: Deployment
metadata:name: mnist-deployment
spec:replicas: 3template:metadata: labels: app: mnist-serverspec:containers:  - name: mnist-containerimage: neurasights/mnist-servingcommand: - /bin/shargs: - -c - tensorflow_model_server --model_name=mnist --model_base_path=/tmp/mnist_modelports:- containerPort: 8500
---
apiVersion: v1
kind: Service
metadata:labels: run: mnist-servicename: mnist-service
spec:ports:  - port: 8500targetPort: 8500selector:app: mnist-server
#  type: LoadBalancer

如果您在 AWS 或 GCP 云中运行它，则取消注释前一个文件中的LoadBalancer行。由于我们在单个节点上本地运行整个集群，因此我们没有外部负载均衡器。

创建 Kubernetes 部署和服务：

$ kubectl create -f mnist.yaml
deployment "mnist-deployment" created
service "mnist-service" created

检查部署，窗格和服务：

$ kubectl get deployments
NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
mnist-deployment   3         3         3            0           1m$ kubectl get pods
NAME                               READY  STATUS              RESTARTS  AGE
default-http-backend-bbchw         1/1 Running             3          9d
mnist-deployment-554f4b674b-pwk8z  0/1 ContainerCreating   0          1m
mnist-deployment-554f4b674b-vn6sd  0/1 ContainerCreating   0          1m
mnist-deployment-554f4b674b-zt4xt  0/1 ContainerCreating   0          1m
nginx-ingress-controller-724n5     1/1 Running             2          9d

$ kubectl get services
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
default-http-backend   ClusterIP      10.152.183.223 <none>        80/TCP           9d
kubernetes             ClusterIP      10.152.183.1 <none>        443/TCP          9d
mnist-service          LoadBalancer   10.152.183.66 <pending>     8500:32414/TCP   1m

$ kubectl describe service mnist-service
Name:                     mnist-service
Namespace:                default
Labels:                   run=mnist-service
Annotations:              <none>
Selector:                 app=mnist-server
Type:                     LoadBalancer
IP:                       10.152.183.66
Port:                     <unset>  8500/TCP
TargetPort:               8500/TCP
NodePort:                 <unset>  32414/TCP
Endpoints:                10.1.43.122:8500,10.1.43.123:8500,10.1.43.124:8500
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

等到所有 pod 的状态为Running：

$ kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
default-http-backend-bbchw          1/1 Running   3          9d
mnist-deployment-554f4b674b-pwk8z   1/1 Running   0          3m
mnist-deployment-554f4b674b-vn6sd   1/1 Running   0          3m
mnist-deployment-554f4b674b-zt4xt   1/1 Running   0          3m
nginx-ingress-controller-724n5      1/1 Running   2          9d

检查其中一个 pod 的日志，您应该看到如下内容：

$ kubectl logs mnist-deployment-59dfc5df64-g7prf
I tensorflow_serving/model_servers/main.cc:147] Building single TensorFlow model file config: model_name: mnist model_base_path: /tmp/mnist_model
I tensorflow_serving/model_servers/server_core.cc:441] Adding/updating models.
I tensorflow_serving/model_servers/server_core.cc:492] (Re-)adding model: mnist
I tensorflow_serving/core/basic_manager.cc:705] Successfully reserved resources to load servable {name: mnist version: 1}
I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist version: 1}
I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist version: 1}
I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /tmp/mnist_model/1
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /tmp/mnist_model/1
I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:155] Restoring SavedModel bundle.
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running LegacyInitOp on SavedModel bundle.
I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:284] Loading SavedModel: success. Took 45319 microseconds.
I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mnist version: 1}
E1122 12:18:04.566415410 6 ev_epoll1_linux.c:1051] grpc epoll fd: 3
I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:8500 ...

您还可以使用以下命令查看 UI 控制台：

$ kubectl proxy xdg-open http://localhost:8001/ui

Kubernetes UI 控制台如下图所示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lYdSYnL9-1681566540572)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/22a64ef8-7b23-40e2-b71e-736e1063d732.png)][外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fKqDtLrk-1681566540573)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/5b8890de-602d-4f6a-a363-193619b37635.png)]

由于我们在单个节点上本地运行集群，因此我们的服务仅在集群中公开，无法从外部访问。登录我们刚刚实例化的三个 pod 中的一个：

$ kubectl exec -it mnist-deployment-59dfc5df64-bb24q -- /bin/bash

切换到主目录并运行 MNIST 客户端来测试服务：

$ kubectl exec -it mnist-deployment-59dfc5df64-bb24q -- /bin/bash
root@mnist-deployment-59dfc5df64-bb24q:/# cd 
root@mnist-deployment-59dfc5df64-bb24q:~# python serving/tensorflow_serving/example/mnist_client.py --num_tests=100 --server=10.152.183.67:8500
Extracting /tmp/train-images-idx3-ubyte.gz
Extracting /tmp/train-labels-idx1-ubyte.gz
Extracting /tmp/t10k-images-idx3-ubyte.gz
Extracting /tmp/t10k-labels-idx1-ubyte.gz
....................................................................................................
Inference error rate: 7.0%
root@mnist-deployment-59dfc5df64-bb24q:~#

我们学习了如何在本地单个节点上运行的 Kubernetes 集群上部署 TensorFlow 服务。您可以使用相同的概念知识在您的场所内的公共云或私有云上部署服务。

总结

在本章中，我们学习了如何利用 TensorFlow 服务来为生产环境中的模型提供服务。我们还学习了如何使用 TensorFlow 和 Keras 保存和恢复完整模型或选择模型。我们构建了一个 Docker 容器，并从官方 TensorFlow 服务仓库中提供了 Docker 容器中的示例 MNIST 示例代码。我们还安装了一个本地 Kubernetes 集群，并部署了 MNIST 模型，用于在 Kubernetes pod 中运行的 TensorFlow 服务。我们鼓励读者在这些例子的基础上进行尝试并尝试提供不同的模型。 TF 服务文档描述了各种选项，并提供了其他信息，使您可以进一步探索此主题。

在接下来的章节中，我们将继续使用迁移学习的高级模型。 TensorFlow 仓库中提供的预训练模型是使用 TF 服务练习服务 TF 模型的最佳候选者。我们使用 Ubuntu 包安装了 TF 服务，但您可能希望尝试从源代码构建以优化生产环境的 TF 服务安装。

十二、迁移学习和预训练模型

简单来说，迁移学习意味着你需要训练有素的预训练模型来预测一种类，然后直接使用它或仅再训练它的一小部分，以便预测另一种类。例如，您可以采用预训练的模型来识别猫的类型，然后仅对狗的类型再训练模型的小部分，然后使用它来预测狗的类型。

如果没有迁移学习，在大型数据集上训练一个巨大的模型需要几天甚至几个月。然而，通过迁移学习，通过采用预训练的模型，并且仅训练最后几层，我们可以节省大量时间从头开始训练模型。

当没有庞大的数据集时，迁移学习也很有用。在小型数据集上训练的模型可能无法检测在大型数据集上训练的模型可以进行的特征。因此，通过迁移学习，即使数据集较小，也可以获得更好的模型。

在本章中，我们将采用预训练的模型并对新物体进行训练。我们展示了带有图像的预训练模型的示例，并将它们应用于图像分类问题。您应该尝试找到其他预训练的模型，并将它们应用于不同的问题，如对象检测，文本生成或机器翻译。本章将介绍以下主题：

ImageNet 数据集
再训练或微调模型
COCO 动物数据集和预处理
使用 TensorFlow 中预训练的 VGG16 进行图像分类
TensorFlow 中的图像预处理，用于预训练的 VGG16
在 TensorFlow 中使用再训练的 VGG16 进行图像分类
使用 Keras 中预训练的 VGG16 进行图像分类
使用再训练的 VGG16 在 Keras 中进行图像分类
使用 TensorFlow 中的 InceptionV3 进行图像分类
在 TensorFlow 中使用再训练的 InceptionV3 进行图像分类

ImageNet 数据集

根据 ImageNet：

ImageNet 是根据 WordNet 层次结构组织的图像数据集。WordNet 中的每个有意义的概念，可能由多个单词或单词短语描述，称为同义词集或 synset。

ImageNet 有大约 100K 个同义词集，平均每个同义词集约有 1,000 个人工标注图像。 ImageNet 仅存储对图像的引用，而图像存储在互联网上的原始位置。在深度学习论文中，ImageNet-1K 是指作为 ImageNet 的大规模视觉识别挑战（ILSVRC）的一部分发布的数据集，用于将数据集分类为 1,000 个类别：

可以在以下 URL 找到 1,000 个挑战类别：

http://image-net.org/challenges/LSVRC/2017/browse-synsets
http://image-net.org/challenges/LSVRC/2016/browse-synsets
http://image-net.org/challenges/LSVRC/2015/browse-synsets
http://image-net.org/challenges/LSVRC/2014/browse-synsets
http://image-net.org/challenges/LSVRC/2013/browse-synsets
http://image-net.org/challenges/LSVRC/2012/browse-synsets
http://image-net.org/challenges/LSVRC/2011/browse-synsets
http://image-net.org/challenges/LSVRC/2010/browse-synsets

我们编写了一个自定义函数来从 Google 下载 ImageNet 标签：

def build_id2label(self):base_url = 'https://raw.githubusercontent.com/tensorflow/models/master/research/inception/inception/data/'synset_url = '{}/imagenet_lsvrc_2015_synsets.txt'.format(base_url)synset_to_human_url = '{}/imagenet_metadata.txt'.format(base_url)filename, _ = urllib.request.urlretrieve(synset_url)synset_list = [s.strip() for s in open(filename).readlines()]num_synsets_in_ilsvrc = len(synset_list)assert num_synsets_in_ilsvrc == 1000filename, _ = urllib.request.urlretrieve(synset_to_human_url)synset_to_human_list = open(filename).readlines()num_synsets_in_all_imagenet = len(synset_to_human_list)assert num_synsets_in_all_imagenet == 21842synset2name = {}for s in synset_to_human_list:parts = s.strip().split('\t')assert len(parts) == 2synset = parts[0]name = parts[1]synset2name[synset] = nameif self.n_classes == 1001:id2label={0:'empty'}id=1else:id2label = {}id=0for synset in synset_list:label = synset2name[synset]id2label[id] = labelid += 1return id2label

我们将这些标签加载到我们的 Jupyter 笔记本中，如下所示：

### Load ImageNet dataset for labels
from datasetslib.imagenet import imageNet
inet = imageNet()
inet.load_data(n_classes=1000)  
#n_classes is 1001 for Inception models and 1000 for VGG models

在 ImageNet-1K 数据集上训练过的热门预训练图像分类模型如下表所示：

模型名称	Top-1 准确率	Top-5 准确率	Top-5 错误率	原始文件的链接
AlexNet			15.3%	https://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
Inception 也称为 InceptionV1	69.8	89.6	6.67%	https://arxiv.org/abs/1409.4842
BN-Inception-V2 也称为 InceptionV2	73.9	91.8	4.9%	https://arxiv.org/abs/1502.03167
InceptionV3	78.0	93.9	3.46%	https://arxiv.org/abs/1512.00567
InceptionV4	80.2	95.2		http://arxiv.org/abs/1602.07261
Inception-Resnet-V2	80.4	95.2		http://arxiv.org/abs/1602.07261
VGG16	71.5	89.8	7.4%	https://arxiv.org/abs/1409.1556
VGG19	71.1	89.8	7.3%	https://arxiv.org/abs/1409.1556
ResNetV1 50	75.2	92.2	7.24%	https://arxiv.org/abs/1512.03385
ResNetV1 101	76.4	92.9		https://arxiv.org/abs/1512.03385
ResNetV1 152	76.8	93.2		https://arxiv.org/abs/1512.03385
ResNetV2 50	75.6	92.8		https://arxiv.org/abs/1603.05027
ResNetV2 101	77.0	93.7		https://arxiv.org/abs/1603.05027
ResNetV2 152	77.8	94.1		https://arxiv.org/abs/1603.05027
ResNetV2 200	79.9	95.2		https://arxiv.org/abs/1603.05027
Xception	79.0	94.5		https://arxiv.org/abs/1610.02357
MobileNet V1	41.3 至 70.7	66.2 至 89.5		https://arxiv.org/pdf/1704.04861.pdf

在上表中，Top-1 和 Top-5 指标指的是模型在 ImageNet 验证数据集上的表现。

Google Research 最近发布了一种名为 MobileNets 的新模型。 MobileNets 采用移动优先策略开发，牺牲了低资源使用的准确率。 MobileNets 旨在消耗低功耗并提供低延迟，以便在移动和嵌入式设备上提供更好的体验。谷歌为 MobileNet 模型提供了 16 个预训练好的检查点文件，每个模型提供不同数量的参数和乘法累加（MAC）。 MAC 和参数越高，资源使用和延迟就越高。因此，您可以在更高的准确率与更高的资源使用/延迟之间进行选择。

模型检查点	百万 MAC	百万参数	Top-1 准确率	Top-5 准确率
MobileNet_v1_1.0_224	569	4.24	70.7	89.5
MobileNet_v1_1.0_192	418	4.24	69.3	88.9
MobileNet_v1_1.0_160	291	4.24	67.2	87.5
MobileNet_v1_1.0_128	186	4.24	64.1	85.3
MobileNet_v1_0.75_224	317	2.59	68.4	88.2
MobileNet_v1_0.75_192	233	2.59	67.4	87.3
MobileNet_v1_0.75_160	162	2.59	65.2	86.1
MobileNet_v1_0.75_128	104	2.59	61.8	83.6
MobileNet_v1_0.50_224	150	1.34	64.0	85.4
MobileNet_v1_0.50_192	110	1.34	62.1	84.0
MobileNet_v1_0.50_160	77	1.34	59.9	82.5
MobileNet_v1_0.50_128	49	1.34	56.2	79.6
MobileNet_v1_0.25_224	41	0.47	50.6	75.0
MobileNet_v1_0.25_192	34	0.47	49.0	73.6
MobileNet_v1_0.25_160	21	0.47	46.0	70.7
MobileNet_v1_0.25_128	14	0.47	41.3	66.2

有关 MobileNets 的更多信息，请访问以下资源：

https://research.googleblog.com/2017/06/mobilenets-open-source-models-for.html

https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md

https://arxiv.org/pdf/1704.04861.pdf

再训练或微调模型

在像 ImageNet 这样的大型和多样化数据集上训练的模型能够检测和捕获一些通用特征，如曲线，边缘和形状。其中一些特征很容易适用于其他类型的数据集。因此，在迁移学习中，我们采用这样的通用模型，并使用以下一些技术来微调或再训练它们到我们的数据集：

废除并替换最后一层：通常的做法是删除最后一层并添加与我们的数据集匹配的新分类层。例如，ImageNet 模型使用 1,000 个类别进行训练，但我们的 COCO 动物数据集只有 8 个类别，因此我们删除了 softmax 层，该层使用 softmax 层生成 1,000 个类别的概率，该层生成 8 个类别的概率。通常，当新数据集几乎与训练模型的数据集类似时使用此技术，因此仅需要再训练最后一层。
冻结前几层：另一种常见做法是冻结前几层，以便仅使用新数据集更新最后未冻结层的权重。我们将看到一个例子，我们冻结前 15 层，同时只再训练最后 10 层。通常，当新数据集与训练模型的数据集非常不相似时使用此技术，因此不仅需要训练最后的层。
调整超参数：您还可以在再训练之前调整超参数，例如更改学习率或尝试不同的损失函数或不同的优化器。

TensorFlow 和 Keras 均提供预训练模型。

我们将在文件夹tensorflow/models/research/slim/nets中通过 TensorFlow Slim 演示我们的示例，TensorFlow Slim 在编写时有几个预训练的模型。我们将使用 TensorFlow Slim 来实例化预训练的模型，然后从下载的检查点文件加载权重。然后，加载的模型将用于使用新数据集进行预测。然后我们将再训练模型以微调预测。

我们还将通过keras.applications模块中提供的 Keras 预训练模型演示迁移学习。虽然 TensorFlow 有大约 20 多个预训练模型，但keras.appplications只有以下 7 种预训练模型：

Xception
VGG16
VGG19
ResNet50
InceptionV3
InceptionResNetV2
MobileNet

COCO 动物数据集和预处理图像

对于我们的例子，我们将使用 COCO 动物数据集，这是 COCO 数据集的一小部分，由斯坦福大学的研究人员提供。 COCO 动物数据集有 800 个训练图像和 200 个动物类别的测试图像：熊，鸟，猫，狗，长颈鹿，马，绵羊和斑马。为 VGG16 和 Inception 模型下载和预处理图像。

对于 VGG 模型，图像大小为224 x 224，预处理步骤如下：

将图像调整为224×224，其函数类似于来自 TensorFlow 的tf.image.resize_image_with_crop_or_pad函数。我们实现了这个函数如下：

def resize_image(self,in_image:PIL.Image, new_width, new_height, crop_or_pad=True):img = in_imageif crop_or_pad:half_width = img.size[0] // 2half_height = img.size[1] // 2half_new_width = new_width // 2half_new_height = new_height // 2img = img.crop((half_width-half_new_width,half_height-half_new_height,half_width+half_new_width,half_height+half_new_height))img = img.resize(size=(new_width, new_height))return img

调整大小后，将图像从PIL.Image转换为 NumPy 数组并检查图像是否有深度通道，因为数据集中的某些图像仅为灰度。

img = self.pil_to_nparray(img)
if len(img.shape)==2:   # greyscale or no channels then add three channelsh=img.shape[0]w=img.shape[1]img = np.dstack([img]*3)

然后我们从图像中减去 VGG 数据集平均值以使数据居中。我们将新训练图像的数据居中的原因是这些特征具有与用于降雨模型的初始数据类似的范围。通过在相似范围内制作特征，我们确保再训练期间的梯度不会变得太高或太低。同样通过使数据居中，学习过程变得更快，因为对于以零均值为中心的每个通道，梯度变得均匀。

means = np.array([[[123.68, 116.78, 103.94]]]) #shape=[1, 1, 3]
img = img - means

完整的预处理函数如下：

def preprocess_for_vgg(self,incoming, height, width):if isinstance(incoming, six.string_types):img = self.load_image(incoming)else:img=incomingimg_size = vgg.vgg_16.default_image_sizeheight = img_sizewidth = img_sizeimg = self.resize_image(img,height,width)img = self.pil_to_nparray(img)if len(img.shape)==2:   # greyscale or no channels then add three channelsh=img.shape[0]w=img.shape[1]img = np.dstack([img]*3)means = np.array([[[123.68, 116.78, 103.94]]]) #shape=[1, 1, 3]try:img = img - meansexcept Exception as ex:print('Error preprocessing ',incoming)print(ex)return img

对于 Inception 模型，图像大小为299 x 299，预处理步骤如下：

图像大小调整为299 x 299，其函数类似于来自 TensorFlow 的tf.image.resize_image_with_crop_or_pad函数。我们实现了之前在 VGG 预处理步骤中定义的此函数。
然后使用以下代码将图像缩放到范围(-1, +1)：

img = ((https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/255.0) - 0.5) * 2.0

完整的预处理函数如下：

def preprocess_for_inception(self,incoming):img_size = inception.inception_v3.default_image_sizeheight = img_sizewidth = img_sizeif isinstance(incoming, six.string_types):img = self.load_image(incoming)else:img=incomingimg = self.resize_image(img,height,width)img = self.pil_to_nparray(img)if len(img.shape)==2:   # greyscale or no channels then add three channelsh=img.shape[0]w=img.shape[1]img = np.dstack([img]*3)img = ((https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/255.0) - 0.5) * 2.0return img

让我们加载 COCO 动物数据集：

from datasetslib.coco import coco_animals
coco = coco_animals()
x_train_files, y_train, x_val_files, x_val = coco.load_data()

我们从验证集中的每个类中取一个图像，制作列表，x_test并预处理图像以制作列表images_test：

x_test = [x_val_files[25*x] for x in range(8)]
images_test=np.array([coco.preprocess_for_vgg(x) for x in x_test])

我们使用这个辅助函数来显示与图像相关的前五个类的图像和概率：

# helper function
def disp(images,id2label=None,probs=None,n_top=5,scale=False):if scale:imgs = np.abs(images + np.array([[[[123.68, 116.78, 103.94]]]]))/255.0else:imgs = images    ids={}for j in range(len(images)):if scale:plt.figure(figsize=(5,5))plt.imshow(imgs[j])else:plt.imshow(imgs[j].astype(np.uint8) )plt.show()if probs is not None:ids[j] = [i[0] for i in sorted(enumerate(-probs[j]), key=lambda x:x[1])]for k in range(n_top):id = ids[j][k]print('Probability {0:1.2f}% of[{1:}]'.format(100*probs[j,id],id2label[id]))

上述函数中的以下代码恢复为预处理的效果，以便显示原始图像而不是预处理图像：

imgs = np.abs(images + np.array([[[[123.68, 116.78, 103.94]]]]))/255.0

在 Inception 模型的情况下，用于反转预处理的代码如下：

imgs = (images / 2.0) + 0.5

您可以使用以下代码查看测试图像：

images=np.array([mpimg.imread(x) for x in x_test])
disp(images)

按照 Jupyter 笔记本中的代码查看图像。它们看起来都有不同的尺寸，所以让我们打印它们的原始尺寸：

print([x.shape for x in images])

尺寸是：

[(640, 425, 3), (373, 500, 3), (367, 640, 3), (427, 640, 3), (428, 640, 3), (426, 640, 3), (480, 640, 3), (612, 612, 3)]

让我们预处理测试图像并查看尺寸：

images_test=np.array([coco.preprocess_for_vgg(x) for x in x_test])
print(images_test.shape)

维度为：

(8, 224, 224, 3)

在 Inception 的情况下，维度是：

(8, 299, 299, 3)

Inception 的预处理图像不可见，但让我们打印 VGG 的预处理图像，以了解它们的外观：

disp(images_test)


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xcMdwdrd-1681566540573)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/419ef907-4c6d-48e3-9866-4e675686751d.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ag96MSYg-1681566540574)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/b00bb2da-0abe-4b7e-ab95-bb0891dc33b8.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZjDvlu7P-1681566540574)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/186294ac-6801-4a58-a6b2-0a657a885c86.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UjssNorG-1681566540574)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/7cacfcbe-e64f-457e-8e73-9ecd73ebc911.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c2zwBz44-1681566540575)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/ed513581-8fb7-41a9-8c8b-fe8d42da81f9.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u9JYAUGB-1681566540575)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/c1da5131-f5fc-4a41-adf1-e7b422dfed9b.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ds5pdgGg-1681566540575)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/1133a0aa-b980-442b-a720-a784fc0159a8.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ag8DepVA-1681566540576)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/74915dae-52c5-47c0-b7a8-c15d0f3a3025.png)]

实际上图像被裁剪了，我们可以看到当我们在保持裁剪的同时反转预处理时它们的样子：


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QpL11AYv-1681566540576)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/f2e2cb7f-d2ea-4b0b-8aac-2d3024be69f9.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JkiQoCbB-1681566540576)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/8973b4f4-f720-4652-8e73-9edae74795de.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-86kea4vA-1681566540576)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/4515afb5-3071-41c9-b79e-b338e07d9180.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-33zuaWCU-1681566540577)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/27411293-3b22-4ec5-a059-6008187deedc.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L80r78yj-1681566540577)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/8f35127d-a5b8-4cdb-b0fc-cee41f1258d9.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CZAg6ne6-1681566540577)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/4d991ffb-06ee-490f-baef-09e1f4f89e37.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xaW9XHc4-1681566540577)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/03dfe1bf-3bee-458b-82ea-570151197638.png)]	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kfLnQMFK-1681566540578)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/3945319a-63e2-4586-b686-70da2a58c614.png)]

现在我们已经有来自 ImageNet 的标签以及来自 COCO 图像数据集的图像和标签，我们试试迁移学习示例。

TensorFlow 中的 VGG16

您可以按照 Jupyter 笔记本中的代码ch-12a_VGG16_TensorFlow。

对于 TensorFlow 中 VGG16 的所有示例，我们首先从这里下载检查点文件并使用以下内容初始化变量码：

model_name='vgg_16'
model_url='http://download.tensorflow.org/models/'
model_files=['vgg_16_2016_08_28.tar.gz']
model_home=os.path.join(models_root,model_name) dsu.download_dataset(source_url=model_url,source_files=model_files,dest_dir = model_home,force=False,extract=True)

我们还定义了一些常见的导入和变量：

from tensorflow.contrib import slim
from tensorflow.contrib.slim.nets import vgg
image_height=vgg.vgg_16.default_image_size
image_width=vgg.vgg_16.default_image_size

TensorFlow 中的预训练 VGG16 的图像分类

现在让我们首先尝试预测测试图像的类别，而不进行再训练。首先，我们清除默认图并定义图像的占位符：

tf.reset_default_graph()
x_p = tf.placeholder(shape=(None,image_height, image_width,3),dtype=tf.float32,name='x_p')

占位符x_p的形状是(?, 224, 224, 3)。接下来，加载vgg16模型：

with slim.arg_scope(vgg.vgg_arg_scope()):logits,_ = vgg.vgg_16(x_p,num_classes=inet.n_classes,is_training=False)

添加 softmax 层以生成类的概率：

probabilities = tf.nn.softmax(logits)

定义初始化函数以恢复变量，例如检查点文件中的权重和偏差。

init = slim.assign_from_checkpoint_fn(os.path.join(model_home, '{}.ckpt'.format(model_name)),slim.get_variables_to_restore())

在 TensorFlow 会话中，初始化变量并运行概率张量以获取每个图像的概率：

with tf.Session() as tfs:init(tfs)probs = tfs.run([probabilities],feed_dict={x_p:images_test})probs=probs[0]

让我们看看我们得到的类：

disp(images_test,id2label=inet.id2label,probs=probs,scale=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qn8uionZ-1681566540578)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/5cd6e046-8fe7-4da2-b549-964d3e06180c.png)]

Probability 99.15% of [zebra]
Probability 0.37% of [tiger cat]
Probability 0.33% of [tiger, Panthera tigris]
Probability 0.04% of [goose]
Probability 0.02% of [tabby, tabby cat]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WgfunFs2-1681566540578)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/7520cd6f-b793-49ff-bdcf-fc09e2018dc8.png)]

Probability 99.50% of [horse cart, horse-cart]
Probability 0.37% of [plow, plough]
Probability 0.06% of [Arabian camel, dromedary, Camelus dromedarius]
Probability 0.05% of [sorrel]
Probability 0.01% of [barrel, cask]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TC082GSk-1681566540578)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/26e098eb-7379-432c-93e6-268591e07556.png)]

Probability 19.32% of [Cardigan, Cardigan Welsh corgi] Probability 11.78% of [papillon] Probability 9.01% of [Shetland sheepdog, Shetland sheep dog, Shetland] Probability 7.09% of [Siamese cat, Siamese] Probability 6.27% of [Pembroke, Pembroke Welsh corgi]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ygWPoJtY-1681566540578)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/9c57d806-6959-408e-83b6-0a727601d404.png)]

Probability 97.09% of [chickadee]
Probability 2.52% of [water ouzel, dipper]
Probability 0.23% of [junco, snowbird]
Probability 0.09% of [hummingbird]
Probability 0.04% of [bulbul]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YvAIW95l-1681566540579)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/536e01cd-db1f-4561-ad68-ea09ccfbe68b.png)]

Probability 24.98% of [whippet]
Probability 16.48% of [lion, king of beasts, Panthera leo]
Probability 5.54% of [Saluki, gazelle hound]
Probability 4.99% of [brown bear, bruin, Ursus arctos]
Probability 4.11% of [wire-haired fox terrier]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VKFiQQPL-1681566540579)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/b0025a1c-0d97-405e-8779-fec6b645a586.png)]

Probability 98.56% of [brown bear, bruin, Ursus arctos]
Probability 1.40% of [American black bear, black bear, Ursus americanus, Euarctos americanus]
Probability 0.03% of [sloth bear, Melursus ursinus, Ursus ursinus]
Probability 0.00% of [wombat]
Probability 0.00% of [beaver]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OSTq8TnW-1681566540579)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/417fa743-0c24-48ae-8e27-89f41fdf90cc.png)]

Probability 20.84% of [leopard, Panthera pardus]
Probability 12.81% of [cheetah, chetah, Acinonyx jubatus]
Probability 12.26% of [banded gecko]
Probability 10.28% of [jaguar, panther, Panthera onca, Felis onca]
Probability 5.30% of [gazelle]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oisQBXkI-1681566540579)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/77927968-a39b-4355-85b8-98c7f60f8154.png)]

Probability 8.09% of [shower curtain]
Probability 3.59% of [binder, ring-binder]
Probability 3.32% of [accordion, piano accordion, squeeze box]
Probability 3.12% of [radiator]
Probability 1.81% of [abaya]

从未见过我们数据集中的图像，并且对数据集中的类没有任何了解的预训练模型已正确识别斑马，马车，鸟和熊。它没能认出长颈鹿，因为它以前从未见过长颈鹿。我们将在我们的数据集上再训练这个模型，只需要更少的工作量和 800 个图像的较小数据集大小。但在我们这样做之前，让我们看看在 TensorFlow 中进行相同的图像预处理。

为 TensorFlow 中的预训练 VGG16 预处理图像

我们为 TensorFlow 中的预处理步骤定义一个函数，如下所示：

def tf_preprocess(filelist):images=[]for filename in filelist:image_string = tf.read_file(filename)image_decoded = tf.image.decode_jpeg(image_string, channels=3)image_float = tf.cast(image_decoded, tf.float32)resize_fn = tf.image.resize_image_with_crop_or_padimage_resized = resize_fn(image_float, image_height, image_width)means = tf.reshape(tf.constant([123.68, 116.78, 103.94]), [1, 1, 3])image = image_resized - meansimages.append(image)images = tf.stack(images)return images

在这里，我们创建images变量而不是占位符：

images=tf_preprocess([x for x in x_test])

我们按照与以前相同的过程来定义 VGG16 模型，恢复变量然后运行预测：

with slim.arg_scope(vgg.vgg_arg_scope()):logits,_ = vgg.vgg_16(images,num_classes=inet.n_classes,is_training=False)
probabilities = tf.nn.softmax(logits)init = slim.assign_from_checkpoint_fn(os.path.join(model_home, '{}.ckpt'.format(model_name)),slim.get_variables_to_restore())

我们获得与以前相同的类概率。我们只是想证明预处理也可以在 TensorFlow 中完成。但是，TensorFlow 中的预处理仅限于 TensorFlow 提供的功能，并将您与框架深深联系在一起。

我们建议您将预处理管道与 TensorFlow 模型训练和预测代码分开。保持独立使其具有模块化并具有其他优势，例如您可以保存数据以便在多个模型中重复使用。

TensorFlow 中的再训练 VGG16 的图像分类

现在，我们将为 COCO 动物数据集再训练 VGG16 模型。让我们从定义三个占位符开始：

is_training占位符指定我们是否将模型用于训练或预测
x_p是输入占位符，形状为(None, image_height, image_width, 3）
y_p是输出占位符，形状为(None, 1)

is_training = tf.placeholder(tf.bool,name='is_training')
x_p = tf.placeholder(shape=(None,image_height, image_width,3),dtype=tf.float32,name='x_p')
y_p = tf.placeholder(shape=(None,1),dtype=tf.int32,name='y_p')

正如我们在策略部分中所解释的那样，我们将从检查点文件中恢复除最后一层之外的层，这被称为vgg/fc8层：

with slim.arg_scope(vgg.vgg_arg_scope()):logits, _ = vgg.vgg_16(x_p,num_classes=coco.n_classes,is_training=is_training)probabilities = tf.nn.softmax(logits)
# restore except last last layer fc8
fc7_variables=tf.contrib.framework.get_variables_to_restore(exclude=['vgg_16/fc8'])
fc7_init = tf.contrib.framework.assign_from_checkpoint_fn(os.path.join(model_home, '{}.ckpt'.format(model_name)),fc7_variables)

接下来，定义要初始化但未恢复的最后一个层的变量：

# fc8 layer
fc8_variables = tf.contrib.framework.get_variables('vgg_16/fc8')
fc8_init = tf.variables_initializer(fc8_variables)

正如我们在前面章节中所学到的，用tf.losses. sparse_softmax_cross_entropy()定义损失函数。

tf.losses.sparse_softmax_cross_entropy(labels=y_p, logits=logits)
loss = tf.losses.get_total_loss()

训练最后一层几个周期，然后训练整个网络几层。因此，定义两个单独的优化器和训练操作。

learning_rate = 0.001
fc8_optimizer = tf.train.GradientDescentOptimizer(learning_rate)
fc8_train_op = fc8_optimizer.minimize(loss, var_list=fc8_variables)full_optimizer = tf.train.GradientDescentOptimizer(learning_rate)
full_train_op = full_optimizer.minimize(loss)

我们决定对两个优化器函数使用相同的学习率，但如果您决定进一步调整超参数，则可以定义单独的学习率。

像往常一样定义精度函数：

y_pred = tf.to_int32(tf.argmax(logits, 1))
n_correct_pred = tf.equal(y_pred, y_p)
accuracy = tf.reduce_mean(tf.cast(n_correct_pred, tf.float32))

最后，我们运行最后一层 10 个周期的训练，然后使用批量大小为 32 的 10 个周期的完整网络。我们还使用相同的会话来预测类：

fc8_epochs = 10
full_epochs = 10
coco.y_onehot = False
coco.batch_size = 32
coco.batch_shuffle = Truetotal_images = len(x_train_files)
n_batches = total_images // coco.batch_sizewith tf.Session() as tfs:fc7_init(tfs) tfs.run(fc8_init) for epoch in range(fc8_epochs):print('Starting fc8 epoch ',epoch)coco.reset_index()epoch_accuracy=0for batch in range(n_batches):x_batch, y_batch = coco.next_batch()images=np.array([coco.preprocess_for_vgg(x) \for x in x_batch])feed_dict={x_p:images,y_p:y_batch,is_training:True}tfs.run(fc8_train_op, feed_dict = feed_dict)feed_dict={x_p:images,y_p:y_batch,is_training:False}batch_accuracy = tfs.run(accuracy,feed_dict=feed_dict)epoch_accuracy += batch_accuracyexcept Exception as ex:epoch_accuracy /= n_batchesprint('Train accuracy in epoch {}:{}'.format(epoch,epoch_accuracy))for epoch in range(full_epochs):print('Starting full epoch ',epoch)coco.reset_index()epoch_accuracy=0for batch in range(n_batches):x_batch, y_batch = coco.next_batch()images=np.array([coco.preprocess_for_vgg(x) \for x in x_batch])                    feed_dict={x_p:images,y_p:y_batch,is_training:True}tfs.run(full_train_op, feed_dict = feed_dict )feed_dict={x_p:images,y_p:y_batch,is_training:False}batch_accuracy = tfs.run(accuracy,feed_dict=feed_dict)epoch_accuracy += batch_accuracyepoch_accuracy /= n_batchesprint('Train accuracy in epoch {}:{}'.format(epoch,epoch_accuracy))# now run the predictionsfeed_dict={x_p:images_test,is_training: False}probs = tfs.run([probabilities],feed_dict=feed_dict)probs=probs[0]

让我们看看打印我们的预测结果：

disp(images_test,id2label=coco.id2label,probs=probs,scale=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RZv9xTRi-1681566540580)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/7adcf006-6b7c-4cee-b005-b468d4c04147.png)]

Probability 100.00% of [zebra]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GoCRCkBp-1681566540580)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/8ace4594-5cf1-4158-a87c-04208b7751fb.png)]

Probability 100.00% of [horse]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3wkB3995-1681566540580)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/a1212dda-c918-4237-916e-a654869cdd5f.png)]

Probability 98.88% of [cat]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lwNGoRYS-1681566540580)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/5b227dbc-55c5-4f27-b00d-3b163504a21b.png)]

Probability 100.00% of [bird]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WlMMTgI6-1681566540580)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/ffe6abcd-99a3-4550-9cf6-dfd34cd05d4c.png)]

Probability 68.88% of [bear]
Probability 31.06% of [sheep]
Probability 0.02% of [dog]
Probability 0.02% of [bird]
Probability 0.01% of [horse]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lNRbqq4h-1681566540581)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/387adb5c-ebd8-4fa3-9476-61073f49dacf.png)]

Probability 100.00% of [bear]
Probability 0.00% of [dog]
Probability 0.00% of [bird]
Probability 0.00% of [sheep]
Probability 0.00% of [cat]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8uCwsl7b-1681566540581)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/707f58ea-ab0b-46a6-871c-1808c47a5cb2.png)]

Probability 100.00% of [giraffe]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OdKhpe2Z-1681566540581)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/cb16c14b-fc2a-4c16-be33-87fa850b2c6f.png)]

Probability 61.36% of [cat]
Probability 16.70% of [dog]
Probability 7.46% of [bird]
Probability 5.34% of [bear]
Probability 3.65% of [giraffe]

它正确识别了猫和长颈鹿，并将其他概率提高到 100%。它仍然犯了一些错误，因为最后一张照片被归类为猫，这实际上是裁剪后的噪音图片。我们会根据这些结果对您进行改进。

Keras 的 VGG16

您可以按照 Jupyter 笔记本ch-12a_VGG16_Keras中的代码进行操作。

现在让我们对 Keras 进行相同的分类和再训练。您将看到我们可以轻松地使用较少量的代码在 Keras 中使用 VGG16 预训练模型。

Keras 中的预训练 VGG16 的图像分类

加载模型是一个单行操作：

from keras.applications import VGG16
model=VGG16(weights='imagenet')

我们可以使用这个模型来预测类的概率：

probs = model.predict(images_test)

以下是此分类的结果：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PR3uV1Fn-1681566540581)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/6444cde2-eb27-4029-8e1e-32fb482b2628.png)]

Probability 99.41% of [zebra]
Probability 0.19% of [tiger cat]
Probability 0.13% of [goose]
Probability 0.09% of [tiger, Panthera tigris]
Probability 0.02% of [mushroom]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YkiPAK19-1681566540582)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/4b9c2921-a96c-4e3b-a833-2128c141ccd2.png)]

Probability 87.50% of [horse cart, horse-cart]
Probability 5.58% of [Arabian camel, dromedary, Camelus dromedarius]
Probability 4.72% of [plow, plough]
Probability 1.03% of [dogsled, dog sled, dog sleigh]
Probability 0.31% of [wreck]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-99Fn48GD-1681566540582)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/714d813c-f53a-4c7e-b79b-42ff83a72443.png)]

Probability 34.96% of [Siamese cat, Siamese]
Probability 12.71% of [toy terrier]
Probability 10.15% of [Boston bull, Boston terrier]
Probability 6.53% of [Italian greyhound]
Probability 6.01% of [Cardigan, Cardigan Welsh corgi]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o5cMea7A-1681566540582)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/38fed697-2f6b-4123-b671-8312da339d11.png)]

Probability 56.41% of [junco, snowbird]
Probability 38.08% of [chickadee]
Probability 1.93% of [bulbul]
Probability 1.35% of [hummingbird]
Probability 1.09% of [house finch, linnet, Carpodacus mexicanus]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4uHYSpIR-1681566540582)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/b745738e-d17a-4658-b60b-abc083a38e59.png)]

Probability 54.19% of [brown bear, bruin, Ursus arctos]
Probability 28.07% of [lion, king of beasts, Panthera leo]
Probability 0.87% of [Norwich terrier]
Probability 0.82% of [Lakeland terrier]
Probability 0.73% of [wild boar, boar, Sus scrofa]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QKxPNRyL-1681566540582)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/a25bbefc-e8d0-4926-9d2d-25af9410e867.png)]

Probability 88.64% of [brown bear, bruin, Ursus arctos]
Probability 7.22% of [American black bear, black bear, Ursus americanus, Euarctos americanus]
Probability 4.13% of [sloth bear, Melursus ursinus, Ursus ursinus]
Probability 0.00% of [badger]
Probability 0.00% of [wombat]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6YZapiXs-1681566540583)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/4732f72e-9bb8-45c1-b5d3-1d302b4d549f.png)]

Probability 38.70% of [jaguar, panther, Panthera onca, Felis onca]
Probability 33.78% of [leopard, Panthera pardus]
Probability 14.22% of [cheetah, chetah, Acinonyx jubatus]
Probability 6.15% of [banded gecko]
Probability 1.53% of [snow leopard, ounce, Panthera uncia]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vfUi3Hd3-1681566540583)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/4c030b30-b0e7-45da-9c7b-91793c3d1b4d.png)]

Probability 12.54% of [shower curtain] 
Probability 2.82% of [binder, ring-binder] 
Probability 2.28% of [toilet tissue, toilet paper, bathroom tissue] 
Probability 2.12% of [accordion, piano accordion, squeeze box] 
Probability 2.05% of [bath towel]

它无法识别绵羊，长颈鹿以及狗的图像被裁剪出来的最后一张噪音图像。现在，让我们用我们的数据集再训练 Keras 中的模型。

Keras 中的再训练 VGG16 的图像分类

让我们使用 COCO 图像数据集来再训练模型以微调分类任务。我们将删除 Keras 模型中的最后一层，并添加我们自己的完全连接层，其中softmax激活 8 个类。我们还将通过将前 15 层的trainable属性设置为False来演示冻结前几层。

首先导入 VGG16 模型而不使用顶层变量，方法是将include_top设置为False：

# load the vgg model
from keras.applications import VGG16
base_model=VGG16(weights='imagenet',include_top=False, input_shape=(224,224,3))

我们还在上面的代码中指定了input_shape，否则 Keras 会在以后抛出异常。

现在我们构建分类器模型以置于导入的 VGG 模型之上：

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dropout(0.5))
top_model.add(Dense(coco.n_classes, activation='softmax'))

接下来，在 VGG 基础之上添加模型：

model=Model(inputs=base_model.input, outputs=top_model(base_model.output))

冻结前 15 层：

for layer in model.layers[:15]:layer.trainable = False

我们随机挑选了 15 层冻结，你可能想要玩这个数字。让我们编译模型并打印模型摘要：

model.compile(loss='categorical_crossentropy',optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
sequential_1 (Sequential)    (None, 8)                 6424840   
=================================================================
Total params: 21,139,528
Trainable params: 13,504,264
Non-trainable params: 7,635,264

我们看到近 40% 的参数是冻结的和不可训练的。

接下来，训练 Keras 模型 20 个周期，批量大小为 32：

from keras.utils import np_utilsbatch_size=32
n_epochs=20total_images = len(x_train_files)
n_batches = total_images // batch_size
for epoch in range(n_epochs):print('Starting epoch ',epoch)coco.reset_index_in_epoch()for batch in range(n_batches):try:x_batch, y_batch = coco.next_batch(batch_size=batch_size)images=np.array([coco.preprocess_image(x) for x in x_batch])y_onehot = np_utils.to_categorical(y_batch,num_classes=coco.n_classes)model.fit(x=images,y=y_onehot,verbose=0)except Exception as ex:print('error in epoch {} batch {}'.format(epoch,batch))print(ex)

让我们使用再训练的新模型执行图像分类：

probs = model.predict(images_test)

以下是分类结果：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cwiJS0XN-1681566540583)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/5e31bad2-c2e8-43e1-9d19-0b5cef8867d6.png)]

Probability 100.00% of [zebra]
Probability 0.00% of [dog]
Probability 0.00% of [horse]
Probability 0.00% of [giraffe]
Probability 0.00% of [bear]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XwJlr895-1681566540583)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/4184ac21-2885-47b0-a503-361915a70401.png)]

Probability 96.11% of [horse]
Probability 1.85% of [cat]
Probability 0.77% of [bird]
Probability 0.43% of [giraffe]
Probability 0.40% of [sheep]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fs8JwRU3-1681566540583)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/0321b135-32b5-427c-a7a8-39058d197274.png)]

Probability 99.75% of [dog] Probability 0.22% of [cat] Probability 0.03% of [horse] Probability 0.00% of [bear] Probability 0.00% of [zebra]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u1tSE7ju-1681566540584)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/fe76371b-f805-4abf-ac9d-6b4bda36bdff.png)]

Probability 99.88% of [bird]
Probability 0.11% of [horse]
Probability 0.00% of [giraffe]
Probability 0.00% of [bear]
Probability 0.00% of [cat]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YIF4xFXr-1681566540584)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/27c1cba9-a04b-44a3-af60-507f73f5e462.png)]

Probability 65.28% of [bear]
Probability 27.09% of [sheep]
Probability 4.34% of [bird]
Probability 1.71% of [giraffe]
Probability 0.63% of [dog]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FnCs5aWf-1681566540584)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/dfc094b5-ff7e-4af2-a691-64344d669e6c.png)]

Probability 100.00% of [bear]
Probability 0.00% of [sheep]
Probability 0.00% of [dog]
Probability 0.00% of [cat]
Probability 0.00% of [giraffe]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-at67P6F9-1681566540584)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/b89ae721-3a80-458a-ae17-aea4fbf493ef.png)]

Probability 100.00% of [giraffe]
Probability 0.00% of [bird]
Probability 0.00% of [bear]
Probability 0.00% of [sheep]
Probability 0.00% of [zebra]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HChRtCvM-1681566540585)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/ddb79a37-37df-4f86-bc7a-2ad4b25f8d45.png)]

Probability 81.05% of [cat] 
Probability 15.68% of [dog] 
Probability 1.64% of [bird] 
Probability 0.90% of [horse] 
Probability 0.43% of [bear]

除了最后的嘈杂图像外，所有类别都已正确识别。通过适当的超参数调整，也可以进行改进。

到目前为止，您已经看到了使用预训练模型进行分类并对预训练模型进行微调的示例。接下来，我们将使用 InceptionV3 模型显示分类示例。

TensorFlow 中的 InceptionV3

您可以按照 Jupyter 笔记本中的代码ch-12c_InceptionV3_TensorFlow。

TensorFlow 的 InceptionV3 在 1,001 个标签上训练，而不是 1,000 个。此外，用于训练的图像被不同地预处理。我们在前面的部分中展示了预处理代码。让我们直接深入了解使用 TensorFlow 恢复 InceptionV3 模型。

让我们下载 InceptionV3 的检查点文件：

# load the InceptionV3 model
model_name='inception_v3'
model_url='http://download.tensorflow.org/models/'
model_files=['inception_v3_2016_08_28.tar.gz']
model_home=os.path.join(models_root,model_name) dsu.download_dataset(source_url=model_url,source_files=model_files,dest_dir = model_home,force=False,extract=True)

定义初始模块和变量的常见导入：

### define common imports and variables
from tensorflow.contrib.slim.nets import inception
image_height=inception.inception_v3.default_image_size
image_width=inception.inception_v3.default_image_size

TensorFlow 中的 InceptionV3 的图像分类

图像分类与使用 VGG 16 模型的上一节中说明的相同。 InceptionV3 模型的完整代码如下：

x_p = tf.placeholder(shape=(None,image_height, image_width,3),dtype=tf.float32,name='x_p')
with slim.arg_scope(inception.inception_v3_arg_scope()):logits,_ = inception.inception_v3(x_p,num_classes=inet.n_classes,is_training=False)
probabilities = tf.nn.softmax(logits)init = slim.assign_from_checkpoint_fn(os.path.join(model_home, '{}.ckpt'.format(model_name)),slim.get_variables_to_restore())with tf.Session() as tfs:init(tfs)probs = tfs.run([probabilities],feed_dict={x_p:images_test})probs=probs[0]

让我们看看我们的模型如何处理测试图像：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pNMMX61b-1681566540585)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/8f8448d0-148c-4f6f-afbc-7dea9073f49f.png)]

Probability 95.15% of [zebra]
Probability 0.07% of [ostrich, Struthio camelus]
Probability 0.07% of [hartebeest]
Probability 0.03% of [sock]
Probability 0.03% of [warthog]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CCg2jaBs-1681566540585)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/6dc13766-92b1-4544-88f1-32f26ebb5e0f.png)]

Probability 93.09% of [horse cart, horse-cart]
Probability 0.47% of [plow, plough]
Probability 0.07% of [oxcart]
Probability 0.07% of [seashore, coast, seacoast, sea-coast]
Probability 0.06% of [military uniform]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zULW05SS-1681566540585)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/1b39121e-4240-462e-8884-8d77594919d3.png)]

Probability 18.94% of [Cardigan, Cardigan Welsh corgi]
Probability 8.19% of [Pembroke, Pembroke Welsh corgi]
Probability 7.86% of [studio couch, day bed]
Probability 5.36% of [English springer, English springer spaniel]
Probability 4.16% of [Border collie]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5B2ntnkx-1681566540585)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/a3bf50cb-2c6b-43fa-82d9-26c12a2bef2c.png)]

Probability 27.18% of [water ouzel, dipper]
Probability 24.38% of [junco, snowbird]
Probability 6.91% of [chickadee]
Probability 0.99% of [magpie]
Probability 0.73% of [brambling, Fringilla montifringilla]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i4RTfLzr-1681566540586)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/e669f2b4-5d08-4f8a-bfce-09bf10bff2e5.png)]

Probability 93.00% of [hog, pig, grunter, squealer, Sus scrofa]
Probability 2.23% of [wild boar, boar, Sus scrofa]
Probability 0.65% of [ram, tup]
Probability 0.43% of [ox]
Probability 0.23% of [marmot]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AiL9pr6X-1681566540586)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/a35351da-d301-404f-922a-cb48941c706a.png)]

Probability 84.27% of [brown bear, bruin, Ursus arctos]
Probability 1.57% of [American black bear, black bear, Ursus americanus, Euarctos americanus]
Probability 1.34% of [sloth bear, Melursus ursinus, Ursus ursinus]
Probability 0.13% of [lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens]
Probability 0.12% of [ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Tvrud9J8-1681566540586)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/f1a56a59-9a04-4607-b7d2-7a0a8c9c9b2e.png)]

Probability 20.20% of [honeycomb]
Probability 6.52% of [gazelle]
Probability 5.14% of [sorrel]
Probability 3.72% of [impala, Aepyceros melampus]
Probability 2.44% of [Saluki, gazelle hound]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tMRuQ5XP-1681566540586)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/a5bae2ea-89db-4c7e-9e31-1eb86bfc8e47.png)]

Probability 41.17% of [harp]
Probability 13.64% of [accordion, piano accordion, squeeze box]
Probability 2.97% of [window shade]
Probability 1.59% of [chain]
Probability 1.55% of [pay-phone, pay-station]

虽然它在与 VGG 模型几乎相同的地方失败了，但并不算太糟糕。现在让我们用 COCO 动物图像和标签再训练这个模型。

TensorFlow 中的再训练 InceptionV3 的图像分类

InceptionV3 的再训练与 VGG16 不同，因为我们使用 softmax 激活层作为输出，tf.losses.softmax_cross_entropy()作为损耗函数。

首先定义占位符：

is_training = tf.placeholder(tf.bool,name='is_training')
x_p = tf.placeholder(shape=(None,image_height, image_width,3),dtype=tf.float32,name='x_p')
y_p = tf.placeholder(shape=(None,coco.n_classes),dtype=tf.int32,name='y_p')

接下来，加载模型：

with slim.arg_scope(inception.inception_v3_arg_scope()):logits,_ = inception.inception_v3(x_p,num_classes=coco.n_classes,is_training=True)
probabilities = tf.nn.softmax(logits)

接下来，定义函数以恢复除最后一层之外的变量：

with slim.arg_scope(inception.inception_v3_arg_scope()):logits,_ = inception.inception_v3(x_p,num_classes=coco.n_classes,is_training=True)
probabilities = tf.nn.softmax(logits)# restore except last layer
checkpoint_exclude_scopes=["InceptionV3/Logits", "InceptionV3/AuxLogits"]
exclusions = [scope.strip() for scope in checkpoint_exclude_scopes]variables_to_restore = []
for var in slim.get_model_variables():excluded = Falsefor exclusion in exclusions:if var.op.name.startswith(exclusion):excluded = Truebreakif not excluded:variables_to_restore.append(var)init_fn = slim.assign_from_checkpoint_fn(os.path.join(model_home, '{}.ckpt'.format(model_name)),variables_to_restore)

定义损失，优化器和训练操作：

tf.losses.softmax_cross_entropy(onehot_labels=y_p, logits=logits)
loss = tf.losses.get_total_loss()
learning_rate = 0.001
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss)

训练模型并在同一会话中完成训练后运行预测：

n_epochs=10
coco.y_onehot = True
coco.batch_size = 32
coco.batch_shuffle = True
total_images = len(x_train_files)
n_batches = total_images // coco.batch_sizewith tf.Session() as tfs:tfs.run(tf.global_variables_initializer())init_fn(tfs) for epoch in range(n_epochs):print('Starting epoch ',epoch)coco.reset_index()epoch_accuracy=0epoch_loss=0for batch in range(n_batches):x_batch, y_batch = coco.next_batch()images=np.array([coco.preprocess_for_inception(x) \for x in x_batch])feed_dict={x_p:images,y_p:y_batch,is_training:True}batch_loss,_ = tfs.run([loss,train_op], feed_dict = feed_dict)epoch_loss += batch_loss epoch_loss /= n_batchesprint('Train loss in epoch {}:{}'.format(epoch,epoch_loss))# now run the predictionsfeed_dict={x_p:images_test,is_training: False}probs = tfs.run([probabilities],feed_dict=feed_dict)probs=probs[0]

我们看到每个周期的损失都在减少：

INFO:tensorflow:Restoring parameters from /home/armando/models/inception_v3/inception_v3.ckpt
Starting epoch  0
Train loss in epoch 0:2.7896385192871094
Starting epoch  1
Train loss in epoch 1:1.6651896286010741
Starting epoch  2
Train loss in epoch 2:1.2332031989097596
Starting epoch  3
Train loss in epoch 3:0.9912329530715942
Starting epoch  4
Train loss in epoch 4:0.8110128355026245
Starting epoch  5
Train loss in epoch 5:0.7177265572547913
Starting epoch  6
Train loss in epoch 6:0.6175705575942994
Starting epoch  7
Train loss in epoch 7:0.5542363750934601
Starting epoch  8
Train loss in epoch 8:0.523461252450943
Starting epoch  9
Train loss in epoch 9:0.4923107647895813

这次结果正确识别了绵羊，但错误地将猫图片识别为狗：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kE1be6tG-1681566540586)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/701ce0d5-b399-4f4b-b1e9-84e399ae0af9.png)]

Probability 98.84% of [zebra]
Probability 0.84% of [giraffe]
Probability 0.11% of [sheep]
Probability 0.07% of [cat]
Probability 0.06% of [dog]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RhPmB39N-1681566540587)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/729c8e7b-f573-4d38-ae0b-1b9eb3ac96b2.png)]

Probability 95.77% of [horse]
Probability 1.34% of [dog]
Probability 0.89% of [zebra]
Probability 0.68% of [bird]
Probability 0.61% of [sheep]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MtlXtYuZ-1681566540587)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/db0b9246-ccc2-4c1b-904a-b3dfc799da24.png)]

Probability 94.83% of [dog] 
Probability 4.53% of [cat] 
Probability 0.56% of [sheep] 
Probability 0.04% of [bear] 
Probability 0.02% of [zebra]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rTkp9MTL-1681566540587)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/0511bc01-7f4d-48f9-a88a-ebb5eca0aab4.png)]

Probability 42.80% of [bird]
Probability 25.64% of [cat]
Probability 15.56% of [bear]
Probability 8.77% of [giraffe]
Probability 3.39% of [sheep]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-S2TsUeJu-1681566540587)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/6e000739-cba7-4e27-8336-cef9f931629c.png)]

Probability 72.58% of [sheep] 
Probability 8.40% of [bear] 
Probability 7.64% of [giraffe] 
Probability 4.02% of [horse] 
Probability 3.65% of [bird]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1u7Z5zMH-1681566540587)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/8c50e317-4436-4256-a8d3-f4f48411643b.png)]

Probability 98.03% of [bear] 
Probability 0.74% of [cat] 
Probability 0.54% of [sheep] 
Probability 0.28% of [bird] 
Probability 0.17% of [horse]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cT2dw8yp-1681566540588)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/8dfafd80-1a4f-4fe7-b806-a18738ff7059.png)]

Probability 96.43% of [giraffe] 
Probability 1.78% of [bird] 
Probability 1.10% of [sheep] 
Probability 0.32% of [zebra] 
Probability 0.14% of [bear]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dP4sZZGW-1681566540588)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/a344f73f-c858-4134-82ef-541924b1c64a.png)]

Probability 34.43% of [horse] 
Probability 23.53% of [dog] 
Probability 16.03% of [zebra] 
Probability 9.76% of [cat] 
Probability 9.02% of [giraffe]

总结

迁移学习是一项伟大的发现，它允许我们通过将在较大数据集中训练的模型应用于不同的数据集来节省时间。当数据集很小时，迁移学习也有助于热启动训练过程。在本章中，我们学习了如何使用预训练的模型，如 VGG16 和 InceptionV3，将不同数据集中的图像分类为他们所训练的数据集。我们还学习了如何使用 TensorFlow 和 Keras 中的示例再训练预训练模型，以及如何预处理图像以供给两个模型。

我们还了解到有几种模型在 ImageNet 数据集上进行了训练。尝试查找在不同数据集上训练的其他模型，例如视频数据集，语音数据集或文本/ NLP 数据集。尝试使用这些模型再训练并在您自己的数据集中使用您自己的深度学习问题。

十三、深度强化学习

强化学习是一种学习形式，其中软件智能体观察环境并采取行动以最大化其对环境的奖励，如下图所示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qebfvplz-1681566540588)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/bdd263e7-4be5-4a46-8b91-f5560793473b.png)]

这个比喻可以用来表示现实生活中的情况，如下所示：

股票交易智能体观察交易信息，新闻，分析和其他形式信息，并采取行动买入或卖出交易，以便以短期利润或长期利润的形式最大化奖励。
保险智能体观察有关客户的信息，然后采取行动确定保险费金额，以便最大化利润并最大限度地降低风险。
类人机器人观察环境然后采取行动，例如步行，跑步或拾取物体，以便在实现目标方面最大化奖励。

强化学习已成功应用于许多应用，如广告优化，股票市场交易，自动驾驶汽车，机器人和游戏，仅举几例。

强化学习与监督学习不同，因为预先没有标签来调整模型的参数。该模型从运行中获得的奖励中学习。虽然短期奖励可以立即获得，但只有经过几个步骤才能获得长期奖励。这种现象也称为延迟反馈。

强化学习也与无监督学习不同，因为在无监督学习中没有可用的标签，而在强化学习中，反馈可用于奖励。

在本章中，我们将通过涵盖以下主题来了解强化学习及其在 TensorFlow 和 Keras 中的实现：

OpenAI Gym 101
将简单的策略应用于 Cartpole 游戏
强化学习 101
- Q 函数
- 探索和利用
- V 函数
- RL 技术
RL 的简单神经网络策略
实现 Q-Learning
- Q-Learning 的初始化和离散化
- 使用 Q-Table 进行 Q-Learning
- 深度 Q 网络：使用 Q-Network 进行 Q-Learning

我们将在 OpenAI Gym 中演示我们的示例，让我们首先了解一下 OpenAI Gym。

OpenAI Gym 101

OpenAI Gym 是一个基于 Python 的工具包，用于研究和开发强化学习算法。 OpenAI Gym 在撰写本文时提供了 700 多个开源贡献环境。使用 OpenAI，您还可以创建自己的环境。最大的优势是 OpenAI 提供了一个统一的接口来处理这些环境，并在您专注于强化学习算法的同时负责运行模拟。

描述 OpenAI Gym 的研究论文可在此链接中找到。

您可以使用以下命令安装 OpenAI Gym：

pip3 install gym

如果上述命令不起作用，您可以在此链接中找到有关安装的更多帮助。

让我们在 OpenAI Gym 中打印可用环境的数量：

您可以按照本书代码包中的 Jupyter 笔记本ch-13a_Reinforcement_Learning_NN中的代码进行操作。

all_env = list(gym.envs.registry.all())
print('Total Environments in Gym version {} : {}'.format(gym.__version__,len(all_env)))Total Environments in Gym version 0.9.4 : 777

让我们打印所有环境的列表：

for e in list(all_env):print(e)

输出的部分列表如下：

EnvSpec(Carnival-ramNoFrameskip-v0)
EnvSpec(EnduroDeterministic-v0)
EnvSpec(FrostbiteNoFrameskip-v4)
EnvSpec(Taxi-v2)
EnvSpec(Pooyan-ram-v0)
EnvSpec(Solaris-ram-v4)
EnvSpec(Breakout-ramDeterministic-v0)
EnvSpec(Kangaroo-ram-v4)
EnvSpec(StarGunner-ram-v4)
EnvSpec(Enduro-ramNoFrameskip-v4)
EnvSpec(DemonAttack-ramDeterministic-v0)
EnvSpec(TimePilot-ramNoFrameskip-v0)
EnvSpec(Amidar-v4)

由env对象表示的每个环境都有一个标准化的接口，例如：

通过传递 ID 字符串，可以使用env.make(<game-id-string>)函数创建env对象。
每个env对象包含以下主要函数：
- step()函数将操作对象作为参数并返回四个对象：
  - 观察：由环境实现的对象，代表对环境的观察。
  - 奖励：一个带符号的浮点值，表示前一个操作的增益（或损失）。
  - done：表示方案是否完成的布尔值。
  - info：表示诊断信息的 Python 字典对象。
- render()函数可创建环境的直观表示。
- reset()函数将环境重置为原始状态。
每个env对象都有明确定义的动作和观察，由action_space和observation_space表示。

CartPole 是健身房里最受欢迎的学习强化学习游戏之一。在这个游戏中，连接到推车的杆必须平衡，以便它不会下降。如果杆子倾斜超过 15 度或者推车从中心移动超过 2.4 个单元，则游戏结束。 OpenAI.com 的主页用这些词强调游戏：

这种环境的小尺寸和简单性使得可以进行非常快速的实验，这在学习基础知识时是必不可少的。

游戏只有四个观察和两个动作。动作是通过施加 +1 或 -1 的力来移动购物车。观察结果是推车的位置，推车的速度，杆的角度以及杆的旋转速度。然而，学习观察语义的知识不是学习最大化游戏奖励所必需的。

现在让我们加载一个流行的游戏环境 CartPole-v0，然后用随机控件播放：

使用标准make函数创建env对象：

env = gym.make('CartPole-v0')

剧集的数量是游戏的数量。我们现在将它设置为一个，表示我们只想玩一次游戏。由于每集都是随机的，因此在实际的制作过程中，您将运行多集并计算奖励的平均值。此外，我们可以初始化一个数组，以便在每个时间步都存储环境的可视化：

n_episodes = 1
env_vis = []

运行两个嵌套循环 - 一个用于剧集数量的外部循环和一个用于您要模拟的时间步数的内部循环。您可以继续运行内部循环，直到方案完成或将步数设置为更高的值。
- 在每集开始时，使用env.reset()重置环境。
- 在每个时间步的开始，使用env.render()捕获可视化。

for i_episode in range(n_episodes):observation = env.reset()for t in range(100):env_vis.append(env.render(mode = 'rgb_array'))print(observation)action = env.action_space.sample()observation, reward, done, info = env.step(action)if done:print("Episode finished at t{}".format(t+1))break

使用辅助函数渲染环境：

env_render(env_vis)

辅助函数的代码如下：

def env_render(env_vis):plt.figure()plot = plt.imshow(env_vis[0])plt.axis('off')def animate(i):plot.set_data(env_vis[i])anim = anm.FuncAnimation(plt.gcf(),animate,frames=len(env_vis),interval=20,repeat=True,repeat_delay=20)display(display_animation(anim, default_mode='loop'))

运行此示例时，我们得到以下输出：

[-0.00666995 -0.03699492 -0.00972623  0.00287713]
[-0.00740985  0.15826516 -0.00966868 -0.29285861]
[-0.00424454 -0.03671761 -0.01552586 -0.00324067]
[-0.0049789  -0.2316135  -0.01559067  0.28450351]
[-0.00961117 -0.42650966 -0.0099006   0.57222875]
[-0.01814136 -0.23125029  0.00154398  0.27644332]
[-0.02276636 -0.0361504   0.00707284 -0.01575223]
[-0.02348937  0.1588694   0.0067578  -0.30619523]
[-0.02031198 -0.03634819  0.00063389 -0.01138875]
[-0.02103895  0.15876466  0.00040612 -0.3038716 ]
[-0.01786366  0.35388083 -0.00567131 -0.59642642]
[-0.01078604  0.54908168 -0.01759984 -0.89089036]
[  1.95594914e-04   7.44437934e-01  -3.54176495e-02  -1.18905344e+00]
[ 0.01508435  0.54979251 -0.05919872 -0.90767902]
[ 0.0260802   0.35551978 -0.0773523  -0.63417465]
[ 0.0331906   0.55163065 -0.09003579 -0.95018025]
[ 0.04422321  0.74784161 -0.1090394  -1.26973934]
[ 0.05918004  0.55426764 -0.13443418 -1.01309691]
[ 0.0702654   0.36117014 -0.15469612 -0.76546874]
[ 0.0774888   0.16847818 -0.1700055  -0.52518186]
[ 0.08085836  0.3655333  -0.18050913 -0.86624457]
[ 0.08816903  0.56259197 -0.19783403 -1.20981195]
Episode finished at t22

杆子需要 22 个时间步长才能变得不平衡。在每次运行中，我们得到不同的时间步长值，因为我们通过使用env.action_space.sample()在学术上选择了动作。

由于游戏如此迅速地导致失败，随机选择一个动作并应用它可能不是最好的策略。有许多算法可以找到解决方案，使杆子保持笔直，可以使用更长的时间步长，例如爬山，随机搜索和策略梯度。

解决 Cartpole 游戏的一些算法可通过此链接获得：

https://openai.com/requests-for-research/#cartpole

http://kvfrans.com/simple-algoritms-for-solving-cartpole/

https://github.com/kvfrans/openai-cartpole

将简单的策略应用于 Cartpole 游戏

到目前为止，我们已经随机选择了一个动作并应用它。现在让我们应用一些逻辑来挑选行动而不是随机机会。第三个观察指的是角度。如果角度大于零，则意味着杆向右倾斜，因此我们将推车向右移动（1）。否则，我们将购物车向左移动（0）。我们来看一个例子：

我们定义了两个策略函数如下：

def policy_logic(env,obs):return 1 if obs[2] > 0 else 0
def policy_random(env,obs):return env.action_space.sample()

接下来，我们定义一个将针对特定数量的剧集运行的实验函数；每一集一直持续到游戏损失，即done为True。我们使用rewards_max来指示何时突破循环，因为我们不希望永远运行实验：

def experiment(policy, n_episodes, rewards_max):rewards=np.empty(shape=(n_episodes))env = gym.make('CartPole-v0')for i in range(n_episodes):obs = env.reset()done = Falseepisode_reward = 0while not done:action = policy(env,obs)obs, reward, done, info = env.step(action)episode_reward += rewardif episode_reward > rewards_max:breakrewards[i]=episode_rewardprint('Policy:{}, Min reward:{}, Max reward:{}'.format(policy.__name__,min(rewards),max(rewards)))

我们运行实验 100 次，或直到奖励小于或等于rewards_max，即设置为 10,000：

n_episodes = 100
rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)

我们可以看到逻辑选择的动作比随机选择的动作更好，但不是更好：

Policy:policy_random, Min reward:9.0, Max reward:63.0, Average reward:20.26
Policy:policy_logic, Min reward:24.0, Max reward:66.0, Average reward:42.81

现在让我们进一步修改选择动作的过程 - 基于参数。参数将乘以观察值，并且将基于乘法结果是零还是一来选择动作。让我们修改随机搜索方法，我们随机初始化参数。代码如下：

def policy_logic(theta,obs):# just ignore thetareturn 1 if obs[2] > 0 else 0def policy_random(theta,obs):return 0 if np.matmul(theta,obs) < 0 else 1def episode(env, policy, rewards_max):obs = env.reset()done = Falseepisode_reward = 0if policy.__name__ in ['policy_random']:theta = np.random.rand(4) * 2 - 1else:theta = Nonewhile not done:action = policy(theta,obs)obs, reward, done, info = env.step(action)episode_reward += rewardif episode_reward > rewards_max:breakreturn episode_rewarddef experiment(policy, n_episodes, rewards_max):rewards=np.empty(shape=(n_episodes))env = gym.make('CartPole-v0')for i in range(n_episodes):rewards[i]=episode(env,policy,rewards_max)#print("Episode finished at t{}".format(reward))print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'.format(policy.__name__,np.min(rewards),np.max(rewards),np.mean(rewards)))n_episodes = 100
rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)

我们可以看到随机搜索确实改善了结果：

Policy:policy_random, Min reward:8.0, Max reward:200.0, Average reward:40.04
Policy:policy_logic, Min reward:25.0, Max reward:62.0, Average reward:43.03

通过随机搜索，我们改进了结果以获得 200 的最大奖励。平均而言，随机搜索的奖励较低，因为随机搜索会尝试各种不良参数，从而降低整体结果。但是，我们可以从所有运行中选择最佳参数，然后在生产中使用最佳参数。让我们修改代码以首先训练参数：

def policy_logic(theta,obs):# just ignore thetareturn 1 if obs[2] > 0 else 0def policy_random(theta,obs):return 0 if np.matmul(theta,obs) < 0 else 1def episode(env,policy, rewards_max,theta):obs = env.reset()done = Falseepisode_reward = 0while not done:action = policy(theta,obs)obs, reward, done, info = env.step(action)episode_reward += rewardif episode_reward > rewards_max:breakreturn episode_rewarddef train(policy, n_episodes, rewards_max):env = gym.make('CartPole-v0')theta_best = np.empty(shape=[4])reward_best = 0for i in range(n_episodes):if policy.__name__ in ['policy_random']:  theta = np.random.rand(4) * 2 - 1else:theta = Nonereward_episode=episode(env,policy,rewards_max, theta)if reward_episode > reward_best:reward_best = reward_episodetheta_best = theta.copy()return reward_best,theta_bestdef experiment(policy, n_episodes, rewards_max, theta=None):rewards=np.empty(shape=[n_episodes])env = gym.make('CartPole-v0')for i in range(n_episodes):rewards[i]=episode(env,policy,rewards_max,theta)#print("Episode finished at t{}".format(reward))print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'.format(policy.__name__,np.min(rewards),np.max(rewards),np.mean(rewards)))n_episodes = 100
rewards_max = 10000reward,theta = train(policy_random, n_episodes, rewards_max)
print('trained theta: {}, rewards: {}'.format(theta,reward))
experiment(policy_random, n_episodes, rewards_max, theta)
experiment(policy_logic, n_episodes, rewards_max)

我们训练了 100 集，然后使用最佳参数为随机搜索策略运行实验：

n_episodes = 100
rewards_max = 10000reward,theta = train(policy_random, n_episodes, rewards_max)
print('trained theta: {}, rewards: {}'.format(theta,reward))
experiment(policy_random, n_episodes, rewards_max, theta)
experiment(policy_logic, n_episodes, rewards_max)

我们发现训练参数给出了 200 的最佳结果：

trained theta: [-0.14779543  0.93269603  0.70896423  0.84632461], rewards: 200.0
Policy:policy_random, Min reward:200.0, Max reward:200.0, Average reward:200.0
Policy:policy_logic, Min reward:24.0, Max reward:63.0, Average reward:41.94

我们可以优化训练代码以继续训练，直到我们获得最大奖励。笔记本ch-13a_Reinforcement_Learning_NN中提供了此优化的代码。

现在我们已经学习了 OpenAI Gym 的基础知识，让我们学习强化学习。

强化学习 101

强化学习由智能体从前一个时间步骤输入观察和奖励并以动作产生输出来描述，目标是最大化累积奖励。

智能体具有策略，值函数和模型：

智能体用于选择下一个动作的算法称为策略。在上一节中，我们编写了一个策略，它将采用一组参数θ，并根据观察和参数之间的乘法返回下一个动作。该策略由以下等式表示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Kqg27IjI-1681566540589)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/73c06697-d661-4b7a-82ad-d4bfb078d386.png)]

S是一组状态，A是一组动作。

策略是确定性的或随机性的。
- 确定性策略在每次运行中为相同状态返回相同的操作：
  
  [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6ditJhzf-1681566540589)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/40b98256-188a-44e6-9aaf-d3b105a28424.png)]
- 随机策略为每次运行中的相同状态返回相同操作的不同概率：
  
  [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MCvqOFYm-1681566540589)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/11dc9c49-c717-4115-b9a5-2015befdef3a.png)]
值函数根据当前状态中的所选动作预测长期奖励的数量。因此，值函数特定于智能体使用的策略。奖励表示行动的直接收益，而值函数表示行动的累积或长期未来收益。奖励由环境返回，值函数由智能体在每个时间步骤估计。
模型表示智能体在内部保存的环境。该模型可能是环境的不完美表示。智能体使用该模型来估计所选动作的奖励和下一个状态。

智能体的目标还可以是为马尔可夫决策过程（MDP）找到最优策略。 MDP 是从一个州到另一个州的观察，行动，奖励和过渡的数学表示。为简洁起见，我们将省略对 MDP 的讨论，并建议好奇的读者在互联网上搜索更深入 MDP 的资源。

Q 函数（在模型不可用时学习优化）

如果模型不可用，则智能体通过反复试验来学习模型和最优策略。当模型不可用时，智能体使用 Q 函数，其定义如下：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YLQiBOkR-1681566540590)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/7bdea2f4-cb5e-43ab-b490-5c114303c6cf.png)]

如果状态s处的智能体选择动作a，则 Q 函数基本上将状态和动作对映射到表示预期总奖励的实数。

RL 算法的探索与利用

在没有模型的情况下，智能体在每一步都要探索或利用。探索意味着智能体选择一个未知动作来找出奖励和模型。利用意味着智能体选择最知名的行动来获得最大奖励。如果智能体总是决定利用它，那么它可能会陷入局部最优值。因此，有时智能体会绕过学到的策略来探索未知的行为。同样，如果智能体总是决定探索，那么它可能无法找到最优策略。因此，在探索和利用之间取得平衡非常重要。在我们的代码中，我们通过使用概率p来选择随机动作和概率1-p来选择最优动作来实现这一点。

V 函数（模型可用时学习优化）

如果事先知道模型，则智能体可以执行策略搜索以找到最大化值函数的最优策略。当模型可用时，智能体使用值函数，该函数可以朴素地定义为未来状态的奖励总和：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fv7GHNmM-1681566540590)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/be0e1650-917d-4ff5-b5ca-31ce30f05ed8.png)]

因此，使用策略p选择操作的时间步t的值将是：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jNQeYBye-1681566540590)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/1431616e-143f-4d1d-87b8-98d6000e8f31.png)]

V是值，R是奖励，值函数估计在未来最多n个时间步长。

当智能体使用这种方法估计奖励时，它会平等地将所有行为视为奖励。在极点推车示例中，如果民意调查在步骤 50 处进行，则它将把直到第 50 步的所有步骤视为对跌倒的同等责任。因此，不是添加未来奖励，而是估计未来奖励的加权总和。通常，权重是提高到时间步长的折扣率。如果贴现率为零，则值函数变为上面讨论的幼稚函数，并且如果贴现率的值接近 1，例如 0.9 或 0.92，则与当前奖励相比，未来奖励的影响较小。

因此，现在行动a的时间步t的值将是：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iL5chU6e-1681566540590)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/607cbf22-522c-4ffd-adc7-8f4ba95fe4ea.png)]
V是值，R是奖励，r是折扣率。

V 函数和 Q 函数之间的关系：

V*(s)是状态s下的最优值函数，其给出最大奖励，并且Q*(s，a)是状态s下的最佳 Q 函数，其通过选择动作a给出最大期望奖励。因此，V*(s)是所有可能动作中所有最优 Q 函数Q*(s，a)的最大值：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4PkNXMjX-1681566540590)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/bce4bf13-c9cc-416f-a901-56121b7f631c.png)]

强化学习技巧

可以根据模型的可用性对强化学习技术进行如下分类：

模型可用：如果模型可用，则智能体可以通过迭代策略或值函数来离线计划，以找到提供最大奖励的最优策略。
- 值迭代学习：在值迭代学习方法中，智能体通过将V(s)初始化为随机值开始，然后重复更新V(s)直到找到最大奖励。
- 策略迭代学习 ：在策略迭代学习方法中，智能体通过初始化随机策略p开始，然后重复更新策略，直到找到最大奖励。
模型不可用：如果模型不可用，则智能体只能通过观察其动作的结果来学习。因此，从观察，行动和奖励的历史来看，智能体会尝试估计模型或尝试直接推导出最优策略：
- 基于模型的学习：在基于模型的学习中，智能体首先从历史中估计模型，然后使用策略或基于值的方法来找到最优策略。
- 无模型学习：在无模型学习中，智能体不会估计模型，而是直接从历史中估计最优策略。 Q-Learning 是无模型学习的一个例子。

作为示例，值迭代学习的算法如下：

initialize V(s) to random values for all states
Repeatfor s in statesfor a in actionscompute Q[s,a]V(s) = max(Q[s])   # maximum of Q for all actions for that state
Until optimal value of V(s) is found for all states

策略迭代学习的算法如下：

initialize a policy P_new to random sequence of actions for all states
RepeatP = P_newfor s in statescompute V(s) with P[s]P_new[s] = policy of optimal V(s)
Until P == P_new

强化学习的朴素神经网络策略

我们按照以下策略进行：

让我们实现一个朴素的基于神经网络的策略。为定义一个新策略使用基于神经网络的预测来返回动作：

def policy_naive_nn(nn,obs):return np.argmax(nn.predict(np.array([obs])))

将nn定义为一个简单的单层 MLP 网络，它将具有四个维度的观测值作为输入，并产生两个动作的概率：

from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(8,input_dim=4, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam')
model.summary()

这就是模型的样子：

Layer (type)                 Output Shape              Param #   
=================================================================
dense_16 (Dense)             (None, 8)                 40        
_________________________________________________________________
dense_17 (Dense)             (None, 2)                 18        
=================================================================
Total params: 58
Trainable params: 58
Non-trainable params: 0

这个模型需要训练。运行 100 集的模拟并仅收集分数大于 100 的那些剧集的训练数据。如果分数小于 100，那么这些状态和动作不值得记录，因为它们不是好戏的例子：

# create training data
env = gym.make('CartPole-v0')
n_obs = 4
n_actions = 2
theta = np.random.rand(4) * 2 - 1
n_episodes = 100
r_max = 0
t_max = 0x_train, y_train = experiment(env, policy_random, n_episodes,theta,r_max,t_max, return_hist_reward=100 )
y_train = np.eye(n_actions)[y_train]
print(x_train.shape,y_train.shape)

我们能够收集 5732 个样本进行训练：

(5732, 4) (5732, 2)

接下来，训练模型：

model.fit(x_train, y_train, epochs=50, batch_size=10)

训练的模型可用于玩游戏。但是，在我们合并更新训练数据的循环之前，模型不会从游戏的进一步游戏中学习：

n_episodes = 200
r_max = 0
t_max = 0_ = experiment(env, policy_naive_nn, n_episodes,theta=model, r_max=r_max, t_max=t_max, return_hist_reward=0 )_ = experiment(env, policy_random, n_episodes,theta,r_max,t_max, return_hist_reward=0 )

我们可以看到，这种朴素的策略几乎以同样的方式执行，虽然比随机策略好一点：

Policy:policy_naive_nn, Min reward:37.0, Max reward:200.0, Average reward:71.05
Policy:policy_random, Min reward:36.0, Max reward:200.0, Average reward:68.755

我们可以通过网络调整和超参数调整，或通过学习更多游戏玩法来进一步改进结果。但是，有更好的算法，例如 Q-Learning。

在本章的其余部分，我们将重点关注 Q-Learning 算法，因为大多数现实生活中的问题涉及无模型学习。

实现 Q-Learning

Q-Learning 是一种无模型的方法，可以找到可以最大化智能体奖励的最优策略。在最初的游戏过程中，智能体会为每对（状态，动作）学习 Q 值，也称为探索策略，如前面部分所述。一旦学习了 Q 值，那么最优策略将是在每个状态中选择具有最大 Q 值的动作，也称为利用策略。学习算法可以以局部最优解决方案结束，因此我们通过设置exploration_rate参数来继续使用探索策略。

Q-Learning 算法如下：

initialize Q(shape=[#s,#a]) to random values or zeroes
Repeat (for each episode)observe current state sRepeatselect an action a (apply explore or exploit strategy)observe state s_next as a result of action aupdate the Q-Table using bellman's equationset current state s = s_next       until the episode ends or a max reward / max steps condition is reached
Until a number of episodes or a condition is reached (such as max consecutive wins)

上述算法中的Q(s, )表示我们在前面部分中描述的 Q 函数。此函数的值用于选择操作而不是奖励，因此此函数表示奖励或折扣奖励。使用未来状态中 Q 函数的值更新 Q 函数的值。众所周知的贝尔曼方程捕获了这一更新：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wcQ4ninC-1681566540591)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/64daa9fe-499c-4067-81fc-9e8d555740e2.png)]

这基本上意味着在时间步骤t，在状态s中，对于动作a，最大未来奖励（Q）等于来自当前状态的奖励加上来自下一状态的最大未来奖励。

Q(s, a)可以实现为 Q 表或称为 Q 网络的神经网络。在这两种情况下，Q 表或 Q 网络的任务是基于给定输入的 Q 值提供最佳可能的动作。随着 Q 表变大，基于 Q 表的方法通常变得棘手，因此使神经网络成为通过 Q 网络逼近 Q 函数的最佳候选者。让我们看看这两种方法的实际应用。

您可以按照本书代码包中的 Jupyter 笔记本ch-13b_Reinforcement_Learning_DQN中的代码进行操作。

Q-Learning 的初始化和离散化

极地车环境返回的观测涉及环境状况。极点车的状态由我们需要离散的连续值表示。

如果我们将这些值离散化为小的状态空间，那么智能体会得到更快的训练，但需要注意的是会有收敛到最优策略的风险。

我们使用以下辅助函数来离散极推车环境的状态空间：

# discretize the value to a state space
def discretize(val,bounds,n_states):discrete_val = 0if val <= bounds[0]:discrete_val = 0elif val >= bounds[1]:discrete_val = n_states-1else:discrete_val = int(round( (n_states-1) * ((val-bounds[0])/(bounds[1]-bounds[0])) ))return discrete_valdef discretize_state(vals,s_bounds,n_s):discrete_vals = []for i in range(len(n_s)):discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i]))return np.array(discrete_vals,dtype=np.int)

我们将每个观察尺寸的空间离散为 10 个单元。您可能想尝试不同的离散空间。在离散化之后，我们找到观察的上限和下限，并将速度和角速度的界限改变在 -1 和 +1 之间，而不是-Inf和+Inf。代码如下：

env = gym.make('CartPole-v0')
n_a = env.action_space.n
# number of discrete states for each observation dimension
n_s = np.array([10,10,10,10])   # position, velocity, angle, angular velocity
s_bounds = np.array(list(zip(env.observation_space.low, env.observation_space.high)))
# the velocity and angular velocity bounds are 
# too high so we bound between -1, +1
s_bounds[1] = (-1.0,1.0) 
s_bounds[3] = (-1.0,1.0)

使用 Q-Table 的 Q-Learning

您可以在ch-13b.ipynb中按照本节的代码进行操作。由于我们的离散空间的尺寸为[10,10,10,10]，因此我们的 Q 表的尺寸为[10,10,10,10,2]：

# create a Q-Table of shape (10,10,10,10, 2) representing S X A -> R
q_table = np.zeros(shape = np.append(n_s,n_a))

我们根据exploration_rate定义了一个利用或探索的 Q-Table 策略：

def policy_q_table(state, env):# Exploration strategy - Select a random actionif np.random.random() < explore_rate:action = env.action_space.sample()# Exploitation strategy - Select the action with the highest qelse:action = np.argmax(q_table[tuple(state)])return action

定义运行单个剧集的episode()函数，如下所示：

首先初始化变量和第一个状态：

obs = env.reset()
state_prev = discretize_state(obs,s_bounds,n_s)episode_reward = 0
done = False
t = 0

选择操作并观察下一个状态：

action = policy(state_prev, env)
obs, reward, done, info = env.step(action)
state_new = discretize_state(obs,s_bounds,n_s)

更新 Q 表：

best_q = np.amax(q_table[tuple(state_new)])
bellman_q = reward + discount_rate * best_q
indices = tuple(np.append(state_prev,action))
q_table[indices] += learning_rate*( bellman_q - q_table[indices])

将下一个状态设置为上一个状态，并将奖励添加到剧集的奖励中：

state_prev = state_new
episode_reward += reward

experiment()函数调用剧集函数并累积报告奖励。您可能希望修改该函数以检查连续获胜以及特定于您的游戏或游戏的其他逻辑：

# collect observations and rewards for each episode
def experiment(env, policy, n_episodes,r_max=0, t_max=0):rewards=np.empty(shape=[n_episodes])for i in range(n_episodes):val = episode(env, policy, r_max, t_max)rewards[i]=valprint('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'.format(policy.__name__,np.min(rewards),np.max(rewards),np.mean(rewards)))

现在，我们要做的就是定义参数，例如learning_rate，discount_rate和explore_rate，并运行experiment()函数，如下所示：

learning_rate = 0.8
discount_rate = 0.9
explore_rate = 0.2
n_episodes = 1000
experiment(env, policy_q_table, n_episodes)

对于 1000 集，基于我们的简单实现，基于 Q-Table 的策略的最大奖励为 180：

Policy:policy_q_table, Min reward:8.0, Max reward:180.0, Average reward:17.592

我们对算法的实现很容易解释。但是，您可以对代码进行修改以将探索率设置为最初，然后随着时间步长的过去而衰减。同样，您还可以实现学习和折扣率的衰减逻辑。让我们看看，由于我们的 Q 函数学得更快，我们是否可以用更少的剧集获得更高的奖励。

使用 Q-Network 或深度 Q 网络（DQN）的 Q-Learning

在 DQN 中，我们将 Q-Table 替换为神经网络（Q-Network），当我们使用探索状态及其 Q 值连续训练时，它将学会用最佳动作进行响应。因此，为了训练网络，我们需要一个存储游戏内存的地方：

使用大小为 1000 的双端队列实现游戏内存：

memory = deque(maxlen=1000)

接下来，构建一个简单的隐藏层神经网络模型，q_nn：

from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(8,input_dim=4, activation='relu'))
model.add(Dense(2, activation='linear'))
model.compile(loss='mse',optimizer='adam')
model.summary()
q_nn = model

Q-Network 看起来像这样：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 8)                 40        
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 18        
=================================================================
Total params: 58
Trainable params: 58
Non-trainable params: 0
_________________________________________________________________

执行游戏的一集的episode()函数包含基于 Q-Network 的算法的以下更改：

生成下一个状态后，将状态，操作和奖励添加到游戏内存中：

action = policy(state_prev, env)
obs, reward, done, info = env.step(action)
state_next = discretize_state(obs,s_bounds,n_s)# add the state_prev, action, reward, state_new, done to memory
memory.append([state_prev,action,reward,state_next,done])

使用 bellman 函数生成并更新q_values以获得最大的未来奖励：

states = np.array([x[0] for x in memory])
states_next = np.array([np.zeros(4) if x[4] else x[3] for x in memory])
q_values = q_nn.predict(states)
q_values_next = q_nn.predict(states_next)for i in range(len(memory)):state_prev,action,reward,state_next,done = memory[i]if done:q_values[i,action] = rewardelse:best_q = np.amax(q_values_next[i])bellman_q = reward + discount_rate * best_qq_values[i,action] = bellman_q

训练q_nn的状态和我们从记忆中收到的q_values：

q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0)

将游戏玩法保存在内存中并使用它来训练模型的过程在深度强化学习文献中也称为记忆重放。让我们按照以下方式运行基于 DQN 的游戏：

learning_rate = 0.8
discount_rate = 0.9
explore_rate = 0.2
n_episodes = 100
experiment(env, policy_q_nn, n_episodes)

我们获得 150 的最大奖励，您可以通过超参数调整，网络调整以及使用折扣率和探索率的速率衰减来改进：

Policy:policy_q_nn, Min reward:8.0, Max reward:150.0, Average reward:41.27

我们在每一步计算和训练模型；您可能希望在剧集之后探索将其更改为训练。此外，您可以更改代码以丢弃内存重放，并为返回较小奖励的剧集再训练模型。但是，请谨慎实现此选项，因为它可能会减慢您的学习速度，因为初始游戏会更频繁地产生较小的奖励。

总结

在本章中，我们学习了如何在 Keras 中实现强化学习算法。为了保持示例的简单，我们使用了 Keras;您也可以使用 TensorFlow 实现相同的网络和模型。我们只使用了单层 MLP，因为我们的示例游戏非常简单，但对于复杂的示例，您最终可能会使用复杂的 CNN，RNN 或序列到序列模型。

我们还了解了 OpenAI Gym，这是一个框架，提供了一个模拟许多流行游戏的环境，以实现和实践强化学习算法。我们谈到了深层强化学习概念，我们鼓励您探索专门写有关强化学习的书籍，以深入学习理论和概念。

强化学习是一种先进的技术，你会发现它常用于解决复杂的问题。在下一章中，我们将学习另一系列先进的深度学习技术：生成对抗网络。

十四、生成对抗网络

生成模型被训练以生成与他们训练的数据类似的更多数据，并且训练对抗模型以通过提供对抗性示例来区分真实数据和假数据。

生成对抗网络（GAN）结合了两种模型的特征。 GAN 有两个组成部分：

生成模型，用于学习如何生成类似数据的
判别模型，用于学习如何区分真实数据和生成数据（来自生成模型）

GAN 已成功应用于各种复杂问题，例如：

从低分辨率图像生成照片般逼真的高分辨率图像
在文本中合成图像
风格迁移
补全不完整的图像和视频

在本章中，我们将学习以下主题，以学习如何在 TensorFlow 和 Keras 中实现 GAN：

生成对抗网络
TensorFlow 中的简单 GAN
Keras 中的简单 GAN
TensorFlow 和 Keras 中的深度卷积 GAN

生成对抗网络 101

如下图所示，生成对抗网络（通常称为 GAN）有两个同步工作模型，用于学习和训练复杂数据，如图像，视频或音频文件：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cFlcI7Yc-1681566540591)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/51604132-6071-4e6b-80d6-4515efac6085.png)]

直观地，生成器模型从随机噪声开始生成数据，但是慢慢地学习如何生成更真实的数据。生成器输出和实际数据被馈送到判别器，该判别器学习如何区分假数据和真实数据。

因此，生成器和判别器都发挥对抗性游戏，其中生成器试图通过生成尽可能真实的数据来欺骗判别器，并且判别器试图不通过从真实数据中识别伪数据而被欺骗，因此判别器试图最小化分类损失。两个模型都以锁步方式进行训练。

在数学上，生成模型G(z)学习概率分布p(z)，使得判别器D(G(z), x)无法在概率分布p(z)和p(x)之间进行识别。 GAN 的目标函数可以通过下面描述值函数V的等式来描述，（来自此链接）：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gCtmk8H4-1681566540591)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/db26da3f-bcb8-4519-99e4-ef18c9edfa57.png)]

可以在此链接中找到 IAN Goodfellow 在 NIPS 2016 上关于 GAN 的开创性教程。

这个描述代表了一个简单的 GAN（在文献中也称为香草 GAN），由 Goodfellow 在此链接提供的开创性论文中首次介绍。从那时起，在基于 GAN 推导不同架构并将其应用于不同应用领域方面进行了大量研究。

例如，在条件 GAN 中，为生成器和判别器网络提供标签，使得条件 GAN 的目标函数可以通过以下描述值函数V的等式来描述：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1K2B0R0Z-1681566540591)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/fb6f5df6-c343-405f-bd0c-28f5a7b5051c.png)]

描述条件 GAN 的原始论文位于此链接。

应用中使用的其他几种衍生产品及其原始论文，如文本到图像，图像合成，图像标记，样式转移和图像转移等，如下表所示：

GAN 衍生物	原始文件	示例应用
StackGAN	https://arxiv.org/abs/1710.10916	文字到图像
StackGAN++	https://arxiv.org/abs/1612.03242	逼真的图像合成
DCGAN	https://arxiv.org/abs/1511.06434	图像合成
HR-DCGAN	https://arxiv.org/abs/1711.06491	高分辨率图像合成
条件 GAN	https://arxiv.org/abs/1411.1784	图像标记
InfoGAN	https://arxiv.org/abs/1606.03657	风格识别
Wasserstein GAN	https://arxiv.org/abs/1701.07875 https://arxiv.org/abs/1704.00028	图像生成
耦合 GAN	https://arxiv.org/abs/1606.07536	图像转换，域适应
BEGAN	https://arxiv.org/abs/1703.10717	图像生成
DiscoGAN	https://arxiv.org/abs/1703.05192	风格迁移
CycleGAN	https://arxiv.org/abs/1703.10593	风格迁移

让我们练习使用 MNIST 数据集创建一个简单的 GAN。在本练习中，我们将使用以下函数将 MNIST 数据集标准化为介于[-1, +1]之间：

def norm(x):return (x-0.5)/0.5

我们还定义了 256 维的随机噪声，用于测试生成器模型：

n_z = 256
z_test = np.random.uniform(-1.0,1.0,size=[8,n_z])

显示将在本章所有示例中使用的生成图像的函数：

def display_images(images):for i in range(images.shape[0]):  plt.subplot(1, 8, i + 1)plt.imshow(images[i])plt.axis('off')plt.tight_layout()plt.show()

建立和训练 GAN 的最佳实践

对于我们为此演示选择的数据集，判别器在对真实和假图像进行分类方面变得非常擅长，因此没有为生成器提供梯度方面的大量反馈。因此，我们必须通过以下最佳实践使判别器变弱：

判别器的学习率保持远高于生成器的学习率。
判别器的优化器是GradientDescent，生成器的优化器是Adam。
判别器具有丢弃正则化，而生成器则没有。
与生成器相比，判别器具有更少的层和更少的神经元。
生成器的输出是tanh，而判别器的输出是 sigmoid。
在 Keras 模型中，对于实际数据的标签，我们使用 0.9 而不是 1.0 的值，对于伪数据的标签，我们使用 0.1 而不是 0.0，以便在标签中引入一点噪声

欢迎您探索并尝试其他最佳实践。

TensorFlow 中的简单的 GAN

您可以按照 Jupyter 笔记本中的代码ch-14a_SimpleGAN。

为了使用 TensorFlow 构建 GAN，我们使用以下步骤构建三个网络，两个判别器模型和一个生成器模型：

首先添加用于定义网络的超参数：

# graph hyperparameters
g_learning_rate = 0.00001
d_learning_rate = 0.01
n_x = 784  # number of pixels in the MNIST image # number of hidden layers for generator and discriminator
g_n_layers = 3
d_n_layers = 1
# neurons in each hidden layer
g_n_neurons = [256, 512, 1024]
d_n_neurons = [256]# define parameter ditionary
d_params = {}
g_params = {}activation = tf.nn.leaky_relu
w_initializer = tf.glorot_uniform_initializer
b_initializer = tf.zeros_initializer

接下来，定义生成器网络：

z_p = tf.placeholder(dtype=tf.float32, name='z_p', shape=[None, n_z])
layer = z_p# add generator network weights, biases and layers
with tf.variable_scope('g'):for i in range(0, g_n_layers):  w_name = 'w_{0:04d}'.format(i)g_params[w_name] = tf.get_variable(name=w_name,shape=[n_z if i == 0 else g_n_neurons[i - 1], g_n_neurons[i]],initializer=w_initializer())b_name = 'b_{0:04d}'.format(i)g_params[b_name] = tf.get_variable(name=b_name, shape=[g_n_neurons[i]], initializer=b_initializer())layer = activation(tf.matmul(layer, g_params[w_name]) + g_params[b_name])# output (logit) layeri = g_n_layersw_name = 'w_{0:04d}'.format(i)g_params[w_name] = tf.get_variable(name=w_name,shape=[g_n_neurons[i - 1], n_x],initializer=w_initializer())b_name = 'b_{0:04d}'.format(i)g_params[b_name] = tf.get_variable(name=b_name, shape=[n_x], initializer=b_initializer())g_logit = tf.matmul(layer, g_params[w_name]) + g_params[b_name]g_model = tf.nn.tanh(g_logit)

接下来，定义我们将构建的两个判别器网络的权重和偏差：

with tf.variable_scope('d'):for i in range(0, d_n_layers):  w_name = 'w_{0:04d}'.format(i)d_params[w_name] = tf.get_variable(name=w_name,shape=[n_x if i == 0 else d_n_neurons[i - 1], d_n_neurons[i]],initializer=w_initializer())b_name = 'b_{0:04d}'.format(i)d_params[b_name] = tf.get_variable(name=b_name, shape=[d_n_neurons[i]], initializer=b_initializer())#output (logit) layeri = d_n_layersw_name = 'w_{0:04d}'.format(i)d_params[w_name] = tf.get_variable(name=w_name, shape=[d_n_neurons[i - 1], 1], initializer=w_initializer())b_name = 'b_{0:04d}'.format(i)d_params[b_name] = tf.get_variable(name=b_name, shape=[1], initializer=b_initializer())

现在使用这些参数，构建将真实图像作为输入并输出分类的判别器：

# define discriminator_real# input real images
x_p = tf.placeholder(dtype=tf.float32, name='x_p', shape=[None, n_x])layer = x_pwith tf.variable_scope('d'):for i in range(0, d_n_layers):  w_name = 'w_{0:04d}'.format(i)b_name = 'b_{0:04d}'.format(i)layer = activation(tf.matmul(layer, d_params[w_name]) + d_params[b_name])layer = tf.nn.dropout(layer,0.7)#output (logit) layeri = d_n_layersw_name = 'w_{0:04d}'.format(i)b_name = 'b_{0:04d}'.format(i)d_logit_real = tf.matmul(layer, d_params[w_name]) + d_params[b_name]d_model_real = tf.nn.sigmoid(d_logit_real)

接下来，使用相同的参数构建另一个判别器网络，但提供生成器的输出作为输入：

# define discriminator_fake# input generated fake images
z = g_model
layer = zwith tf.variable_scope('d'):for i in range(0, d_n_layers):  w_name = 'w_{0:04d}'.format(i)b_name = 'b_{0:04d}'.format(i)layer = activation(tf.matmul(layer, d_params[w_name]) + d_params[b_name])layer = tf.nn.dropout(layer,0.7)#output (logit) layeri = d_n_layersw_name = 'w_{0:04d}'.format(i)b_name = 'b_{0:04d}'.format(i)d_logit_fake = tf.matmul(layer, d_params[w_name]) + d_params[b_name]d_model_fake = tf.nn.sigmoid(d_logit_fake)

现在我们已经建立了三个网络，它们之间的连接是使用损失，优化器和训练函数完成的。在训练生成器时，我们只训练生成器的参数，在训练判别器时，我们只训练判别器的参数。我们使用var_list参数将此指定给优化器的minimize()函数。以下是为两种网络定义损失，优化器和训练函数的完整代码：

g_loss = -tf.reduce_mean(tf.log(d_model_fake))
d_loss = -tf.reduce_mean(tf.log(d_model_real) + tf.log(1 - d_model_fake))g_optimizer = tf.train.AdamOptimizer(g_learning_rate)
d_optimizer = tf.train.GradientDescentOptimizer(d_learning_rate)g_train_op = g_optimizer.minimize(g_loss, var_list=list(g_params.values()))
d_train_op = d_optimizer.minimize(d_loss, var_list=list(d_params.values()))

现在我们已经定义了模型，我们必须训练模型。训练按照以下算法完成：

For each epoch:For each batch:  get real images x_batchgenerate noise z_batchtrain discriminator using z_batch and x_batchgenerate noise z_batchtrain generator using z_batch

笔记本电脑的完整训练代码如下：

n_epochs = 400
batch_size = 100
n_batches = int(mnist.train.num_examples / batch_size)
n_epochs_print = 50with tf.Session() as tfs:tfs.run(tf.global_variables_initializer())for epoch in range(n_epochs):epoch_d_loss = 0.0epoch_g_loss = 0.0for batch in range(n_batches):x_batch, _ = mnist.train.next_batch(batch_size)x_batch = norm(x_batch)z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z])feed_dict = {x_p: x_batch,z_p: z_batch}_,batch_d_loss = tfs.run([d_train_op,d_loss], feed_dict=feed_dict)z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z])feed_dict={z_p: z_batch}_,batch_g_loss = tfs.run([g_train_op,g_loss], feed_dict=feed_dict)epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_lossif epoch%n_epochs_print == 0:average_d_loss = epoch_d_loss / n_batchesaverage_g_loss = epoch_g_loss / n_batchesprint('epoch: {0:04d}   d_loss = {1:0.6f}  g_loss = {2:0.6f}'.format(epoch,average_d_loss,average_g_loss))# predict images using generator model trained            x_pred = tfs.run(g_model,feed_dict={z_p:z_test})display_images(x_pred.reshape(-1,pixel_size,pixel_size))

我们每 50 个周期印刷生成的图像：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yPE3BLWZ-1681566540592)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/bd3dca6c-f38a-4de9-b6a2-6dd6c1722938.png)]

正如我们所看到的那样，生成器在周期 0 中只产生噪声，但是在周期 350 中，它经过训练可以产生更好的手写数字形状。您可以尝试使用周期，正则化，网络架构和其他超参数进行试验，看看是否可以产生更快更好的结果。

Keras 中的简单的 GAN

您可以按照 Jupyter 笔记本中的代码ch-14a_SimpleGAN。

现在让我们在 Keras 实现相同的模型：

# graph hyperparameters
g_learning_rate = 0.00001
d_learning_rate = 0.01
n_x = 784  # number of pixels in the MNIST image 
# number of hidden layers for generator and discriminator
g_n_layers = 3
d_n_layers = 1
# neurons in each hidden layer
g_n_neurons = [256, 512, 1024]
d_n_neurons = [256]

接下来，定义生成器网络：

# define generatorg_model = Sequential()
g_model.add(Dense(units=g_n_neurons[0], input_shape=(n_z,),name='g_0'))
g_model.add(LeakyReLU())
for i in range(1,g_n_layers):g_model.add(Dense(units=g_n_neurons[i],name='g_{}'.format(i)))g_model.add(LeakyReLU())
g_model.add(Dense(units=n_x, activation='tanh',name='g_out'))
print('Generator:')
g_model.summary()
g_model.compile(loss='binary_crossentropy',optimizer=keras.optimizers.Adam(lr=g_learning_rate))

这就是生成器模型的样子：

Generator:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
g_0 (Dense)                  (None, 256)               65792     
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 256)               0         
_________________________________________________________________
g_1 (Dense)                  (None, 512)               131584    
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU)    (None, 512)               0         
_________________________________________________________________
g_2 (Dense)                  (None, 1024)              525312    
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU)    (None, 1024)              0         
_________________________________________________________________
g_out (Dense)                (None, 784)               803600    
=================================================================
Total params: 1,526,288
Trainable params: 1,526,288
Non-trainable params: 0
_________________________________________________________________

在 Keras 示例中，我们没有定义两个判别器网络，就像我们在 TensorFlow 示例中定义的那样。相反，我们定义一个判别器网络，然后将生成器和判别器网络缝合到 GAN 网络中。然后，GAN 网络仅用于训练生成器参数，判别器网络用于训练判别器参数：

# define discriminatord_model = Sequential()
d_model.add(Dense(units=d_n_neurons[0],  input_shape=(n_x,),name='d_0'))
d_model.add(LeakyReLU())
d_model.add(Dropout(0.3))
for i in range(1,d_n_layers):d_model.add(Dense(units=d_n_neurons[i], name='d_{}'.format(i)))d_model.add(LeakyReLU())d_model.add(Dropout(0.3))
d_model.add(Dense(units=1, activation='sigmoid',name='d_out'))
print('Discriminator:')
d_model.summary()
d_model.compile(loss='binary_crossentropy',optimizer=keras.optimizers.SGD(lr=d_learning_rate))

这是判别器模型的外观：

Discriminator:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
d_0 (Dense)                  (None, 256)               200960    
_________________________________________________________________
leaky_re_lu_4 (LeakyReLU)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
d_out (Dense)                (None, 1)                 257       
=================================================================
Total params: 201,217
Trainable params: 201,217
Non-trainable params: 0
_________________________________________________________________

接下来，定义 GAN 网络，并将判别器模型的可训练属性转换为false，因为 GAN 仅用于训练生成器：

# define GAN network
d_model.trainable=False
z_in = Input(shape=(n_z,),name='z_in')
x_in = g_model(z_in)
gan_out = d_model(x_in)gan_model = Model(inputs=z_in,outputs=gan_out,name='gan')
print('GAN:')
gan_model.summary()

gan_model.compile(loss='binary_crossentropy',optimizer=keras.optimizers.Adam(lr=g_learning_rate))

这就是 GAN 模型的样子：

GAN:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
z_in (InputLayer)            (None, 256)               0         
_________________________________________________________________
sequential_1 (Sequential)    (None, 784)               1526288   
_________________________________________________________________
sequential_2 (Sequential)    (None, 1)                 201217    
=================================================================
Total params: 1,727,505
Trainable params: 1,526,288
Non-trainable params: 201,217
_________________________________________________________________

太好了，现在我们已经定义了三个模型，我们必须训练模型。训练按照以下算法进行：

For each epoch:For each batch:  get real images x_batchgenerate noise z_batchgenerate images g_batch using generator modelcombine g_batch and x_batch into x_in and create labels y_outset discriminator model as trainabletrain discriminator using x_in and y_outgenerate noise z_batchset x_in = z_batch and labels y_out = 1set discriminator model as non-trainabletrain gan model using x_in and y_out, (effectively training generator model)

为了设置标签，我们分别对真实和假图像应用标签 0.9 和 0.1。通常，建议您使用标签平滑，通过为假数据选择 0.0 到 0.3 的随机值，为实际数据选择 0.8 到 1.0。

以下是笔记本电脑训练的完整代码：

n_epochs = 400
batch_size = 100
n_batches = int(mnist.train.num_examples / batch_size)
n_epochs_print = 50for epoch in range(n_epochs+1):epoch_d_loss = 0.0epoch_g_loss = 0.0for batch in range(n_batches):x_batch, _ = mnist.train.next_batch(batch_size)x_batch = norm(x_batch)z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z])g_batch = g_model.predict(z_batch)x_in = np.concatenate([x_batch,g_batch])y_out = np.ones(batch_size*2)y_out[:batch_size]=0.9y_out[batch_size:]=0.1d_model.trainable=Truebatch_d_loss = d_model.train_on_batch(x_in,y_out)z_batch = np.random.uniform(-1.0,1.0,size=[batch_size,n_z])x_in=z_batchy_out = np.ones(batch_size)d_model.trainable=Falsebatch_g_loss = gan_model.train_on_batch(x_in,y_out)epoch_d_loss += batch_d_loss epoch_g_loss += batch_g_loss if epoch%n_epochs_print == 0:average_d_loss = epoch_d_loss / n_batchesaverage_g_loss = epoch_g_loss / n_batchesprint('epoch: {0:04d}   d_loss = {1:0.6f}  g_loss = {2:0.6f}'.format(epoch,average_d_loss,average_g_loss))# predict images using generator model trained            x_pred = g_model.predict(z_test)display_images(x_pred.reshape(-1,pixel_size,pixel_size))

我们每 50 个周期印刷结果，最多 350 个周期：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qKOVQaN7-1681566540592)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/207381b6-d535-4692-89b6-45bb1501c664.png)]

该模型慢慢地学习从随机噪声中生成高质量的手写数字图像。

GAN 有如此多的变化，它将需要另一本书来涵盖所有不同类型的 GAN。但是，实现技术几乎与我们在此处所示的相似。

TensorFlow 和 Keras 中的深度卷积 GAN

您可以按照 Jupyter 笔记本中的代码ch-14b_DCGAN。

在 DCGAN 中，判别器和生成器都是使用深度卷积网络实现的：

在此示例中，我们决定将生成器实现为以下网络：

Generator:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
g_in (Dense)                 (None, 3200)              822400    
_________________________________________________________________
g_in_act (Activation)        (None, 3200)              0         
_________________________________________________________________
g_in_reshape (Reshape)       (None, 5, 5, 128)         0         
_________________________________________________________________
g_0_up2d (UpSampling2D)      (None, 10, 10, 128)       0         
_________________________________________________________________
g_0_conv2d (Conv2D)          (None, 10, 10, 64)        204864    
_________________________________________________________________
g_0_act (Activation)         (None, 10, 10, 64)        0         
_________________________________________________________________
g_1_up2d (UpSampling2D)      (None, 20, 20, 64)        0         
_________________________________________________________________
g_1_conv2d (Conv2D)          (None, 20, 20, 32)        51232     
_________________________________________________________________
g_1_act (Activation)         (None, 20, 20, 32)        0         
_________________________________________________________________
g_2_up2d (UpSampling2D)      (None, 40, 40, 32)        0         
_________________________________________________________________
g_2_conv2d (Conv2D)          (None, 40, 40, 16)        12816     
_________________________________________________________________
g_2_act (Activation)         (None, 40, 40, 16)        0         
_________________________________________________________________
g_out_flatten (Flatten)      (None, 25600)             0         
_________________________________________________________________
g_out (Dense)                (None, 784)               20071184  
=================================================================
Total params: 21,162,496
Trainable params: 21,162,496
Non-trainable params: 0

生成器是一个更强大的网络，有三个卷积层，然后是 tanh 激活。我们将判别器网络定义如下：

Discriminator:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
d_0_reshape (Reshape)        (None, 28, 28, 1)         0         
_________________________________________________________________
d_0_conv2d (Conv2D)          (None, 28, 28, 64)        1664      
_________________________________________________________________
d_0_act (Activation)         (None, 28, 28, 64)        0         
_________________________________________________________________
d_0_maxpool (MaxPooling2D)   (None, 14, 14, 64)        0         
_________________________________________________________________
d_out_flatten (Flatten)      (None, 12544)             0         
_________________________________________________________________
d_out (Dense)                (None, 1)                 12545     
=================================================================
Total params: 14,209
Trainable params: 14,209
Non-trainable params: 0
_________________________________________________________________

GAN 网络由判别器和生成器组成，如前所述：

GAN:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
z_in (InputLayer)            (None, 256)               0         
_________________________________________________________________
g (Sequential)               (None, 784)               21162496  
_________________________________________________________________
d (Sequential)               (None, 1)                 14209     
=================================================================
Total params: 21,176,705
Trainable params: 21,162,496
Non-trainable params: 14,209
_________________________________________________________________

当我们运行这个模型 400 个周期时，我们得到以下输出：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-z1N8JYrD-1681566540592)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/63ecc6fc-d64d-48ed-868a-92a5b28e1b06.png)]

如您所见，DCGAN 能够从周期 100 本身开始生成高质量的数字。 DGCAN 已被用于样式转移，图像和标题的生成以及图像代数，即拍摄一个图像的一部分并将其添加到另一个图像的部分。 MNIST DCGAN 的完整代码在笔记本ch-14b_DCGAN中提供。

总结

在本章中，我们了解了生成对抗网络。我们在 TensorFlow 和 Keras 中构建了一个简单的 GAN，并将其应用于从 MNIST 数据集生成图像。我们还了解到，许多不同的 GAN 衍生产品正在不断推出，例如 DCGAN，SRGAN，StackGAN 和 CycleGAN 等等。我们还建立了一个 DCGAN，其中生成器和判别器由卷积网络组成。我们鼓励您阅读并尝试不同的衍生工具，以了解哪些模型适合他们试图解决的问题。

在下一章中，我们将学习如何使用 TensorFlow 集群和多个计算设备（如多个 GPU）在分布式集群中构建和部署模型。

十五、TensorFlow 集群的分布式模型

之前我们学习了如何使用 Kubernetes，Docker 和 TensorFlow 服务在生产中大规模运行 TensorFlow 模型。 TensorFlow 服务并不是大规模运行 TensorFlow 模型的唯一方法。 TensorFlow 提供了另一种机制，不仅可以运行，还可以在多个节点或同一节点上的不同节点和不同设备上训练模型。在第 1 章，TensorFlow 101 中，我们还学习了如何在不同设备上放置变量和操作。在本章中，我们将学习如何分发 TensorFlow 模型以在多个节点上的多个设备上运行。

在本章中，我们将介绍以下主题：

分布式执行策略
TensorFlow 集群
数据并行模型
对分布式模型的异步和同步更新

分布式执行策略

为了在多个设备或节点上分发单个模型的训练，有以下策略：

模型并行：将模型划分为多个子图，并将单独的图放在不同的节点或设备上。子图执行计算并根据需要交换变量。
数据并行：将数据分组并在多个节点或设备上运行相同的模型，并在主节点上组合参数。因此，工作节点在批量数据上训练模型并将参数更新发送到主节点，也称为参数服务器。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X6fLCt2k-1681566540593)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/997df543-69ec-44f2-afb4-1b0da5d0dca6.png)]

上图显示了数据并行方法，其中模型副本分批读取数据分区并将参数更新发送到参数服务器，参数服务器将更新的参数发送回模型副本以进行下一次批量计算的更新。

在 TensorFlow 中，有两种方法可以在数据并行策略下在多个节点/设备上实现模型的复制：

图中复制：在这种方法中，有一个客户端任务拥有模型参数，并将模型计算分配给多个工作任务。
图之间复制：在这种方法中，每个客户端任务都连接到自己的工作者以分配模型计算，但所有工作器都更新相同的共享模型。在此模型中，TensorFlow 会自动将一个工作器指定为主要工作器，以便模型参数仅由主要工作器初始化一次。

在这两种方法中，参数服务器上的参数可以通过两种不同的方式更新：

同步更新：在同步更新中，参数服务器等待在更新梯度之前从所有工作器接收更新。参数服务器聚合更新，例如通过计算所有聚合的平均值并将其应用于参数。更新后，参数将同时发送给所有工作器。这种方法的缺点是一个慢工作者可能会减慢每个人的更新速度。
异步更新：在异步更新中，工作器在准备好时将更新发送到参数服务器，然后参数服务器在接收更新时应用更新并将其发回。这种方法的缺点是，当工作器计算参数并发回更新时，参数可能已被其他工作器多次更新。这个问题可以通过几种方法来减轻，例如降低批量大小或降低学习率。令人惊讶的是，异步方法甚至可以工作，但实际上，它们确实有效！

TensorFlow 集群

TensorFlow（TF）集群是一种实现我们刚刚讨论过的分布式策略的机制。在逻辑层面，TF 集群运行一个或多个作业，并且每个作业由一个或多个任务组成。因此，工作只是任务的逻辑分组。在进程级别，每个任务都作为 TF 服务器运行。在机器级别，每个物理机器或节点可以通过运行多个服务器（每个任务一个服务器）来运行多个任务。客户端在不同的服务器上创建图，并通过调用远程会话在一台服务器上开始执行图。

作为示例，下图描绘了连接到名为m1的两个作业的两个客户端：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CJo7UVzZ-1681566540593)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/ee641e15-ca7f-4213-a2f6-ded4a09769de.png)]

这两个节点分别运行三个任务，作业w1分布在两个节点上，而其他作业包含在节点中。

TF 服务器实现为两个进程：主控制器和工作器。主控制器与其他任务协调计算，工作器是实际运行计算的工作器。在更高级别，您不必担心 TF 服务器的内部。出于我们的解释和示例的目的，我们将仅涉及 TF 任务。

要以数据并行方式创建和训练模型，请使用以下步骤：

定义集群规范
创建服务器以承载任务
定义要分配给参数服务器任务的变量节点
定义要在所有工作任务上复制的操作节点
创建远程会话
在远程会话中训练模型
使用该模型进行预测

定义集群规范

要创建集群，首先要定义集群规范。集群规范通常包含两个作业：ps用于创建参数服务器任务，worker用于创建工作任务。worker和ps作业包含运行各自任务的物理节点列表。举个例子：

clusterSpec = tf.train.ClusterSpec({'ps': ['master0.neurasights.com:2222',  # /job:ps/task:0'master1.neurasights.com:2222' # /job:ps/task:1]'worker': ['worker0.neurasights.com:2222',  # /job:worker/task:0'worker1.neurasights.com:2222',  # /job:worker/task:1'worker0.neurasights.com:2223',  # /job:worker/task:2'worker1.neurasights.com:2223' # /job:worker/task:3]})

该规范创建了两个作业，作业ps中的两个任务分布在两个物理节点上，作业worker中的四个任务分布在两个物理节点上。

在我们的示例代码中，我们在不同端口上的 localhost 上创建所有任务：

ps = ['localhost:9001',  # /job:ps/task:0]
workers = ['localhost:9002',  # /job:worker/task:0'localhost:9003',  # /job:worker/task:1'localhost:9004',  # /job:worker/task:2]
clusterSpec = tf.train.ClusterSpec({'ps': ps, 'worker': workers})

正如您在代码中的注释中所看到的，任务通过/job:<job name>/task:<task index>标识。

创建服务器实例

由于集群每个任务包含一个服务器实例，因此在每个物理节点上，通过向服务器传递集群规范，它们自己的作业名称和任务索引来启动服务器。服务器使用集群规范来确定计算中涉及的其他节点。

server = tf.train.Server(clusterSpec, job_name="ps", task_index=0)
server = tf.train.Server(clusterSpec, job_name="worker", task_index=0)
server = tf.train.Server(clusterSpec, job_name="worker", task_index=1)
server = tf.train.Server(clusterSpec, job_name="worker", task_index=2)

在我们的示例代码中，我们有一个 Python 文件可以在所有物理机器上运行，包含以下内容：

server = tf.train.Server(clusterSpec,job_name=FLAGS.job_name,task_index=FLAGS.task_index,config=config)

在此代码中，job_name和task_index取自命令行传递的参数。包tf.flags是一个花哨的解析器，可以访问命令行参数。 Python 文件在每个物理节点上执行如下（如果您仅使用本地主机，则在同一节点上的单独终端中执行）：

# the model should be run in each physical node 
# using the appropriate arguments
$ python3 model.py --job_name='ps' --task_index=0
$ python3 model.py --job_name='worker' --task_index=0
$ python3 model.py --job_name='worker' --task_index=1
$ python3 model.py --job_name='worker' --task_index=2

为了在任何集群上运行代码具有更大的灵活性，您还可以通过命令行传递运行参数服务器和工作程序的计算机列表：-ps='localhost:9001' --worker='localhost:9002,localhost:9003,``localhost:9004'。您需要解析它们并在集群规范字典中正确设置它们。

为确保我们的参数服务器仅使用 CPU 而我们的工作器任务使用 GPU，我们使用配置对象：

config = tf.ConfigProto()
config.allow_soft_placement = Trueif FLAGS.job_name=='ps':#print(config.device_count['GPU'])config.device_count['GPU']=0server = tf.train.Server(clusterSpec,job_name=FLAGS.job_name,task_index=FLAGS.task_index,config=config)server.join()sys.exit('0')
elif FLAGS.job_name=='worker':config.gpu_options.per_process_gpu_memory_fraction = 0.2server = tf.train.Server(clusterSpec,job_name=FLAGS.job_name,task_index=FLAGS.task_index,config=config

当工作器执行模型训练并退出时，参数服务器等待server.join()。

这就是我们的 GPU 在所有四台服务器运行时的样子：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XMxBBBCi-1681566540593)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/0ccdf601-3c4d-4046-9de9-69789da8f6b0.png)]

定义服务器和设备之间的参数和操作

您可以使用我们在第 1 章中使用的tf.device()函数，将参数放在ps任务和worker任务上图的计算节点上。

请注意，您还可以通过将设备字符串添加到任务字符串来将图节点放置在特定设备上，如下所示：/job:<job name>/task:<task index>/device:<device type>:<device index>.

对于我们的演示示例，我们使用 TensorFlow 函数tf.train.replica_device_setter()来放置变量和操作。

首先，我们将工作器设备定义为当前工作器：

worker_device='/job:worker/task:{}'.format(FLAGS.task_index)

接下来，使用replica_device_setter定义设备函数，传递集群规范和当前工作设备。replica_device_setter函数从集群规范中计算出参数服务器，如果有多个参数服务器，则默认情况下以循环方式在它们之间分配参数。参数放置策略可以更改为tf.contrib包中的用户定义函数或预构建策略。

device_func = tf.train.replica_device_setter(worker_device=worker_device,cluster=clusterSpec)

最后，我们在tf.device(device_func)块内创建图并训练它。对于同步更新和异步更新，图的创建和训练是不同的，因此我们将在两个单独的小节中介绍这些内容。

定义并训练图以进行异步更新

如前所述，并在此处的图中显示，在异步更新中，所有工作任务在准备就绪时发送参数更新，参数服务器更新参数并发回参数。参数更新没有同步或等待或聚合：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wibiRa3O-1681566540593)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/7602cad7-0f34-44d0-9aba-179a4cf3c0be.png)]The full code for this example is in ch-15_mnist_dist_async.py. You are encouraged to modify and explore the code with your own datasets.

对于异步更新，将使用以下步骤创建和训练图：

图的定义在with块内完成：

with tf.device(device_func):

使用内置的 TensorFlow 函数创建全局步骤变量：

global_step = tf.train.get_or_create_global_step()

此变量也可以定义为：

tf.Variable(0,name='global_step',trainable=False)

像往常一样定义数据集，参数和超参数：

x_test = mnist.test.images
y_test = mnist.test.labels
n_outputs = 10  # 0-9 digits
n_inputs = 784  # total pixels
learning_rate = 0.01
n_epochs = 50
batch_size = 100
n_batches = int(mnist.train.num_examples/batch_size)
n_epochs_print=10

像往常一样定义占位符，权重，偏差，对率，交叉熵，损失操作，训练操作，准确率：

# input images
x_p = tf.placeholder(dtype=tf.float32,name='x_p',shape=[None, n_inputs])
# target output
y_p = tf.placeholder(dtype=tf.float32,name='y_p',shape=[None, n_outputs])
w = tf.Variable(tf.random_normal([n_inputs, n_outputs],name='w'))
b = tf.Variable(tf.random_normal([n_outputs],name='b'))
logits = tf.matmul(x_p,w) + bentropy_op = tf.nn.softmax_cross_entropy_with_logits(labels=y_p,logits=logits)
loss_op = tf.reduce_mean(entropy_op)optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss_op,global_step=global_step)correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y_p, 1))
accuracy_op = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

当我们学习如何构建同步更新时，这些定义将会改变。

TensorFlow 提供了一个主管类，可以帮助创建训练会话，在分布式训练设置中非常有用。创建一个supervisor对象，如下所示：

init_op = tf.global_variables_initializer
sv = tf.train.Supervisor(is_chief=is_chief,init_op = init_op(),global_step=global_step)

使用supervisor对象创建会话并像往常一样在此会话块下运行训练：

with sv.prepare_or_wait_for_session(server.target) as mts:lstep = 0for epoch in range(n_epochs):for batch in range(n_batches):x_batch, y_batch = mnist.train.next_batch(batch_size)feed_dict={x_p:x_batch,y_p:y_batch}_,loss,gstep=mts.run([train_op,loss_op,global_step],feed_dict=feed_dict)lstep +=1if (epoch+1)%n_epochs_print==0:print('worker={},epoch={},global_step={}, \local_step={},loss={}'.format(FLAGS.task_index,epoch,gstep,lstep,loss))feed_dict={x_p:x_test,y_p:y_test}accuracy = mts.run(accuracy_op, feed_dict=feed_dict)print('worker={}, final accuracy = {}'.format(FLAGS.task_index,accuracy))

在启动参数服务器时，我们得到以下输出：

$ python3 ch-15_mnist_dist_async.py --job_name='ps' --task_index=0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506
pciBusID: 0000:01:00.0
totalMemory: 15.89GiB freeMemory: 15.79GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1)
E1213 16:50:14.023235178   27224 ev_epoll1_linux.c:1051]     grpc epoll fd: 23
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:9001

在启动工作任务时，我们得到以下三个输出：

工作器 1 的输出：

$ python3 ch-15_mnist_dist_async.py --job_name='worker' --task_index=0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506
pciBusID: 0000:01:00.0
totalMemory: 15.89GiB freeMemory: 9.16GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1)
E1213 16:50:37.516609689   27507 ev_epoll1_linux.c:1051]     grpc epoll fd: 23
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:9002
I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 1421824c3df413b5 with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true
worker=0,epoch=9,global_step=10896, local_step=5500, loss = 1.2575616836547852
worker=0,epoch=19,global_step=22453, local_step=11000, loss = 0.7158586382865906
worker=0,epoch=29,global_step=39019, local_step=16500, loss = 0.43712112307548523
worker=0,epoch=39,global_step=55513, local_step=22000, loss = 0.3935799300670624
worker=0,epoch=49,global_step=72002, local_step=27500, loss = 0.3877961337566376
worker=0, final accuracy = 0.8865000009536743

工作器 2 的输出：

$ python3 ch-15_mnist_dist_async.py --job_name='worker' --task_index=1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506
pciBusID: 0000:01:00.0
totalMemory: 15.89GiB freeMemory: 12.43GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1)
E1213 16:50:36.684334877   27461 ev_epoll1_linux.c:1051]     grpc epoll fd: 23
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:9003
I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 2bd8a136213a1fce with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true
worker=1,epoch=9,global_step=11085, local_step=5500, loss = 0.6955764889717102
worker=1,epoch=19,global_step=22728, local_step=11000, loss = 0.5891970992088318
worker=1,epoch=29,global_step=39074, local_step=16500, loss = 0.4183048903942108
worker=1,epoch=39,global_step=55599, local_step=22000, loss = 0.32243454456329346
worker=1,epoch=49,global_step=72105, local_step=27500, loss = 0.5384714007377625
worker=1, final accuracy = 0.8866000175476074

工作器 3 的输出：

$ python3 ch-15_mnist_dist_async.py --job_name='worker' --task_index=2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Quadro P5000 major: 6 minor: 1 memoryClockRate(GHz): 1.506
pciBusID: 0000:01:00.0
totalMemory: 15.89GiB freeMemory: 15.70GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1)
E1213 16:50:35.568349791   27449 ev_epoll1_linux.c:1051]     grpc epoll fd: 23
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:9002, 1 -> localhost:9003, 2 -> localhost:9004}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://The full code for this example is in ch-15_mnist_dist_sync.py. You are encouraged to modify and explore the code with your own datasets.localhost:9004
I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session cb0749c9f5fc163e with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true
I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 55bf9a2b9718a571 with config: gpu_options { per_process_gpu_memory_fraction: 0.2 } allow_soft_placement: true
worker=2,epoch=9,global_step=37367, local_step=5500, loss = 0.8077645301818848
worker=2,epoch=19,global_step=53859, local_step=11000, loss = 0.26333487033843994
worker=2,epoch=29,global_step=70299, local_step=16500, loss = 0.6506651043891907
worker=2,epoch=39,global_step=76999, local_step=22000, loss = 0.20321622490882874
worker=2,epoch=49,global_step=82499, local_step=27500, loss = 0.4170967936515808
worker=2, final accuracy = 0.8894000053405762

我们打印了全局步骤和本地步骤。全局步骤表示所有工作器任务的步数，而本地步骤是该工作器任务中的计数，这就是为什么本地任务计数高达 27,500 并且每个工作器的每个周期都相同，但是因为工作器正在做按照自己的步骤采取全局性措施，全局步骤的数量在周期或工作器之间没有对称性或模式。此外，我们发现每个工作器的最终准确率是不同的，因为每个工作器在不同的时间执行最终的准确率，当时有不同的参数。

定义并训练图以进行同步更新

如前所述，并在此处的图中描述，在同步更新中，任务将其更新发送到参数服务器，ps任务等待接收所有更新，聚合它们，然后更新参数。工作任务在继续下一次计算参数更新迭代之前等待更新：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G201DT2C-1681566540594)(https://gitcode.net/apachecn/apachecn-dl-zh/-/raw/master/docs/mastering-tf-1x-zh/img/bf31ed94-ada0-462d-a4ed-4ae08cc18fb6.png)]

此示例的完整代码位于ch-15_mnist_dist_sync.py中。建议您使用自己的数据集修改和浏览代码。

对于同步更新，需要对代码进行以下修改：

优化器需要包装在SyncReplicaOptimizer中。因此，在定义优化器后，添加以下代码：

# SYNC: next line added for making it sync update
optimizer = tf.train.SyncReplicasOptimizer(optimizer,replicas_to_aggregate=len(workers),total_num_replicas=len(workers),)

之后应该像以前一样添加训练操作：

train_op = optimizer.minimize(loss_op,global_step=global_step)

接下来，添加特定于同步更新方法的初始化函数定义：

if is_chief:local_init_op = optimizer.chief_init_op()
else:local_init_op = optimizer.local_step_init_op()
chief_queue_runner = optimizer.get_chief_queue_runner()
init_token_op = optimizer.get_init_tokens_op()

使用两个额外的初始化函数也可以不同地创建supervisor对象：

# SYNC: sv is initialized differently for sync update
sv = tf.train.Supervisor(is_chief=is_chief,init_op = tf.global_variables_initializer(),local_init_op = local_init_op,ready_for_local_init_op = optimizer.ready_for_local_init_op,global_step=global_step)

最后，在训练的会话块中，我们初始化同步变量并启动队列运行器（如果它是主要的工作者任务）：

# SYNC: if block added to make it sync update
if is_chief:mts.run(init_token_op)sv.start_queue_runners(mts, [chief_queue_runner])

其余代码与异步更新保持一致。

用于支持分布式训练的 TensorFlow 库和函数正在不断发展。因此，请注意添加的新函数或函数签名的更改。在撰写本书的时候，我们使用了 TensorFlow 1.4。

总结

在本章中，我们学习了如何使用 TensorFlow 集群在多台机器和设备上分发模型的训练。我们还学习了 TensorFlow 代码分布式执行的模型并行和数据并行策略。

参数更新可以与参数服务器的同步或异步更新共享。我们学习了如何为同步和异步参数更新实现代码。借助本章中学到的技能，您将能够构建和训练具有非常大的数据集的非常大的模型。

精通 TensorFlow 1.x：11~15