使用numpy基于全连接层实现Minst数据集训练—

PS：本贴还没完全写完，全部代码直接转至文末，有时间会一点一点把细节部分解释清楚
题外话：以前习惯调包实现网络，然后发现研究新的较为复杂的网络结构会很吃力，于是回过头来垒实基础，后续会在全连接层的基础上添加卷积池化dropout归一层等等，如果你也有对应需求，可以持续关注哈。so,进入正题

网络结构

代码实现

参数初始化

前向传播:

L_model_forward

linear_activation_forward

linear_forward

activation

Minist数据集网盘地址：回头贴上
ps:度娘随便找一下也有

网络结构

ps:全连接层功能函数实现是基于吴恩达老师的课程的课后作业实现的，有大佬整合成了中文版的，顺便推一下（真的很细节）：deeplearning_目录
使用简单的3层全连接层来实现，隐藏层分别使用256,64个神经元，输出层为10分类。(ps:笔者实力有限(大概懒得整理)，所以没有将函数整合成类)接下来让我们从全连接层的各个小功能开始一一实现。

代码实现

这里我是从大函数开始讲，我觉的这样会比从小功能开始说起更有条理感，如果大家觉得不好理解可以留言我，我会进一步修改。

参数初始化

def linear_initialize_parameters(n_x,n_h1,n_h2,n_y):W1 = np.random.randn(n_h1, n_x) * 0.1b1 = np.zeros((n_h1, 1))#HE初始化呢# W2 = np.random.randn(n_h2, n_h1) * 0.1W2 = he_init(n_h1,n_h2)b2 = np.zeros((n_h2, 1))#softmax激活W3 = np.random.randn(n_y, n_h2) * 0.1b3 = np.zeros((n_y, 1))### END CODE HERE ###parameters = {"W1": W1,"b1": b1,"W2": W2,"b2": b2,"W3": W3,"b3": b3}return parameters

输入：
n_x ----------输入神经元个数
n_h1---------隐藏I层神经元个数
n_h2---------隐藏II层神经元个数
n_y-----------输出层神经元个数(多分类的类别数)

输出：
parameters-----一个包含了所有权重矩阵以及偏置值的字典，用于后续的传播

在这段代码中W2是使用了HE初始化，W2对应的输出会用到relu激活函数，对于relu激活的层，使用HE可以更快的收敛，也能减少梯度爆炸概率（原理这里就不细说了），实测random.randn初始化也是可行的，但最后精度会少0.4%(不排除模型随机性影响)。
PS:如果要增加层数或是减少层数，需要更改本函数。举个例子降为2层：输入变为：nx,n_h,n_y,对应的也只需要做W1，b1,W2,b2的初始化，parameters更新为4个元素。

前向传播:

L_model_forward

L层model前向传播

def L_model_forward(X, parameters):#parameters会存有（w,b）两个数据caches = []A = XL = len(parameters) // 2  # number of layers in the neural networkfor l in range(1, L):A_prev = AA, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)],activation="relu")caches.append(cache)#outPutAL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)],activation="softmax")#AL (10,1)caches.append(cache)# assert (AL.shape == (1, X.shape[1]))return AL, caches

输入：
X ------------------ 拉伸后图片对象（将原28*28的图片拉伸成(784,1)的数组）
parameters ------ 存储了所有的W（权重矩阵），b（偏置值对象）。ps:举个例子，我有3层，那么parameters中就存储了6个参数分别是： $W_{1},b_{1},W_{2},b_{2},W_{3},b_{3}$ ，所以L就反应了全连接的层数。
输出：
AL------------------- 前向传播L层后的输出值，在本例中位0-9每个数字的预测值
caches------------- 存储前向传播过程中的一些参数，这些参数在反向传播中会使用
在获取了层数之后，接下来就是进入循环遍历（除了输出层）PS:由于我W是从1开始计数的，所以循环范围取(1,L)

A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)],activation="relu")

这句代码调用linear_activation_forward函数,激活函数使用relu(ps:relu一般适用于隐藏层)，这个函数的作用就是进行如下操作：
$Relu(W_{i}^{}*A_{i}+b_{i})$
对于输出层，直接用relu截断是一个很不明智的选择，对于多分类任务，有一个常用组合softmax激活+交叉熵损失函数(ps:这个组合在有良好的模型评估能力的同时有着非常简单的导数便于反向传播)，所以最后一层也就是第L层使用softmax激活函数。

linear_activation_forward

全连接层+激活函数

def linear_activation_forward(A_prev, W, b,activation):if activation == "softmax":Z, linear_cache = linear_forward(A_prev, W, b)A, activation_cache = softmax(Z)elif activation == "relu":Z, linear_cache = linear_forward(A_prev, W, b)A, activation_cache = relu(Z)assert (A.shape == (W.shape[0], A_prev.shape[1]))cache = (linear_cache, activation_cache)return A, cache

输入：
A_prev ---------- 前一层的输出（第一层时A_prev位原图像）
W ----------------- 权重矩阵
b ------------------ 偏置值
activation ------- 激活函数类型

输出：
A ----------------- 输出矩阵
cache ----------- (linear_cache, activation_cache)，其中linear_forward中产生的与反向传播有关参数，activation_cache是激活函数中产生的与反向传播有关参数

这一函数功能很是明了了，对全连接层前向传播后，根据不同的激活函数对结果进行非线性化。这里只实现了relu以及softmax，有需要这可以加入更多的激活函数譬如relu的各种变种等等。

so,让我们继续吧，这个函数中设计了2(3)个新的函数，让我们一起看看吧

linear_forward

单层全连接层前向传播

def linear_forward(A, W, b):#(64,784) * (784,1)# print(W.shape,A.shape)Z = np.dot(W, A) + b  #A 784,1  W:10,784cache = (A, W, b)return Z, cache

输入：
A，W，b ---------- 这个不用我重复了吧。。。

输出：
Z --------------------- 输出矩阵
cache --------------- (A，W，b)，注意啊，这个也就是linear_activation_forward函数输出cache中的linear_cache
这功能一眼明了了好吧。。。就不多说了 $W_{i}^{}*A_{i}+b_{i}$

activation

主要涉及2种激活函数relu以及softmax，函数比较简单，就放一起了

def relu(Z):A = np.maximum(0, Z)cache = Zreturn A, cachedef softmax(Z):m = Z.shape[-1]A = np.zeros_like(Z)for i in range(m):a = np.exp(Z[:, i])/np.sum(np.exp(Z[:, i]))A[:,i] = acache = Areturn A,cache

值得一提的是两者返回的与反向传播相关的参数也就是activation_cache有些许不同，relu是返回的输入值，而softmax返回的是输出值(其实返回一个A就行，但是为了统一格式，还是返回了一个cache对象)，softmax中的Z(种类，批次数量)，不过我并没有使用mini_batch梯度下降的方式来进行反向传播，所以m将会为1。

OK 至此前向传播就全部完成了，再次贴上最开始的总纲函数，这么看是不是更加清晰了：

def L_model_forward(X, parameters):#parameters会存有（w,b）两个数据caches = []A = XL = len(parameters) // 2  # number of layers in the neural networkfor l in range(1, L):A_prev = AA, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)],activation="relu")caches.append(cache)#outPutAL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)],activation="softmax")#AL (10,1)caches.append(cache)# assert (AL.shape == (1, X.shape[1]))return AL, caches

to do :
损失函数计算 + 反向传播emmm明天再写吧

全部代码：
Ps:如果要在本地跑通请确定安装有numpy,pandas,以及sklearn（这个库是用来随机划分训练集和测试集的，我偷懒没纯手工实现），同时需要将train,以及X_test对应的地址换成Minst数据集的文件

1import numpy as np
import pandas as pd
def he_init(input_dim,output_dim):stddev = np.sqrt(2.0/input_dim)W = np.random.normal(0,stddev,size =(output_dim,input_dim) )return  W
def linear_initialize_parameters(n_x,n_h1,n_h2,n_y):W1 = np.random.randn(n_h1, n_x) * 0.1b1 = np.zeros((n_h1, 1))#HE初始化呢# W2 = np.random.randn(n_h2, n_h1) * 0.1W2 = he_init(n_h1,n_h2)b2 = np.zeros((n_h2, 1))#softmax激活W3 = np.random.randn(n_y, n_h2) * 0.1b3 = np.zeros((n_y, 1))### END CODE HERE ###parameters = {"W1": W1,"b1": b1,"W2": W2,"b2": b2,"W3": W3,"b3": b3}return parametersdef softmax(Z):#Z = (10,1)m = Z.shape[-1]A = np.zeros_like(Z)for i in range(m):a = np.exp(Z[:, i])/np.sum(np.exp(Z[:, i]))A[:,i] = acache = Areturn A,cachedef relu(Z):A = np.maximum(0, Z)cache = Zreturn A, cachedef relu_backward(dA, cache):Z = cachedZ = np.array(dA, copy=True)  # just converting dz to a correct object.dZ[Z <= 0] = 0return dZdef linear_forward(A, W, b):#(64,784) * (784,1)# print(W.shape,A.shape)Z = np.dot(W, A) + b  #A 784,1  W:10,784cache = (A, W, b)return Z, cachedef linear_activation_forward(A_prev, W, b,activation):if activation == "softmax":Z, linear_cache = linear_forward(A_prev, W, b)A, activation_cache = softmax(Z)elif activation == "relu":Z, linear_cache = linear_forward(A_prev, W, b)A, activation_cache = relu(Z)assert (A.shape == (W.shape[0], A_prev.shape[1]))cache = (linear_cache, activation_cache)return A, cache#return AL, cache
def L_model_forward(X, parameters):#parameters会存有（w,b）两个数据caches = []A = XL = len(parameters) // 2  # number of layers in the neural networkfor l in range(1, L):A_prev = AA, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)],activation="relu")caches.append(cache)#outPutAL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)],activation="softmax")#AL (10,1)caches.append(cache)# assert (AL.shape == (1, X.shape[1]))return AL, caches
#return cost###有问题
#二分类任务
def compute_cost(AL, Y):#AL预测值 (10,1)  Y--标签（10,1）# Y = Y.m = Y.shape[1]#multiply内积 (10,1)@(10,1) = (10,1)passNEAR_0 = 1e-10cost = -np.sum(np.multiply(np.log(AL + NEAR_0), Y) + np.multiply(np.log(1 - AL + NEAR_0), 1 - Y)) / mcost = np.squeeze(cost)assert (cost.shape == ())return costdef softmax_backward(Y,activation_cache):"""Y是真实标签activation_cache是softmax后的预测标签"""dZ = activation_cache - Yreturn  dZ#多分类任务
def compute_cost_multi(AL,Y):"""AL:预测值10,1Y标签 10,1"""### m = 1m = AL.shape[-1]for i in range(m):cost = -(1/m) * np.sum(np.multiply(np.log(AL[:,i]),Y[:,i]))return cost
#return dA_prev, dW, db  #dZ(10,1),A_prev(100,1) W(10,100)
def linear_backward(dZ, cache):A_prev, W, b = cachem = A_prev.shape[1]A_prev, W, b = cachem = A_prev.shape[1]dW = np.dot(dZ, A_prev.T) / mdb = np.sum(dZ, axis=1, keepdims=True) / mdA_prev = np.dot(W.T, dZ)return dA_prev, dW, db#return dA_prev, dW, db
def linear_activation_backward(dA, cache,Y,activation="relu"):#y_hat 经过softmax激活后的预测概率#dL/dA     dZ = dL/dA * dA/dA_prev(对激活函数求导)# cache  =linear_cache, activation_cache = cacheif activation == "relu":dZ = relu_backward(dA, activation_cache)dA_prev, dW, db = linear_backward(dZ, linear_cache)####elif activation == "softmax":dZ = softmax_backward( Y, activation_cache)#activation_cache 是softmax后的预测数据dA_prev, dW, db = linear_backward(dZ, linear_cache)return dA_prev, dW, db#return grads
def L_model_backward(AL, Y, caches):##1.标签对损失函数求导##AYgrads = {}L = len(caches)  # the number of layersm = AL.shape[1]Y = Y.reshape(AL.shape)  # after this line, Y is the same shape as AL# Initializing the backpropagation### START CODE HERE ### (1 line of code)epsilon = 1e-10  # 或者根据具体情况选择一个适当的值AL = np.clip(AL, epsilon, 1 - epsilon)# dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))  #dL/dA  #损失函数导数# print(dAL)# Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]current_cache = caches[L - 1]# print(current_cache)#dL/dA * dA/d#计算输出层的导数grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(_, current_cache,Y,activation="softmax")### END CODE HERE ###for l in reversed(range(L - 1)):# lth layer: (RELU -> LINEAR) gradients.# Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]### START CODE HERE ### (approx. 5 lines)current_cache = caches[l]dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache,_,activation="relu")grads["dA" + str(l + 1)] = dA_prev_tempgrads["dW" + str(l + 1)] = dW_tempgrads["db" + str(l + 1)] = db_temp### END CODE HERE ###return gradsdef update_parameters(parameters, grads, learning_rate):L = len(parameters) // 2  # number of layers in the neural networkfor l in range(L):parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]# print(parameters["b" + str(l + 1)])return parameters##################data clear
digital_map = {"0": [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],"1": [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],"2": [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],"3": [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],"4": [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],"5": [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],"6": [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],"7": [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],"8": [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],"9": [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
}
digital_map = {int(key): value for key, value in digital_map.items()}
train = pd.read_csv(r'CNN_Minst/data/train.csv')
X_test = pd.read_csv(r'CNN_Minst/data/test.csv')
X = train.drop('label',axis=1)
y = train['label']
X = X / 255.0   #0-1缩放
X = X.values.reshape(-1, 1, 28, 28)from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)_,_,H,W = X_train.shape
learning_rate = 0.05
train_images = X_train[:30000]
# 1000,1,28,28
train_labels = y_train[:30000]
# list[1000[10]]
train_labels = [digital_map[num] for num in train_labels]
test_images = X_val[:1000]
#1000,1,28,28
# print(X_val.shape)
test_labels = y_val[:1000]
test_labels = test_labels.map(digital_map)
para = linear_initialize_parameters(28 * 28, 256, 64, 10)
################train
for epoch in range(3):print('--- Epoch %d ---' % (epoch + 1))permutation = np.random.permutation(len(train_images))train_images = train_images[permutation]train_labels = np.array(train_labels)[permutation]loss = 0num_correct = 0for i, (im, label) in enumerate(zip(train_images, train_labels)):label = label[:,np.newaxis]# print(im.shape,label.shape)  #(1,28,28) (10,)if i > 0 and i % 1000 == 999:print('[Step %d] Past 1000 steps: Average Loss %.3f | Accuracy: %d%%' %(i + 1, loss / 1000, num_correct/10))loss = 0num_correct = 0###########全连接层im = im.reshape(28*28,1)#权重矩阵初始化#前向传播AL,chaches = L_model_forward(im,parameters=para)  #AL是softmax激活后的概率#计算准确率acc = 1 if np.argmax(AL) == np.argmax(label) else 0num_correct += acc#计算损失函数cost = compute_cost_multi(AL, label)loss += cost#反向传播获得grads集合grads = L_model_backward(AL,label,chaches)#根据集合更新权重矩阵parameters = update_parameters(para,grads,0.01)#######predict
permutation = np.random.permutation(len(test_images))
test_images = test_images[permutation]
test_labels = np.array(test_labels)[permutation]
num_correct = 0
loss = 0
for i,(im,label) in enumerate(zip(test_images,test_labels)):label =np.array(label).reshape(1, -1)label = label[:,np.newaxis]im = im.reshape(28*28,1)AL,chaches = L_model_forward(im,parameters=para)acc = 1 if np.argmax(AL) == np.argmax(label) else 0num_correct += acccost = compute_cost_multi(AL, label)loss += cost
print("testdata  accuary:",num_correct/10,"  |loss:",loss/1000)