前言
实验中难免有许多缺陷和错误,望批评指正!
数据集和实验文件下载
通过百度网盘分享的文件:预测二手房房价(1).zip
链接:https://pan.baidu.com/s/11Me9CHCys-No9eoKBIrDoQ?pwd=yicj
提取码:yicj
备注:实验文件代码和文章中代码不完全一致
相关文章推荐
线性回归详细笔记可以看这篇,本实验中的核心内容在文章中均有详细解析:
机器学习之监督学习(一)线性回归、多项式回归、算法优化
实验过程
导入相关模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from LinearRegressor import run_gradient_descent
注解:
1、pandas用于读取数据文件转换为DataFrame类型,并进行数据集预处理包括数据清洗、特征编码等等。
2、sklearn,用于进行特征缩放,数据集分割,使用其中的随机梯度下降线性回归器,以及几个线性回归模型评估函数
3、LinearRegressor是自己编写的库,将多元线性回归梯度下降代码中的代码进行完善,引入正则化,采用小批量梯度下降(出于运行效率考虑,数据集庞大,全批量梯度下降运行时间过长)算法训练模型,并返回模型训练的各个评价指标,包括RMSE(均方根误差)、MAE(平均绝对误差)、R2 score (决定系数),它们都能对线性回归模型的预测效果进行评估,表达式如下:
R M S E = 1 m ∑ i = 1 m ( y i − y i ^ ) 2 RMSE=\sqrt{\frac{1}{m}\sum_{i=1}^{m}{(y_i-\hat{y_i})^2}} RMSE=m1i=1∑m(yi−yi^)2
M A E = 1 m ∑ i = 1 m ∣ y i − y i ^ ∣ MAE = \frac{1}{m}\sum_{i=1}^{m}{|y_i - \hat{y_i}|} MAE=m1i=1∑m∣yi−yi^∣
R 2 = 1 − ∑ i = 1 m ( y i − y i ^ ) 2 ∑ i = 1 m ( y i − y ‾ ) 2 R^2 = 1 - \frac{\sum_{i=1}^{m}{(y_i - \hat{y_i})^2}}{\sum_{i=1}^{m}{(y_i - \overline{y})^2}} R2=1−∑i=1m(yi−y)2∑i=1m(yi−yi^)2
LinearRegressor.py 如下
import numpy as np
import math
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score# 计算代价的函数
def get_cost(x, y, w, b,lamb):# 数据集的大小m = x.shape[0]total_cost = 0for i in range(m):error = np.dot(x[i, :], w) + b - y[i]total_cost = total_cost + error ** 2cost = total_cost / (2 * m)cost=cost+(lamb/(2*m))*np.sum(np.square(w))return cost# 计算梯度的函数
def get_gradient(x, y, w, b,lamb):# 获取数据量和特征数量m = x.shape[0]n = x.shape[1]dj_dw = np.zeros((n,))dj_db = 0for i in range(m):error = np.dot(x[i, :], w) + b - y[i]dj_db += errorfor j in range(n):dj_dw[j] += (error * x[i, j])dj_db = dj_db / mdj_dw = dj_dw / mdj_dw=dj_dw+(lamb/m)*wreturn dj_dw, dj_db# 梯度下降函数
def run_gradient_descent(x, y, w_in, b_in, alpha,lamb,num_iters,batch_size):'''x:输入向量,numpy.ndarrayy:输出向量,numpy.ndarrayw_in:初始w向量b_in:初始balpha:学习率lamb:正则化系数num_iters:迭代次数cost_function:代价函数gradient_function:计算梯度函数'''m,n=x.shapeJ_history = [] # 记录训练过程中的所有代价b = b_inw = w_infor i in range(int(num_iters)):# 随机选择一个小批量indices = np.random.choice(m, batch_size, replace=False)x_batch = x[indices]y_batch = y[indices]# 计算偏导,更新参数w,bdj_dw, dj_db = get_gradient(x_batch, y_batch, w, b,lamb)w = w - dj_dw * alphab = b - dj_db * alpha# 保存当前代价J和参数(w,b)->可用于后续可视化J_history.append(get_cost(x, y, w, b,lamb))# 打印其中十次训练信息if i % math.ceil(num_iters / 10) == 0:print(f"Iteration {i}: Cost {J_history[-1]} ")print(f'final w:{w},b:{b} Cost{J_history[-1]}')# 输出训练效果的几个评价指标y_hat = np.dot(x, w) + b# 计算评价指标mse = mean_squared_error(y, y_hat) #均方误差rmse = np.sqrt(mse) #均方根误差mae = mean_absolute_error(y, y_hat) #平均绝对误差r2 = r2_score(y, y_hat) #决定系数print(f'RMSE (train): {rmse:.4f}')print(f'MAE (train): {mae:.4f}')print(f'R² (train): {r2:.4f}')return w, b, J_history,[rmse,mae,r2]
数据预处理
read_csv:读取csv文件,转为为pandas的DataFrame对象
data=pd.read_csv('data/house.csv',encoding='gbk')
查看数据集相关信息,发现没有缺失值,无需清洗数据
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9899 entries, 0 to 9898
Data columns (total 8 columns):Column Non-Null Count Dtype 0 区域 9899 non-null object 1 卧室数 9899 non-null int64 2 客厅数 9899 non-null int64 3 房屋面积 9899 non-null float644 楼层高低 9899 non-null object 5 是否是地铁房 9899 non-null int64 6 是否是学区房 9899 non-null int64 7 价格(万/m2) 9899 non-null float64
dtypes: float64(2), int64(4), object(2)
memory usage: 618.8+ KB
查看部分数据
head显示前五条数据,tail显示后五条数据
data.head()
特征包括:区域、卧室数、客厅数、房屋面积、楼层面积、是否是地铁房、是否是学区房,目标:价格。
其中区域属于文字信息,需要编码,由于区域之间无大小关系,不宜使用标签编码,宜使用独热编码; 楼层高低属于文字信息,需要编码,由于楼层高低反映大小关系,宜使用标签编码。
数据集分割->X,y
X_=data.iloc[:,:-1]
y=data.iloc[:,-1].values
特征编码 区域->独热编码
get_dummies用于对数据集某一列特征进行独热编码。
原本区域特征内包含若干个地点,第一想法是给各个地点确定一个数字编号,但是地点之间是没有大小关系的,标签编码不适宜,适合的编码方式是独热编码。何为独热编码(one-hot code)?从运行后的结果看,将每个地点作为新的特征,False表示区域不在此地,True表示区域即此地。还需要将bool类型转换为int类型,False-0,True-1。
X_1=pd.get_dummies(X_,columns=['区域'])
X_1.head()
还需要将bool类型转换为int类型,False-0,True-1。
# 找出布尔类型的列,并将它们转换为整数类型
bool_columns = X_1.select_dtypes(include=['bool']).columns
X_1[bool_columns] = X_1[bool_columns].astype(int)
X_1.head()
特征编码 楼层高低->标签编码
map方法传入转化字典作为参数,实现文本->数字标签编码,由于楼层高低有数值上的大小意义,因此采用标签编码
X_1['楼层高低']=X_1['楼层高低'].map({'高':2,'中':1,'低':0})
X_1.head()
特征缩放(除了独热编码和标签编码)
一般来说,独热编码和标签编码不需要特征缩放。原因在于这两种编码方式处理的是类别特征,而特征缩放通常是为了处理数值特征的范围和分布问题。
scaled_colums=['卧室数','客厅数','房屋面积']
scaler=StandardScaler()
X_1[scaled_colums]=scaler.fit_transform(X_1[scaled_colums])
X_1.head()
预处理好后,获取X_1的数值矩阵部分得到最终的特征矩阵X
X=X_1.values
#获取数据集规模
m,n=X_train.shape
print(f'm:{m},n:{n}')
m:5939,n:12
**将数据集分割为训练集、验证集、测试集,比例6:2:2
两次使用使用sklearn.model_selection中的train_test_split函数
X_train,X_,y_train,y_=train_test_split(X,y,test_size=0.4,random_state=0)
X_cv,X_test,y_cv,y_test=train_test_split(X_,y_,test_size=0.5,random_state=0)
查看训练集、验证集、测试集大小
print(f'训练集:{X_train.shape}')
print(f'验证集:{X_cv.shape}')
print(f'测试集:{X_test.shape}')
训练集:(5939, 12)
验证集:(1980, 12)
测试集:(1980, 12)
手动梯度下降训练
初始化权重与偏置
#初始化权重与偏置
w_in=np.zeros(n)
b_in=0
正则化系数选择
对于不同的正则化系数,进行模型训练后,根据验证集预测的均方误差大小比较,选择最优正则化系数,这里是0.01。
注:这一步可行性值得商榷,因为采用小批量随机梯度下降,每次运行结果存在差异,所以很难说选出最佳正则化系数。并且观察可知,几个相邻正则化系数效果相当,因此不必过于纠结系数的选择,合适即可。
lamb_list=[0,0.003,0.01,0.03,0.1]
w_all==[]
b_all=[]
rmse_all=[]
for i in range(len(lamb_list)):lamb=lamb_list[i]print(f'lamb:{lamb}')w,b,history,scores=run_gradient_descent(X_train,y_train,w_in,b_in,alpha=0.01,lamb,num_iters=1000,batch_size=32)y_cv_hat=np.dot(X_cv,w)+bmse=mean_squared_error(y_cv,y_cv_hat)rmse=np.sqrt(mse)print(f'验证集rmse:{rmse}')w_all.append(w)b_all.append(b)rmse_all.append(rmse)
rmse_all=np.array(rmse_all)
print(f'最佳正则化系数:{lamb_list[rmse_all.argmin()]}')
参考运行结果:
lamb:0
Iteration 0: Cost 22.23059601198786
Iteration 100: Cost 2.061076611083008
Iteration 200: Cost 1.3561087252998258
Iteration 300: Cost 1.0280998358159494
Iteration 400: Cost 0.8569524011307861
Iteration 500: Cost 0.7593091461261117
Iteration 600: Cost 0.694745542900074
Iteration 700: Cost 0.6492185319330726
Iteration 800: Cost 0.614560487525777
Iteration 900: Cost 0.5869805356544532
final w:[ 0.02143975 0.17038763 -0.17398796 0.06084088 1.75171081 1.63784311.20447934 -0.39465444 0.01006027 1.12995972 -0.29795183 2.17772605],b:3.829619111941914 Cost0.5622034394345841
RMSE (train): 1.0604
MAE (train): 0.8645
R² (train): 0.6958
验证集rmse:1.0571995565292818
lamb:0.003
Iteration 0: Cost 22.351385773755982
Iteration 100: Cost 1.9828997211762693
Iteration 200: Cost 1.3543807706617699
Iteration 300: Cost 1.0399094264386297
Iteration 400: Cost 0.8628443705610528
Iteration 500: Cost 0.764474592689213
Iteration 600: Cost 0.6972510396164013
Iteration 700: Cost 0.6476368062690389
Iteration 800: Cost 0.6104113644232597
Iteration 900: Cost 0.5846282720631463
final w:[ 0.03136469 0.16048989 -0.18776667 0.0651586 1.73883466 1.641326011.22904597 -0.39989929 0.00270852 1.14713962 -0.30977836 2.15427881],b:3.826412132321924 Cost0.5619165309100926
RMSE (train): 1.0601
MAE (train): 0.8639
R² (train): 0.6960
验证集rmse:1.0573511497962558
lamb:0.01
Iteration 0: Cost 22.345809951332445
Iteration 100: Cost 2.0179093767467084
Iteration 200: Cost 1.3540412996107423
Iteration 300: Cost 1.0322567480968006
Iteration 400: Cost 0.8544782275887487
Iteration 500: Cost 0.7594217089430233
Iteration 600: Cost 0.6946625903607831
Iteration 700: Cost 0.64501118903113
Iteration 800: Cost 0.6097704275279966
Iteration 900: Cost 0.580857638244589
final w:[ 0.02375557 0.18864764 -0.17627059 0.05236308 1.7464483 1.623610331.24307867 -0.41702115 0.0100025 1.13727691 -0.31859057 2.18478831],b:3.8492936378141356 Cost0.556517929151675
RMSE (train): 1.0550
MAE (train): 0.8598
R² (train): 0.6989
验证集rmse:1.0524199082008658
lamb:0.03
Iteration 0: Cost 22.13760766414618
Iteration 100: Cost 1.9792383674599943
Iteration 200: Cost 1.3309707044480739
Iteration 300: Cost 1.0314380656185427
Iteration 400: Cost 0.8626182048908188
Iteration 500: Cost 0.7628534227818494
Iteration 600: Cost 0.6961525014198667
Iteration 700: Cost 0.6491098135586656
Iteration 800: Cost 0.6137308220314517
Iteration 900: Cost 0.5868599520844399
final w:[ 0.01335199 0.15733817 -0.14890463 0.07036306 1.73883984 1.652439611.22793312 -0.3934939 0.0139159 1.12679856 -0.3277127 2.15003744],b:3.826606724886083 Cost0.5628590817275239
RMSE (train): 1.0610
MAE (train): 0.8639
R² (train): 0.6955
验证集rmse:1.0581639060644847
lamb:0.1
Iteration 0: Cost 22.51671544759003
Iteration 100: Cost 2.0398728505176593
Iteration 200: Cost 1.3628865416278861
Iteration 300: Cost 1.0296313493619456
Iteration 400: Cost 0.8606342241606357
Iteration 500: Cost 0.7617998432747694
Iteration 600: Cost 0.6927332059016614
Iteration 700: Cost 0.6485004821584673
Iteration 800: Cost 0.6109500914665773
Iteration 900: Cost 0.5836713728604442
final w:[ 0.00995542 0.13994331 -0.15864095 0.05401658 1.73817419 1.620213871.22191713 -0.3945572 0.01987262 1.10185906 -0.32871132 2.15077073],b:3.867383745613026 Cost0.559817615021792
RMSE (train): 1.0580
MAE (train): 0.8622
R² (train): 0.6971
验证集rmse:1.056415833868402
最佳正则化系数:0.01
学习率选择
学习率选择与正则化系数选择类似,但学习率取值范围不同,这里两个参数的取值范围都仅供参考。这里选择的最佳学习率是0.1。
alpha_list=[0.003,0.01,0.03,0.1,0.3]
w_all_=[]
b_all=[]
rmse_all_=[]
for i in range(len(alpha_list)):alpha=alpha_list[i]print(f'alpha:{alpha}')w,b,history,scores=run_gradient_descent(X_train,y_train,w_in,b_in,alpha,0.01,1000,batch)y_cv_hat=np.dot(X_cv,w)+bmse=mean_squared_error(y_cv,y_cv_hat)rmse=np.sqrt(mse)print(f'验证集rmse:{rmse}')w_all.append(w)b_all.append(b)rmse_all_.append(rmse)
rmse_all_=np.array(rmse_all_)
print(f'最佳学习率:{alpha_list[rmse_all_.argmin()]}')
参考运行结果:
alpha:0.003
Iteration 0: Cost 23.253100300920302
Iteration 100: Cost 5.781796554717344
Iteration 200: Cost 2.8908550574645138
Iteration 300: Cost 2.1782169802111473
Iteration 400: Cost 1.8655282509644668
Iteration 500: Cost 1.6320514704981137
Iteration 600: Cost 1.453725374767599
Iteration 700: Cost 1.3100475883354463
Iteration 800: Cost 1.200433605322448
Iteration 900: Cost 1.1076688363423737
final w:[ 0.00582021 0.07079598 -0.11993019 0.66561711 2.14452232 1.484285920.82638515 -0.07419383 0.21743776 0.77370104 -0.04226368 1.23513901],b:2.938344003070145 Cost1.0330393655533985
RMSE (train): 1.4374
MAE (train): 1.1565
R² (train): 0.4410
验证集rmse:1.43557463281523
alpha:0.01
Iteration 0: Cost 22.288236805699807
Iteration 100: Cost 2.033996669824914
Iteration 200: Cost 1.3503161741421037
Iteration 300: Cost 1.0398051820488934
Iteration 400: Cost 0.8684076055530864
Iteration 500: Cost 0.7634785033747918
Iteration 600: Cost 0.6980292714838968
Iteration 700: Cost 0.6503829982463116
Iteration 800: Cost 0.6137008193570105
Iteration 900: Cost 0.5843130784956765
final w:[ 0.01444502 0.15800244 -0.16551005 0.04960028 1.73509097 1.633527031.24986869 -0.39301773 -0.01190119 1.1155374 -0.30946838 2.1621154 ],b:3.8228287139242716 Cost0.5618271447271908
RMSE (train): 1.0600
MAE (train): 0.8636
R² (train): 0.6960
验证集rmse:1.0588126169723184
alpha:0.03
Iteration 0: Cost 19.338694305426777
Iteration 100: Cost 1.0109210195804499
Iteration 200: Cost 0.701999318363482
Iteration 300: Cost 0.5868193617909866
Iteration 400: Cost 0.5239756304843721
Iteration 500: Cost 0.48955134363438374
Iteration 600: Cost 0.46759876225718744
Iteration 700: Cost 0.4486287351969675
Iteration 800: Cost 0.4404540073363661
Iteration 900: Cost 0.4341016365903772
final w:[ 0.09791409 0.18659969 -0.16593792 -0.05763644 0.92842167 1.28678091.65094703 -0.58368349 0.02566331 1.35554508 -0.58187152 2.76678769],b:4.670336775408304 Cost0.4333772684492536
RMSE (train): 0.9310
MAE (train): 0.7558
R² (train): 0.7655
验证集rmse:0.9328138860617473
alpha:0.1
Iteration 0: Cost 11.351727702767866
Iteration 100: Cost 0.5558313567896654
Iteration 200: Cost 0.46885025330217456
Iteration 300: Cost 0.44335012903716364
Iteration 400: Cost 0.4313798692115776
Iteration 500: Cost 0.4355631365543866
Iteration 600: Cost 0.4309577528710226
Iteration 700: Cost 0.4307936399988437
Iteration 800: Cost 0.43521994314479706
Iteration 900: Cost 0.42515728254626284
final w:[ 0.06566415 0.17321172 -0.21987795 -0.07900453 0.68852837 1.18098651.78782584 -0.6183192 0.09440103 1.32792047 -0.72961745 2.92264108],b:4.926293929041702 Cost0.4242597480520984
RMSE (train): 0.9211
MAE (train): 0.7482
R² (train): 0.7704
验证集rmse:0.9244478848259442
alpha:0.3
Iteration 0: Cost 3.5461776348761567
Iteration 100: Cost 0.4694446691999329
Iteration 200: Cost 0.43273702563394617
Iteration 300: Cost 0.4467992554218754
Iteration 400: Cost 0.43921649466346924
Iteration 500: Cost 0.44547703626122276
Iteration 600: Cost 0.4322157901113187
Iteration 700: Cost 0.48817439906319376
Iteration 800: Cost 0.4483529095197104
Iteration 900: Cost 0.43502632253114654
final w:[ 0.05609692 0.15534287 -0.16358383 -0.10601799 0.72187948 1.190759721.68234615 -0.62368935 0.01328925 1.37229405 -0.72398518 2.85003004],b:5.0029658904496115 Cost0.42638259927878613
RMSE (train): 0.9234
MAE (train): 0.7503
R² (train): 0.7693
验证集rmse:0.9265010393861376
最佳学习率:0.1
得到最佳超参数alpha=0.1,lambda=0.01后正式训练模型,尝试迭代次数5000
w_final_1,b_final_1,history_1,scores_1=run_gradient_descent(X_test,y_test,w_in,b_in,0.1,0.01,5000,batch)
Iteration 0: Cost 12.388595343057906
Iteration 500: Cost 0.4334083417824843
Iteration 1000: Cost 0.45092509167104144
Iteration 1500: Cost 0.43616663738541933
Iteration 2000: Cost 0.43250984039210616
Iteration 2500: Cost 0.42945876039876385
Iteration 3000: Cost 0.43352571114166116
Iteration 3500: Cost 0.43104419890243933
Iteration 4000: Cost 0.43044929727826253
Iteration 4500: Cost 0.4400853653507515
final w:[ 0.13163098 0.22833865 -0.19156223 -0.05097081 0.64825028 1.301028381.59946697 -0.75801881 -0.05103348 1.35646213 -0.76514126 2.85316972],b:4.931440929635891 Cost0.43813670168359764
RMSE (train): 0.9361
MAE (train): 0.7587
R² (train): 0.7793
绘制测试误差变化曲线
通过查看曲线局部,确定合适的迭代次数
plt.plot(history_1)
plt.show()
plt.plot(np.arange(0,2000),history_1[0:2000])
plt.show()
plt.plot(np.arange(2000,3000),history_1[2000:3000])
plt.show
观察到大约2000次迭代后,误差开始波动,不再呈现下降趋势,于是推荐的最小迭代次数是2000。
查看评价指标
print('手动梯度下降线性回归模型预测效果各个指标:')
print(f'rmse:{scores_1[0]:.4f}')
print(f'mae:{scores_1[1]:.4f}')
print(f'r2_score:{scores_1[2]:.4f}')
手动梯度下降线性回归模型预测效果各个指标:
rmse:0.9361
mae:0.7587
r2_score:0.7793
可以看出,预测效果还是不错的,平均绝对误差在0.75万元每平米左右,决定系数接近0.78。
使用scikit-learn随机梯度下降
使用sklearn中的SGDRegressor回归器进行预测,和我们手动实现的线性回归器还是有较大区别的,首先它采用单样本随机梯度下降,同时学习率是随训练过程优化的,当迭代至一定程度后便停止训练。通过多种不同参数设置尝试后发现,预测效果最佳的情况是不额外设置太多参数,只是将learning_rate设置为‘adaptive’,其他采用缺省值,具体的原因暂不清楚,需要后续研究一番。
Sgdr = SGDRegressor(learning_rate='adaptive')
Sgdr.fit(X_train,y_train)
print(f'实际迭代次数:{Sgdr.n_iter_}')
y_test_hat_sgdr=Sgdr.predict(X_test)
print('sklearn线性回归模型预测效果各个指标:')
print(f'rmse:{np.sqrt(mean_squared_error(y_test,y_test_hat_sgdr)):.4f}')
print(f'mae:{mean_absolute_error(y_test,y_test_hat_sgdr):.4f}')
print(f'r2_score:{r2_score(y_test,y_test_hat_sgdr):.4f}')
实际迭代次数:40
sklearn线性回归模型预测效果各个指标:
rmse:0.9288
mae:0.7557
r2_score:0.7827
和手动模型相较,预测效果略好,三个指标都更优异一些,但总体区别不大。
在实验室中还有利用神经网络预测,其中一个模型的预测效果远好于上面两个模型,感兴趣的读者可以细致查看。
神经网络预测效果各个指标:
rmse:0.8579
mae:0.6963
r2_score:0.8146