机器学习之实战篇——预测二手房房价(线性回归）

前言
数据集和实验文件下载
相关文章推荐
实验过程
- 导入相关模块
- 数据预处理
- 手动梯度下降训练
- 使用scikit-learn随机梯度下降

前言

实验中难免有许多缺陷和错误，望批评指正！

数据集和实验文件下载

通过百度网盘分享的文件：预测二手房房价(1).zip
链接：https://pan.baidu.com/s/11Me9CHCys-No9eoKBIrDoQ?pwd=yicj
提取码：yicj
备注：实验文件代码和文章中代码不完全一致

实验过程

导入相关模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from LinearRegressor import run_gradient_descent

注解：
1、pandas用于读取数据文件转换为DataFrame类型，并进行数据集预处理包括数据清洗、特征编码等等。
2、sklearn,用于进行特征缩放，数据集分割，使用其中的随机梯度下降线性回归器,以及几个线性回归模型评估函数
3、LinearRegressor是自己编写的库，将多元线性回归梯度下降代码中的代码进行完善，引入正则化，采用小批量梯度下降(出于运行效率考虑，数据集庞大，全批量梯度下降运行时间过长)算法训练模型，并返回模型训练的各个评价指标，包括RMSE(均方根误差)、MAE(平均绝对误差)、R² score (决定系数)，它们都能对线性回归模型的预测效果进行评估，表达式如下：
$RMSE=\sqrt{\frac{1}{m}\sum_{i=1}^{m}{(y_i-\hat{y_i})^2}}$
$\frac{1}{m}\sum_{i=1}^{m}{|y_i - \hat{y_i}|}$
$R^2 = 1 - \frac{\sum_{i=1}^{m}{(y_i - \hat{y_i})^2}}{\sum_{i=1}^{m}{(y_i - \overline{y})^2}}$
LinearRegressor.py 如下

import numpy as np
import math
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score# 计算代价的函数
def get_cost(x, y, w, b,lamb):# 数据集的大小m = x.shape[0]total_cost = 0for i in range(m):error = np.dot(x[i, :], w) + b - y[i]total_cost = total_cost + error ** 2cost = total_cost / (2 * m)cost=cost+(lamb/(2*m))*np.sum(np.square(w))return cost# 计算梯度的函数
def get_gradient(x, y, w, b,lamb):# 获取数据量和特征数量m = x.shape[0]n = x.shape[1]dj_dw = np.zeros((n,))dj_db = 0for i in range(m):error = np.dot(x[i, :], w) + b - y[i]dj_db += errorfor j in range(n):dj_dw[j] += (error * x[i, j])dj_db = dj_db / mdj_dw = dj_dw / mdj_dw=dj_dw+(lamb/m)*wreturn dj_dw, dj_db# 梯度下降函数
def run_gradient_descent(x, y, w_in, b_in, alpha,lamb,num_iters,batch_size):'''x:输入向量,numpy.ndarrayy:输出向量，numpy.ndarrayw_in:初始w向量b_in:初始balpha:学习率lamb:正则化系数num_iters:迭代次数cost_function:代价函数gradient_function:计算梯度函数'''m,n=x.shapeJ_history = []  # 记录训练过程中的所有代价b = b_inw = w_infor i in range(int(num_iters)):# 随机选择一个小批量indices = np.random.choice(m, batch_size, replace=False)x_batch = x[indices]y_batch = y[indices]# 计算偏导，更新参数w，bdj_dw, dj_db = get_gradient(x_batch, y_batch, w, b,lamb)w = w - dj_dw * alphab = b - dj_db * alpha# 保存当前代价J和参数(w,b)->可用于后续可视化J_history.append(get_cost(x, y, w, b,lamb))# 打印其中十次训练信息if i % math.ceil(num_iters / 10) == 0:print(f"Iteration {i}: Cost {J_history[-1]} ")print(f'final w:{w},b:{b} Cost{J_history[-1]}')# 输出训练效果的几个评价指标y_hat = np.dot(x, w) + b# 计算评价指标mse = mean_squared_error(y, y_hat) #均方误差rmse = np.sqrt(mse)   #均方根误差mae = mean_absolute_error(y, y_hat) #平均绝对误差r2 = r2_score(y, y_hat) #决定系数print(f'RMSE (train): {rmse:.4f}')print(f'MAE (train): {mae:.4f}')print(f'R² (train): {r2:.4f}')return w, b, J_history,[rmse,mae,r2]

数据预处理

read_csv:读取csv文件，转为为pandas的DataFrame对象

data=pd.read_csv('data/house.csv',encoding='gbk')

查看数据集相关信息，发现没有缺失值，无需清洗数据

data.info()

 <class 'pandas.core.frame.DataFrame'>
RangeIndex: 9899 entries, 0 to 9898
Data columns (total 8 columns):Column    Non-Null Count  Dtype  0   区域        9899 non-null   object 1   卧室数       9899 non-null   int64  2   客厅数       9899 non-null   int64  3   房屋面积      9899 non-null   float644   楼层高低      9899 non-null   object 5   是否是地铁房    9899 non-null   int64  6   是否是学区房    9899 non-null   int64  7   价格（万/m2）  9899 non-null   float64
dtypes: float64(2), int64(4), object(2)
memory usage: 618.8+ KB

查看部分数据
head显示前五条数据，tail显示后五条数据

data.head()

在这里插入图片描述
特征包括:区域、卧室数、客厅数、房屋面积、楼层面积、是否是地铁房、是否是学区房，目标：价格。
其中区域属于文字信息，需要编码，由于区域之间无大小关系，不宜使用标签编码，宜使用独热编码； 楼层高低属于文字信息，需要编码，由于楼层高低反映大小关系，宜使用标签编码。

数据集分割->X,y

X_=data.iloc[:,:-1]
y=data.iloc[:,-1].values

特征编码区域->独热编码
get_dummies用于对数据集某一列特征进行独热编码。
原本区域特征内包含若干个地点，第一想法是给各个地点确定一个数字编号，但是地点之间是没有大小关系的，标签编码不适宜，适合的编码方式是独热编码。何为独热编码(one-hot code)？从运行后的结果看，将每个地点作为新的特征，False表示区域不在此地，True表示区域即此地。还需要将bool类型转换为int类型，False-0，True-1。

X_1=pd.get_dummies(X_,columns=['区域'])
X_1.head()

在这里插入图片描述
还需要将bool类型转换为int类型，False-0，True-1。

# 找出布尔类型的列，并将它们转换为整数类型
bool_columns = X_1.select_dtypes(include=['bool']).columns
X_1[bool_columns] = X_1[bool_columns].astype(int)
X_1.head()

在这里插入图片描述
特征编码楼层高低->标签编码
map方法传入转化字典作为参数，实现文本->数字标签编码，由于楼层高低有数值上的大小意义，因此采用标签编码

X_1['楼层高低']=X_1['楼层高低'].map({'高':2,'中':1,'低':0})
X_1.head()

在这里插入图片描述
特征缩放(除了独热编码和标签编码)
一般来说，独热编码和标签编码不需要特征缩放。原因在于这两种编码方式处理的是类别特征，而特征缩放通常是为了处理数值特征的范围和分布问题。

scaled_colums=['卧室数','客厅数','房屋面积']
scaler=StandardScaler()
X_1[scaled_colums]=scaler.fit_transform(X_1[scaled_colums])
X_1.head()

在这里插入图片描述
预处理好后，获取X_1的数值矩阵部分得到最终的特征矩阵X

X=X_1.values
#获取数据集规模
m,n=X_train.shape
print(f'm:{m},n:{n}')

m:5939,n:12

**将数据集分割为训练集、验证集、测试集，比例6:2:2
两次使用使用sklearn.model_selection中的train_test_split函数

X_train,X_,y_train,y_=train_test_split(X,y,test_size=0.4,random_state=0)
X_cv,X_test,y_cv,y_test=train_test_split(X_,y_,test_size=0.5,random_state=0)

查看训练集、验证集、测试集大小

print(f'训练集:{X_train.shape}')
print(f'验证集:{X_cv.shape}')
print(f'测试集:{X_test.shape}')

训练集:(5939, 12)
验证集:(1980, 12)
测试集:(1980, 12)

手动梯度下降训练

初始化权重与偏置

#初始化权重与偏置
w_in=np.zeros(n)
b_in=0

正则化系数选择
对于不同的正则化系数，进行模型训练后,根据验证集预测的均方误差大小比较，选择最优正则化系数，这里是0.01。
注：这一步可行性值得商榷，因为采用小批量随机梯度下降，每次运行结果存在差异，所以很难说选出最佳正则化系数。并且观察可知，几个相邻正则化系数效果相当，因此不必过于纠结系数的选择，合适即可。

lamb_list=[0,0.003,0.01,0.03,0.1]
w_all==[]
b_all=[]
rmse_all=[]
for i in range(len(lamb_list)):lamb=lamb_list[i]print(f'lamb:{lamb}')w,b,history,scores=run_gradient_descent(X_train,y_train,w_in,b_in,alpha=0.01,lamb,num_iters=1000,batch_size=32)y_cv_hat=np.dot(X_cv,w)+bmse=mean_squared_error(y_cv,y_cv_hat)rmse=np.sqrt(mse)print(f'验证集rmse:{rmse}')w_all.append(w)b_all.append(b)rmse_all.append(rmse)
rmse_all=np.array(rmse_all)
print(f'最佳正则化系数:{lamb_list[rmse_all.argmin()]}')

参考运行结果：

lamb:0
Iteration 0: Cost 22.23059601198786 
Iteration 100: Cost 2.061076611083008 
Iteration 200: Cost 1.3561087252998258 
Iteration 300: Cost 1.0280998358159494 
Iteration 400: Cost 0.8569524011307861 
Iteration 500: Cost 0.7593091461261117 
Iteration 600: Cost 0.694745542900074 
Iteration 700: Cost 0.6492185319330726 
Iteration 800: Cost 0.614560487525777 
Iteration 900: Cost 0.5869805356544532 
final w:[ 0.02143975  0.17038763 -0.17398796  0.06084088  1.75171081  1.63784311.20447934 -0.39465444  0.01006027  1.12995972 -0.29795183  2.17772605],b:3.829619111941914 Cost0.5622034394345841
RMSE (train): 1.0604
MAE (train): 0.8645
R² (train): 0.6958
验证集rmse:1.0571995565292818
lamb:0.003
Iteration 0: Cost 22.351385773755982 
Iteration 100: Cost 1.9828997211762693 
Iteration 200: Cost 1.3543807706617699 
Iteration 300: Cost 1.0399094264386297 
Iteration 400: Cost 0.8628443705610528 
Iteration 500: Cost 0.764474592689213 
Iteration 600: Cost 0.6972510396164013 
Iteration 700: Cost 0.6476368062690389 
Iteration 800: Cost 0.6104113644232597 
Iteration 900: Cost 0.5846282720631463 
final w:[ 0.03136469  0.16048989 -0.18776667  0.0651586   1.73883466  1.641326011.22904597 -0.39989929  0.00270852  1.14713962 -0.30977836  2.15427881],b:3.826412132321924 Cost0.5619165309100926
RMSE (train): 1.0601
MAE (train): 0.8639
R² (train): 0.6960
验证集rmse:1.0573511497962558
lamb:0.01
Iteration 0: Cost 22.345809951332445 
Iteration 100: Cost 2.0179093767467084 
Iteration 200: Cost 1.3540412996107423 
Iteration 300: Cost 1.0322567480968006 
Iteration 400: Cost 0.8544782275887487 
Iteration 500: Cost 0.7594217089430233 
Iteration 600: Cost 0.6946625903607831 
Iteration 700: Cost 0.64501118903113 
Iteration 800: Cost 0.6097704275279966 
Iteration 900: Cost 0.580857638244589 
final w:[ 0.02375557  0.18864764 -0.17627059  0.05236308  1.7464483   1.623610331.24307867 -0.41702115  0.0100025   1.13727691 -0.31859057  2.18478831],b:3.8492936378141356 Cost0.556517929151675
RMSE (train): 1.0550
MAE (train): 0.8598
R² (train): 0.6989
验证集rmse:1.0524199082008658
lamb:0.03
Iteration 0: Cost 22.13760766414618 
Iteration 100: Cost 1.9792383674599943 
Iteration 200: Cost 1.3309707044480739 
Iteration 300: Cost 1.0314380656185427 
Iteration 400: Cost 0.8626182048908188 
Iteration 500: Cost 0.7628534227818494 
Iteration 600: Cost 0.6961525014198667 
Iteration 700: Cost 0.6491098135586656 
Iteration 800: Cost 0.6137308220314517 
Iteration 900: Cost 0.5868599520844399 
final w:[ 0.01335199  0.15733817 -0.14890463  0.07036306  1.73883984  1.652439611.22793312 -0.3934939   0.0139159   1.12679856 -0.3277127   2.15003744],b:3.826606724886083 Cost0.5628590817275239
RMSE (train): 1.0610
MAE (train): 0.8639
R² (train): 0.6955
验证集rmse:1.0581639060644847
lamb:0.1
Iteration 0: Cost 22.51671544759003 
Iteration 100: Cost 2.0398728505176593 
Iteration 200: Cost 1.3628865416278861 
Iteration 300: Cost 1.0296313493619456 
Iteration 400: Cost 0.8606342241606357 
Iteration 500: Cost 0.7617998432747694 
Iteration 600: Cost 0.6927332059016614 
Iteration 700: Cost 0.6485004821584673 
Iteration 800: Cost 0.6109500914665773 
Iteration 900: Cost 0.5836713728604442 
final w:[ 0.00995542  0.13994331 -0.15864095  0.05401658  1.73817419  1.620213871.22191713 -0.3945572   0.01987262  1.10185906 -0.32871132  2.15077073],b:3.867383745613026 Cost0.559817615021792
RMSE (train): 1.0580
MAE (train): 0.8622
R² (train): 0.6971
验证集rmse:1.056415833868402
最佳正则化系数:0.01

学习率选择
学习率选择与正则化系数选择类似，但学习率取值范围不同，这里两个参数的取值范围都仅供参考。这里选择的最佳学习率是0.1。

alpha_list=[0.003,0.01,0.03,0.1,0.3]
w_all_=[]
b_all=[]
rmse_all_=[]
for i in range(len(alpha_list)):alpha=alpha_list[i]print(f'alpha:{alpha}')w,b,history,scores=run_gradient_descent(X_train,y_train,w_in,b_in,alpha,0.01,1000,batch)y_cv_hat=np.dot(X_cv,w)+bmse=mean_squared_error(y_cv,y_cv_hat)rmse=np.sqrt(mse)print(f'验证集rmse:{rmse}')w_all.append(w)b_all.append(b)rmse_all_.append(rmse)
rmse_all_=np.array(rmse_all_)
print(f'最佳学习率:{alpha_list[rmse_all_.argmin()]}')

参考运行结果：

alpha:0.003
Iteration 0: Cost 23.253100300920302 
Iteration 100: Cost 5.781796554717344 
Iteration 200: Cost 2.8908550574645138 
Iteration 300: Cost 2.1782169802111473 
Iteration 400: Cost 1.8655282509644668 
Iteration 500: Cost 1.6320514704981137 
Iteration 600: Cost 1.453725374767599 
Iteration 700: Cost 1.3100475883354463 
Iteration 800: Cost 1.200433605322448 
Iteration 900: Cost 1.1076688363423737 
final w:[ 0.00582021  0.07079598 -0.11993019  0.66561711  2.14452232  1.484285920.82638515 -0.07419383  0.21743776  0.77370104 -0.04226368  1.23513901],b:2.938344003070145 Cost1.0330393655533985
RMSE (train): 1.4374
MAE (train): 1.1565
R² (train): 0.4410
验证集rmse:1.43557463281523
alpha:0.01
Iteration 0: Cost 22.288236805699807 
Iteration 100: Cost 2.033996669824914 
Iteration 200: Cost 1.3503161741421037 
Iteration 300: Cost 1.0398051820488934 
Iteration 400: Cost 0.8684076055530864 
Iteration 500: Cost 0.7634785033747918 
Iteration 600: Cost 0.6980292714838968 
Iteration 700: Cost 0.6503829982463116 
Iteration 800: Cost 0.6137008193570105 
Iteration 900: Cost 0.5843130784956765 
final w:[ 0.01444502  0.15800244 -0.16551005  0.04960028  1.73509097  1.633527031.24986869 -0.39301773 -0.01190119  1.1155374  -0.30946838  2.1621154 ],b:3.8228287139242716 Cost0.5618271447271908
RMSE (train): 1.0600
MAE (train): 0.8636
R² (train): 0.6960
验证集rmse:1.0588126169723184
alpha:0.03
Iteration 0: Cost 19.338694305426777 
Iteration 100: Cost 1.0109210195804499 
Iteration 200: Cost 0.701999318363482 
Iteration 300: Cost 0.5868193617909866 
Iteration 400: Cost 0.5239756304843721 
Iteration 500: Cost 0.48955134363438374 
Iteration 600: Cost 0.46759876225718744 
Iteration 700: Cost 0.4486287351969675 
Iteration 800: Cost 0.4404540073363661 
Iteration 900: Cost 0.4341016365903772 
final w:[ 0.09791409  0.18659969 -0.16593792 -0.05763644  0.92842167  1.28678091.65094703 -0.58368349  0.02566331  1.35554508 -0.58187152  2.76678769],b:4.670336775408304 Cost0.4333772684492536
RMSE (train): 0.9310
MAE (train): 0.7558
R² (train): 0.7655
验证集rmse:0.9328138860617473
alpha:0.1
Iteration 0: Cost 11.351727702767866 
Iteration 100: Cost 0.5558313567896654 
Iteration 200: Cost 0.46885025330217456 
Iteration 300: Cost 0.44335012903716364 
Iteration 400: Cost 0.4313798692115776 
Iteration 500: Cost 0.4355631365543866 
Iteration 600: Cost 0.4309577528710226 
Iteration 700: Cost 0.4307936399988437 
Iteration 800: Cost 0.43521994314479706 
Iteration 900: Cost 0.42515728254626284 
final w:[ 0.06566415  0.17321172 -0.21987795 -0.07900453  0.68852837  1.18098651.78782584 -0.6183192   0.09440103  1.32792047 -0.72961745  2.92264108],b:4.926293929041702 Cost0.4242597480520984
RMSE (train): 0.9211
MAE (train): 0.7482
R² (train): 0.7704
验证集rmse:0.9244478848259442
alpha:0.3
Iteration 0: Cost 3.5461776348761567 
Iteration 100: Cost 0.4694446691999329 
Iteration 200: Cost 0.43273702563394617 
Iteration 300: Cost 0.4467992554218754 
Iteration 400: Cost 0.43921649466346924 
Iteration 500: Cost 0.44547703626122276 
Iteration 600: Cost 0.4322157901113187 
Iteration 700: Cost 0.48817439906319376 
Iteration 800: Cost 0.4483529095197104 
Iteration 900: Cost 0.43502632253114654 
final w:[ 0.05609692  0.15534287 -0.16358383 -0.10601799  0.72187948  1.190759721.68234615 -0.62368935  0.01328925  1.37229405 -0.72398518  2.85003004],b:5.0029658904496115 Cost0.42638259927878613
RMSE (train): 0.9234
MAE (train): 0.7503
R² (train): 0.7693
验证集rmse:0.9265010393861376
最佳学习率:0.1

得到最佳超参数alpha=0.1，lambda=0.01后正式训练模型，尝试迭代次数5000

w_final_1,b_final_1,history_1,scores_1=run_gradient_descent(X_test,y_test,w_in,b_in,0.1,0.01,5000,batch)

Iteration 0: Cost 12.388595343057906 
Iteration 500: Cost 0.4334083417824843 
Iteration 1000: Cost 0.45092509167104144 
Iteration 1500: Cost 0.43616663738541933 
Iteration 2000: Cost 0.43250984039210616 
Iteration 2500: Cost 0.42945876039876385 
Iteration 3000: Cost 0.43352571114166116 
Iteration 3500: Cost 0.43104419890243933 
Iteration 4000: Cost 0.43044929727826253 
Iteration 4500: Cost 0.4400853653507515 
final w:[ 0.13163098  0.22833865 -0.19156223 -0.05097081  0.64825028  1.301028381.59946697 -0.75801881 -0.05103348  1.35646213 -0.76514126  2.85316972],b:4.931440929635891 Cost0.43813670168359764
RMSE (train): 0.9361
MAE (train): 0.7587
R² (train): 0.7793

绘制测试误差变化曲线
通过查看曲线局部，确定合适的迭代次数

plt.plot(history_1)
plt.show()
plt.plot(np.arange(0,2000),history_1[0:2000])
plt.show()
plt.plot(np.arange(2000,3000),history_1[2000:3000])
plt.show

在这里插入图片描述

在这里插入图片描述

观察到大约2000次迭代后，误差开始波动，不再呈现下降趋势，于是推荐的最小迭代次数是2000。
查看评价指标

print('手动梯度下降线性回归模型预测效果各个指标:')
print(f'rmse:{scores_1[0]:.4f}')
print(f'mae:{scores_1[1]:.4f}')
print(f'r2_score:{scores_1[2]:.4f}')

手动梯度下降线性回归模型预测效果各个指标:
rmse:0.9361
mae:0.7587
r2_score:0.7793

可以看出，预测效果还是不错的，平均绝对误差在0.75万元每平米左右，决定系数接近0.78。

使用scikit-learn随机梯度下降

使用sklearn中的SGDRegressor回归器进行预测，和我们手动实现的线性回归器还是有较大区别的，首先它采用单样本随机梯度下降，同时学习率是随训练过程优化的，当迭代至一定程度后便停止训练。通过多种不同参数设置尝试后发现，预测效果最佳的情况是不额外设置太多参数，只是将learning_rate设置为‘adaptive’,其他采用缺省值，具体的原因暂不清楚，需要后续研究一番。

Sgdr = SGDRegressor(learning_rate='adaptive')
Sgdr.fit(X_train,y_train)
print(f'实际迭代次数:{Sgdr.n_iter_}')
y_test_hat_sgdr=Sgdr.predict(X_test)
print('sklearn线性回归模型预测效果各个指标:')
print(f'rmse:{np.sqrt(mean_squared_error(y_test,y_test_hat_sgdr)):.4f}')
print(f'mae:{mean_absolute_error(y_test,y_test_hat_sgdr):.4f}')
print(f'r2_score:{r2_score(y_test,y_test_hat_sgdr):.4f}')

实际迭代次数:40
sklearn线性回归模型预测效果各个指标:
rmse:0.9288
mae:0.7557
r2_score:0.7827

和手动模型相较，预测效果略好，三个指标都更优异一些，但总体区别不大。

在实验室中还有利用神经网络预测，其中一个模型的预测效果远好于上面两个模型，感兴趣的读者可以细致查看。

神经网络预测效果各个指标:
rmse:0.8579
mae:0.6963
r2_score:0.8146