- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊
本周任务:
任务说明:该数据集提供了来自澳大利亚许多地点的大约 10 年的每日天气观测数据。你需要做的是根据这些数据对RainTomorrow
进行一个预测,这次任务任务与以往的不同,我增加了探索式数据分析(EDA),希望这部分内容可以帮助到大家。
一、导入数据
python">from datetime import dateimport numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScalerimport torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Ffile_path = 'D:/OneDrive/code learning(python and matlab and latex)/365camp/data/weatherAUS.csv'
data = pd.read_csv(file_path)
df = data.copy()
data.head(10)
代码输出:
python"> Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine \
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN
5 2008-12-06 Albury 14.6 29.7 0.2 NaN NaN
6 2008-12-07 Albury 14.3 25.0 0.0 NaN NaN
7 2008-12-08 Albury 7.7 26.7 0.0 NaN NaN
8 2008-12-09 Albury 9.7 31.9 0.0 NaN NaN
9 2008-12-10 Albury 13.1 30.1 1.4 NaN NaN WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm \
0 W 44.0 W ... 71.0 22.0
1 WNW 44.0 NNW ... 44.0 25.0
2 WSW 46.0 W ... 38.0 30.0
3 NE 24.0 SE ... 45.0 16.0
4 W 41.0 ENE ... 82.0 33.0
5 WNW 56.0 W ... 55.0 23.0
6 W 50.0 SW ... 49.0 19.0
7 W 35.0 SSE ... 48.0 19.0
8 NNW 80.0 SE ... 42.0 9.0
9 W 28.0 S ... 58.0 27.0 Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday \
0 1007.7 1007.1 8.0 NaN 16.9 21.8 No
1 1010.6 1007.8 NaN NaN 17.2 24.3 No
2 1007.6 1008.7 NaN 2.0 21.0 23.2 No
3 1017.6 1012.8 NaN NaN 18.1 26.5 No
4 1010.8 1006.0 7.0 8.0 17.8 29.7 No
5 1009.2 1005.4 NaN NaN 20.6 28.9 No
6 1009.6 1008.2 1.0 NaN 18.1 24.6 No
7 1013.4 1010.1 NaN NaN 16.3 25.5 No
8 1008.9 1003.6 NaN NaN 18.3 30.2 No
9 1007.0 1005.7 NaN NaN 20.1 28.2 Yes RainTomorrow
0 No
1 No
2 No
3 No
4 No
5 No
6 No
7 No
8 Yes
9 No [10 rows x 23 columns]
我们接下来查看数据的基本特征
python">data.describe()
代码输出:
python"> MinTemp MaxTemp Rainfall Evaporation \
count 143975.000000 144199.000000 142199.000000 82670.000000
mean 12.194034 23.221348 2.360918 5.468232
std 6.398495 7.119049 8.478060 4.193704
min -8.500000 -4.800000 0.000000 0.000000
25% 7.600000 17.900000 0.000000 2.600000
50% 12.000000 22.600000 0.000000 4.800000
75% 16.900000 28.200000 0.800000 7.400000
max 33.900000 48.100000 371.000000 145.000000 Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm \
count 75625.000000 135197.000000 143693.000000 142398.000000
mean 7.611178 40.035230 14.043426 18.662657
std 3.785483 13.607062 8.915375 8.809800
min 0.000000 6.000000 0.000000 0.000000
25% 4.800000 31.000000 7.000000 13.000000
50% 8.400000 39.000000 13.000000 19.000000
75% 10.600000 48.000000 19.000000 24.000000
max 14.500000 135.000000 130.000000 87.000000 Humidity9am Humidity3pm Pressure9am Pressure3pm \
count 142806.000000 140953.000000 130395.00000 130432.000000
mean 68.880831 51.539116 1017.64994 1015.255889
std 19.029164 20.795902 7.10653 7.037414
min 0.000000 0.000000 980.50000 977.100000
25% 57.000000 37.000000 1012.90000 1010.400000
50% 70.000000 52.000000 1017.60000 1015.200000
75% 83.000000 66.000000 1022.40000 1020.000000
max 100.000000 100.000000 1041.00000 1039.600000 Cloud9am Cloud3pm Temp9am Temp3pm
count 89572.000000 86102.000000 143693.000000 141851.00000
mean 4.447461 4.509930 16.990631 21.68339
std 2.887159 2.720357 6.488753 6.93665
min 0.000000 0.000000 -7.200000 -5.40000
25% 1.000000 2.000000 12.300000 16.60000
50% 5.000000 5.000000 16.700000 21.10000
75% 7.000000 7.000000 21.600000 26.40000
max 9.000000 9.000000 40.200000 46.70000
随后我们查看数据类型:
python">data.dtypes
代码输出:
python">Date object
Location object
MinTemp float64
MaxTemp float64
Rainfall float64
Evaporation float64
Sunshine float64
WindGustDir object
WindGustSpeed float64
WindDir9am object
WindDir3pm object
WindSpeed9am float64
WindSpeed3pm float64
Humidity9am float64
Humidity3pm float64
Pressure9am float64
Pressure3pm float64
Cloud9am float64
Cloud3pm float64
Temp9am float64
Temp3pm float64
RainToday object
RainTomorrow object
dtype: object
将日期转换为日期时间格式:
python">data['Date'] = pd.to_datetime(data['Date'])data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.daydata.head()
代码输出:
python"> Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine \
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN WindGustDir WindGustSpeed WindDir9am ... Pressure3pm Cloud9am Cloud3pm \
0 W 44.0 W ... 1007.1 8.0 NaN
1 WNW 44.0 NNW ... 1007.8 NaN NaN
2 WSW 46.0 W ... 1008.7 NaN 2.0
3 NE 24.0 SE ... 1012.8 NaN NaN
4 W 41.0 ENE ... 1006.0 7.0 8.0 Temp9am Temp3pm RainToday RainTomorrow year month day
0 16.9 21.8 No No 2008 12 1
1 17.2 24.3 No No 2008 12 2
2 21.0 23.2 No No 2008 12 3
3 18.1 26.5 No No 2008 12 4
4 17.8 29.7 No No 2008 12 5 [5 rows x 26 columns]
python">data.drop('Date', axis=1, inplace=True)
data.columns
代码输出:
python">Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine','WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm','WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am','Temp3pm', 'RainToday', 'RainTomorrow', 'year', 'month', 'day'],dtype='object')
二、探索式数据分析(EDA)
1、数据相关性探索
python">numeric_data = data.select_dtypes(include=['number'])
plt.figure(figsize=(15,13))
ax = sns.heatmap(numeric_data.corr(), square=True, annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
代码输出:
2、是否会下雨
python"># 设置样式和调色板
sns.set(style="whitegrid", palette="Set2")# 创建一个 1 行 2 列的图像布局
fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # 图形尺寸调大 (10, 4)# 图表标题样式
title_font = {'fontsize': 14, 'fontweight': 'bold', 'color': 'darkblue'}# 第一张图:RainTomorrow
sns.countplot(x='RainTomorrow', data=data, ax=axes[0], edgecolor='black',palette={'Yes': 'green', 'No': 'red'}) # 添加边框
axes[0].set_title('Rain Tomorrow', fontdict=title_font) # 设置标题
axes[0].set_xlabel('Will it Rain Tomorrow?', fontsize=12) # X轴标签
axes[0].set_ylabel('Count', fontsize=12) # Y轴标签
axes[0].tick_params(axis='x', labelsize=11) # X轴刻度字体大小
axes[0].tick_params(axis='y', labelsize=11) # Y轴刻度字体大小# 第二张图:RainToday
sns.countplot(x='RainToday', data=data, ax=axes[1], edgecolor='black',palette={'Yes': 'green', 'No': 'red'}) # 添加边框
axes[1].set_title('Rain Today', fontdict=title_font) # 设置标题
axes[1].set_xlabel('Did it Rain Today?', fontsize=12) # X轴标签
axes[1].set_ylabel('Count', fontsize=12) # Y轴标签
axes[1].tick_params(axis='x', labelsize=11) # X轴刻度字体大小
axes[1].tick_params(axis='y', labelsize=11) # Y轴刻度字体大小sns.despine() # 去除图表顶部和右侧的边框
plt.tight_layout() # 调整布局,避免图形之间的重叠
plt.show()
代码输出:
python">x=pd.crosstab(data['RainTomorrow'],data['RainToday'])
x
代码输出:
python">RainToday No Yes
RainTomorrow
No 92728 16858
Yes 16604 14597
python">y=x/x.transpose().sum().values.reshape(2,1)*100
y
代码输出:
python">RainToday No Yes
RainTomorrow
No 84.616648 15.383352
Yes 53.216243 46.783757
python">y.plot(kind="bar",figsize=(4,3),color=['#006666','#d279a6']);
代码输出:
3、地理位置与下雨的关系
python">x=pd.crosstab(data['Location'],data['RainToday'])
# 获取每个城市下雨天数和非下雨天数的百分比
y=x/x.transpose().sum().values.reshape((-1, 1))*100
# 按每个城市的雨天百分比排序
y=y.sort_values(by='Yes',ascending=True )color=['#cc6699','#006699','#006666','#862d86','#ff9966' ]
y.Yes.plot(kind="barh",figsize=(15,20),color=color)
代码输出:
4、湿度和压力对下雨的影响
python">data.columns
代码输出:
python">Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine','WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm','WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am','Temp3pm', 'RainToday', 'RainTomorrow', 'year', 'month', 'day'],dtype='object')
python">plt.figure(figsize=(8,6))
sns.scatterplot(data=data,x='Pressure9am',y='Pressure3pm',hue='RainTomorrow');
代码输出:
python">plt.figure(figsize=(8,6))
sns.scatterplot(data=data,x='Humidity9am',y='Humidity3pm',hue='RainTomorrow');
代码输出:
5、气温对下雨的影响
python">plt.figure(figsize=(8,6))
sns.scatterplot(x='MaxTemp', y='MinTemp', data=data, hue='RainTomorrow');
代码输出:
三、数据预处理
python"># 每列中缺失数据的百分比
data.isnull().sum()/data.shape[0]*100
代码输出:
python">Location 0.000000
MinTemp 1.020899
MaxTemp 0.866905
Rainfall 2.241853
Evaporation 43.166506
Sunshine 48.009762
WindGustDir 7.098859
WindGustSpeed 7.055548
WindDir9am 7.263853
WindDir3pm 2.906641
WindSpeed9am 1.214767
WindSpeed3pm 2.105046
Humidity9am 1.824557
Humidity3pm 3.098446
Pressure9am 10.356799
Pressure3pm 10.331363
Cloud9am 38.421559
Cloud3pm 40.807095
Temp9am 1.214767
Temp3pm 2.481094
RainToday 2.241853
RainTomorrow 2.245978
year 0.000000
month 0.000000
day 0.000000
dtype: float64
python"># 在该列中随机选择数进行填充
lst=['Evaporation','Sunshine','Cloud9am','Cloud3pm']
for col in lst:fill_list = data[col].dropna()data[col] = data[col].fillna(pd.Series(np.random.choice(fill_list, size=len(data.index))))s = (data.dtypes == "object")
object_cols = list(s[s].index)
object_cols
代码输出:
python">['Location','WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow']
python"># inplace=True:直接修改原对象,不创建副本
# data[i].mode()[0] 返回频率出现最高的选项,众数for i in object_cols:data[i].fillna(data[i].mode()[0], inplace=True)t = (data.dtypes == "float64")
num_cols = list(t[t].index)
num_cols
代码输出:
python">['MinTemp','MaxTemp','Rainfall','Evaporation','Sunshine','WindGustSpeed','WindSpeed9am','WindSpeed3pm','Humidity9am','Humidity3pm','Pressure9am','Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']
中位数进行填充:
python"># .median(), 中位数
for i in num_cols:data[i].fillna(data[i].median(), inplace=True)data.isnull().sum()
代码输出:
python">Date 0
Location 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 10326
WindGustSpeed 0
WindDir9am 10566
WindDir3pm 4228
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 3261
RainTomorrow 3267
dtype: int64
2、构建数据集
python">from sklearn.preprocessing import LabelEncoderlabel_encoder = LabelEncoder()
for i in object_cols:data[i] = label_encoder.fit_transform(data[i])X = data.drop(['RainTomorrow','day'],axis=1).values
y = data['RainTomorrow'].valuesX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=101)scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
3、网络构建与训练
python">import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np# 检查 GPU 是否可用
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)# 1. 定义模型结构
class MyModel(nn.Module):def __init__(self, input_dim):super(MyModel, self).__init__()self.fc1 = nn.Linear(input_dim, 24) # 第一层:输入 -> 24self.fc2 = nn.Linear(24, 18) # 第二层:24 -> 18self.fc3 = nn.Linear(18, 23) # 第三层:18 -> 23self.dropout1 = nn.Dropout(0.5) # dropout 0.5self.fc4 = nn.Linear(23, 12) # 第四层:23 -> 12self.dropout2 = nn.Dropout(0.2) # dropout 0.2self.fc5 = nn.Linear(12, 1) # 输出层:12 -> 1def forward(self, x):x = torch.tanh(self.fc1(x)) # 使用 tanh 激活x = torch.tanh(self.fc2(x))x = torch.tanh(self.fc3(x))x = self.dropout1(x)x = torch.tanh(self.fc4(x))x = self.dropout2(x)x = torch.sigmoid(self.fc5(x)) # 输出层用 sigmoid,输出概率return x# 2. 数据准备
# 假设 X_train, y_train, X_test, y_test 是 NumPy 数组
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1) # 调整为 (N,1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)# 创建数据集和 DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_test_tensor, y_test_tensor)batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)# 3. 实例化模型,并移动到 GPU(如果可用)
input_dim = X_train.shape[1] # 特征个数
model = MyModel(input_dim).to(device)# 4. 定义优化器和损失函数
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.BCELoss() # 二元交叉熵损失# 5. 设置早停参数
patience = 25 # 最多等待 25 个 epoch
min_delta = 0.001 # 验证损失改善阈值
best_val_loss = float('inf')
patience_counter = 0import matplotlib.pyplot as plt# 初始化记录指标的列表
train_loss_history = []
val_loss_history = []
train_acc_history = []
val_acc_history = []num_epochs = 10 # 或你实际的训练周期数for epoch in range(num_epochs):model.train()train_loss = 0.0correct_train = 0total_train = 0# 遍历训练集for batch_x, batch_y in train_loader:# 将数据移动到设备(GPU/CPU)batch_x = batch_x.to(device)batch_y = batch_y.to(device)optimizer.zero_grad() # 清除梯度outputs = model(batch_x) # 前向传播loss = criterion(outputs, batch_y) # 计算损失loss.backward() # 反向传播optimizer.step() # 更新参数train_loss += loss.item() * batch_x.size(0)# 二分类:将输出概率大于 0.5 的判为 1,否则为 0preds = (outputs >= 0.5).float()correct_train += (preds == batch_y).sum().item()total_train += batch_y.size(0)epoch_train_loss = train_loss / total_trainepoch_train_acc = correct_train / total_traintrain_loss_history.append(epoch_train_loss)train_acc_history.append(epoch_train_acc)# 验证阶段model.eval()val_loss = 0.0correct_val = 0total_val = 0with torch.no_grad():for batch_x, batch_y in val_loader:batch_x = batch_x.to(device)batch_y = batch_y.to(device)outputs = model(batch_x)loss = criterion(outputs, batch_y)val_loss += loss.item() * batch_x.size(0)preds = (outputs >= 0.5).float()correct_val += (preds == batch_y).sum().item()total_val += batch_y.size(0)epoch_val_loss = val_loss / total_valepoch_val_acc = correct_val / total_valval_loss_history.append(epoch_val_loss)val_acc_history.append(epoch_val_acc)print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {epoch_train_loss:.4f}, Training Acc: {epoch_train_acc:.4f}, "f"Validation Loss: {epoch_val_loss:.4f}, Validation Acc: {epoch_val_acc:.4f}")# 绘图:训练和验证的准确率与损失
epochs_range = range(1, num_epochs + 1)plt.figure(figsize=(14, 4))plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_acc_history, label='Training Accuracy')
plt.plot(epochs_range, val_acc_history, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_loss_history, label='Training Loss')
plt.plot(epochs_range, val_loss_history, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()
代码输出:
python">Using device: cuda
Epoch 1/10, Training Loss: 0.4736, Training Acc: 0.7941, Validation Loss: 0.3953, Validation Acc: 0.8288
Epoch 2/10, Training Loss: 0.3899, Training Acc: 0.8356, Validation Loss: 0.3775, Validation Acc: 0.8381
Epoch 3/10, Training Loss: 0.3836, Training Acc: 0.8391, Validation Loss: 0.3746, Validation Acc: 0.8385
Epoch 4/10, Training Loss: 0.3812, Training Acc: 0.8400, Validation Loss: 0.3730, Validation Acc: 0.8392
Epoch 5/10, Training Loss: 0.3797, Training Acc: 0.8392, Validation Loss: 0.3719, Validation Acc: 0.8398
Epoch 6/10, Training Loss: 0.3776, Training Acc: 0.8400, Validation Loss: 0.3711, Validation Acc: 0.8404
Epoch 7/10, Training Loss: 0.3776, Training Acc: 0.8398, Validation Loss: 0.3707, Validation Acc: 0.8405
Epoch 8/10, Training Loss: 0.3771, Training Acc: 0.8400, Validation Loss: 0.3715, Validation Acc: 0.8400
Epoch 9/10, Training Loss: 0.3769, Training Acc: 0.8402, Validation Loss: 0.3708, Validation Acc: 0.8405
Epoch 10/10, Training Loss: 0.3770, Training Acc: 0.8395, Validation Loss: 0.3698, Validation Acc: 0.8398<Figure size 1400x400 with 2 Axes>