安装:
在Scikit-learn中,数据通常表示为NumPy数组或Pandas DataFrame。特征数据(X)通常是一个二维数组,其中每一行代表一个样本,每一列代表一个特征。目标数据(y)通常是一个一维数组,包含了每个样本的标签或目标值。
数据预处理的工具:
标准化:StandardScaler
归一化:MinMaxScaler
缺失值处理:SimpleImputer
编码:OneHotEncoder
或LabelEncoder
特征选择:
SelectKBest
- 数据清洗
涉及处理缺失值、重复数据、异常值等 import pandas as pd import numpy as np# 创建一个包含缺失值、重复数据和异常值的DataFrame data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'ZZZ', 'Alice'],'Age': [25, 30, 35, 25, 30, 35, 25, 30, -1, 25],'Score': [88, 92, np.nan, 88, 92, np.nan, 88, 92, 1000, 88] }# 创建DataFrame df = pd.DataFrame(data)# 显示原始DataFrame print("原始DataFrame:") print(df)# 处理缺失值:用平均值填充数值列的缺失值 df['Score'] = df['Score'].fillna(df['Score'].mean())# 删除重复行 df = df.drop_duplicates()# 处理异常值:假设年龄不可能小于0,我们将其设置为年龄列的中位数 median_age = df['Age'].median() df['Age'] = df['Age'].replace(to_replace=-1, value=median_age)# 处理异常值:假设分数不可能超过100,我们将其设置为分数列的最大值 max_score = df['Score'].max() df['Score'] = df['Score'].where(df['Score'] <= 100, max_score)# 显示清洗后的DataFrame print("\n清洗后的DataFrame:") print(df)
- 特征提取与转换
from sklearn.feature_extraction.text import CountVectorizer# 示例文本数据 text_data = ["hello world", "hello everyone", "world of programming"]# 初始化CountVectorizer vectorizer = CountVectorizer()# 转换文本数据为词频矩阵 X = vectorizer.fit_transform(text_data)print(X.toarray())//用于将稀疏矩阵(sparse matrix)转换为一个常规的 numpy 数组(dense array)
- 标准化与归一化
from sklearn.preprocessing import StandardScaler, MinMaxScaler# 示例数据 data = [[1, 2], [2, 3], [3, 4]]# 标准化 scaler = StandardScaler() standardized_data = scaler.fit_transform(data)# 归一化 min_max_scaler = MinMaxScaler() normalized_data = min_max_scaler.fit_transform(data)print("Standardized data:", standardized_data) print("Normalized data:", normalized_data)
- 缺失值处理:SimpleImputer---》》填充缺失值
from sklearn.impute import SimpleImputer import numpy as np# 示例数据 data = [[1, 2], [np.nan, 3], [7, 6]]# 初始化SimpleImputer,使用均值填充 imputer = SimpleImputer(strategy='mean')# 填充缺失值 imputed_data = imputer.fit_transform(data)print(imputed_data)5.数据集划分
train_test_split
是一个非常常用的函数,它用于将数据集分割为训练集和测试集
from sklearn.model_selection import train_test_split# 示例数据 X = [[1, 2], [3, 4], [5, 6], [7, 8]] y = [0, 1, 0, 1]# 划分数据集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)print("Training data:", X_train, y_train) print("Testing data:", X_test, y_test)s四
四.监督学习
4.1线性模型
假设特征之间的关系可以用一条直线(对于二元分类)或超平面(对于多类分类)来表示。线性模型主要包括线性回归(用于连续目标变量)和逻辑回归(用于分类目标变量)。
import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error# 示例数据集 # 特征矩阵X X = np.array([[800], [1200], [1600], [2000], [2400]]) # 目标向量y y = np.array([150000, 200000, 250000, 300000, 350000])# 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)# 创建线性回归模型并训练 model = LinearRegression() model.fit(X_train, y_train)# 进行预测 y_pred = model.predict(X_test)# 计算均方误差 mse = mean_squared_error(y_test, y_pred) print(f"均方误差为: {mse}")# 打印训练数据 print("训练数据:") print("特征矩阵 X_train:") print(X_train) print("目标向量 y_train:") print(y_train)# 打印测试数据 print("测试数据:") print("特征矩阵 X_test:") print(X_test) print("目标向量 y_test:") print(y_test) print("预测值 y_pred:") print(y_pred)
4.2支持向量机
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.metrics import classification_report, confusion_matrix# 生成模拟数据集,调整特征数量 X, y = make_classification(n_samples=100, n_features=4, n_classes=2, n_informative=2, n_redundant=0, n_repeated=0, random_state=42)# 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# 特征标准化 scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)# 创建SVM分类器实例 model = SVC(kernel='linear') # 使用线性核# 训练模型 model.fit(X_train, y_train)# 进行预测 y_pred = model.predict(X_test)# 打印混淆矩阵 print(confusion_matrix(y_test, y_pred))# 打印分类报告 print(classification_report(y_test, y_pred))