注意:本文引用自专业人工智能社区Venus AI
更多AI知识请参考原站 ([www.aideeplearning.cn])
项目背景
随着数字通信的快速发展,垃圾短信成为了一个普遍而烦人的问题。这些不请自来的消息不仅打扰了我们的日常生活,还可能包含诈骗和欺诈的风险。因此,有效地识别并过滤垃圾短信变得至关重要。
项目目标
本项目的主要目标是开发一个机器学习模型,能够自动、准确地区分垃圾短信和正常短信。通过训练模型识别典型的垃圾短信特征,我们可以大大减少垃圾短信对用户的干扰,并提高通信的安全性和效率。
项目应用
- 邮件服务提供商: 自动过滤垃圾短信,保护用户免受不必要的打扰和潜在的欺诈风险。
- 企业通信: 在内部通信系统中部署,确保员工不会因垃圾短信而分散注意力,提高工作效率。
- 个人用户: 为个人用户提供一个工具或应用程序,帮助他们在日常生活中自动识别和过滤垃圾短信。
数据集详情
“垃圾邮件”的概念多种多样:产品/网站广告、快速赚钱计划、连锁信、色情内容……
垃圾短信集合是一组为垃圾短信研究而收集的带有 SMS 标记的消息。 它包含一组 5,574 条英文 SMS 消息,根据垃圾邮件(合法)或垃圾邮件进行标记。
模型选择
为了实现垃圾短信的有效识别,我们考虑了以下几种机器学习算法:
- 逻辑回归(Logistic Regression): 提供快速、有效的分类,适合基准模型。
- 朴素贝叶斯(Naive Bayes): 在文本分类任务中表现出色,尤其是在短信长度有限的情况下。
- 支持向量机(SVC): 适用于复杂的文本数据,能够处理高维空间。
- 随机森林(Random Forest): 一个强大的集成学习方法,可以提供准确的分类结果。
依赖库
在开发过程中,我们使用了以下Python库:
- pandas: 数据处理和分析。
- numpy: 数值计算。
- nltk: 自然语言处理。
- re: 正则表达式,用于文本数据清洗。
- sklearn: 提供机器学习算法和数据预处理工具。
代码实现
import pandas as pd
import re
from nltk.corpus import stopwords
加载数据
df = pd.read_csv('spam.csv')
df.head()
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... | NaN | NaN | NaN |
1 | ham | Ok lar... Joking wif u oni... | NaN | NaN | NaN |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... | NaN | NaN | NaN |
3 | ham | U dun say so early hor... U c already then say... | NaN | NaN | NaN |
4 | ham | Nah I don't think he goes to usf, he lives aro... | NaN | NaN | NaN |
# 获取有用的数据(前两列)
df = df[['v2', 'v1']]
# df.rename(columns={'v2': 'messages', 'v1': 'label'}, inplace=True)
df = df.rename(columns={'v2': 'messages', 'v1': 'label'})
df.head()
messages | label | |
---|---|---|
0 | Go until jurong point, crazy.. Available only ... | ham |
1 | Ok lar... Joking wif u oni... | ham |
2 | Free entry in 2 a wkly comp to win FA Cup fina... | spam |
3 | U dun say so early hor... U c already then say... | ham |
4 | Nah I don't think he goes to usf, he lives aro... | ham |
数据预处理
# 检查的空值
df.isnull().sum()
messages 0 label 0 dtype: int64
STOPWORDS = set(stopwords.words('english'))def clean_text(text):# 转化成小写text = text.lower()# 移除特殊字符text = re.sub(r'[^0-9a-zA-Z]', ' ', text)# 移除多余空格text = re.sub(r'\s+', ' ', text)# 移除停用词text = " ".join(word for word in text.split() if word not in STOPWORDS)return text
# 清洗数据
df['clean_text'] = df['messages'].apply(clean_text)
df.head()
messages | label | clean_text | |
---|---|---|---|
0 | Go until jurong point, crazy.. Available only ... | ham | go jurong point crazy available bugis n great ... |
1 | Ok lar... Joking wif u oni... | ham | ok lar joking wif u oni |
2 | Free entry in 2 a wkly comp to win FA Cup fina... | spam | free entry 2 wkly comp win fa cup final tkts 2... |
3 | U dun say so early hor... U c already then say... | ham | u dun say early hor u c already say |
4 | Nah I don't think he goes to usf, he lives aro... | ham | nah think goes usf lives around though |
数据与标签划分
X = df['clean_text']
y = df['label']
y = df['label']
模型训练
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer , TfidfTransformerdef classify(model, X, y):# train test splitx_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True, stratify=y)# model trainingpipeline_model = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', model)])pipeline_model.fit(x_train, y_train)print('Accuracy:', pipeline_model.score(x_test, y_test)*100)# cv_score = cross_val_score(model, X, y, cv=5)
# print("CV Score:", np.mean(cv_score)*100)y_pred = pipeline_model.predict(x_test)print(classification_report(y_test, y_pred))
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)
Accuracy: 96.8413496051687precision recall f1-score supportham 0.97 1.00 0.98 1206spam 0.99 0.77 0.87 187accuracy 0.97 1393macro avg 0.98 0.88 0.92 1393 weighted avg 0.97 0.97 0.97 1393
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
classify(model, X, y)
Accuracy: 96.69777458722182precision recall f1-score supportham 0.96 1.00 0.98 1206spam 1.00 0.75 0.86 187accuracy 0.97 1393macro avg 0.98 0.88 0.92 1393 weighted avg 0.97 0.97 0.96 1393
from sklearn.svm import SVC
model = SVC(C=3)
classify(model, X, y)
Accuracy: 98.27709978463747precision recall f1-score supportham 0.98 1.00 0.99 1206spam 1.00 0.87 0.93 187accuracy 0.98 1393macro avg 0.99 0.94 0.96 1393 weighted avg 0.98 0.98 0.98 1393
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
classify(model, X, y)
Accuracy: 97.4156496769562precision recall f1-score supportham 0.97 1.00 0.99 1206spam 1.00 0.81 0.89 187accuracy 0.97 1393macro avg 0.99 0.90 0.94 1393 weighted avg 0.97 0.97 0.97 1393
代码与数据集下载
详情请见SMS垃圾短信识别项目-VenusAI (aideeplearning.cn)