一个脚本打比赛之SMP WEIBO 2016

## 一个脚本打比赛之SMP WEIBO 2016 ## 前言：如何对用户进行精准画像是社交网络分析的基础问题。本文就如何对weibo用户网络提取特征发表一点小的想法，还请尽管拍砖。数据来源：SMP WEIBO 2016 任务目标：分析用户关联关系与用户发帖内容，通过无监督与有监督方法对用户进行聚类。 ———- 第一部分：筛选source，即判定用户发表的内容是否是垃圾信息。

import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from time import time
%matplotlib inline

训练数据字段含义： uid: 用户唯一标识，由数字组成 retweet count: 转发数，数字 review count: 评论数，数字 source: 来源，文本 time: 创建时间，时间戳文本(目前有两种格式，yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm) content: 文本内容（可能包含@信息、表情符信息等）

with open('train/train/train_status.txt','r') as f:lines = f.readlines()
status=[]
for line in lines:status.append(line.strip().split(','))
tr_status = pd.DataFrame(status).loc[:,:5]
tr_status.columns=['uid','retweet','review','source','time','content']
tr_status.to_csv('train_status.csv',index=False)
display(tr_status.head())
with open('valid/valid_status.txt','r') as f:lines = f.readlines()
status=[]
for line in lines:status.append(line.strip().split(','))
v_status = pd.DataFrame(status).loc[:,:5]
v_status.columns=['uid','retweet','review','source','time','content']
v_status.to_csv('valid_status.csv',index=False)
display(v_status.head())

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	uid	review	source	time	content
0	1103763581	0	Arduino中文社区	2016-01-07 13:14	我用微博在 Arduino 中文社区上登录啦！ Arduino 中文社区 …
1	1103763581	2	荣耀6 Plus	2015-11-10 09:13:35	很长时间没有上微博看看了，估计都快被忘记了吧！无锡·新安 …
2	1103763581	0	荣耀6 Plus	2015-07-26 20:07:57	# 农村现状 # 20 年前还是个小孩，一到瓜果成熟的季节，三五…
3	1103763581	0	荣耀6 Plus	2015-06-22 18:39:47	我分享了 @环球时报的文章社评：法国出租与专车司机冲突的启示
4	1103763581	6	荣耀6 Plus	2015-06-10 07:37:22	好久没上微博了，不知道大家还记得我不？梁家巷显示地图

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	uid	source	time	content
0	1753249671	iPhone客户端	2016-05-06 10:01	扑通扑通我的心跳！久久不能平 …… 深呼吸、深呼吸、深呼吸！
1	1753249671	iPhone客户端	2016-04-15 01:19	失眠的夜晚，夜慢慢慢慢原图
2	1753249671	iPhone客户端	2016-03-29 19:15	贱人就是矫情、奇葩朵朵开的一天极品领导同事，人生不如意之…
3	1753249671	iPhone客户端	2016-01-25 22:53	# 买家反馈语录 # 来自小伙伴们对牛板筋的好评，还在等待观望的…
4	1753249671	iPhone客户端	2016-01-06 19:51	童言无忌：朋友女儿今年小学三年级，看到她妈妈朋友圈里我发的 …

已标注用户字段含义：
uid: 用户唯一标识，由数字组成

gender: 用户性别，m代表男性，f代表女性，None代表此项信息缺失

birthday: 用户出生年份，None代表此项信息缺失

location: 用户地域，部分用户包含省份和城市信息，部分用户只有省份信息，None代表此项信息缺失

with open('train/train/train_labels.txt','r') as f:labels = f.readlines()

labels[0]

‘1832205887||m||1990||四川 None\n’

userHasLabel = [x.split("||")[0].strip() for x in lines]

import pandas as pd
t_labels = pd.read_csv('train/train/train_labels.txt',sep="\\|\\|",header=None)
t_labels.columns = ['uid','gender','birthday','location']v_labels = pd.read_csv('valid/valid_labels.txt',sep="\\|\\|",header=None)
v_labels.columns = ['uid','gender','birthday','location']
print(t_labels.head())
print(v_labels.head())
labeled_nodes = pd.concat([t_labels,v_labels])
labeled_nodes.to_csv('labeled_nodes.csv',index=False,encoding='gbk')

uid gender birthday location 0 1832205887 m 1990 四川 None 1 1737245804 m 1982 吉林长春 2 2157991124 m 1976 四川成都 3 2758890931 f 1983 黑龙江哈尔滨 4 1802646764 m 1981 湖南长沙 uid gender birthday location 0 1743152063 m 1984 广东广州 1 1073390982 m 1983 北京朝阳区 2 2137599524 m 1990 湖北黄石 3 2279196033 f 1987 江苏南京 4 1039584863 m 1985 广东深圳 /home/ll/miniconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’. /home/ll/miniconda3/lib/python3.5/site-packages/ipykernel_launcher.py:5: ParserWarning: Falling back to the ‘python’ engine because the ‘c’ engine does not support regex separators (separators > 1 char and different from ‘\s+’ are interpreted as regex); you can avoid this warning by specifying engine=’python’. “””

df = pd.concat([tr_status,v_status])
df.loc[:,'uid'] = df['uid'].astype(int)
df = df.merge(labeled_nodes)
display(df.head(10))
display(df.shape)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	uid	retweet	review	source	time	content	gender	birthday	location
0	1103763581	0	0	Arduino中文社区	2016-01-07 13:14	我用微博在 Arduino 中文社区上登录啦！ Arduino 中文社区 …	m	1986	四川成都
1	1103763581	0	2	荣耀6 Plus	2015-11-10 09:13:35	很长时间没有上微博看看了，估计都快被忘记了吧！无锡·新安 …	m	1986	四川成都
2	1103763581	0	0	荣耀6 Plus	2015-07-26 20:07:57	# 农村现状 # 20 年前还是个小孩，一到瓜果成熟的季节，三五…	m	1986	四川成都
3	1103763581	0	0	荣耀6 Plus	2015-06-22 18:39:47	我分享了 @环球时报的文章社评：法国出租与专车司机冲突的启示	m	1986	四川成都
4	1103763581	0	6	荣耀6 Plus	2015-06-10 07:37:22	好久没上微博了，不知道大家还记得我不？梁家巷显示地图	m	1986	四川成都
5	1103763581	0	0	世界3D打印	2015-06-05 08:08:00	【太尔时代助力 “ 太空制造 ” ，挑战微重力环境下 3D 打印】【分…	m	1986	四川成都
6	1103763581	0	0	荣耀6 Plus	2015-05-27 21:29:57	愤怒的小鸟存钱 [ 钱 ] 罐 http://t.cn/z8dS7zS 显示地…	m	1986	四川成都
7	1103763581	3	1	荣耀6 Plus	2015-05-22 08:45:56	成都科技爱好者的盛宴。太尔时代 UP 系列机器将在两场活动中展出…	m	1986	四川成都
8	1103763581	0	0	荣耀6 Plus	2015-05-05 19:10:32	最近身体不适，准备适当休整。如工作事宜请拨打办公电话 028-6…	m	1986	四川成都
9	1103763581	0	0	百度分享	2015-04-30 10:43:52	3D 打印机公司太尔时代上榜福布斯中国潜力企业 100 强 -3D 打印资…	m	1986	四川成都

(331634, 9)

labeled_id = list(t_labels['uid']) + list(v_labels['uid'])

print(len(labeled_id),len(set(labeled_id)))

4467 4467 从第二个用户到最后一个用户均为第一个用户的粉丝筛选出链接中给出的用户。 @output : nodelist

with open('train/train/train_links.txt','r') as f:t_links = f.readlines()
with open('valid/valid_links.txt','r') as f:v_links = f.readlines()
with open('test/test/test_links.txt','r') as f:te_links = f.readlines()

linklist=[]
for link in t_links:linklist += [str(x) for x in link.strip().split(' ')]
for link in v_links:linklist += [str(x) for x in link.strip().split(' ')]
for link in te_links:linklist += [str(x) for x in link.strip().split(' ')]

print(len(linklist))
print(len(set(linklist)))

721388 308787

nodelist = []
for node in labeled_id:try:linklist.index(str(node))nodelist.append(node)except Exception as e:#print(node)pass
print(len(nodelist))

2476

df = df.set_index('uid').loc[nodelist,:]
display(df.shape)
df.to_csv('labeled_linked_fulltable.csv')

(191119, 8) 至此，将可以筛选的Node筛选出来，具有标签与网络中存在的节点

display(df.shape)
import re 
patt = re.compile('努比亚')
res = filter(patt.match, list(df['source'].drop_duplicates()))
list(res)
# re.findall(patt,list(df['source'].drop_duplicates()))
# list(df['source'].drop_duplicates()).index('努比亚Android')

(191119, 9) [‘努比亚智能手机’]

import pandas as pd
#df= pd.read_csv('labeled_linked_fulltable.csv')
df1 = df[['source','location']]
df1.loc[:,'count'] = 1
diffcount = df1.groupby('source')[['location']].apply(lambda x : x.location.drop_duplicates().count()).to_frame('diff')
count = df1.groupby('source')[['location']].apply(lambda x : x.location.count()).to_frame('count')
source = diffcount.join(count)
display(source.head())

/home/ll/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py:337: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[key] = _infer_fill_value(value) /home/ll/miniconda3/lib/python3.5/site-packages/pandas/core/indexing.py:517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	diff	count
source
努比亚Android	1	1
0519微趣测试	1	3
0元赢荣耀畅玩5C	2	2
100+V6手机	3	8
100+个性定制手机	1	2

plt.subplot(211)
source.sort_values('count',ascending=False).head(100)['count'].plot(kind='line',figsize=(30,10),title='count long tail distribution' )
plt.subplot(212)
source.sort_values('count',ascending=False).head(100)['diff'].plot(kind='line',figsize=(30,10))

def hebing(group):
#     display(list(group.content.values))return ' '.join([str(x) for x in list(group.content.values)])
content_merge = df.groupby('source')[['content']].apply(hebing).to_frame('content')
#是否过滤，查看主题变化
content_merge = content_merge.join(source,how='left').reset_index()
#content_merge = content_merge.join(source)[content_merge['count']>10]
display(content_merge.head())tokenizer = lambda s:s.split(' ')
tfv = TfidfVectorizer(tokenizer=tokenizer)
data = tfv.fit_transform(content_merge.content)#词频统计 
#max_df=1, min_df=1
tfc = CountVectorizer(max_features=10000,tokenizer=tokenizer)
tf = tfc.fit_transform(content_merge)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	source	content	diff	count
0	努比亚Android	# 苏宁入股努比亚 # 我用的就是努比亚祝努比亚能越办越好越办 …	1	1
1	0519微趣测试	/ 偷笑 OMG ，原来有 209 人暗恋着我，太不可思议了！快来看…	1	3
2	0元赢荣耀畅玩5C	免费狂抽 55 台 # 荣耀畅玩 5C # ？！有这好事还不让微博 …	2	2
3	100+V6手机	牛刀说的有点道理，不过现在很多放高利贷的，特别是县城或者是…	3	8
4	100+个性定制手机	新版微博客户端，好听，好看，更好玩！自定义个人封面；音、…	1	2

import pickle
with open('tfidf_vec.pkl','wb') as f:pickle.dump(data,f)

from time import time
import pickle
with open('tfidf_vec.pkl','rb') as f:data = pickle.load(f)
#print(data[0])
print('维度约减，自动约减到合适的维度')
print('提取重要特征')
from sklearn.decomposition import NMF, LatentDirichletAllocation
t0 = time()
n_components = 2
nmf = NMF(n_components=2).fit(data)
print("done in %0.3fs." % (time() - t0))

维度约减，自动约减到合适的维度
提取重要特征
done in 1.477s.

with open('nmf_model.pkl','wb') as f:pickle.dump(nmf,f)
print(nmf.reconstruction_err_)
topic = nmf.transform(data)
print(topic)
# if topic.shape[1] ==2 :
#     plt.figure(figsize=(30,30))
#     plt.scatter(x=topic[:,0],y=topic[:,1],c=content_merge['count'])

57.0115343243
[[ 0.02043113  0.        ][ 0.09281784  0.        ][ 0.03913675  0.        ]..., [ 0.05743546  0.03386496][ 0.07525523  0.        ][ 0.01523509  0.        ]]

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2)
gmm.fit(topic)
pred = gmm.predict(topic)
content_merge.loc[:,'pred'] = predfrom sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(data)
pred = km.predict(data)
content_merge.loc[:,'pred'] = pred

label = ['荣耀6 Plus','iPhone客户端','虾米音乐移动版','爱相机','华住酒店App','分享按钮']
label = content_merge[content_merge.pred==0].head(100).source
#中文字体显示  
import matplotlib
zhfont = matplotlib.font_manager.FontProperties(fname='/home/ll/.fonts/NotoSansMonoCJKsc-Regular.otf')
# plt.rcParams['font.sans-serif'] = ['Source Han Sans TW', 'sans-serif']
# plt.rc('font', family='Noto Sans Mono CJK SC', size=13)
plt.figure(figsize=(20,20))
plt.scatter(x=topic[:,0],y=topic[:,1],c=pred)
for l in label:pindex = content_merge[content_merge['source']==l].indexplt.annotate(l,xy=(topic[pindex,0],topic[pindex,1]),fontproperties=zhfont)

/home/ll/miniconda3/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['Noto Sans Mono CJK SC'] not found. Falling back to DejaVu Sans(prop.get_family(), self.defaultFamily[fontext]))

source分布