ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

news/2024/10/30 15:25:05/

ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

 

 

 

 

目录

基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

设计思路

输出结果

核心代码


 

相关文章
ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测
ML之NB:基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测实现

基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

设计思路

 

输出结果

代码中的数据集:https://download.csdn.net/download/qq_41185868/13757777

F:\Program Files\Python\Python36\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serialwarnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1293 entries, 0 to 1292
Data columns (total 6 columns):#   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 0   Unnamed: 0  1293 non-null   int64 1   content     1292 non-null   object2   id          1293 non-null   int64 3   tags        1293 non-null   object4   time        1293 non-null   object5   title       1293 non-null   object
dtypes: int64(2), object(4)
memory usage: 60.7+ KB
NoneUnnamed: 0                                            content  \
0           0   牵动人心的雄安新区规划细节内容和出台时间表敲定。日前,北京商报记者从业内获悉,京津冀协同发...   
1           1  去年以来,多个城市先后发布了多项楼市调控政策。在限购、限贷甚至限售的政策“组合拳”下,房地产...   
2           2  在今年中国国际自行车展上,上海凤凰自行车总裁王朝阳表示,共享单车的到来把我们打懵了,影响更是...   
3           3  25家上市银行迎来了一年一度的“分红季”,21世纪经济报道记者根据公开信息梳理发现,25家银...   
4           4  说起卷饼,大家其实并不陌生,这个来自中原的传统美食,发展至今也衍生出各种各样的种类,卷边的制...   id                                  tags  \
0  6428905748545732865   ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山']   
1  6428954136200855810   ['财经', '碧桂园', '万科集团', '投资', '广州恒大']   
2  6420576443738784002    ['财经', '自行车', '凤凰', '王朝阳', '汽车展览']   
3  6429007290541031681  ['财经', '银行', '工商银行', '兴业银行', '交通银行']   
4  6397481672254619905     ['财经', '小吃', '装修', '市场营销', '手工艺']   time                   title  
0  2017-06-07 22:52:55  雄安新区规划“骨架”敲定,方案有望9月底出炉  
1  2017-06-08 08:01:13       “红五月”不红 房企资金链压力攀升  
2  2017-05-16 12:03:00      凤凰自行车总裁:共享单车把我们打懵了  
3  2017-06-08 07:00:00    25家银行分红季派出3536亿“大红包”  
4  2017-03-15 07:03:22      五万以下的小本餐饮项目,卷饼赚钱最稳  
chinese_pattern re.compile('[\\u4e00-\\u9fff]+')
Building prefix dict from F:\File_Jupyter\实用代码\naive_bayes(简单贝叶斯)\jieba_dict\dict.txt.big ...
Loading model from cache C:\Users\niu\AppData\Local\Temp\jieba.ue3752d4e13420d2dc6b66831a5a4ab13.cache
Loading model cost 1.326 seconds.
Prefix dict has been built succesfully.
dictionary
<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(46351 unique tokens: ['一个', '一个个', '一举一动', '一些', '一体']...)
<class 'method'> <bound method Dictionary.doc2bow of <gensim.corpora.dictionary.Dictionary object at 0x000001BDC62291D0>>
F:\Program Files\Python\Python36\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarrayreturn array(a, dtype, copy=False, order=order)Unnamed: 0                                            content  \
0           0   牵动人心的雄安新区规划细节内容和出台时间表敲定。日前,北京商报记者从业内获悉,京津冀协同发...   
1           1  去年以来,多个城市先后发布了多项楼市调控政策。在限购、限贷甚至限售的政策“组合拳”下,房地产...   
2           2  在今年中国国际自行车展上,上海凤凰自行车总裁王朝阳表示,共享单车的到来把我们打懵了,影响更是...   id                                 tags  \
0  6428905748545732865  ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山']   
1  6428954136200855810  ['财经', '碧桂园', '万科集团', '投资', '广州恒大']   
2  6420576443738784002   ['财经', '自行车', '凤凰', '王朝阳', '汽车展览']   time                   title  \
0  2017-06-07 22:52:55  雄安新区规划“骨架”敲定,方案有望9月底出炉   
1  2017-06-08 08:01:13       “红五月”不红 房企资金链压力攀升   
2  2017-05-16 12:03:00      凤凰自行车总裁:共享单车把我们打懵了   doc_words  \
0  [牵动人心, 雄安, 新区, 规划, 细节, 内容, 出台, 时间表, 敲定, 日前, 北京...   
1  [去年, 以来, 多个, 城市, 先后, 发布, 多项, 楼市, 调控, 政策, 限购, 限...   
2  [今年, 中国, 国际, 自行车, 展上, 上海, 凤凰, 自行车, 总裁, 王, 朝阳, ...   corpus  \
0  [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2...   
1  [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5...   
2  [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1...   tfidf  
0  [(0, 0.005554342859788116), (1, 0.007470250835...  
1  [(0, 0.002081356679198299), (3, 0.012288034179...  
2  [(15, 0.057457146244872616), (53, 0.0543395377...  
after abs 4.7683716e-07
foo: (1293, 1293)
dis2TSNE_Visual:  (1293, 2)
{'养生': 0, '科技': 1, '财经': 2, '游戏': 3, '育儿': 4, '汽车': 5}
data_frame.keyword_index: 1    379
2    287
5    283
4    148
3    141
0     55
Name: keyword_index, dtype: int64Unnamed: 0                                            content  \
0           0   牵动人心的雄安新区规划细节内容和出台时间表敲定。日前,北京商报记者从业内获悉,京津冀协同发...   
1           1  去年以来,多个城市先后发布了多项楼市调控政策。在限购、限贷甚至限售的政策“组合拳”下,房地产...   
2           2  在今年中国国际自行车展上,上海凤凰自行车总裁王朝阳表示,共享单车的到来把我们打懵了,影响更是...   id                                 tags  \
0  6428905748545732865  ['财经', '白洋淀', '城市规划', '徐匡迪', '太行山']   
1  6428954136200855810  ['财经', '碧桂园', '万科集团', '投资', '广州恒大']   
2  6420576443738784002   ['财经', '自行车', '凤凰', '王朝阳', '汽车展览']   time                   title  \
0  2017-06-07 22:52:55  雄安新区规划“骨架”敲定,方案有望9月底出炉   
1  2017-06-08 08:01:13       “红五月”不红 房企资金链压力攀升   
2  2017-05-16 12:03:00      凤凰自行车总裁:共享单车把我们打懵了   doc_words  \
0  [牵动人心, 雄安, 新区, 规划, 细节, 内容, 出台, 时间表, 敲定, 日前, 北京...   
1  [去年, 以来, 多个, 城市, 先后, 发布, 多项, 楼市, 调控, 政策, 限购, 限...   
2  [今年, 中国, 国际, 自行车, 展上, 上海, 凤凰, 自行车, 总裁, 王, 朝阳, ...   corpus  \
0  [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2...   
1  [(0, 1), (3, 3), (13, 1), (17, 1), (41, 1), (5...   
2  [(15, 1), (53, 1), (167, 1), (262, 1), (396, 1...   tfidf   visual01   visual02  \
0  [(0, 0.005554342859788116), (1, 0.007470250835... -65.903542 -14.433964   
1  [(0, 0.002081356679198299), (3, 0.012288034179... -29.659267 -14.811647   
2  [(15, 0.057457146244872616), (53, 0.0543395377... -22.118195 -48.148167   keyword_index  
0              2  
1              2  
2              2  
Childcare,label_category_ID_pos.tfidf)[:20]: ['孩子', '家长', '教育', '学习', '男孩子', '成绩', '爸爸', '分享', '帮助', '方法', '小学', '数学', '交流', '男孩', '妈妈', '成长', '父母', '懂', '免费', '翼航']
Childcare,label_category_ID_neg.tfidf)[:20]: []
train_index MatrixSimilarity<646 docs, 46329 features>
hot_words shape: 6 300
{0: {1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 18009, 7258, 4697, 7260, 16989, 3674, 91, 87, 16993, 18020, 616, 4714, 5228, 40044, 1646, 4720, 3185, 15986, 34928, 5236, 113, 34936, 6777, 126, 15999, 127, 4737, 40067, 5252, 643, 4739, 13444, 8840, 1157, 133, 4749, 3219, 10388, 17562, 5278, 46239, 5287, 3751, 167, 680, 6827, 4784, 16048, 16050, 180, 46260, 16054, 6839, 4792, 2743, 4789, 17083, 16060, 4790, 16062, 43200, 5315, 46276, 46279, 17098, 6860, 5836, 16081, 43219, 1237, 1750, 15575, 8921, 2266, 6877, 12511, 12512, 21216, 226, 4834, 6884, 16101, 4838, 742, 2280, 2281, 227, 7915, 6886, 6893, 2798, 6894, 5870, 4849, 242, 1779, 4852, 21215, 44791, 4864, 3329, 258, 4865, 4866, 44805, 4877, 21264, 4882, 274, 8986, 8987, 796, 32029, 4382, 21277, 4896, 1825, 801, 3363, 36644, 1830, 4393, 36138, 303, 815, 4401, 12594, 21299, 7986, 820, 310, 1337, 21307, 4411, 317, 33598, 5953, 17730, 5954, 10050, 17733, 17734, 25927, 21320, 17739, 4939, 21324, 4942, 33615, 6885, 16210, 6071, 18261, 5976, 860, 16740, 16745, 2922, 4969, 17263, 6512, 33649, 16242, 2419, 17775, 373, 1398, 880, 1916, 17276, 16255, 1920, 43394, 3974, 4999, 396, 8080, 16788, 18325, 1942, 16279, 1433, 43418, 36252, 17311, 43425, 16802, 7585, 15959, 7594, 36268, 4525, 7597, 5551, 6063, 36272, 36275, 4533, 16309, 18358, 36280, 1465, 441, 7611, 16825, 16829, 4538, 2488, 2495, 8129, 4545, 4547, 16836, 4549, 7621, 1484, 1997, 11214, 1999, 16846, 16847, 4563, 7636, 14293, 7638, 4567, 16855, 17369, 16861, 478, 16351, 18400, 17377, 993, 9699, 5085, 6111, 7645, 6119, 6124, 17903, 1011, 4597, 6646, 16376, 6138, 16891, 16892, 7165, 4606}, 1: {0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 12391, 28267, 12396, 109, 9836, 12399, 11884, 12401, 12400, 12403, 627, 117, 629, 9847, 628, 17020, 637, 9855, 639, 12418, 643, 1668, 133, 3715, 14470, 1160, 12424, 11912, 9867, 33420, 10376, 655, 12433, 148, 150, 3735, 1176, 12440, 154, 21659, 1180, 3742, 10399, 11936, 1185, 31904, 675, 13472, 167, 1704, 7337, 11946, 171, 172, 8876, 8878, 2734, 1200, 1709, 2226, 8877, 180, 1155, 697, 12475, 189, 8894, 1215, 1218, 4291, 708, 709, 3271, 2760, 6354, 2771, 1748, 213, 3798, 727, 730, 20187, 44767, 225, 2786, 2787, 13028, 1765, 1254, 13543, 26344, 740, 11497, 1771, 3819, 13549, 11502, 751, 1775, 752, 242, 21743, 12524, 759, 11511, 2809, 2812, 35581, 257, 8962, 771, 259, 15623, 1288, 3849, 12048, 1810, 786, 788, 3862, 793, 7450, 798, 24862, 7458, 12579, 31524, 31523, 7459, 1322, 810, 25391, 12081, 1329, 820, 3386, 1850, 9023, 319, 835, 9029, 325, 4424, 330, 12107, 13134, 846, 3409, 3924, 1878, 854, 344, 11609, 5978, 1883, 11612, 343, 11615, 358, 4457, 362, 875, 1385, 1900, 4462, 3439, 12144, 369, 3438, 1396, 38773, 28025, 2428, 13305, 13183, 12161, 12674, 1922, 34690, 2438, 1926, 13193, 907, 9100, 911, 13204, 1431, 10135, 2456, 44956, 925, 413, 32670, 1952, 928, 23455, 5540, 1956, 1447, 12200, 1448, 1452, 8109, 12205, 1965, 9651, 2486, 5559, 1464, 956, 1982, 959, 3522, 12235, 976, 3025, 10194, 1491, 12244, 465, 30675, 5585, 472, 470, 10714, 475, 3027, 478, 1503, 479, 5089, 483, 2532, 995, 9190, 5607, 1512, 1513, 9703, 10728, 494, 1518, 1520, 2545, 1007, 1524, 501, 503, 1017, 1534}, 2: {0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 3146, 1100, 26701, 1614, 1102, 592, 3577, 35410, 2639, 2644, 3159, 25688, 1626, 91, 3162, 1119, 608, 21089, 1634, 102, 2662, 31848, 2665, 11881, 27242, 12907, 1131, 1132, 15388, 2672, 3185, 1138, 627, 43124, 2675, 113, 1657, 2682, 3194, 127, 3715, 1668, 133, 3717, 135, 2696, 3209, 1162, 1158, 1676, 2701, 11916, 1167, 138, 1169, 148, 2710, 1174, 152, 1177, 22167, 26779, 21659, 157, 158, 1183, 30880, 1185, 26784, 2209, 2724, 3232, 672, 167, 4256, 8876, 685, 4269, 1202, 2226, 691, 1205, 3253, 1207, 2231, 2242, 4291, 14026, 27340, 1740, 1231, 14032, 24273, 3284, 1749, 213, 727, 217, 730, 2266, 14044, 1246, 1248, 225, 1254, 742, 745, 3819, 14060, 12013, 750, 1775, 242, 1780, 1268, 759, 760, 249, 33536, 1281, 261, 262, 2311, 1290, 267, 37132, 5902, 1810, 7958, 39191, 280, 793, 43813, 1318, 807, 295, 45354, 1324, 28461, 1838, 28462, 815, 1329, 820, 1333, 317, 2366, 39743, 832, 2365, 45378, 835, 330, 1356, 845, 334, 1359, 4433, 4438, 854, 14168, 1370, 1883, 1372, 1371, 860, 863, 3935, 3937, 1378, 11618, 3426, 870, 358, 3942, 361, 874, 362, 875, 28010, 3438, 2416, 369, 880, 14196, 886, 4472, 1403, 894, 895, 2432, 385, 904, 905, 27528, 907, 909, 911, 1431, 409, 1433, 925, 1950, 415, 928, 413, 13731, 3494, 20902, 937, 1452, 942, 1968, 1973, 1464, 1977, 956, 34240, 3009, 32706, 14278, 3015, 456, 1993, 973, 975, 976, 465, 466, 1491, 14290, 2512, 1494, 472, 475, 480, 3554, 995, 2532, 3048, 1513, 23529, 3564, 494, 498, 500, 501, 503, 1017, 3070}, 3: {1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 6731, 9293, 31823, 2133, 9303, 601, 91, 43615, 608, 9314, 10338, 25709, 1646, 10349, 6257, 7794, 27763, 11381, 9337, 7801, 637, 3709, 639, 11391, 9345, 7299, 3715, 1668, 41606, 11401, 11402, 4233, 9868, 10893, 142, 5259, 9872, 25744, 25741, 148, 10389, 34455, 3735, 8345, 8857, 154, 10396, 1178, 7839, 10399, 8554, 1704, 10409, 9900, 10412, 2734, 14512, 10416, 7858, 9394, 9904, 6325, 2232, 1721, 38589, 8894, 6336, 1220, 9925, 11461, 3271, 9420, 719, 14544, 2773, 3286, 3287, 214, 20187, 9438, 26335, 6048, 13534, 226, 3811, 19172, 1766, 2280, 36585, 14575, 2801, 9457, 10993, 10485, 23797, 759, 27896, 5882, 8443, 23803, 1790, 767, 8962, 9476, 7433, 6924, 2316, 2318, 3853, 14608, 4371, 9494, 8983, 6425, 793, 362, 6433, 7458, 2339, 810, 1835, 8493, 6447, 1329, 28466, 44855, 9527, 1338, 10044, 317, 3390, 10047, 41280, 31554, 2372, 9029, 11592, 9547, 3916, 9042, 10066, 3925, 343, 10072, 5978, 860, 8030, 10079, 10593, 9572, 2916, 9061, 3430, 6501, 4969, 10089, 30571, 10603, 11117, 9582, 10607, 6505, 14193, 28529, 14707, 7197, 369, 11639, 23929, 894, 1919, 3459, 11652, 2438, 10631, 907, 10642, 9109, 2454, 14743, 2456, 29594, 11164, 6559, 9631, 3999, 1951, 14754, 14756, 31653, 9638, 31654, 33704, 45984, 3500, 31661, 1453, 1455, 9645, 9649, 41394, 9651, 9652, 10165, 30718, 2999, 31672, 1982, 9662, 44483, 11205, 2505, 5581, 10704, 465, 977, 31699, 9172, 4053, 9174, 31703, 4567, 470, 10714, 475, 5076, 478, 480, 23008, 9186, 30692, 9190, 9703, 10216, 491, 30699, 1005, 2542, 31726, 1007, 494, 25586, 10222, 18417, 10736, 8178, 3064, 1529, 509, 1534}, 4: {0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 601, 7258, 91, 5722, 5214, 4703, 608, 3679, 2143, 101, 6758, 5224, 616, 7277, 2158, 4723, 5236, 6267, 1660, 637, 639, 4737, 4739, 5252, 133, 1668, 4606, 23688, 5768, 17035, 2188, 5772, 38034, 5779, 3220, 6805, 2199, 1688, 5273, 154, 155, 1694, 4767, 5280, 5278, 5284, 1191, 1704, 167, 3754, 5802, 5290, 3751, 3247, 5296, 3257, 5818, 5823, 3265, 708, 5318, 5830, 4294, 1738, 5841, 5330, 4825, 4316, 734, 6369, 5349, 4838, 4326, 2280, 4329, 46315, 6380, 29660, 44269, 5871, 5873, 242, 7927, 759, 760, 2812, 1277, 8448, 3329, 4866, 2304, 4869, 5382, 7430, 3848, 3339, 2318, 782, 3857, 5906, 26513, 788, 2841, 7450, 4382, 1825, 7458, 801, 37156, 4393, 810, 7979, 3886, 815, 4911, 4401, 7986, 1329, 820, 5942, 3896, 8506, 2874, 317, 5441, 835, 5445, 5958, 6578, 5964, 5965, 4942, 8016, 8024, 344, 4952, 860, 1884, 29533, 8545, 8037, 3430, 6504, 7017, 2922, 4457, 362, 5998, 2928, 373, 374, 2935, 1398, 8057, 6011, 6015, 32127, 384, 4994, 8579, 4996, 8072, 396, 6541, 5006, 6540, 5009, 1938, 1427, 7571, 2965, 1942, 6039, 1940, 7574, 2970, 409, 7068, 7575, 8606, 5014, 5018, 7585, 5017, 6561, 7588, 1447, 3497, 6058, 5547, 1965, 6065, 4529, 21939, 4531, 6069, 5043, 5559, 7096, 1465, 6074, 3515, 4533, 6077, 5054, 7103, 448, 6080, 6076, 4547, 8132, 4552, 4555, 1484, 39372, 39374, 4561, 6611, 5078, 470, 1496, 5081, 472, 7131, 4572, 7133, 5598, 5086, 4576, 4577, 6111, 478, 4580, 1508, 480, 1503, 5096, 1506, 4584, 23019, 493, 494, 498, 5108, 18935, 1529, 6138, 7163, 10238, 5119}, 5: {0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91, 14940, 9308, 14937, 14943, 608, 6755, 1124, 13924, 14950, 5219, 14947, 9325, 3697, 14961, 11893, 14968, 12408, 15485, 637, 5247, 1668, 1157, 23172, 647, 15492, 15498, 5773, 19087, 13969, 9362, 15506, 1681, 148, 11926, 1176, 2713, 155, 1180, 15517, 1692, 20124, 10401, 19105, 675, 674, 19109, 167, 1704, 11946, 15019, 12458, 1709, 682, 9091, 2224, 15025, 20656, 176, 180, 7858, 12982, 15031, 15543, 41136, 14013, 2239, 1729, 708, 9413, 21700, 712, 15562, 15051, 2765, 15057, 15061, 9942, 15063, 21718, 22747, 15068, 15069, 32475, 13535, 15583, 15074, 227, 19683, 2789, 1766, 13542, 13036, 2799, 752, 3312, 13552, 242, 26867, 1268, 15618, 759, 2809, 763, 28924, 2812, 10495, 2817, 2818, 14083, 769, 259, 15622, 2823, 1288, 8962, 15109, 19720, 15629, 19213, 3345, 786, 788, 280, 25375, 2337, 15650, 804, 15653, 3366, 807, 2349, 15151, 7984, 1329, 21810, 820, 12602, 1338, 317, 11582, 5953, 2370, 835, 323, 15688, 1864, 15693, 854, 13142, 344, 15705, 4955, 860, 23899, 11615, 863, 15199, 15711, 13155, 15205, 872, 4457, 15722, 362, 15724, 875, 3438, 15215, 369, 883, 19828, 24437, 374, 29179, 9593, 19834, 15227, 894, 19326, 13186, 35203, 2436, 15749, 389, 19847, 15750, 19849, 2438, 1922, 6028, 909, 15752, 2446, 13200, 2448, 409, 21923, 9644, 14766, 22959, 14771, 23989, 12728, 9145, 14778, 14779, 3000, 12733, 7102, 3007, 9665, 14786, 12226, 2498, 14789, 8645, 15301, 15305, 15818, 461, 976, 5585, 977, 1489, 15358, 472, 1496, 42457, 2524, 478, 19422, 480, 15330, 15843, 20452, 26084, 6631, 14827, 492, 15343, 3571, 14836, 15348, 19446, 14839, 11765, 1017, 14843, 14844, 14846}}
word_bagNum shape: 6 50
{0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960], 1: [0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613], 2: [0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651], 3: [1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284], 4: [0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740], 5: [0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]}
after all_words, word_bag shape: 6 300
{0: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 1: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 2: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 3: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 4: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91], 5: [1536, 7681, 17410, 17411, 17415, 6664, 17420, 15886, 4623, 17935, 4625, 5139, 4631, 17916, 17437, 544, 16422, 5671, 1065, 4650, 4651, 4653, 4690, 16943, 4657, 17458, 15921, 51, 7222, 17464, 17465, 10299, 15932, 64, 6209, 66, 17474, 4680, 8264, 8266, 40008, 6730, 8273, 6738, 5203, 5206, 18005, 15958, 597, 15960, 0, 3, 11785, 2569, 32779, 9227, 526, 21519, 530, 4116, 533, 11805, 2590, 2591, 3105, 7203, 1571, 8740, 1574, 12836, 1062, 1577, 2553, 4654, 1071, 2094, 30257, 51, 30260, 53, 28213, 24633, 1082, 1087, 68, 8779, 78, 12367, 11859, 2647, 91, 13916, 13917, 15455, 608, 9825, 1634, 12387, 13412, 613, 0, 3, 520, 1547, 12300, 2062, 3599, 1040, 26641, 18, 25616, 2577, 13846, 2583, 4121, 25114, 1051, 1052, 25629, 1054, 1567, 2591, 3105, 3616, 4126, 1060, 4125, 1062, 1063, 26663, 1577, 13863, 1066, 1580, 45, 1071, 51, 3123, 53, 2614, 3125, 1082, 2622, 66, 2627, 11843, 1093, 1606, 1605, 3651, 1536, 0, 10242, 3, 37889, 1029, 10248, 2569, 9740, 9745, 10770, 17938, 2577, 10257, 9238, 3094, 9752, 9751, 9754, 30235, 9243, 18425, 9246, 2590, 24096, 9249, 9250, 9251, 4643, 10272, 9252, 5666, 3616, 3625, 4133, 4136, 1071, 9264, 4657, 51, 9267, 22583, 10808, 40504, 10304, 6210, 3650, 37444, 68, 9284, 0, 5121, 4098, 3, 3078, 7175, 1543, 1545, 22027, 5131, 14, 4623, 4625, 22547, 533, 2588, 2590, 1570, 4643, 2597, 5669, 5159, 6183, 2602, 45, 6702, 18937, 5168, 5169, 48, 4657, 3063, 51, 1590, 12343, 5686, 5689, 2105, 1586, 5175, 5694, 6721, 68, 2630, 29767, 29778, 4692, 2133, 5204, 6740, 0, 14849, 512, 3, 11266, 14853, 2053, 23047, 1527, 2569, 15370, 14861, 13, 19471, 2577, 11793, 14867, 18423, 533, 15384, 14875, 15388, 11807, 15396, 4132, 1574, 14890, 14893, 14896, 14897, 1586, 51, 1590, 14911, 1088, 15429, 14406, 23111, 16968, 14921, 14925, 16461, 14929, 15442, 8789, 14934, 2647, 3161, 7770, 91]}
features_data_frame.shape: (6, 255)
0 30
1 185
2 139
3 66
4 69
5 157
class_Proportion: [0.04643962848297214, 0.28637770897832815, 0.21517027863777088, 0.1021671826625387, 0.10681114551083591, 0.24303405572755418]
test_data_frame.head(2) Unnamed: 0                                            content  \
854         854  据Mobileexpose报道,华硕已经正式向媒体发出邀请,定于6月14日在台湾举办记者会,...   
101         101   6月6日,王者荣耀猴三棍重做引起王者峡谷一阵轩然大波,毕竟这个强势的猴子已经陪伴我们好几个...   id                                   tags  \
854  6429089676803440897  ['科技', '华硕', '华硕ZenFone', '台湾', '手机']   
101  6429098400347586818       ['游戏', '猴子', '王者荣耀', '黄忠', '游戏']   time                     title  \
854  2017-06-07 10:11:00        华硕ZenFone AR宣布本月发售   
101  2017-06-07 10:39:20  猴子重做之后是加强还是削弱?狂到站对面泉水拿双杀   doc_words  \
854  [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,...   
101  [月, 日, 王者, 荣耀, 猴三棍, 重, 做, 引起, 王者, 峡谷, 一阵, 轩然大波...   corpus  \
854  [(142, 1), (362, 1), (472, 1), (475, 1), (494,...   
101  [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)...   tfidf   visual01   visual02  \
854  [(142, 0.13953435619531032), (362, 0.046441336...  21.684397 -30.567736   
101  [(0, 0.012838015508020575), (68, 0.04742284222...  67.188065  21.183245   keyword_index  
854              1  
101              3  
print the first sample Unnamed: 0                                                     854
content          据Mobileexpose报道,华硕已经正式向媒体发出邀请,定于6月14日在台湾举办记者会,...
id                                             6429089676803440897
tags                         ['科技', '华硕', '华硕ZenFone', '台湾', '手机']
time                                           2017-06-07 10:11:00
title                                           华硕ZenFone AR宣布本月发售
doc_words        [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,...
corpus           [(142, 1), (362, 1), (472, 1), (475, 1), (494,...
tfidf            [(142, 0.13953435619531032), (362, 0.046441336...
visual01                                                   21.6844
visual02                                                  -30.5677
keyword_index                                                    1
Name: 854, dtype: object
test_data_frame.iloc[0].corpus:  [(142, 1), (362, 1), (472, 1), (475, 1), (494, 1), (530, 1), (872, 1), (909, 1), (1254, 1), (1312, 1), (1878, 1), (2577, 1), (2783, 1), (2979, 1), (3697, 1), (5508, 1), (9052, 1), (12204, 1), (12256, 1), (12591, 1), (12936, 1), (12991, 1), (13128, 1), (13194, 1), (13244, 1), (13317, 1), (31670, 1), (31683, 1), (33417, 1)]
[1.45708072e-43 1.78656934e-66 7.12148875e-63 1.71090490e-534.71385662e-54 2.08405934e-64]
[-35.34436300647761, -16.431856044032266, -20.267559000416433, -22.405433968586664, -27.97121661401147, -18.05089965903481]
F:\File_Jupyter\实用代码\naive_bayes(简单贝叶斯)\TextClassPrediction_kNN_NB_LDA_P.py:346: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copytest_data_frame['predicted_class'] = test_data_frame['corpus'].apply(predict_text_ByMax)       #预测所有测试文档   predict all test documents
test_data_frame:       Unnamed: 0                                            content  \
854          854  据Mobileexpose报道,华硕已经正式向媒体发出邀请,定于6月14日在台湾举办记者会,...   
101          101   6月6日,王者荣耀猴三棍重做引起王者峡谷一阵轩然大波,毕竟这个强势的猴子已经陪伴我们好几个...   
738          738  骗子往往都很会讲故事,比如以下这些硅谷骗局:验血公司Theranos,号称只要从指尖抽几滴血...   
511          511  专访 Whyd 创始人 孟崨在学校,他是最调皮,却又成绩最好的学生,让老师头疼不已。在公司,...   
725          725  据介绍,喜马拉雅FM会员月费为18元,年度会员188元,价格与视频网站会员价格相仿。在会员福...   
...          ...                                                ...   
805          805  每经记者 王海慜 每经编辑 叶峰今日盘中,昨日领涨的中小创出现休整,而昨日暂时休整的一批龙头...   
448          448  中国人买什么都喜欢大的,房子要买面积大的、手机要买屏大的,买车自然也是要挑选空间大的。抛开拉...   
782          782  中证网讯 (记者 徐金忠)6月7日,国能电动汽车瑞典有限公司(NEVS)亮相CES亚洲消费电...   
1264        1264  目前日系豪华品牌讴歌已经开启了国产之路,在推出CDX车型后,讴歌在国内的知名度一度飙升。CD...   
1195        1195  近日有爆料称,乐视位于北京达美中心的办公地因未及时缴纳办公地费用已被停止物业一切服务;物业公...   id                                   tags  \
854   6429089676803440897  ['科技', '华硕', '华硕ZenFone', '台湾', '手机']   
101   6429098400347586818       ['游戏', '猴子', '王者荣耀', '黄忠', '游戏']   
738   6413133652368982274     ['科技', '厨卫电器', '榨汁机', '小家电', '硅谷']   
511   6428827159980867842     ['科技', '智能家居', '音箱', '苹果公司', '法国']   
725   6428841852455354625                  ['科技', '喜马拉雅山', '科技']   
...                   ...                                    ...   
805   6429151552733069569                           ['财经', '财经']   
448   6415852634885341441    ['汽车', 'SUV', '国产车', '概念车', '汽车用品']   
782   6428858665063383297   ['科技', '新能源汽车', '电动汽车', '新能源', '经济']   
1264  6427822755417194753    ['汽车', '日本汽车', '讴歌汽车', 'SUV', '空调']   
1195  6429093420292210945                     ['科技', '乐视', '科技']   time                        title  \
854   2017-06-07 10:11:00           华硕ZenFone AR宣布本月发售   
101   2017-06-07 10:39:20     猴子重做之后是加强还是削弱?狂到站对面泉水拿双杀   
738   2017-04-26 10:41:39                绝!他用一台榨汁机骗了8亿   
511   2017-06-08 11:06:00    他的智能音箱一上市,苹果公司就推出了HomePod   
725   2017-06-07 18:37:00  喜马拉雅FM推出“付费会员”,当天召集超221万名会员   
...                   ...                          ...   
805   2017-06-08 14:30:00          盘中近20家龙头白马股集体创下历史新高   
448   2017-05-03 18:37:20      别瞎找了!10万左右尺寸最大的SUV都在这里了   
782   2017-06-07 19:12:00      倡导移动出行新概念 NEVS两款概念量产车亮相   
1264  2017-06-08 09:54:40        居然还有一款车,最低配和中高配看不出差别?   
1195  2017-06-08 10:45:00     乐视被爆未及时缴物业费,员工或将被阻止进大楼办公   doc_words  \
854   [报道, 华硕, 已经, 正式, 媒体, 发出, 邀请, 定于, 月, 日, 台湾, 举办,...   
101   [月, 日, 王者, 荣耀, 猴三棍, 重, 做, 引起, 王者, 峡谷, 一阵, 轩然大波...   
738   [骗子, 往往, 很会, 讲故事, 以下, 硅谷, 骗局, 验血, 公司, 号称, 指尖, ...   
511   [专访, 创始人, 孟, 崨, 学校, 最, 调皮, 却, 成绩, 最好, 学生, 老师, ...   
725   [据介绍, 喜马拉雅, 会员, 月费, 元, 年度, 会员, 元, 价格, 视频, 网站, ...   
...                                                 ...   
805   [每经, 记者, 王海, 慜, 每经, 编辑, 叶峰, 今日, 盘中, 昨日, 领涨, 中小...   
448   [中国, 人买, 喜欢, 房子, 买, 面积, 手机, 买, 屏大, 买车, 自然, 挑选,...   
782   [中证网, 讯, 记者, 徐金忠, 月, 日, 国, 电动汽车, 瑞典, 有限公司, 亮相,...   
1264  [目前, 日系, 豪华, 品牌, 讴歌, 已经, 开启, 国产, 路, 推出, 车型, 后,...   
1195  [近日, 爆料, 称, 乐视, 位于, 北京, 达美, 中心, 办公地, 因未, 及时, 缴...   corpus  \
854   [(142, 1), (362, 1), (472, 1), (475, 1), (494,...   
101   [(0, 2), (68, 3), (133, 1), (184, 1), (226, 1)...   
738   [(0, 2), (45, 1), (48, 1), (133, 2), (155, 1),...   
511   [(0, 10), (13, 2), (14, 2), (20, 1), (45, 1), ...   
725   [(30, 1), (102, 1), (142, 1), (154, 1), (189, ...   
...                                                 ...   
805   [(113, 1), (167, 1), (169, 1), (214, 1), (258,...   
448   [(4, 2), (8, 1), (14, 1), (51, 6), (53, 2), (6...   
782   [(15, 2), (30, 1), (53, 7), (93, 1), (143, 1),...   
1264  [(0, 1), (20, 1), (51, 1), (176, 1), (225, 1),...   
1195  [(57, 1), (111, 1), (191, 1), (361, 1), (476, ...   tfidf   visual01   visual02  \
854   [(142, 0.13953435619531032), (362, 0.046441336...  21.684397 -30.567736   
101   [(0, 0.012838015508020575), (68, 0.04742284222...  67.188065  21.183245   
738   [(0, 0.008984009118453712), (45, 0.01791359767... -22.855194 -11.270862   
511   [(0, 0.04361196171462796), (13, 0.028607388065... -22.198786  12.217076   
725   [(30, 0.05815947983270004), (102, 0.0450585853...  26.268911  21.240065   
...                                                 ...        ...        ...   
805   [(113, 0.030899018921031703), (167, 0.02103003... -66.232071   0.221611   
448   [(4, 0.04071064284477513), (8, 0.0235138776022...  41.836094 -44.539528   
782   [(15, 0.03392075672049564), (30, 0.03003603467... -26.810091 -29.602842   
1264  [(0, 0.009883726180653873), (20, 0.04080153677...  36.279522 -52.474297   
1195  [(57, 0.09668298763559263), (111, 0.1255406499...  -6.373239  16.101738   keyword_index  predicted_class  
854               1                1  
101               3                3  
738               1                1  
511               1                2  
725               1                1  
...             ...              ...  
805               2                2  
448               5                5  
782               1                1  
1264              5                5  
1195              1                1  [647 rows x 13 columns]
SModel_CS_acc_score: 0.7047913446676971
300
label_category_ID 2
一个
一些
概念
经营
补贴
股市
增持
成本
乳业
万吨
train_data_frame.corpus[0] [(0, 6), (1, 1), (2, 1), (3, 3), (4, 2), (5, 2), (6, 1), (7, 1), (8, 2), (9, 1), (10, 3), (11, 1), (12, 2), (13, 2), (14, 2), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 2), (21, 1), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 2), (30, 3), (31, 4), (32, 3), (33, 1), (34, 1), (35, 1), (36, 7), (37, 1), (38, 1), (39, 2), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 4), (51, 21), (52, 3), (53, 7), (54, 1), (55, 2), (56, 1), (57, 4), (58, 2), (59, 1), (60, 5), (61, 1), (62, 1), (63, 1), (64, 2), (65, 1), (66, 3), (67, 1), (68, 2), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 4), (84, 7), (85, 2), (86, 3), (87, 1), (88, 9), (89, 1), (90, 1), (91, 8), (92, 3), (93, 1), (94, 4), (95, 1), (96, 2), (97, 1), (98, 7), (99, 1), (100, 2), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 2), (110, 1), (111, 2), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 1), (121, 2), (122, 1), (123, 1), (124, 1), (125, 1), (126, 5), (127, 1), (128, 4), (129, 1), (130, 1), (131, 1), (132, 2), (133, 2), (134, 1), (135, 5), (136, 1), (137, 1), (138, 3), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 2), (146, 1), (147, 1), (148, 2), (149, 4), (150, 1), (151, 1), (152, 2), (153, 2), (154, 1), (155, 3), (156, 1), (157, 1), (158, 1), (159, 1), (160, 1), (161, 2), (162, 1), (163, 1), (164, 1), (165, 2), (166, 1), (167, 3), (168, 1), (169, 1), (170, 3), (171, 3), (172, 1), (173, 2), (174, 1), (175, 1), (176, 2), (177, 5), (178, 1), (179, 1), (180, 1), (181, 1), (182, 1), (183, 1), (184, 4), (185, 1), (186, 1), (187, 1), (188, 1), (189, 3), (190, 1), (191, 14), (192, 2), (193, 2), (194, 2), (195, 1), (196, 3), (197, 1), (198, 1), (199, 11), (200, 6), (201, 1), (202, 1), (203, 2), (204, 1), (205, 8), (206, 2), (207, 2), (208, 2), (209, 1), (210, 1), (211, 1), (212, 1), (213, 1), (214, 1), (215, 1), (216, 3), (217, 1), (218, 1), (219, 2), (220, 2), (221, 1), (222, 1), (223, 1), (224, 1), (225, 17), (226, 1), (227, 1), (228, 1), (229, 1), (230, 1), (231, 1), (232, 2), (233, 1), (234, 1), (235, 3), (236, 1), (237, 1), (238, 2), (239, 1), (240, 1), (241, 1), (242, 1), (243, 2), (244, 2), (245, 1), (246, 1), (247, 2), (248, 2), (249, 2), (250, 1), (251, 1), (252, 2), (253, 1), (254, 1), (255, 1), (256, 1), (257, 1), (258, 3), (259, 3), (260, 1), (261, 3), (262, 2), (263, 1), (264, 1), (265, 6), (266, 1), (267, 3), (268, 1), (269, 1), (270, 3), (271, 2), (272, 1), (273, 2), (274, 1), (275, 1), (276, 5), (277, 1), (278, 4), (279, 4), (280, 25), (281, 2), (282, 2), (283, 2), (284, 7), (285, 1), (286, 1), (287, 2), (288, 2), (289, 1), (290, 1), (291, 1), (292, 1), (293, 3), (294, 2), (295, 1), (296, 3), (297, 1), (298, 3), (299, 2), (300, 1), (301, 1), (302, 1), (303, 2), (304, 1), (305, 1), (306, 1), (307, 2), (308, 2), (309, 1), (310, 1), (311, 1), (312, 1), (313, 1), (314, 1), (315, 1), (316, 7), (317, 2), (318, 2), (319, 1), (320, 1), (321, 1), (322, 1), (323, 1), (324, 1), (325, 4), (326, 1), (327, 2), (328, 1), (329, 1), (330, 3), (331, 3), (332, 1), (333, 2), (334, 2), (335, 1), (336, 1), (337, 2), (338, 1), (339, 1), (340, 1), (341, 1), (342, 1), (343, 1), (344, 2), (345, 1), (346, 1), (347, 2), (348, 1), (349, 2), (350, 5), (351, 2), (352, 3), (353, 1), (354, 4), (355, 1), (356, 1), (357, 2), (358, 4), (359, 2), (360, 2), (361, 1), (362, 9), (363, 2), (364, 2), (365, 1), (366, 1), (367, 7), (368, 1), (369, 4), (370, 2), (371, 1), (372, 1), (373, 1), (374, 1), (375, 1), (376, 1), (377, 1), (378, 2), (379, 1), (380, 3), (381, 1), (382, 2), (383, 1), (384, 3), (385, 26), (386, 1), (387, 1), (388, 1), (389, 3), (390, 1), (391, 2), (392, 1), (393, 4), (394, 4), (395, 4), (396, 2), (397, 1), (398, 40), (399, 2), (400, 4), (401, 1), (402, 1), (403, 2), (404, 1), (405, 1), (406, 2), (407, 1), (408, 1), (409, 3), (410, 1), (411, 1), (412, 2), (413, 7), (414, 4), (415, 2), (416, 1), (417, 1), (418, 1), (419, 3), (420, 1), (421, 1), (422, 1), (423, 1), (424, 1), (425, 1), (426, 1), (427, 2), (428, 1), (429, 1), (430, 1), (431, 1), (432, 5), (433, 1), (434, 1), (435, 1), (436, 1), (437, 1), (438, 1), (439, 1), (440, 1), (441, 1), (442, 1), (443, 3), (444, 3), (445, 2), (446, 5), (447, 1), (448, 1), (449, 1), (450, 4), (451, 1), (452, 2), (453, 2), (454, 1), (455, 4), (456, 1), (457, 1), (458, 1), (459, 2), (460, 1), (461, 1), (462, 5), (463, 2), (464, 1), (465, 5), (466, 74), (467, 2), (468, 1), (469, 1), (470, 2), (471, 22), (472, 2), (473, 1), (474, 1), (475, 2), (476, 2), (477, 2), (478, 2), (479, 1), (480, 1), (481, 1), (482, 1), (483, 2), (484, 1), (485, 1), (486, 2), (487, 1), (488, 2), (489, 1), (490, 1), (491, 1), (492, 4), (493, 1), (494, 2), (495, 4), (496, 2), (497, 1), (498, 1), (499, 1), (500, 1), (501, 5), (502, 1), (503, 13), (504, 4), (505, 3), (506, 1), (507, 7), (508, 1), (509, 1), (510, 1), (511, 1), (512, 1), (513, 1), (514, 2), (515, 1), (516, 3), (517, 4), (518, 1), (519, 1), (520, 1), (521, 1), (522, 1), (523, 1), (524, 1), (525, 1), (526, 2), (527, 2), (528, 1), (529, 1), (530, 1), (531, 1), (532, 1), (533, 1), (534, 1), (535, 2), (536, 5), (537, 2), (538, 1), (539, 1), (540, 1), (541, 7), (542, 1), (543, 1), (544, 1), (545, 2), (546, 1), (547, 3), (548, 2), (549, 1), (550, 1), (551, 2), (552, 1), (553, 2), (554, 1), (555, 1), (556, 2), (557, 1), (558, 2), (559, 5), (560, 2), (561, 1), (562, 1), (563, 1), (564, 1), (565, 1), (566, 1), (567, 7), (568, 2), (569, 1), (570, 2), (571, 1), (572, 1), (573, 1), (574, 4), (575, 1), (576, 2), (577, 2), (578, 1), (579, 2), (580, 1), (581, 1), (582, 1), (583, 2), (584, 1), (585, 1), (586, 1), (587, 4), (588, 1), (589, 4), (590, 2), (591, 1), (592, 1), (593, 1), (594, 2), (595, 1), (596, 1), (597, 1), (598, 1), (599, 1), (600, 1), (601, 1), (602, 1), (603, 1), (604, 1), (605, 1), (606, 1), (607, 1), (608, 2), (609, 1), (610, 2), (611, 1), (612, 1), (613, 11), (614, 1), (615, 1), (616, 3), (617, 1), (618, 1), (619, 1), (620, 1), (621, 1), (622, 1), (623, 1), (624, 32), (625, 2), (626, 1), (627, 8), (628, 1), (629, 3), (630, 3), (631, 1), (632, 1), (633, 4), (634, 1), (635, 1), (636, 2), (637, 1), (638, 3), (639, 2), (640, 1), (641, 1), (642, 1), (643, 3), (644, 5), (645, 4), (646, 1), (647, 1), (648, 3), (649, 1), (650, 1), (651, 1), (652, 1), (653, 1), (654, 1), (655, 2), (656, 1), (657, 7), (658, 1), (659, 2), (660, 1), (661, 2), (662, 1), (663, 1), (664, 1), (665, 1), (666, 1), (667, 1), (668, 4), (669, 1), (670, 1), (671, 3), (672, 1), (673, 1), (674, 2), (675, 1), (676, 1), (677, 1), (678, 1), (679, 1), (680, 2), (681, 2), (682, 1), (683, 1), (684, 1), (685, 3), (686, 1), (687, 1), (688, 1), (689, 1), (690, 4), (691, 1), (692, 2), (693, 3), (694, 1), (695, 2), (696, 1), (697, 1), (698, 2), (699, 1), (700, 1), (701, 4), (702, 1), (703, 1), (704, 2), (705, 1), (706, 1), (707, 1), (708, 1), (709, 2), (710, 1), (711, 3), (712, 1), (713, 1), (714, 4), (715, 1), (716, 1), (717, 1), (718, 2), (719, 1), (720, 1), (721, 2), (722, 1), (723, 1), (724, 4), (725, 1), (726, 1), (727, 1), (728, 1), (729, 2), (730, 12), (731, 2), (732, 1), (733, 2), (734, 3), (735, 1), (736, 26), (737, 1), (738, 5), (739, 1), (740, 2), (741, 5), (742, 2), (743, 3), (744, 3), (745, 2), (746, 1), (747, 3), (748, 2), (749, 2), (750, 2), (751, 1), (752, 1), (753, 2), (754, 1), (755, 1), (756, 1), (757, 1), (758, 1), (759, 4), (760, 1), (761, 1), (762, 1), (763, 1), (764, 1), (765, 2), (766, 1), (767, 1), (768, 1), (769, 2), (770, 8), (771, 2), (772, 4), (773, 1), (774, 8), (775, 3), (776, 1), (777, 1), (778, 3), (779, 1), (780, 1), (781, 1), (782, 5), (783, 2), (784, 2), (785, 1), (786, 4), (787, 1), (788, 1), (789, 1), (790, 1), (791, 1), (792, 1), (793, 4), (794, 1), (795, 1), (796, 1), (797, 5), (798, 3), (799, 5), (800, 3), (801, 1), (802, 1), (803, 1), (804, 1), (805, 2), (806, 2), (807, 2), (808, 1), (809, 1), (810, 1), (811, 1), (812, 1), (813, 1), (814, 1), (815, 3), (816, 1), (817, 2), (818, 1), (819, 1), (820, 11), (821, 1), (822, 1), (823, 2), (824, 3), (825, 1), (826, 1), (827, 1), (828, 1), (829, 1), (830, 3), (831, 4), (832, 46), (833, 1), (834, 1), (835, 2), (836, 2), (837, 1), (838, 1), (839, 2), (840, 2), (841, 1), (842, 1), (843, 2), (844, 2), (845, 2), (846, 1), (847, 1), (848, 2), (849, 1), (850, 1), (851, 1), (852, 3), (853, 1), (854, 1), (855, 6), (856, 1), (857, 1), (858, 1)]
[33. 74. 73. 31. 47. 48.]
<class 'numpy.ndarray'>
SModel_acc_score: 0.8114374034003091
kNNC_acc_score: 0.8160741885625966
GNBC_acc_score: 0.6352395672333848
MNBC_acc_score: 0.6352395672333848
BNBC_acc_score: 0.29675425038639874
LDAC_acc_score: 0.8238021638330757
PerceptronC_acc_score: 0.8222565687789799

 

 

核心代码

class GaussianNB Found at: sklearn.naive_bayesclass GaussianNB(_BaseNB):"""Gaussian Naive Bayes (GaussianNB)Can perform online updates to model parameters via :meth:`partial_fit`.For details on algorithm used to update feature means and variance online,see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdfRead more in the :ref:`User Guide <gaussian_naive_bayes>`.Parameters----------priors : array-like of shape (n_classes,)Prior probabilities of the classes. If specified the priors are notadjusted according to the data.var_smoothing : float, default=1e-9Portion of the largest variance of all features that is added tovariances for calculation stability... versionadded:: 0.20Attributes----------class_count_ : ndarray of shape (n_classes,)number of training samples observed in each class.class_prior_ : ndarray of shape (n_classes,)probability of each class.classes_ : ndarray of shape (n_classes,)class labels known to the classifierepsilon_ : floatabsolute additive value to variancessigma_ : ndarray of shape (n_classes, n_features)variance of each feature per classtheta_ : ndarray of shape (n_classes, n_features)mean of each feature per classExamples-------->>> import numpy as np>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])>>> Y = np.array([1, 1, 1, 2, 2, 2])>>> from sklearn.naive_bayes import GaussianNB>>> clf = GaussianNB()>>> clf.fit(X, Y)GaussianNB()>>> print(clf.predict([[-0.8, -1]]))[1]>>> clf_pf = GaussianNB()>>> clf_pf.partial_fit(X, Y, np.unique(Y))GaussianNB()>>> print(clf_pf.predict([[-0.8, -1]]))[1]"""@_deprecate_positional_argsdef __init__(self, *, priors=None, var_smoothing=1e-9):self.priors = priorsself.var_smoothing = var_smoothingdef fit(self, X, y, sample_weight=None):"""Fit Gaussian Naive Bayes according to X, yParameters----------X : array-like of shape (n_samples, n_features)Training vectors, where n_samples is the number of samplesand n_features is the number of features.y : array-like of shape (n_samples,)Target values.sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted)... versionadded:: 0.17Gaussian Naive Bayes supports fitting with *sample_weight*.Returns-------self : object"""X, y = self._validate_data(X, y)y = column_or_1d(y, warn=True)return self._partial_fit(X, y, np.unique(y), _refit=True, sample_weight=sample_weight)def _check_X(self, X):return check_array(X)@staticmethoddef _update_mean_variance(n_past, mu, var, X, sample_weight=None):"""Compute online update of Gaussian mean and variance.Given starting sample count, mean, and variance, a new set ofpoints X, and optionally sample weights, return the updated mean andvariance. (NB - each dimension (column) in X is treated as independent-- you get variance, not covariance).Can take scalar mean and variance, or vector mean and variance tosimultaneously update a number of independent Gaussians.See Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdfParameters----------n_past : intNumber of samples represented in old mean and variance. If sampleweights were given, this should contain the sum of sampleweights represented in old mean and variance.mu : array-like of shape (number of Gaussians,)Means for Gaussians in original set.var : array-like of shape (number of Gaussians,)Variances for Gaussians in original set.sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted).Returns-------total_mu : array-like of shape (number of Gaussians,)Updated mean for each Gaussian over the combined set.total_var : array-like of shape (number of Gaussians,)Updated variance for each Gaussian over the combined set."""if X.shape[0] == 0:return mu, var# Compute (potentially weighted) mean and variance of new datapointsif sample_weight is not None:n_new = float(sample_weight.sum())new_mu = np.average(X, axis=0, weights=sample_weight)new_var = np.average((X - new_mu) ** 2, axis=0, weights=sample_weight)else:n_new = X.shape[0]new_var = np.var(X, axis=0)new_mu = np.mean(X, axis=0)if n_past == 0:return new_mu, new_varn_total = float(n_past + n_new)# Combine mean of old and new data, taking into consideration# (weighted) number of observationstotal_mu = (n_new * new_mu + n_past * mu) / n_total# Combine variance of old and new data, taking into consideration# (weighted) number of observations. This is achieved by combining# the sum-of-squared-differences (ssd)old_ssd = n_past * varnew_ssd = n_new * new_vartotal_ssd = old_ssd + new_ssd + (n_new * n_past / n_total) * (mu - new_mu) ** 2total_var = total_ssd / n_totalreturn total_mu, total_vardef partial_fit(self, X, y, classes=None, sample_weight=None):"""Incremental fit on a batch of samples.This method is expected to be called several times consecutivelyon different chunks of a dataset so as to implement out-of-coreor online learning.This is especially useful when the whole dataset is too big to fit inmemory at once.This method has some performance and numerical stability overhead,hence it is better to call partial_fit on chunks of data that areas large as possible (as long as fitting in the memory budget) tohide the overhead.Parameters----------X : array-like of shape (n_samples, n_features)Training vectors, where n_samples is the number of samples andn_features is the number of features.y : array-like of shape (n_samples,)Target values.classes : array-like of shape (n_classes,), default=NoneList of all the classes that can possibly appear in the y vector.Must be provided at the first call to partial_fit, can be omittedin subsequent calls.sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted)... versionadded:: 0.17Returns-------self : object"""return self._partial_fit(X, y, classes, _refit=False, sample_weight=sample_weight)def _partial_fit(self, X, y, classes=None, _refit=False, sample_weight=None):"""Actual implementation of Gaussian NB fitting.Parameters----------X : array-like of shape (n_samples, n_features)Training vectors, where n_samples is the number of samples andn_features is the number of features.y : array-like of shape (n_samples,)Target values.classes : array-like of shape (n_classes,), default=NoneList of all the classes that can possibly appear in the y vector.Must be provided at the first call to partial_fit, can be omittedin subsequent calls._refit : bool, default=FalseIf true, act as though this were the first time we called_partial_fit (ie, throw away any past fitting and start over).sample_weight : array-like of shape (n_samples,), default=NoneWeights applied to individual samples (1. for unweighted).Returns-------self : object"""X, y = check_X_y(X, y)if sample_weight is not None:sample_weight = _check_sample_weight(sample_weight, X)# If the ratio of data variance between dimensions is too small, it# will cause numerical errors. To address this, we artificially# boost the variance by epsilon, a small fraction of the standard# deviation of the largest dimension.self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()if _refit:self.classes_ = Noneif _check_partial_fit_first_call(self, classes):# This is the first call to partial_fit:# initialize various cumulative countersn_features = X.shape[1]n_classes = len(self.classes_)self.theta_ = np.zeros((n_classes, n_features))self.sigma_ = np.zeros((n_classes, n_features))self.class_count_ = np.zeros(n_classes, dtype=np.float64)# Initialise the class prior# Take into account the priorsif self.priors is not None:priors = np.asarray(self.priors)# Check that the provide prior match the number of classesif len(priors) != n_classes:raise ValueError('Number of priors must match number of'' classes.')# Check that the sum is 1if not np.isclose(priors.sum(), 1.0):raise ValueError('The sum of the priors should be 1.') # Check that the prior are non-negativeif (priors < 0).any():raise ValueError('Priors must be non-negative.')self.class_prior_ = priorselse:self.class_prior_ = np.zeros(len(self.classes_), dtype=np.float64) # Initialize the priors to zeros for each classelse:if X.shape[1] != self.theta_.shape[1]:msg = "Number of features %d does not match previous data %d."raise ValueError(msg % (X.shape[1], self.theta_.shape[1]))# Put epsilon back in each time::]self.epsilon_self.sigma_[ -= classes = self.classes_unique_y = np.unique(y)unique_y_in_classes = np.in1d(unique_y, classes)if not np.all(unique_y_in_classes):raise ValueError("The target label(s) %s in y do not exist in the ""initial classes %s" % (unique_y[~unique_y_in_classes], classes))for y_i in unique_y:i = classes.searchsorted(y_i)X_i = X[y == y_i:]if sample_weight is not None:sw_i = sample_weight[y == y_i]N_i = sw_i.sum()else:sw_i = NoneN_i = X_i.shape[0]new_theta, new_sigma = self._update_mean_variance(self.class_count_[i], self.theta_[i:], self.sigma_[i:], X_i, sw_i)self.theta_[i:] = new_thetaself.sigma_[i:] = new_sigmaself.class_count_[i] += N_iself.sigma_[::] += self.epsilon_# Update if only no priors is providedif self.priors is None:# Empirical prior, with sample_weight taken into accountself.class_prior_ = self.class_count_ / self.class_count_.sum()return selfdef _joint_log_likelihood(self, X):joint_log_likelihood = []for i in range(np.size(self.classes_)):jointi = np.log(self.class_prior_[i])n_ij = -0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i:]))n_ij -= 0.5 * np.sum(((X - self.theta_[i:]) ** 2) / (self.sigma_[i:]), 1)joint_log_likelihood.append(jointi + n_ij)joint_log_likelihood = np.array(joint_log_likelihood).Treturn joint_log_likelihoodclass MultinomialNB Found at: sklearn.naive_bayesclass MultinomialNB(_BaseDiscreteNB):"""Naive Bayes classifier for multinomial modelsThe multinomial Naive Bayes classifier is suitable for classification withdiscrete features (e.g., word counts for text classification). Themultinomial distribution normally requires integer feature counts. However,in practice, fractional counts such as tf-idf may also work.Read more in the :ref:`User Guide <multinomial_naive_bayes>`.Parameters----------alpha : float, default=1.0Additive (Laplace/Lidstone) smoothing parameter(0 for no smoothing).fit_prior : bool, default=TrueWhether to learn class prior probabilities or not.If false, a uniform prior will be used.class_prior : array-like of shape (n_classes,), default=NonePrior probabilities of the classes. If specified the priors are notadjusted according to the data.Attributes----------class_count_ : ndarray of shape (n_classes,)Number of samples encountered for each class during fitting. Thisvalue is weighted by the sample weight when provided.class_log_prior_ : ndarray of shape (n_classes, )Smoothed empirical log probability for each class.classes_ : ndarray of shape (n_classes,)Class labels known to the classifiercoef_ : ndarray of shape (n_classes, n_features)Mirrors ``feature_log_prob_`` for interpreting MultinomialNBas a linear model.feature_count_ : ndarray of shape (n_classes, n_features)Number of samples encountered for each (class, feature)during fitting. This value is weighted by the sample weight whenprovided.feature_log_prob_ : ndarray of shape (n_classes, n_features)Empirical log probability of featuresgiven a class, ``P(x_i|y)``.intercept_ : ndarray of shape (n_classes, )Mirrors ``class_log_prior_`` for interpreting MultinomialNBas a linear model.n_features_ : intNumber of features of each sample.Examples-------->>> import numpy as np>>> rng = np.random.RandomState(1)>>> X = rng.randint(5, size=(6, 100))>>> y = np.array([1, 2, 3, 4, 5, 6])>>> from sklearn.naive_bayes import MultinomialNB>>> clf = MultinomialNB()>>> clf.fit(X, y)MultinomialNB()>>> print(clf.predict(X[2:3]))[3]Notes-----For the rationale behind the names `coef_` and `intercept_`, i.e.naive Bayes as a linear classifier, see J. Rennie et al. (2003),Tackling the poor assumptions of naive Bayes text classifiers, ICML.References----------C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction toInformation Retrieval. Cambridge University Press, pp. 234-265.https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html"""@_deprecate_positional_argsdef __init__(self, *, alpha=1.0, fit_prior=True, class_prior=None):self.alpha = alphaself.fit_prior = fit_priorself.class_prior = class_priordef _more_tags(self):return {'requires_positive_X':True}def _count(self, X, Y):"""Count and smooth feature occurrences."""check_non_negative(X, "MultinomialNB (input X)")self.feature_count_ += safe_sparse_dot(Y.T, X)self.class_count_ += Y.sum(axis=0)def _update_feature_log_prob(self, alpha):"""Apply smoothing to raw counts and recompute log probabilities"""smoothed_fc = self.feature_count_ + alphasmoothed_cc = smoothed_fc.sum(axis=1)self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))def _joint_log_likelihood(self, X):"""Calculate the posterior log probability of the samples X"""return safe_sparse_dot(X, self.feature_log_prob_.T) + self.class_log_prior_class BernoulliNB Found at: sklearn.naive_bayesclass BernoulliNB(_BaseDiscreteNB):"""Naive Bayes classifier for multivariate Bernoulli models.Like MultinomialNB, this classifier is suitable for discrete data. Thedifference is that while MultinomialNB works with occurrence counts,BernoulliNB is designed for binary/boolean features.Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.Parameters----------alpha : float, default=1.0Additive (Laplace/Lidstone) smoothing parameter(0 for no smoothing).binarize : float or None, default=0.0Threshold for binarizing (mapping to booleans) of sample features.If None, input is presumed to already consist of binary vectors.fit_prior : bool, default=TrueWhether to learn class prior probabilities or not.If false, a uniform prior will be used.class_prior : array-like of shape (n_classes,), default=NonePrior probabilities of the classes. If specified the priors are notadjusted according to the data.Attributes----------class_count_ : ndarray of shape (n_classes)Number of samples encountered for each class during fitting. Thisvalue is weighted by the sample weight when provided.class_log_prior_ : ndarray of shape (n_classes)Log probability of each class (smoothed).classes_ : ndarray of shape (n_classes,)Class labels known to the classifierfeature_count_ : ndarray of shape (n_classes, n_features)Number of samples encountered for each (class, feature)during fitting. This value is weighted by the sample weight whenprovided.feature_log_prob_ : ndarray of shape (n_classes, n_features)Empirical log probability of features given a class, P(x_i|y).n_features_ : intNumber of features of each sample.Examples-------->>> import numpy as np>>> rng = np.random.RandomState(1)>>> X = rng.randint(5, size=(6, 100))>>> Y = np.array([1, 2, 3, 4, 4, 5])>>> from sklearn.naive_bayes import BernoulliNB>>> clf = BernoulliNB()>>> clf.fit(X, Y)BernoulliNB()>>> print(clf.predict(X[2:3]))[3]References----------C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction toInformation Retrieval. Cambridge University Press, pp. 234-265.https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.htmlA. McCallum and K. Nigam (1998). A comparison of event models for naiveBayes text classification. Proc. AAAI/ICML-98 Workshop on Learning forText Categorization, pp. 41-48.V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering withnaive Bayes -- Which naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS)."""@_deprecate_positional_argsdef __init__(self, *, alpha=1.0, binarize=.0, fit_prior=True, class_prior=None):self.alpha = alphaself.binarize = binarizeself.fit_prior = fit_priorself.class_prior = class_priordef _check_X(self, X):X = super()._check_X(X)if self.binarize is not None:X = binarize(X, threshold=self.binarize)return Xdef _check_X_y(self, X, y):X, y = super()._check_X_y(X, y)if self.binarize is not None:X = binarize(X, threshold=self.binarize)return X, ydef _count(self, X, Y):"""Count and smooth feature occurrences."""self.feature_count_ += safe_sparse_dot(Y.T, X)self.class_count_ += Y.sum(axis=0)def _update_feature_log_prob(self, alpha):"""Apply smoothing to raw counts and recompute log probabilities"""smoothed_fc = self.feature_count_ + alphasmoothed_cc = self.class_count_ + alpha * 2self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.reshape(-1, 1))def _joint_log_likelihood(self, X):"""Calculate the posterior log probability of the samples X"""n_classes, n_features = self.feature_log_prob_.shapen_samples, n_features_X = X.shapeif n_features_X != n_features:raise ValueError("Expected input with %d features, got %d instead" % (n_features, n_features_X))neg_prob = np.log(1 - np.exp(self.feature_log_prob_))# Compute  neg_prob · (1 - X).T  as  ∑neg_prob - X · neg_probjll = safe_sparse_dot(X, (self.feature_log_prob_ - neg_prob).T)jll += self.class_log_prior_ + neg_prob.sum(axis=1)return jll

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


http://www.ppmy.cn/news/115696.html

相关文章

Java通过浏览器请求头(User-Agent)获取 浏览器类型,操作系统类型,手机机型

一&#xff1a;获得浏览器请求头中的User-Agent ? 1 String ua request.getHeader("User-Agent") 二&#xff1a;获得浏览器类型&#xff0c;操作系统类型&#xff1a;(注意&#xff0c;UserAgent类在UserAgentUtils.jar中&#xff0c;自行下载) ? 1 2 3 UserA…

关于终端设备的设备唯一性的那些事之IMEI(转)

最近和别人聊起来数据上报&#xff0c;一起讨论到imei和MAC地址&#xff0c;然后发现一个问题&#xff1a;知道这两个东西都不唯一&#xff0c;但是不知道为什么………… 回来上各种小网站巴拉巴拉找了一下&#xff0c;终于大概了解了前世今生&#xff0c;这里简单汇总一下imei…

(转)webapp兼容移动端的屏幕适配

亲测可用!!!! 屏幕适配终极方案 <meta name"viewport" content"width你的基准像素, user-scalableno" /> 主要就是这句话,加载header中,底下是原文 本文中所指Mobile WebApp是指运行在Mobile WebKit浏览器上的WebApp。本篇文章讲解如何像传统PC网…

iScroll4.2.5中的无法滑动或点击的解决方案(转)

又见iScroll问题&#xff0c;特别是三星手机和iPhone&#xff0c;顺便提一句&#xff0c;现在的项目中他们给div加了height属性来解决不能滚动问题&#xff0c;个人认为是个非常愚蠢的解决方案&#xff0c;我必须使用media query来解决随之而来的不同手机有不同高度问题&#x…

Linux性能优化实战:套路篇-磁盘 I/O 性能优化的几个思路(31)

一、上节回顾 上一节&#xff0c;我们一起回顾了常见的文件系统和磁盘 I/O 性能指标&#xff0c;梳理了核心的 I/O 性能观测工具&#xff0c;最后还总结了快速分析 I/O 性能问题的思路。 虽然 I/O 的性能指标很多&#xff0c;相应的性能分析工具也有好几个&#xff0c;但理解了…

红黑树:自平衡的二叉搜索树

当我们向红黑树中插入一个新节点时&#xff0c;首先将其插入为一个红色节点&#xff0c;然后通过一系列的旋转和变色操作来调整树的结构&#xff0c;以保持红黑树的性质。 下面是红黑树中插入节点时可能出现的情况以及相应的操作&#xff1a; 当插入的节点是树的根节点时&…

i9500android操作系统跑流量,央视揭露手机“吃流量”内幕?系统层防护可根治

原标题&#xff1a;央视揭露手机“吃流量”内幕&#xff1f;系统层防护可根治 “WiFi密码是多少?”是智能手机用户的口头禅&#xff0c;流量莫名其妙的用完也一直是困扰用户的难题。近日&#xff0c;央视曝光手机偷跑流量、预装软件过多的问题&#xff0c;再次引发公众关注。究…

整理一波UA(二)

转自简书&#xff0c;作者大明白 Mozilla/5.0 (iPhone; CPU iPhone OS 9_2_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13D15 MicroMessenger/6.1.1 NetType/WIFI Mozilla/5.0 (iPhone; CPU iPhone OS 9_2_1 like Mac OS X) AppleWebKit/601.1.46 (KH…