前言
我们之前的爬虫都是模拟成浏览器后直接爬取,并没有动态设置IP代理以及UserAgent标识,这样很容易被服务器封IP,因此需要设置IP代理,但又不想花钱买,网上有免费IP代理,但大多都数都是不可用,而且不稳定,所以需要自行抓取、校验
本文记录免费IP代理池定时维护,封装通用爬虫工具类每次随机更新IP代理池跟UserAgent池,并制作简易流量爬虫验证我们的IP代理池、UserAgent池
主要用到的知识:爬虫相关、SpringBoot相关,项目整合了多个知识点:
httpclient+jsoup实现小说线上采集阅读
htmlUnit加持,网络小蜘蛛的超级进化
SpringBoot系列——定时器
SpringBoot系列——@Async优雅的异步调用
SpringBoot系列——Spring-Data-JPA
SpringBoot系列——WebSocket
SpringBoot系列——Thymeleaf模板
SpringBoot系列——Logback日志,输出到文件以及实时输出到web页面
common-spider
项目结构
pom引入父类,同时引入基础爬虫所需的依赖,以及mysql、jpa依赖
<!-- 小蜘蛛 --><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.4</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpcore</artifactId><version>4.4.9</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.3</version></dependency><dependency><groupId>net.sf.json-lib</groupId><artifactId>json-lib</artifactId><version>2.4</version><classifier>jdk15</classifier></dependency><dependency><groupId>net.sourceforge.htmlunit</groupId><artifactId>htmlunit</artifactId><version>2.32</version></dependency><!--添加springdata-jpa依赖 --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-data-jpa</artifactId></dependency><!--添加MySQL驱动依赖 --><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId></dependency>
PS:具体的数据库连接配置需要在具体的爬虫项目进行配置
然后就可以作为一个通用功能项目,具体的爬虫项目通过pom进入
统一响应对象
HttpClient请求的响应对象跟WebClient的不一致,为了更加规范,我们定义统一的响应对象
/*** 统一响应对象*/ @Data public class ResultVo<E> {private ResultVo(Integer statusCode, String statusMessage, E page) {this.statusCode = statusCode;this.statusMessage = statusMessage;this.page = page;}//响应状态private Integer statusCode;//响应消息private String statusMessage;//响应对象private E page;/*** 通过静态方法获取实例*/public static <E> ResultVo<E> of(Integer statusCode,String statusMessage,E page) {return new ResultVo<>(statusCode, statusMessage, page);} }
IP代理池
免费的IP代理还是有挺多的,不过大多数都不稳定,需要自己抓取、校验,本文主要抓取的是89ip(http://www.89ip.cn/index_1.html)的免费代理,抓取前十页,150个,校验后大概有50个可用,两个定时异步任务:定时更新IP代理池,目前设置一个小时触发一次、定时检查IP代理池,目前设置半个小时触发一次(西刺的免费IP代理可用的太少了,先注释起来)
更新下来的IP代理需要存库,IP地址就是主键,所以如果已经存在就会替换掉,不存在则会加入数据库,检查IP代理是否可用是用这个IP代理去访问查询外网地址的网站(
http://pv.sohu.com/cityjson
),能请求成功,且返回的外网ip是一样说明代理成功,代理失败的将会从数据库池移除,检查完成后更新IP代理池
IP代理表结构SQL
/*Navicat Premium Data TransferSource Server : localhostSource Server Type : MySQLSource Server Version : 50528Source Host : localhost:3306Source Schema : testTarget Server Type : MySQLTarget Server Version : 50528File Encoding : 65001Date: 13/08/2019 15:55:59 */SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0;-- ---------------------------- -- Table structure for spider_ip_proxy -- ---------------------------- DROP TABLE IF EXISTS `spider_ip_proxy`; CREATE TABLE `spider_ip_proxy` (`ip` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT 'ip地址',`port` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '端口',`city` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '城市',`operator` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '运营商',PRIMARY KEY (`ip`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;SET FOREIGN_KEY_CHECKS = 1;
jpa实体映射
/*** 爬虫IP代理池实体对象*/ @Data @Entity(name = "spider_ip_proxy") public class IpProxy {@Id//ip地址private String ip;//端口private String port;//城市private String city;//运营商private String operator; }
UserAgent池
我并没有在网上找到提供UserAgent池的网站,所以我收集一堆UserAgent标识并存到数据库中当做UserAgent池,个人感觉那么多应该够用了,所以就没有定时任务去更新
UserAgent标识表结构、数据SQL
/*Navicat Premium Data TransferSource Server : localhostSource Server Type : MySQLSource Server Version : 50528Source Host : localhost:3306Source Schema : testTarget Server Type : MySQLTarget Server Version : 50528File Encoding : 65001Date: 13/08/2019 15:58:07 */SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0;-- ---------------------------- -- Table structure for spider_user_agent -- ---------------------------- DROP TABLE IF EXISTS `spider_user_agent`; CREATE TABLE `spider_user_agent` (`user_agent` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT 'User Agent',PRIMARY KEY (`user_agent`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;-- ---------------------------- -- Records of spider_user_agent -- ---------------------------- INSERT INTO `spider_user_agent` VALUES ('Chrome/10.0.648.133 Safari/534.16'); INSERT INTO `spider_user_agent` VALUES ('Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; 360SE) '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.109 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.70 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; LGMS323 Build/KOT49I.MS32310c) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/67.0.3396.87 Mobile Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.109 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.70 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) '); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'); INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6)'); INSERT INTO `spider_user_agent` VALUES ('MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'); INSERT INTO `spider_user_agent` VALUES ('NOKIA5700/ UCWEB7.0.2.37/28/999'); INSERT INTO `spider_user_agent` VALUES ('Openwave/ UCWEB7.0.2.37/28/999'); INSERT INTO `spider_user_agent` VALUES ('Opera/8.0 (Windows NT 5.1; U; en)'); INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10'); INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52'); INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11'); INSERT INTO `spider_user_agent` VALUES ('UCWEB7.0.2.37/28/999');SET FOREIGN_KEY_CHECKS = 1;
jpa实体映射
/*** 爬虫User-Agent池实体对象*/ @Data @Entity(name = "spider_user_agent") public class UserAgent {@Id//User Agentprivate String userAgent; }
HttpClientUtil
HttpClient是http包下面的东西,可以简单发起请求获取数据,但不会去解析DOM、执行js、css等,因此需要借助Jsoup来解析Html文档,工具类包含了IP代理池、UserAgent池,每次发起请求都会随机从IP代理池获取IP代理、从UserAgent池随机获取UserAgent标识,IP代理池由定时任务去更新
提供一个静态方法,获取一个HttpClient对象,支持绕过SSL校验
WebClientUtil
WebClient是htmlunit的东西,可模拟浏览器解析DOM、执行js、css等,可以解析Html文档,例如像jq操作DOM对象一样,工具类包含了IP代理池、UserAgent池,每次发起请求都会随机从IP代理池获取IP代理、从UserAgent池随机获取UserAgent标识,IP代理池由定时任务去更新
提供一个静态方法获取WebClient对象,开启了部分功能
flow-spider
流量爬虫目前有以下几个项目:
刷博客园阅读量
我们引入common-spider,开始编写流量爬虫,主要就是用WebClient去访问博客园的博客,换IP代理、换UserAgent标识,设置执行JS,所有的操作都是随机的、随机代理IP、随机UserAgent标识、随机访问时间、随机访问博客,甚至我们可以设置携带随机cookie(需要进行仔细分析,到底发送了那些cookie,cookie的值有什么规则,建议用火狐浏览器进行分析),从来达到模拟真实用户访问,使博客阅读量增加,俗称刷阅读量
为了方便观察实时日志,秀出我们之前的骚操作(SpringBoot系列——Logback日志,输出到文件以及实时输出到web页面),开始搭建项目
项目结构
在pom文件中引入父类、同时引入common-spider,以及thymeleaf、websocket
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><artifactId>flow-spider</artifactId><version>0.0.1</version><name>flow-spider</name><description>流量爬虫</description><!-- 引入父类 --><parent><groupId>cn.huanzi.qch</groupId><artifactId>parent</artifactId><version>1.0.0</version></parent><dependencies><dependency><groupId>cn.huanzi.qch</groupId><artifactId>common-spider</artifactId><version>0.0.1</version></dependency><!-- springboot websocket --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-websocket</artifactId></dependency><!-- thymeleaf模板 --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-thymeleaf</artifactId></dependency></dependencies><build><plugins><plugin><groupId>org.springframework.boot</groupId><artifactId>spring-boot-maven-plugin</artifactId></plugin></plugins></build></project>
配置文件配置数据库相关配置
#数据库相关 spring.datasource.url=jdbc:mysql://localhost:3306/test?serverTimezone=GMT%2B8&characterEncoding=utf-8 spring.datasource.username= root spring.datasource.password=123456 spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
以及实时日志需要的一些操作就不再重复了,看之前的博客
博客实体对象
为了方便,我们爬取博客集合存储到数据库中
数据库表结构SQL
/*Navicat Premium Data TransferSource Server : localhostSource Server Type : MySQLSource Server Version : 50528Source Host : localhost:3306Source Schema : testTarget Server Type : MySQLTarget Server Version : 50528File Encoding : 65001Date: 13/08/2019 16:48:14 */SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0;-- ---------------------------- -- Table structure for spider_blog -- ---------------------------- DROP TABLE IF EXISTS `spider_blog`; CREATE TABLE `spider_blog` (`blog_url` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT '博客链接',`blog_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '博客标题',PRIMARY KEY (`blog_url`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;SET FOREIGN_KEY_CHECKS = 1;
jpa映射实体
/*** 博客园博客文章实体对象*/ @Data @Entity(name = "spider_blog") public class Blog {@Idprivate String blogUrl;private String blogName; }
controller
为了偷懒,我们连service层都懒得写了,业务逻辑直接写在controller层
启动类
启动类也需要进行一些注解配置,SpringBoot默认只能扫描到当前包和子包,所有我们需要添加注解指定扫描路径Spring才能识别到注解
@Slf4j//使用lombok的@Slf4j,帮我们创建Logger对象,效果与下方获取日志对象一样 @SpringBootApplication//默认只能扫描到当前包和子包 @EnableJpaRepositories(basePackages = {"cn.huanzi.qch.commonspider.repository","cn.huanzi.qch.flowspider.cnblogs.repository"})//扫描@Repository注解; @EntityScan(basePackages = {"cn.huanzi.qch.commonspider.pojo","cn.huanzi.qch.flowspider.cnblogs.pojo"})//扫描@Entity注解; @ComponentScan(basePackages = {"cn.huanzi.qch.commonspider.**","cn.huanzi.qch.flowspider.**"})//扫描 带@Component的注解,如:@Controller、@Service 注解 @EnableScheduling //允许支持定时器了 public class FlowSpiderApplication {//省略部分代码... }
由于我们使用了注解来指定,SpringBoot的默认扫描路径失效,所以也需要将所有需要扫描的路径补全
项目已经配置得差不多了,为了方便操作,我们在实时日志页面新增几个按钮来手动调用这些功能
运行效果
页面效果大概就是这样
那这个流量爬虫具体效果怎么样的?这是我挂机从下午6点多到第二天早上9点多的效果,博客集合就只留一篇,其它的全删掉,这篇博客的访问量从34增加到890
成功一千多次才增加八百?而且还失败三千多次??效率是不是太低了一点?
1、免费的IP代理很多,但真正可用的很少,而且还不稳定,说不定前几分钟刚校验成功,当你用的时候又代理失败,想要稳定的IP代理得花钱买比较靠谱
2、经常出现400 The plain HTTP request was sent to HTTPS port,我目前也不知道怎么解决
3、小概率同一时间段内多次随机到了同一个IP代理,博客园不做访问统计
4、未知原因导致阅读量增加...
PS:
正所谓,程序员何苦为难程序员...,大家随机访问秒数不要太快了,我们只是为了学习,不是为了刷流量,也要考虑博客园运维人员的感受哇!
(偷偷的说一下,可以写个定时任务去更新博客集合,这样我们的流量机器人就可以做到全自动刷流量,按照目前的情况看,一天可以贡献差不多2000的阅读量,打包部署到云服务器,全自动24小时不停机【隐藏滑稽脸~~】)
另外,你们检出代码后,不要都用我的的博客来试,我怕被封号...
补充:
这就尴尬了...
刷微信投票
目前只能刷不需要微信登录授权的投票,比如下面这个投票例子,具体原因在后面再跟大家讨论
我们先简单分析以下这类型的微信投票,做一下前期准备,找个正在进行微信投票的项目的网页链接(http://www.dzmshd.com/Home/index.php?m=Index&a=content&id=42&fid=8130&subscribe=1),右键查看源代码,找到投票发起的请求链接
PS:微信很鸡贼,只能用微信内置浏览器打开...
使用微信电脑端打开,对着网页右键,查看源代码
微信会在这个位置生成一个TXT文件,并帮我们自动打开,然后我们按关键字搜索,
搜索这个js方法,找到请求链接,拼接上参数后:http://www.dzmshd.com/Home/index.php?m=Index&a=vote&vid=8130&id=42&tp=
链接找到了,我们开始写代码,同样,写在controller里就可以了,简单点
注意,UserAgent标识得设置微信的,不能用我们前面的UserAgent池了,我在网上找了几个
//微信UserAgent标识String[] webKitUserAgent = {"Mozilla/5.0 (Linux; Android 7.1.1; MI 6 Build/NMF26X; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/043807 Mobile Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN","Mozilla/5.0 (Linux; Android 7.1.1; OD103 Build/NMF26F; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/4G Language/zh_CN","Mozilla/5.0 (Linux; Android 6.0.1; SM919 Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN","Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN","Mozilla/5.0 (Linux; Android 5.1; HUAWEI TAG-AL00 Build/HUAWEITAG-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043622 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/4G Language/zh_CN","Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 MicroMessenger/6.6.1 NetType/4G Language/zh_CN","Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_2 like Mac OS X) AppleWebKit/604.4.7 (KHTML, like Gecko) Mobile/15C202 MicroMessenger/6.6.1 NetType/4G Language/zh_CN","Mozilla/5.0 (iPhone; CPU iPhone OS 11_1_1 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Mobile/15B150 MicroMessenger/6.6.1 NetType/WIFI Language/zh_CN","Mozilla/5.0 (iphone x Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN",};
controller
这个链接还是比较简单,GET请求,我们使用HttpClientUtil就可以了,然后运行起来,访问:http://localhost:10087/weChatVote/start
效果
我们一样随机秒数去请求,换IP代理,UserAgent要换微信标识的,运行一小段时间后
日志显示成功13次,检查一下,发现已经从2983变成2997,多了一次估计是刚好有人给它投票...
为了方便验证,我们找一个投票数为零的,试一下,别人都几千票了,它一张都没有也是可怜,帮它刷刷人气(嘿嘿~)
先找出请求链接:http://www.dzmshd.com/Home/index.php?m=Index&a=vote&vid=8679&id=42&tp=
项目运行起来,访问:http://localhost:10087/weChatVote/start
效果
运行一小段时间后,刷了137票,瞬间排到12名(捂脸)
PS:发现有好多次失败都是这个原因,因为我们的代理IP太少了,而且前面已经用了部分IP给第二名投票了,所以投票失败,后面去更新IP代理池,然后检查校验继续刷
需要登录授权的比较麻烦,先看一下微信网页授权的大致流程:(微信公众平台:https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1421140842)
普通浏览器无法调试查看微信的链接,得需要抓包软件进行分析,比如fiddler等
如果参数设置错误,连授权页面都访问不了
强行请求在源码找到的链接进行访问,返回这个报错页面,因为少了参数,连授权页面都无法重定向过去
后记
自动任务更新免费IP代理,发起的请求都是随机秒数、随机IP、随机UserAgent,甚至还可以随机cookie,模拟真实用户使用浏览器发起的请求
本文就记录到这里,声明一下,技术仅供学习研究,请大家不要应用在触发法律的地方,欢迎大家一起讨论
升级
原先两个工具类只支持发起GET请求,现在新增支持发起POST请求,不过有一点要注意,经过我测试,post请求分成两种情况来设置参数,后端才能成功接参
1、服务端有@RequestBody,请求头需要设置Content-type=application/json; charset=UTF-8,同时请求参数要放在body里
2、服务端没有@RequestBody,请求头需要设置Content-type=application/x-www-form-urlencoded; charset=UTF-8,同时请求参数要放在URL参数里
目前是两种都写在里面了,我默认先注释其中一个,大家使用的时候再自行调整、扩展
代码开源
代码已经开源、托管到我的GitHub、码云:
GitHub:https://github.com/huanzi-qch/spider
码云:https://gitee.com/huanzi-qch/spider