最近爬虫比较火,空闲之余学习一下,第一个爬虫实验.
爬取影驰世界杯主题里面的影驰币排行榜.
原始网页如下
主要用到两个包:jsoup(用于解析html)和fast-json(用于解析json数据)
<!-- HTML解析工具 jsoup begin --><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.2</version></dependency><!-- HTML解析工具 end --><!-- alibaba fastjson包 --><dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.46</version></dependency>
直接爬地址栏的连接(windows下可事先用ctrl+u查看网页内容),得到的结果如下,发现并没有我们想要的数据.
<!doctype html>
<html lang="zh-CN"><head> <title>竞猜排行</title> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <meta name="renderer" content="webkit"> <meta name="force-rendering" content="webkit"> <meta charset="utf-8"> <link href="/Content/style.css?v=1.2" rel="stylesheet"> <!--[if lt IE 10]><script type="text/javascript" src="/Script/util/PIE.js"></script>
<![endif]--> <!--[if lt IE 8]>
<script src="/Script/util/json2.js"></script>
<![endif]--> </head> <body class="guess1"> <div class="head head2"> <div class="w center fzero relative"> <a href="http://www.szgalaxy.com">影驰官网</a> <div class="absolute right top"> <a href="/">社区首页</a> <!--<a href="/nvideo/?id=1">MOD</a>--> <a href="/nvideo/?id=2">超频</a> <a href="/nvideo/?id=3">游戏</a> <!--<a href="/nvideo/?id=4">新奇特</a>--> <a href="/topic/">话题</a> <a href="/active/">活动中心</a> <a href="/tryout/">0元试用</a> <a href="/integralshop/">积分商城</a><em>|</em> <span class="text-center" id="UserForm"> <a href="javascript:">登录</a><em></em><a href="/register/" class="no-ml">注册</a> </span> <a href="javascript:" data-score="true">签到</a> <a href="/topic/create/">发表</a> </div> </div> </div> <div class="guess_focus"> <img src="/Content/Images/guess/focus.png" id="JFocus" width="100%"> <ul class="guess_block" id="SaiBlock"></ul> <ul class="guess_time" id="SaiTime"></ul> </div> <div class="guess_main"> <div class="guess_main_right"> <div class="full fzero">排行榜谁是预言帝<div></div></div> </div> <marquee direction="left" οnmοuseοut="this.start()" οnmοuseοver="this.stop()">注意:小组赛阶段中奖名单已公布!淘汰赛阶段竞猜已经开始,所有玩家影驰币将回到同样的起点(1000影驰币),搏一搏,单车变摩托,万元主机等着你! </marquee> <table cellpadding="0" cellspacing="0" border="0"> <colgroup> <col width="68"> <col width="230"> <col width="652"> </colgroup> <tbody><tr><th>排名</th><th>影驰币</th><th class="text-left">用户</th></tr> </tbody><tbody id="GuessTop"></tbody> </table> <div class="page relative"> <div class="absolute right fzero" id="Page"></div> </div> </div> <p class="text-center">版权所有 影驰科技 粤ICP备14038543号</p> <script src="/Script/Config.js"></script> <script src="/Script/public.min.js?v=1"></script> <script src="/Script/logic/active.guess.top.js"></script> </body>
</html>
分析发现该数据处理填充上去的,f12查看后,找到了获取数据的连接地址:String url2 = "https://bbs.szgalaxy.com/api/PcGuess/GetPredictionRankList?groupKind=0&oid=bd95f39e-c222-467d-88cd-102012d4315f&pageNum=0&pageSize=20&UserToken=00000000-0000-0000-0000-000000000000",最终结果如下,(由于练习,并没有爬取所有的,只爬取了前20条数据)
<html><head></head><body>{"code":"1","msg":"成功","data":"[{\"Rank\":1,\"AwardRank\":1,\"WinQty\":131571,\"Nickname\":\"小北\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(0398bc8b-d24b-4899-875c-31e8bd157f1e)%26ver%3d3%26prop%3dPhoto\"},{\"Rank\":2,\"AwardRank\":2,\"WinQty\":118845,\"Nickname\":\"小北\",\"MemberPhoto\":null},{\"Rank\":3,\"AwardRank\":2,\"WinQty\":112564,\"Nickname\":\"技飞狗跳\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(9139e694-daa9-4c26-b5b3-1d11750a8f4a)%26ver%3d2%26prop%3dPhoto\"},{\"Rank\":4,\"AwardRank\":2,\"WinQty\":90038,\"Nickname\":\"西风\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(91dcfc25-4048-4630-a4a5-c7942445a51a)%26ver%3d4%26prop%3dPhoto\"},{\"Rank\":5,\"AwardRank\":3,\"WinQty\":88413,\"Nickname\":\"90后大叔\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(feaa5a67-95cb-44da-ab16-0ec14404aa48)%26ver%3d3%26prop%3dPhoto\"},{\"Rank\":6,\"AwardRank\":3,\"WinQty\":77571,\"Nickname\":\"云飞\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(bb149ecc-d1f2-4a43-b4ba-59d139dc3eba)%26ver%3d5%26prop%3dPhoto\"},{\"Rank\":7,\"AwardRank\":3,\"WinQty\":51719,\"Nickname\":\"张平\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d4f2f8f7-8ff8-4b80-8de7-65ab4af76851)%26ver%3d4%26prop%3dPhoto\"},{\"Rank\":8,\"AwardRank\":4,\"WinQty\":46345,\"Nickname\":\"S&J\",\"MemberPhoto\":null},{\"Rank\":9,\"AwardRank\":4,\"WinQty\":43914,\"Nickname\":\"蔡卓桁. Aaron\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(af82e341-2af3-4127-ada1-1dd5874d72f0)%26ver%3d3%26prop%3dPhoto\"},{\"Rank\":10,\"AwardRank\":4,\"WinQty\":28279,\"Nickname\":\"北极星\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(a14fd68f-e7de-4a12-bdc1-d57082812935)%26ver%3d4%26prop%3dPhoto\"},{\"Rank\":11,\"AwardRank\":4,\"WinQty\":27711,\"Nickname\":\"宋颖\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(457563af-44f5-4a0c-9edc-a86f5cc0553e)%26ver%3d5%26prop%3dPhoto\"},{\"Rank\":12,\"AwardRank\":4,\"WinQty\":22714,\"Nickname\":\"春之声42\",\"MemberPhoto\":null},{\"Rank\":13,\"AwardRank\":4,\"WinQty\":20219,\"Nickname\":\"陳傑\",\"MemberPhoto\":null},{\"Rank\":14,\"AwardRank\":4,\"WinQty\":18926,\"Nickname\":\"qeqe\",\"MemberPhoto\":null},{\"Rank\":15,\"AwardRank\":4,\"WinQty\":16370,\"Nickname\":\"Alex\",\"MemberPhoto\":null},{\"Rank\":16,\"AwardRank\":4,\"WinQty\":15836,\"Nickname\":\"Mr.王\",\"MemberPhoto\":null},{\"Rank\":17,\"AwardRank\":4,\"WinQty\":13776,\"Nickname\":\"高攀\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d8a0356a-1c8f-4f05-8ee7-570d5df8f46b)%26ver%3d2%26prop%3dPhoto\"},{\"Rank\":18,\"AwardRank\":5,\"WinQty\":10839,\"Nickname\":\"寳_爺\",\"MemberPhoto\":null},{\"Rank\":19,\"AwardRank\":5,\"WinQty\":10780,\"Nickname\":\"习惯一个人\",\"MemberPhoto\":null},{\"Rank\":20,\"AwardRank\":5,\"WinQty\":10616,\"Nickname\":\"星之所在\",\"MemberPhoto\":null}]","tag":"{\"IsDemo\":false,\"Timestamp\":0,\"ReGet\":false,\"Dict\":{\"records\":\"1511\"}}"}</body>
</html>
处理数据后:----------------------------------
Rank[排名]: 1
AwardRank: 1
WinQty: 131571
Nickname: 小北
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(0398bc8b-d24b-4899-875c-31e8bd157f1e)%26ver%3d3%26prop%3dPhoto
-------------------------------------
Rank[排名]: 2
AwardRank: 2
WinQty: 118845
Nickname: 小北
MemberPhoto: null
-------------------------------------
Rank[排名]: 3
AwardRank: 2
WinQty: 112564
Nickname: 技飞狗跳
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(9139e694-daa9-4c26-b5b3-1d11750a8f4a)%26ver%3d2%26prop%3dPhoto
-------------------------------------
Rank[排名]: 4
AwardRank: 2
WinQty: 90038
Nickname: 西风
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(91dcfc25-4048-4630-a4a5-c7942445a51a)%26ver%3d4%26prop%3dPhoto
-------------------------------------
Rank[排名]: 5
AwardRank: 3
WinQty: 88413
Nickname: 90后大叔
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(feaa5a67-95cb-44da-ab16-0ec14404aa48)%26ver%3d3%26prop%3dPhoto
-------------------------------------
Rank[排名]: 6
AwardRank: 3
WinQty: 77571
Nickname: 云飞
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(bb149ecc-d1f2-4a43-b4ba-59d139dc3eba)%26ver%3d5%26prop%3dPhoto
-------------------------------------
Rank[排名]: 7
AwardRank: 3
WinQty: 51719
Nickname: 张平
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d4f2f8f7-8ff8-4b80-8de7-65ab4af76851)%26ver%3d4%26prop%3dPhoto
-------------------------------------
Rank[排名]: 8
AwardRank: 4
WinQty: 46345
Nickname: S&J
MemberPhoto: null
-------------------------------------
Rank[排名]: 9
AwardRank: 4
WinQty: 43914
Nickname: 蔡卓桁. Aaron
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(af82e341-2af3-4127-ada1-1dd5874d72f0)%26ver%3d3%26prop%3dPhoto
-------------------------------------
Rank[排名]: 10
AwardRank: 4
WinQty: 28279
Nickname: 北极星
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(a14fd68f-e7de-4a12-bdc1-d57082812935)%26ver%3d4%26prop%3dPhoto
-------------------------------------
Rank[排名]: 11
AwardRank: 4
WinQty: 27711
Nickname: 宋颖
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(457563af-44f5-4a0c-9edc-a86f5cc0553e)%26ver%3d5%26prop%3dPhoto
-------------------------------------
Rank[排名]: 12
AwardRank: 4
WinQty: 22714
Nickname: 春之声42
MemberPhoto: null
-------------------------------------
Rank[排名]: 13
AwardRank: 4
WinQty: 20219
Nickname: 陳傑
MemberPhoto: null
-------------------------------------
Rank[排名]: 14
AwardRank: 4
WinQty: 18926
Nickname: qeqe
MemberPhoto: null
-------------------------------------
Rank[排名]: 15
AwardRank: 4
WinQty: 16370
Nickname: Alex
MemberPhoto: null
-------------------------------------
Rank[排名]: 16
AwardRank: 4
WinQty: 15836
Nickname: Mr.王
MemberPhoto: null
-------------------------------------
Rank[排名]: 17
AwardRank: 4
WinQty: 13776
Nickname: 高攀
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d8a0356a-1c8f-4f05-8ee7-570d5df8f46b)%26ver%3d2%26prop%3dPhoto
-------------------------------------
Rank[排名]: 18
AwardRank: 5
WinQty: 10839
Nickname: 寳_爺
MemberPhoto: null
-------------------------------------
Rank[排名]: 19
AwardRank: 5
WinQty: 10780
Nickname: 习惯一个人
MemberPhoto: null
-------------------------------------
Rank[排名]: 20
AwardRank: 5
WinQty: 10616
Nickname: 星之所在
MemberPhoto: null
-------------------------------------
[{"AwardRank":1,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(0398bc8b-d24b-4899-875c-31e8bd157f1e)%26ver%3d3%26prop%3dPhoto","Rank":1,"WinQty":131571,"Nickname":"小北"},{"AwardRank":2,"Rank":2,"WinQty":118845,"Nickname":"小北"},{"AwardRank":2,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(9139e694-daa9-4c26-b5b3-1d11750a8f4a)%26ver%3d2%26prop%3dPhoto","Rank":3,"WinQty":112564,"Nickname":"技飞狗跳"},{"AwardRank":2,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(91dcfc25-4048-4630-a4a5-c7942445a51a)%26ver%3d4%26prop%3dPhoto","Rank":4,"WinQty":90038,"Nickname":"西风"},{"AwardRank":3,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(feaa5a67-95cb-44da-ab16-0ec14404aa48)%26ver%3d3%26prop%3dPhoto","Rank":5,"WinQty":88413,"Nickname":"90后大叔"},{"AwardRank":3,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(bb149ecc-d1f2-4a43-b4ba-59d139dc3eba)%26ver%3d5%26prop%3dPhoto","Rank":6,"WinQty":77571,"Nickname":"云飞"},{"AwardRank":3,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d4f2f8f7-8ff8-4b80-8de7-65ab4af76851)%26ver%3d4%26prop%3dPhoto","Rank":7,"WinQty":51719,"Nickname":"张平"},{"AwardRank":4,"Rank":8,"WinQty":46345,"Nickname":"S&J"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(af82e341-2af3-4127-ada1-1dd5874d72f0)%26ver%3d3%26prop%3dPhoto","Rank":9,"WinQty":43914,"Nickname":"蔡卓桁. Aaron"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(a14fd68f-e7de-4a12-bdc1-d57082812935)%26ver%3d4%26prop%3dPhoto","Rank":10,"WinQty":28279,"Nickname":"北极星"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(457563af-44f5-4a0c-9edc-a86f5cc0553e)%26ver%3d5%26prop%3dPhoto","Rank":11,"WinQty":27711,"Nickname":"宋颖"},{"AwardRank":4,"Rank":12,"WinQty":22714,"Nickname":"春之声42"},{"AwardRank":4,"Rank":13,"WinQty":20219,"Nickname":"陳傑"},{"AwardRank":4,"Rank":14,"WinQty":18926,"Nickname":"qeqe"},{"AwardRank":4,"Rank":15,"WinQty":16370,"Nickname":"Alex"},{"AwardRank":4,"Rank":16,"WinQty":15836,"Nickname":"Mr.王"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d8a0356a-1c8f-4f05-8ee7-570d5df8f46b)%26ver%3d2%26prop%3dPhoto","Rank":17,"WinQty":13776,"Nickname":"高攀"},{"AwardRank":5,"Rank":18,"WinQty":10839,"Nickname":"寳_爺"},{"AwardRank":5,"Rank":19,"WinQty":10780,"Nickname":"习惯一个人"},{"AwardRank":5,"Rank":20,"WinQty":10616,"Nickname":"星之所在"}]
简单的代码如下
package com.hill.jsoup;import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;public class JsoupDemo {public static void main(String[] args) {// 影驰基友会pc端,影驰币排行榜String url = "https://bbs.szgalaxy.com/active/guess/top/?oid=bd95f39e-c222-467d-88cd-102012d4315f&gid=c852015c-33e7-4a88-afe5-901266c0e3f6&gkind=1";String url2 = "https://bbs.szgalaxy.com/api/PcGuess/GetPredictionRankList?groupKind=0&oid=bd95f39e-c222-467d-88cd-102012d4315f&pageNum=0&pageSize=20&UserToken=00000000-0000-0000-0000-000000000000";try {Map<String, String> map = new HashMap<String, String>();map.put("content-type", "application/xml");// ignoreContentType(true),不设置可能回报错.Document htmlPage = Jsoup.connect(url2).ignoreContentType(true).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36").get();System.out.println(htmlPage);System.out.println("处理数据后:----------------------------------");// 解析数据JSONObject json = (JSONObject) JSON.parse(htmlPage.text());JSONArray jsonArray = json.getJSONArray("data");for(int i = 0; i < jsonArray.size(); i++) {JSONObject json_ = jsonArray.getJSONObject(i);System.out.println("Rank[排名]: " + json_.get("Rank"));System.out.println("AwardRank: " + json_.get("AwardRank"));System.out.println("WinQty: " + json_.get("WinQty"));System.out.println("Nickname: " + json_.get("Nickname"));System.out.println("MemberPhoto: " + json_.get("MemberPhoto"));System.out.println("-------------------------------------");}System.out.println(jsonArray);} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}
}