记录一下,基于jsoup的爬虫(入门级)

最近爬虫比较火,空闲之余学习一下,第一个爬虫实验.
爬取影驰世界杯主题里面的影驰币排行榜.
原始网页如下
主要用到两个包:jsoup(用于解析html)和fast-json(用于解析json数据)
    <!-- HTML解析工具  jsoup  begin -->
    <dependency>
        <groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.11.2</version>
    </dependency>
    <!-- HTML解析工具  end -->
    <!-- alibaba fastjson包 -->
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.46</version>
    </dependency>
直接爬地址栏的连接(windows下可事先用ctrl+u查看网页内容),得到的结果如下,发现并没有我们想要的数据.
<!doctype html>
<html lang="zh-CN">
 <head> 
  <title>竞猜排行</title> 
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> 
  <meta name="renderer" content="webkit"> 
  <meta name="force-rendering" content="webkit"> 
  <meta charset="utf-8"> 
  <link href="/Content/style.css?v=1.2" rel="stylesheet"> 
  <!--[if lt IE 10]>
    <script type="text/javascript" src="/Script/util/PIE.js"></script>
<![endif]--> 
  <!--[if lt IE 8]>
<script src="/Script/util/json2.js"></script>
<![endif]--> 
 </head> 
 <body class="guess1"> 
  <div class="head head2"> 
   <div class="w center fzero relative"> 
    <a href="http://www.szgalaxy.com">影驰官网</a> 
    <div class="absolute right top"> 
     <a href="/">社区首页</a> 
     <!--<a href="/nvideo/?id=1">MOD</a>--> 
     <a href="/nvideo/?id=2">超频</a> 
     <a href="/nvideo/?id=3">游戏</a> 
     <!--<a href="/nvideo/?id=4">新奇特</a>--> 
     <a href="/topic/">话题</a> 
     <a href="/active/">活动中心</a> 
     <a href="/tryout/">0元试用</a> 
     <a href="/integralshop/">积分商城</a>
     <em>|</em> 
     <span class="text-center" id="UserForm"> <a href="javascript:">登录</a><em></em><a href="/register/" class="no-ml">注册</a> </span> 
     <a href="javascript:" data-score="true">签到</a> 
     <a href="/topic/create/">发表</a> 
    </div> 
   </div> 
  </div> 
  <div class="guess_focus"> 
   <img src="/Content/Images/guess/focus.png" id="JFocus" width="100%"> 
   <ul class="guess_block" id="SaiBlock"></ul> 
   <ul class="guess_time" id="SaiTime"></ul> 
  </div> 
  <div class="guess_main"> 
   <div class="guess_main_right"> 
    <div class="full fzero">
     排行榜谁是预言帝
     <div></div>
    </div> 
   </div> 
   <marquee direction="left" onmouseout="this.start()" onmouseover="this.stop()">
    注意:小组赛阶段中奖名单已公布!淘汰赛阶段竞猜已经开始,所有玩家影驰币将回到同样的起点(1000影驰币),搏一搏,单车变摩托,万元主机等着你! 
   </marquee> 
   <table cellpadding="0" cellspacing="0" border="0"> 
    <colgroup> 
     <col width="68"> 
     <col width="230"> 
     <col width="652"> 
    </colgroup> 
    <tbody>
     <tr>
      <th>排名</th>
      <th>影驰币</th>
      <th class="text-left">用户</th>
     </tr> 
    </tbody>
    <tbody id="GuessTop"></tbody> 
   </table> 
   <div class="page relative"> 
    <div class="absolute right fzero" id="Page"></div> 
   </div> 
  </div> 
  <p class="text-center">版权所有 影驰科技 粤ICP备14038543号</p> 
  <script src="/Script/Config.js"></script> 
  <script src="/Script/public.min.js?v=1"></script> 
  <script src="/Script/logic/active.guess.top.js"></script>  
 </body>
</html>

分析发现该数据处理填充上去的,f12查看后,找到了获取数据的连接地址:String url2 = "https://bbs.szgalaxy.com/api/PcGuess/GetPredictionRankList?groupKind=0&oid=bd95f39e-c222-467d-88cd-102012d4315f&pageNum=0&pageSize=20&UserToken=00000000-0000-0000-0000-000000000000",最终结果如下,(由于练习,并没有爬取所有的,只爬取了前20条数据)
<html>
 <head></head>
 <body>
  {"code":"1","msg":"成功","data":"[{\"Rank\":1,\"AwardRank\":1,\"WinQty\":131571,\"Nickname\":\"小北\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(0398bc8b-d24b-4899-875c-31e8bd157f1e)%26ver%3d3%26prop%3dPhoto\"},{\"Rank\":2,\"AwardRank\":2,\"WinQty\":118845,\"Nickname\":\"小北\",\"MemberPhoto\":null},{\"Rank\":3,\"AwardRank\":2,\"WinQty\":112564,\"Nickname\":\"技飞狗跳\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(9139e694-daa9-4c26-b5b3-1d11750a8f4a)%26ver%3d2%26prop%3dPhoto\"},{\"Rank\":4,\"AwardRank\":2,\"WinQty\":90038,\"Nickname\":\"西风\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(91dcfc25-4048-4630-a4a5-c7942445a51a)%26ver%3d4%26prop%3dPhoto\"},{\"Rank\":5,\"AwardRank\":3,\"WinQty\":88413,\"Nickname\":\"90后大叔\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(feaa5a67-95cb-44da-ab16-0ec14404aa48)%26ver%3d3%26prop%3dPhoto\"},{\"Rank\":6,\"AwardRank\":3,\"WinQty\":77571,\"Nickname\":\"云飞\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(bb149ecc-d1f2-4a43-b4ba-59d139dc3eba)%26ver%3d5%26prop%3dPhoto\"},{\"Rank\":7,\"AwardRank\":3,\"WinQty\":51719,\"Nickname\":\"张平\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d4f2f8f7-8ff8-4b80-8de7-65ab4af76851)%26ver%3d4%26prop%3dPhoto\"},{\"Rank\":8,\"AwardRank\":4,\"WinQty\":46345,\"Nickname\":\"S&J\",\"MemberPhoto\":null},{\"Rank\":9,\"AwardRank\":4,\"WinQty\":43914,\"Nickname\":\"蔡卓桁. Aaron\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(af82e341-2af3-4127-ada1-1dd5874d72f0)%26ver%3d3%26prop%3dPhoto\"},{\"Rank\":10,\"AwardRank\":4,\"WinQty\":28279,\"Nickname\":\"北极星\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(a14fd68f-e7de-4a12-bdc1-d57082812935)%26ver%3d4%26prop%3dPhoto\"},{\"Rank\":11,\"AwardRank\":4,\"WinQty\":27711,\"Nickname\":\"宋颖\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(457563af-44f5-4a0c-9edc-a86f5cc0553e)%26ver%3d5%26prop%3dPhoto\"},{\"Rank\":12,\"AwardRank\":4,\"WinQty\":22714,\"Nickname\":\"春之声42\",\"MemberPhoto\":null},{\"Rank\":13,\"AwardRank\":4,\"WinQty\":20219,\"Nickname\":\"陳傑\",\"MemberPhoto\":null},{\"Rank\":14,\"AwardRank\":4,\"WinQty\":18926,\"Nickname\":\"qeqe\",\"MemberPhoto\":null},{\"Rank\":15,\"AwardRank\":4,\"WinQty\":16370,\"Nickname\":\"Alex\",\"MemberPhoto\":null},{\"Rank\":16,\"AwardRank\":4,\"WinQty\":15836,\"Nickname\":\"Mr.王\",\"MemberPhoto\":null},{\"Rank\":17,\"AwardRank\":4,\"WinQty\":13776,\"Nickname\":\"高攀\",\"MemberPhoto\":\"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d8a0356a-1c8f-4f05-8ee7-570d5df8f46b)%26ver%3d2%26prop%3dPhoto\"},{\"Rank\":18,\"AwardRank\":5,\"WinQty\":10839,\"Nickname\":\"寳_爺\",\"MemberPhoto\":null},{\"Rank\":19,\"AwardRank\":5,\"WinQty\":10780,\"Nickname\":\"习惯一个人\",\"MemberPhoto\":null},{\"Rank\":20,\"AwardRank\":5,\"WinQty\":10616,\"Nickname\":\"星之所在\",\"MemberPhoto\":null}]","tag":"{\"IsDemo\":false,\"Timestamp\":0,\"ReGet\":false,\"Dict\":{\"records\":\"1511\"}}"}
 </body>
</html>
处理数据后:----------------------------------
Rank[排名]: 1
AwardRank: 1
WinQty: 131571
Nickname: 小北
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(0398bc8b-d24b-4899-875c-31e8bd157f1e)%26ver%3d3%26prop%3dPhoto
-------------------------------------
Rank[排名]: 2
AwardRank: 2
WinQty: 118845
Nickname: 小北
MemberPhoto: null
-------------------------------------
Rank[排名]: 3
AwardRank: 2
WinQty: 112564
Nickname: 技飞狗跳
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(9139e694-daa9-4c26-b5b3-1d11750a8f4a)%26ver%3d2%26prop%3dPhoto
-------------------------------------
Rank[排名]: 4
AwardRank: 2
WinQty: 90038
Nickname: 西风
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(91dcfc25-4048-4630-a4a5-c7942445a51a)%26ver%3d4%26prop%3dPhoto
-------------------------------------
Rank[排名]: 5
AwardRank: 3
WinQty: 88413
Nickname: 90后大叔
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(feaa5a67-95cb-44da-ab16-0ec14404aa48)%26ver%3d3%26prop%3dPhoto
-------------------------------------
Rank[排名]: 6
AwardRank: 3
WinQty: 77571
Nickname: 云飞
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(bb149ecc-d1f2-4a43-b4ba-59d139dc3eba)%26ver%3d5%26prop%3dPhoto
-------------------------------------
Rank[排名]: 7
AwardRank: 3
WinQty: 51719
Nickname: 张平
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d4f2f8f7-8ff8-4b80-8de7-65ab4af76851)%26ver%3d4%26prop%3dPhoto
-------------------------------------
Rank[排名]: 8
AwardRank: 4
WinQty: 46345
Nickname: S&J
MemberPhoto: null
-------------------------------------
Rank[排名]: 9
AwardRank: 4
WinQty: 43914
Nickname: 蔡卓桁. Aaron
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(af82e341-2af3-4127-ada1-1dd5874d72f0)%26ver%3d3%26prop%3dPhoto
-------------------------------------
Rank[排名]: 10
AwardRank: 4
WinQty: 28279
Nickname: 北极星
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(a14fd68f-e7de-4a12-bdc1-d57082812935)%26ver%3d4%26prop%3dPhoto
-------------------------------------
Rank[排名]: 11
AwardRank: 4
WinQty: 27711
Nickname: 宋颖
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(457563af-44f5-4a0c-9edc-a86f5cc0553e)%26ver%3d5%26prop%3dPhoto
-------------------------------------
Rank[排名]: 12
AwardRank: 4
WinQty: 22714
Nickname: 春之声42
MemberPhoto: null
-------------------------------------
Rank[排名]: 13
AwardRank: 4
WinQty: 20219
Nickname: 陳傑
MemberPhoto: null
-------------------------------------
Rank[排名]: 14
AwardRank: 4
WinQty: 18926
Nickname: qeqe
MemberPhoto: null
-------------------------------------
Rank[排名]: 15
AwardRank: 4
WinQty: 16370
Nickname: Alex
MemberPhoto: null
-------------------------------------
Rank[排名]: 16
AwardRank: 4
WinQty: 15836
Nickname: Mr.王
MemberPhoto: null
-------------------------------------
Rank[排名]: 17
AwardRank: 4
WinQty: 13776
Nickname: 高攀
MemberPhoto: https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d8a0356a-1c8f-4f05-8ee7-570d5df8f46b)%26ver%3d2%26prop%3dPhoto
-------------------------------------
Rank[排名]: 18
AwardRank: 5
WinQty: 10839
Nickname: 寳_爺
MemberPhoto: null
-------------------------------------
Rank[排名]: 19
AwardRank: 5
WinQty: 10780
Nickname: 习惯一个人
MemberPhoto: null
-------------------------------------
Rank[排名]: 20
AwardRank: 5
WinQty: 10616
Nickname: 星之所在
MemberPhoto: null
-------------------------------------
[{"AwardRank":1,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(0398bc8b-d24b-4899-875c-31e8bd157f1e)%26ver%3d3%26prop%3dPhoto","Rank":1,"WinQty":131571,"Nickname":"小北"},{"AwardRank":2,"Rank":2,"WinQty":118845,"Nickname":"小北"},{"AwardRank":2,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(9139e694-daa9-4c26-b5b3-1d11750a8f4a)%26ver%3d2%26prop%3dPhoto","Rank":3,"WinQty":112564,"Nickname":"技飞狗跳"},{"AwardRank":2,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(91dcfc25-4048-4630-a4a5-c7942445a51a)%26ver%3d4%26prop%3dPhoto","Rank":4,"WinQty":90038,"Nickname":"西风"},{"AwardRank":3,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(feaa5a67-95cb-44da-ab16-0ec14404aa48)%26ver%3d3%26prop%3dPhoto","Rank":5,"WinQty":88413,"Nickname":"90后大叔"},{"AwardRank":3,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(bb149ecc-d1f2-4a43-b4ba-59d139dc3eba)%26ver%3d5%26prop%3dPhoto","Rank":6,"WinQty":77571,"Nickname":"云飞"},{"AwardRank":3,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d4f2f8f7-8ff8-4b80-8de7-65ab4af76851)%26ver%3d4%26prop%3dPhoto","Rank":7,"WinQty":51719,"Nickname":"张平"},{"AwardRank":4,"Rank":8,"WinQty":46345,"Nickname":"S&J"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(af82e341-2af3-4127-ada1-1dd5874d72f0)%26ver%3d3%26prop%3dPhoto","Rank":9,"WinQty":43914,"Nickname":"蔡卓桁. Aaron"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(a14fd68f-e7de-4a12-bdc1-d57082812935)%26ver%3d4%26prop%3dPhoto","Rank":10,"WinQty":28279,"Nickname":"北极星"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(457563af-44f5-4a0c-9edc-a86f5cc0553e)%26ver%3d5%26prop%3dPhoto","Rank":11,"WinQty":27711,"Nickname":"宋颖"},{"AwardRank":4,"Rank":12,"WinQty":22714,"Nickname":"春之声42"},{"AwardRank":4,"Rank":13,"WinQty":20219,"Nickname":"陳傑"},{"AwardRank":4,"Rank":14,"WinQty":18926,"Nickname":"qeqe"},{"AwardRank":4,"Rank":15,"WinQty":16370,"Nickname":"Alex"},{"AwardRank":4,"Rank":16,"WinQty":15836,"Nickname":"Mr.王"},{"AwardRank":4,"MemberPhoto":"https://wxapi.szgalaxy.com/HLW.axd?hlr=bph&p=BDPId%3dBPPror%26obj%3dR.M.Common.BusinessObjects.IMicroMember(d8a0356a-1c8f-4f05-8ee7-570d5df8f46b)%26ver%3d2%26prop%3dPhoto","Rank":17,"WinQty":13776,"Nickname":"高攀"},{"AwardRank":5,"Rank":18,"WinQty":10839,"Nickname":"寳_爺"},{"AwardRank":5,"Rank":19,"WinQty":10780,"Nickname":"习惯一个人"},{"AwardRank":5,"Rank":20,"WinQty":10616,"Nickname":"星之所在"}]
简单的代码如下
package com.hill.jsoup;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;

public class JsoupDemo {

	public static void main(String[] args) {
		
		// 影驰基友会pc端,影驰币排行榜
		String url = "https://bbs.szgalaxy.com/active/guess/top/?oid=bd95f39e-c222-467d-88cd-102012d4315f&gid=c852015c-33e7-4a88-afe5-901266c0e3f6&gkind=1";
		String url2 = "https://bbs.szgalaxy.com/api/PcGuess/GetPredictionRankList?groupKind=0&oid=bd95f39e-c222-467d-88cd-102012d4315f&pageNum=0&pageSize=20&UserToken=00000000-0000-0000-0000-000000000000";
		try {
			Map<String, String> map  = new HashMap<String, String>();
			map.put("content-type", "application/xml");
			// ignoreContentType(true),不设置可能回报错.
			Document htmlPage = Jsoup.connect(url2).ignoreContentType(true).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36").get();
			System.out.println(htmlPage);
			System.out.println("处理数据后:----------------------------------");
			// 解析数据
			JSONObject json = (JSONObject) JSON.parse(htmlPage.text());
			JSONArray jsonArray = json.getJSONArray("data");
			for(int i = 0; i < jsonArray.size(); i++) {
				JSONObject json_ = jsonArray.getJSONObject(i);
				System.out.println("Rank[排名]: " + json_.get("Rank"));
				System.out.println("AwardRank: " + json_.get("AwardRank"));
				System.out.println("WinQty: " + json_.get("WinQty"));
				System.out.println("Nickname: " + json_.get("Nickname"));
				System.out.println("MemberPhoto: " + json_.get("MemberPhoto"));
				System.out.println("-------------------------------------");
			}
			
			System.out.println(jsonArray);
			
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}






猜你喜欢

转载自blog.csdn.net/superMe_1994/article/details/80860102
今日推荐