爬虫学习5-JSON 数据的分析与解析

        JSON 数据格式以及在 Java 网络爬虫中如何解析 JSON 数据?一般java中我们用于操作json的工具有: org.json、Gson 以及 Fastjson,这篇我们来操作网络爬虫中返回数据是json格式的,该怎么处理了。

     网络爬虫中经常会遇到 JSON 数据,而在我们请求封装有 JSON 数据的网页时,需要对其进行预处理,使其成为标准化的 JSON 数据。例如可能出现下面的形式:

jQuery18305886476962892728_1531402823026({
    "id":"07",
    "language": "C++",
    "edition": "second",
    "author": "E.Balagurusamy"
})
此种包含 JSON 的字符串需要进行预处理(掐头去尾操作),例如上述字符串,在 Java 中可进行如下处理:

//拼接JSON串
String json = "jQuery18305886476962892728_1531402823026({\"id\":\"07\",\"language\": \"C++\",\"edition\": \"second\",\"author\": \"E.Balagurusamy\"})";
//掐头去尾操作
String arr = json.split("\\(")[1];
System.out.println(arr.substring(0,arr.length() - 1));

验证json的网站:json验证

       针对java对象转json,json对象转java对象,json字符串转java对象,json字符串转json对象,这些基础知识,需要了解的网上有相关资料,可以去查一查,这里就不啰嗦了。

爬虫实战案例

     下面来一个真实的爬虫网站实例:

网站地址:http://www.haodou.com/recipe/853171/

第一步,抓包分析评论对应的真实地址

打开f12:

真实地址为:http://www.haodou.com/comment.php?do=list&callback=jQuery18304706379730622201_1542510303429&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common&_=1542510303816

第二步,掐头去尾,在线校验json数据:http://www.bejson.com/

{
	"status": 200,
	"data": {
		"total": 7,
		"data": {
			"_30376977": {
				"CommentId": 30376977,
				"ItemId": 853171,
				"UserId": 4003739,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "漂亮美味",
				"ImageNum": 0,
				"Platform": "iPhone客户端",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "yxeg5",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-4003739\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/9b\/17\/4003739_70.jpg",
				"CreateTime": "2016-02-15 12:22",
				"Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-513793.html\" target=\"_blank\">【第119期】好问豆答:蜜三刀的制作技巧<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29589112": {
				"CommentId": 29589112,
				"ItemId": 853171,
				"UserId": 9235790,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "紫菜是干的还是",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "喻平凶",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-9235790\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/4e\/ed\/9235790_70.jpg",
				"CreateTime": "2015-12-26 09:36",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发布了菜谱专辑:<\/span> <a href=\"http:\/\/www.haodou.com\/recipe\/album\/9061657\/\" target=\"_blank\">炒饭<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29407043": {
				"CommentId": 29407043,
				"ItemId": 853171,
				"UserId": 3342562,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "超市有干贝和海蛎卖?",
				"ImageNum": 0,
				"Platform": "好豆网",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "秋玉的美",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-3342562\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e2\/00\/3342562_70.jpg",
				"CreateTime": "2015-12-05 15:54",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>",
				"LastAct": "",
				"PlatformUrl": "http:\/\/www.haodou.com\/",
				"Admin": "non"
			},
			"_28188378": {
				"CommentId": 28188378,
				"ItemId": 853171,
				"UserId": 8008371,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "干贝虾米一般都是咸的,要用水多泡会,泡软",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "月上荒城6",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-8008371\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/b3\/32\/8008371_70.jpg",
				"CreateTime": "2015-07-09 12:51",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>",
				"LastAct": "",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27165505": {
				"CommentId": 27165505,
				"ItemId": 853171,
				"UserId": 3837,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "食材丰富--口感也丰富!",
				"ImageNum": 0,
				"Platform": "好豆网",
				"Status": 1,
				"SubCommentCnt": 3,
				"OpenDataId": "",
				"OpenUserName": "爱跳舞的老太",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-3837\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/fd\/0e\/3837_70.jpg",
				"CreateTime": "2015-02-26 09:42",
				"Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-556709.html\" target=\"_blank\">【深秋食语】在朋友单位吃午餐<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/",
				"Admin": "non"
			},
			"_30383571": {
				"CommentId": 30383571,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 30376977,
				"Type": 0,
				"AtUserId": 4003739,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-4003739\/\" target=\"_blank\">yxeg5<\/a> 感谢你的分享。",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪红",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2016-02-15 21:39",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29596058": {
				"CommentId": 29596058,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 29589112,
				"Type": 0,
				"AtUserId": 9235790,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-9235790\/\" target=\"_blank\">喻平凶<\/a> 是干的,要冲洗一下。",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪红",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-12-26 23:15",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29407675": {
				"CommentId": 29407675,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 29407043,
				"Type": 0,
				"AtUserId": 3342562,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3342562\/\" target=\"_blank\">秋玉的美<\/a> 商店里有网上也有。",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪红",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-12-05 17:11",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_28189130": {
				"CommentId": 28189130,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 28188378,
				"Type": 0,
				"AtUserId": 8008371,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-8008371\/\" target=\"_blank\">月上荒城6<\/a> 我买的这种不是那种很硬的,很多盐的,要根据情况而定。",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪红",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-07-09 15:19",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27729797": {
				"CommentId": 27729797,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 27165505,
				"Type": 0,
				"AtUserId": 7566907,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-7566907\/\" target=\"_blank\">haodou8704818142<\/a> 我在厦门,漳州吃的,每一次都不是不一样的。都有紫菜",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪红",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-05-07 01:53",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27727527": {
				"CommentId": 27727527,
				"ItemId": 853171,
				"UserId": 7566907,
				"ReplyId": 27165505,
				"Type": 0,
				"AtUserId": 489704,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-489704\/\" target=\"_blank\">挪红<\/a> 和我们的配料不一样",
				"ImageNum": 0,
				"Platform": "Android客户端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "haodou8704818142",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-7566907\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/3b\/76\/7566907_70.jpg",
				"CreateTime": "2015-05-06 19:23",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>",
				"LastAct": "",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27166153": {
				"CommentId": 27166153,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 27165505,
				"Type": 0,
				"AtUserId": 3837,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3837\/\" target=\"_blank\">爱跳舞的老太<\/a> 姐是这儿的人,不知我这样做对吗?",
				"ImageNum": 0,
				"Platform": "好豆网",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪红",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-02-26 11:26",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近发表了话题:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【寻找温暖】港仔后请客,品沙县小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/",
				"Admin": "non"
			}
		},
		"avatar": "",
		"page_nav": "<a href='javaScript:;' page='1' id='' class='cur'>1<\/a><a href='javaScript:;' page='2' id=''>2<\/a><span class='next'><a href='javaScript:;' page='2' id='' class='next'>下一页<\/a><\/span>",
		"more": null,
		"offset": 0
	},
	"message": ""
}

第三步,根据接口数据获取字段,封装javabean

package com.jack.spiderone.entity;

import lombok.Data;

/**
 * create by jack 2018/11/18
 *
 * @author jack
 * @date: 2018/11/18 11:26
 * @Description:
 */
@Data
public class CommentModel {

    /**
     * 评论的id
     */
    private String CommentId;
    //评论的菜品
    private String ItemId;
    //评论的内容
    private String Content;
    //评论的时间
    private String CreateTime;
    //评论作者的名称
    private String OpenUserName;
}

第四步:

        使用 Httpclient 工具或其他 URL 请求工具,获取网页真实地址对应的字符串。针对已获取的字符串在程序中做掐头去尾处理,使其转化成易于解析的 JSON 串(经常使用到正则表达式操作)

代码:

package com.jack.spiderone.service;

import com.alibaba.fastjson.JSONObject;
import com.jack.spiderone.entity.CommentModel;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.util.List;

/**
 * create by jack 2018/11/18
 *
 * @author jack
 * @date: 2018/11/18 11:35
 * @Description:
 */
public class CookBookSpider {

    /**
     * 通过url获取json字符串
     * @param url
     * @return
     */
    public static String getJson(String url) throws IOException {
        //初始化httpclient
        HttpClient httpClient = HttpClients.custom().build();
        //使用的请求方法
        HttpGet httpget = new HttpGet(url);
        //发出get请求
        HttpResponse response = httpClient.execute(httpget);
        //获取网页内容流
        HttpEntity httpEntity = response.getEntity();
        //以字符串的形式(需设置编码)
        String entity = EntityUtils.toString(httpEntity, "gbk");
        //关闭内容流
        EntityUtils.consume(httpEntity);
        //返回JSON字符串
        return entity;
    }


    /**
     * 解析json字符串为对象数组
     * @param jsonStr
     * @return
     */
    public static List<CommentModel> parseData(String jsonStr){
        //将uncode码转化为中文
        jsonStr = decode(jsonStr);
        //使用分割以及正则取代,处理成标准化JSON数组
        String jsondata  = "{"+jsonStr.split("data\":\\{")[2].split("\"avatar")[0].replaceAll("\"_\\d*[0-9]\":", "");
        jsonStr = jsondata.substring(0, jsondata.length()-2);
        //将json数组解析成对象集合
        List<CommentModel>  datalis = JSONObject.parseArray("["+jsonStr.substring(1,jsonStr.length())+"]", CommentModel.class);
        return datalis;
    }

   public static void spiderCookBook() throws IOException {
       //需要解析的URL
       String url = "http://www.haodou.com/comment.php?do=list&callback=jQuery18304706379730622201_1542510303429&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common&_=1542510303816";
       //获取JSON数据
       String jsonstring = getJson(url);
       //解析JSON数据
       List<CommentModel> datalist = parseData(jsonstring);
       //输出数据
       for (CommentModel comm : datalist) {
           System.out.println(comm.getCommentId() + "\t" + comm.getItemId() + "\t" + comm.getContent());
       }
   }



    /**
     * 将unicode码转化为中文
     * @param unicodeStr
     * @return
     */
    public static String decode(String unicodeStr) {
        if (unicodeStr == null) {
            return null;
        }
        StringBuffer retBuf = new StringBuffer();
        int maxLoop = unicodeStr.length();
        for (int i = 0; i < maxLoop; i++) {
            if (unicodeStr.charAt(i) == '\\') {
                if ((i < maxLoop - 5) && ((unicodeStr.charAt(i + 1) == 'u') || (unicodeStr
                        .charAt(i + 1) == 'U')))
                    try {
                        retBuf.append((char) Integer.parseInt(
                                unicodeStr.substring(i + 2, i + 6), 16));
                        i += 5;
                    } catch (NumberFormatException localNumberFormatException) {
                        retBuf.append(unicodeStr.charAt(i));
                    }
                else
                    retBuf.append(unicodeStr.charAt(i));
            } else {
                retBuf.append(unicodeStr.charAt(i));
            }
        }
        return retBuf.toString();
    }

    public static void main(String[] args) throws IOException {
        spiderCookBook();
    }

}

运行程序,输出如下:

30376977	853171	漂亮美味
29589112	853171	紫菜是干的还是
29407043	853171	超市有干贝和海蛎卖?
28188378	853171	干贝虾米一般都是咸的,要用水多泡会,泡软
27165505	853171	食材丰富--口感也丰富!
30383571	853171	@<a href="http://www.haodou.com/cook-4003739/" target="_blank">yxeg5</a> 感谢你的分享。
29596058	853171	@<a href="http://www.haodou.com/cook-9235790/" target="_blank">喻平凶</a> 是干的,要冲洗一下。
29407675	853171	@<a href="http://www.haodou.com/cook-3342562/" target="_blank">秋玉的美</a> 商店里有网上也有。
28189130	853171	@<a href="http://www.haodou.com/cook-8008371/" target="_blank">月上荒城6</a> 我买的这种不是那种很硬的,很多盐的,要根据情况而定。
27729797	853171	@<a href="http://www.haodou.com/cook-7566907/" target="_blank">haodou8704818142</a> 我在厦门,漳州吃的,每一次都不是不一样的。都有紫菜
27727527	853171	@<a href="http://www.haodou.com/cook-489704/" target="_blank">挪红</a> 和我们的配料不一样
27166153	853171	@<a href="http://www.haodou.com/cook-3837/" target="_blank">爱跳舞的老太</a> 姐是这儿的人,不知我这样做对吗?

             需要注意的是该网页的中文编码 Unicode 码,故需在操作之前将其转化成中文字符。再者,读者可能会思考,一般情况下,我们只知道一个菜谱的 ID(http://www.haodou.com/recipe/853171/),即853171,该如何操作?

抓包获取的真实 URL 中包含 &callback=jQuery183016721538977115902_1531563599327,这个字符串又该如何拼接?另外一个字符串 &_=1531563599599 又该怎么得到?在抓包时,我们会发现,这两个字符串是动态变化的,这和前端 JS 操作有关。但我们可以将这两个字符串从抓包的 URL 中去除,对应的地址为:

http://www.haodou.com/comment.php?do=list&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common

请求这个地址,也是可以成功获取数据的,而且得到的是标准化的 JSON 数据。假如给定另外一个菜品的 ID(http://www.haodou.com/recipe/344953/),即344953,便可有规律的拼接其评论内容对应的 URL:

http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common

再者,评论如果存在多页情况,我们可以通过上述 URL 中的 page 字段操作循环的方式获取多页评论数据。例如,ID 为344953菜品的第二页评论 URL 地址为:

http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=2&size=5&comment_id=0&cate=0&purify=common

源码地址:

源码

猜你喜欢

转载自blog.csdn.net/wj903829182/article/details/84196605