Java爬虫实战第一篇:微博爬虫

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/qq_31122833/article/details/91570539

核心:1、有大量的微博uid 2、处理微博的反爬虫

一、开始准备工作

1、获取访问微博网页的cookie

谷歌浏览器访问:https://m.weibo.cn/
按F12进入调试模式
复制如图所示的数据,这就是我们需要的cookie了

 2、cookie拿到了,接下来就是写代码模仿浏览器访问内容了

/**
     * 基于HttpClient 4.3的通用Get方法--微博Cookie
     * @param url  提交的URL
     * @return 提交响应
     */
    public static String get_byCookie(String url,String cookie) {
        if(CheckUtil.checkNull(cookie)){
            cookie = "SCF=AjGxj6fuG*****00174";//这里就是刚刚你获取的cookie,有很长
        }
        CloseableHttpClient client = HttpClients.createDefault();
        String responseText = "";
        CloseableHttpResponse response = null;
        try {
            HttpGet method = new HttpGet(url);
            method.addHeader(new BasicHeader("Cookie",cookie));
            RequestConfig config = RequestConfig.custom()
                    .setConnectTimeout(20*1000) // 连接建立超时,毫秒。
                    .build();
            method.setConfig(config);
            response = client.execute(method);
            HttpEntity entity = response.getEntity();
            if (entity != null) {
                responseText = EntityUtils.toString(entity, ENCODING);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                response.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return responseText;
    }

3、有些小盆友就要耐不住了,要试试上面的方法了,但是没用哦,微博在这里还做了反爬虫,返回的微博内容在<script></script>里面的FM.view中;这里要做额外的处理了;注意,这一步处理是核心代码,是我研究的精华

/**
     * 爬取微博JS内容
     * @param uid
     * @return
     */
    public static Result<List<MvcWeiboReptile>> get_js_html_byuid(String uid, String cookie){
        //默认无数据
        Result<List<MvcWeiboReptile>> result = new Result<List<MvcWeiboReptile>>();
        result.setType(TypeEnum.FAIL.getCode());
        result.setMessage("无数据");
        List<MvcWeiboReptile> weiboReptileList = new ArrayList<>();
        StringBuffer stringBuffer = new StringBuffer();
        stringBuffer.append(smsUtil.get_byCookie("https://weibo.com/u/"+uid,cookie));
        if(stringBuffer!=null){
            Document document = Jsoup.parse(stringBuffer.toString());
            Elements scripts = document.select("script");
            for(Element script : scripts){
                String[] ss = script.html().split("<script>FM.view");
                stringBuffer = new StringBuffer();
                for (String x : ss) {
                    if (x.contains("\"html\":\"")) {
                        stringBuffer.append(getHtml(x));
                    }
                }
                document = Jsoup.parse(stringBuffer.toString());
                Elements WB_details = document.getElementsByClass("WB_detail");
                for (Element WB_detail : WB_details){
                    Elements WB_infos = WB_detail.getElementsByClass("WB_info");
                    if(!WB_infos.isEmpty()&&WB_infos.size()==1){
                        for(Element WB_info : WB_infos){
                            if(WB_info.html().contains(uid)){
                                Elements WB_text = WB_detail.getElementsByClass("WB_text");
                                Elements WB_from = WB_detail.getElementsByClass("WB_from S_txt2");
                                String text = WB_text.html();
                                String time = WB_from.get(0).getElementsByTag("a").attr("title");
                                Date time_date = DateUtils.parseTimesTampDate(time+":00");
                                if(time_date.after(DateUtils.getBeginDayOfLastSixMonth())){
                                    if(!StringUtils.equals(text,"转发微博")){
                                        MvcWeiboReptile weiboReptile = new MvcWeiboReptile();
                                        weiboReptile.setContext(filterEmoji(text));
                                        weiboReptile.setCreateTime(time_date);
                                        weiboReptileList.add(weiboReptile);
                                        result.setType(TypeEnum.SUCCESS.getCode());
                                        result.setMessage("有数据");
                                    }else{
//                                        System.out.println("-----------------转发微博-----------------------");
                                    }
                                }else{
//                                    System.out.println("-----------------6个月之前的数据-----------------------");
                                }
                            }else{
//                                System.out.println("-----------------uid不匹配-----------------------");
                            }
                        }
                    }else{
//                        System.out.println("-----------------WB_infos.size()!=1-----------------------");
                    }
                }
            }
        }
        result.setData(weiboReptileList);
        return result;

4、经过第三步处理后,成功的获取到了我们需要的数据MvcWeiboReptile,这里面是微博内容和发布时间

二、保存数据:写一个定时器,然后调用上面的接口

/**
     * 每天早上5点执行 爬取微博数据
     *  伪代码
     * @throws Exception
     */
    @Scheduled(cron = "0 0 5 * * ?")
    public synchronized void work1() {
        try {
            //我把cookie放进redis里,因为它有时会过期,方便更换
            String cookie = redisService.get("cookie_for_weibo");
            //2、通过uid、Cookie通过get调用获取微博内容
            Result<List<MvcWeiboReptile>> listResult = weiboUtils.get_js_html_byuid(uid(),cookie);
            //然后保存
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

效果图:

猜你喜欢

转载自blog.csdn.net/qq_31122833/article/details/91570539