Albert located one block: The latest version of the micro-blog web crawler (log on version)

                                                 Introduction and brief description of the microblogging site

After the paper check and re-format detection is complete, we continue to do graduate design. But microblogging really reptile debugging problems for a month, starting with the crack login, analysis weibo.com and weibo.cn, find the weibo.com too difficult to design a lot of encryption, and a time stamp .... but weibo. com the above information is really user-friendly, whether the layout or aesthetics, or if we want the data, are presented in a very clear, the other hand weibo.cn full text, layout confusing. weibo.cn was previously 2g, 3g era with the old version, there is no video of those are basically the main character in the previous era, the picture is relatively small. The following put two pictures comparing the two sites. 

                

                                     weibo.cn                                                                                                    weibo.com

 

I wanted to start with weibo.com, the browser F12 and packet capture tools charles, try for a while, and finally gave up. If you really want to use weibo.com, I recommend a b main station up to crack weibo.com video for your reference, please click : Portal . But I saw mainstream mainstream weibo.cn blog weibo reptiles are most used, may we all feel weibo.com headache treatment too, for me it is also true blogger.

Little book down: While weibo.cn some simple, but the attendant advantage is not so much the difficulty of landing (although not so big, but the problem is still a lot to be processed) such data would like to use xpath or json extraction, all are Do not try to , read the previous article, many are using xpath, json parse the data, when I attempt here, found that no matter with what, parsed data returned is empty! Eventually found that only the return of data into text text, in order to have the data , so bloggers used here scrapy frame + regular way text extraction . Besides the way down, if you want to extract a request page-level information, it is difficult, try the bloggers here, scrapy and requests, the same hearder, request interface, the same form data, requests are being given the microblogging no data is returned to you! Here bloggers simple analysis of what, because when entering weibo.cn domain name, it will have a script request before landing, if you go to analysis will find a pre_login request js code hidden inside, so if you would like to use requests have to analyze the reasons for adopting scrapy ... so here is scrapy, before we request seemed to acquiesce in some browsers do things.

Description: Because I wrote the program is not yet complete, is still in progress, so will only be able to announce the landing of the current source and simple article information extraction (unprocessed database section), Note: because this microblogging platform dynamically load data while each news message is too long, it will only show a part of, if you need to crawl all the content needs to address its extract text of the article, the request again, to get all the information article, here is the animated GIF renderings show.

 

 

Though still not commonplace, scrapy commonly used commands familiar with -

Or take a process it ~

 A: start a reptile project: scrapy startproject + project name

 eg:scrapy startproject weibo_info

 It will automatically generate the following directory:

 

 

Two: Create a spider crawling

1. Create a project successful you need to change directory to weibo_info     

     eg:cd weibo_info

2. Create microblogging crawler scrapy genspider + reptiles crawling domain name +

eg: After scrapy genspider weibo weibo.cn created automatically generated weibo under the spider spider directory

 

Three: In the setting inside the reptile to comply with the agreement changed to non-compliance

To ROBOTSTXT_OBEY = True -> ROBOTSTXT_OBEY = False, setting the download delay of 1 second with

 

Cancel DEFAULT_REQUEST_HEADERS comment, add your own browser user-agent information camouflage browser. When the inside of the log information may be used to weibo.cn F12 or capture tool, see the request header, and then move over to the correspondence information.

After the login request header information, you need to click on weibo.cn log in, click on the login button, there will be a login request file, opening it will be able to see the request headers and parameters, we put the request header to win, copied to setting inside the headers inside.

 

Four: Open items.py file, establish the definition of item containers, convenient data storage, as well as the number of data tables

code show as below:


class WeiboInfoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text_url=scrapy.Field()#新闻url地址
    text_publisher = scrapy.Field()#新闻作者
    text_start_time=scrapy.Field()#新闻发表时间
    text_title=scrapy.Field()#新闻标题
    text_content=scrapy.Field()#新闻内容
    text_like = scrapy.Field()  # 新闻点赞数
    text_transfers=scrapy.Field()#新闻转发量
    text_comments_numbers=scrapy.Field()#新闻评论数


class WeiboInfo_hot_commentsItem(scrapy.Item):
    hot_comment=scrapy.Field()#热点评论
    hot_like=scrapy.Field()#点赞数

Five: the micro-blog spider completion, find weibo.py file in the directory spider, crawling write logic code

Extraction method is a regular extraction , why not xpath and json so easy to use, do not it? The reason said earlier, these two methods get results is no data needs to be converted into text types can, and the text is extracted with n-type is very quickly and very convenient.

Here attached are learning a regular URL rookie Tutorial: Click into the regular extraction tutorial  . Articles can not long look simple to understand common extraction methods regular.

code show as below:

Here we must explain weibo.py commented code, because of my own needs or to implement changes more difficult, being not perfect, there is little need to improve the partner's got the source code can be modified.

# -*- coding: utf-8 -*-
import scrapy,json
from ..items import *
import time,re


class WeiboSpider(scrapy.Spider):
    name = 'weibo'
    allowed_domains = ['weibo.cn','weibo.com']
    start_urls = ['https://weibo.cn']

    #登录   在爬取信息之前先登录
    def start_requests(self):
        login_url="https://passport.weibo.cn/sso/login"
        fromdata={
            'username': 'xxxx', #你的账号
            'password': 'xxx', #你的密码
            'savestate': '1',
            'r': 'https://weibo.cn/',
            'ec': '0',
            'pagerefer': 'https://weibo.cn/pub/',
            'entry': 'mweibo',
            'wentry': '',
            'loginfrom': '',
            'client_id': '',
            'code': '',
            'qq': '',
            'mainpageflag': '1',
            'hff': '',
            'hfp': '',
        }
        yield scrapy.FormRequest(url=login_url,formdata=fromdata,callback=self.parse_login)

    def parse_login(self,response):
        #判断是否成功登录微博
        json_res=json.loads(response.text)
        if json_res['retcode']==20000000:
            info_url="https://weibo.cn/"
            yield  scrapy.Request(url=info_url,callback=self.parse_info)
        else:
            print("*"*100+'\n'+"登录失败")



    def parse_info(self, response):
        item= WeiboInfoItem() #导入item容器
        #每一篇文章对应的评论 文章是在数据库是自增的id从1开始,标记初始也为1
        text_comment_flag=1
        print("+"*100)
        # print(response.text)
        # 要转换成tex文本格式,因为微博用的动态加载数据,用xpath的话,数据提取不到
        res=response.text
        # print(res)
        weibo_info=re.findall(r'<div class="c" id=".*?">(.*?)<div class="s">',res)
        for info in weibo_info:
            # print(info)
            try:
                #因为文章太多了,随时又在更新新文章,非热门文章就过滤一下
                text_like= re.findall(r'<a href=.*?add.*?>赞\[(.*?)\]</a>', info)[0]
                text_comments_numbers= re.findall(r'<a href=.*?comment.*?>评论\[(.*?)\]</a>', info)[0]
                #过滤规则  点赞少于2000,或者评论少于200就pass掉
                # if int(text_like)<2000 or int(text_comments_numbers)<200:
                #     continue

                item['text_url']=re.findall(r'<span class="ctt">.*?<a href="(.*?)".*?',info)[0]#正则提取文章地址
                item['text_publisher']=re.findall(r'<a class="nk" href=.*?>(.*?)</a>',info)[0]
                text_start_time=re.findall(r'<span class="ct">(.*?)&nbsp',info)[0]
                # item['text_start_time']=
                item['text_title']=re.findall(r'【.*?>(.*?)</a>',info)[0]
                item['text_content']=re.findall(r'】(.*?)</span>',info)[0]
                ###新闻内容显示不全,源出处提取全部新闻内容
                # text_content=re.findall(r'】(.*?)</span>',info)[0]
                # if '>全文<' in text_content:
                #     #提取新闻全文网址
                #     detaile_content_url=re.findall(r"<a href='(.*?)'>全文</a>",text_content)[0]
                #     #再次请求全文网址
                #     print("全文地址:"+self.start_urls[0]+detaile_content_url)
                #
                #     yield scrapy.Request(url=self.start_urls[0]+detaile_content_url,callback=self.parse_detaile_content)

                    #过滤文章内容较短的
                    # if len(detaile_content['text_content'])<200:
                    #     continue
                    # item['text_content'] = detaile_content
                    # print("@"*100)
                    # print(detaile_content)
                # else:
                #                 #     print("else "*50)
                #                 #     item['text_content'] = text_content
                item['text_like']=text_like
                item['text_transfers']=re.findall(r'<a href=.*?repost.*?>转发\[(.*?)\]</a>',info)[0]
                item['text_comments_numbers']=text_comments_numbers


                print("文章地址:"+item['text_url'])
                print("文章发表人:"+item['text_publisher'])
                print("文章时间:"+text_start_time)
                print("文章标题:"+item['text_title'])
                print("文章内容:"+item['text_content'])
                print("文章点赞:"+item['text_like'])
                print("文章转发:"+item['text_transfers'])
                print("文章评论:"+item['text_comments_numbers'])
                print("文章内容长度"+str(len(item['text_content'])))


                #**********提取评论*********#
                # comments_url=re.findall(r'转发\[.*?\]</a>&nbsp;<a href="(.*?)" class="cc">评论.*?</a>',info)[0]
                # print("文章评论url:"+comments_url)
                # yield scrapy.Request(url=comments_url,meta={'flag':text_comment_flag},callback=self.parse_comments)
                # text_comment_flag+=1
                # # 分割线用来分割每个新闻
                print("*" * 100)
            except:
                continue

        # 提取下一页新闻地址
        try:
            next_url=re.findall(r'<form action="/\?.*?<a href="(.*?)">下页</a>',res)[0].replace("amp;","")
            print("=" * 100)
        except:
            next_url=None
        # 如果地址存在就拼接域名,再次请求,回调自身再次提取
        if next_url:
            print(self.start_urls[0]+next_url)
            yield scrapy.Request(url=self.start_urls[0]+next_url,callback=self.parse_info)

    #提取评论
    def parse_comments(self,response):
        item=WeiboInfo_hot_commentsItem()#导入热点评论容器
        print("6"*100)
        flag=response.meta['flag']
        print("flag:"+str(flag))
        res=response.text
        print(res)
        # hot_comments=re.findall(r'<span class="ctt">(.*?)</span>',res)
        # print(hot_comments)


Below the essence herein should be ~

The above explanation for the code portions and Analysis:

def start_requests (self) is an instance of a class scrapy is the reptile before the start of work to do, before extracting the information we need to login account and password.

In weibo.cn we click login (here I put a password input error), or you capture tool can be found in all or xhr in the browser F12 in the Login ajax request, will send the form data to the server , which contains our account number and password, if you enter the wrong account number will not be displayed , and enter the correct, will jump in a very short period of time , you may not see this request, and this time you need to capture or use other methods to correct this login login stop , we will be able to better observe, but a mistake I have here, is not to jump, easy to see url request, form data, and other parameters.

Here we summarize:

  1. Login url is : https://passport.weibo.cn/sso/login
  2. Required to submit the form data
    
                'username': 'xxx',#你的账号
                'password': 'xxx',#你的密码
                'savestate': '1',
                'r': 'https://weibo.cn/',
                'ec': '0',
                'pagerefer': 'https://weibo.cn/pub/',
                'entry': 'mweibo',
                'wentry': '',
                'loginfrom': '',
                'client_id': '',
                'code': '',
                'qq': '',
                'mainpageflag': '1',
                'hff': '',
                'hfp': '',
            
  3. Request login using a post request , not a GET , so use scrapy the Request (), while the expression data into account.
  4. Login successful return json data, there are 20 million, represents a successful login, the login fails and vice versa

For the above code code also need to pay attention :

  1. All data crawling to run only after login
  2. Information is crawling own login account information concerned people published , such as bloggers use this small, focus on the China Daily, People's Daily, etc.
  3. Database and other follow-up function, there is a need to improve your own, you can also refer to bloggers column inside the column reptiles, there are village after scrapy crawling data mysql data examples
  4. Because the current bloggers according to their needs also still being revised, so the current code may still have many questions, if you encounter other problems, may also bloggers can not solve, please themselves first, I am here is to provide the basic idea and base code, to ensure the basic data extraction.
  5. If you have questions, please leave a message, if there is to know big brother, big brother will want to be able to spend a little time promise, give us help, good life of peace.

 

 

Published 395 original articles · won praise 126 · Views 200,000 +

Guess you like

Origin blog.csdn.net/memory_qianxiao/article/details/105395947