Scrapy from entry to abandon 2-simulated login

scrapy simulation landing

Insert picture description here

learning target:
  1. Application request object cookies parameter use
  2. Understand the role of start_requests function
  3. Application constructs and sends post request

1. Review the previous simulated login method

1.1 How does the requests module implement simulated login?

  1. Request page with cookies directly
  2. Find the url address and send a post request to store the cookie

1.2 How does selenium simulate login?

  1. Find the corresponding input tag, enter the text and click login

1.3 Scrapy's simulated landing

  1. Directly carry cookies
  2. Find the url address and send a post request to store the cookie

2. Scrapy carries cookies to directly get the page that needs to be logged in

Application scenario
  1. Cookie expiration time is very long, common in some irregular websites
  2. Able to get all the data before the cookie expires
  3. Cooperate with other programs, such as using selenium to get the cookie after login and save it locally, and read the local cookie before scrapy sends the request

2.1 Implementation: Refactor the starte_rquests method of scrapy

The start_url in scrapy is processed through start_requests, and the implementation code is as follows

# 这是源代码
def start_requests(self):
    cls = self.__class__
    if method_is_overridden(cls, Spider, 'make_requests_from_url'):
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated; it "
            "won't be called in future Scrapy releases. Please "
            "override Spider.start_requests method instead (see %s.%s)." % (
                cls.__module__, cls.__name__
            ),
        )
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
    else:
        for url in self.start_urls:
            yield Request(url, dont_filter=True)

So correspondingly, if the url in the start_url address is a url address that can be accessed after logging in, you need to rewrite the start_request method and manually add a cookie in it

2.2 Log in to github with cookies

Test account noobpythoner zhoudawei123

import scrapy
import re

class Login1Spider(scrapy.Spider):
    name = 'login1'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/NoobPythoner'] # 这是一个需要登陆以后才能访问的页面

    def start_requests(self): # 重构start_requests方法
        # 这个cookies_str是抓包获取的
        cookies_str = '...' # 抓包获取
        # 将cookies_str转换为cookies_dict
        cookies_dict = {
    
    i.split('=')[0]:i.split('=')[1] for i in cookies_str.split('; ')}
        yield scrapy.Request(
            self.start_urls[0],
            callback=self.parse,
            cookies=cookies_dict
        )

    def parse(self, response): # 通过正则表达式匹配用户名来验证是否登陆成功
        # 正则匹配的是github的用户名
        result_list = re.findall(r'noobpythoner|NoobPythoner', response.body.decode()) 
        print(result_list)
        pass
note:
  1. Cookies in scrapy cannot be placed in headers, there are special cookies parameters when constructing the request, which can accept cookies in dictionary form
  2. Set ROBOTS protocol, USER_AGENT in setting

3. scrapy.Request sends post request

We know that you can send post requests by specifying method and body parameters through scrapy.Request(); but usually scrapy.FormRequest() is used to send post requests

3.1 Send post request

Note: scrapy.FormRequest() can send forms and ajax requests, please refer to https://www.jb51.net/article/146769.htm

3.1.1 Analysis of ideas
  1. Find the url address of the post: click the login button to capture the packet, and then locate the url address as https://github.com/session

  2. Find the law of the request body: analyze the request body of the post request, and the parameters contained in it are in the previous response

  3. Whether the login is successful: Observe whether the user name is included by requesting the personal homepage

3.1.2 The code is implemented as follows:
import scrapy
import re

class Login2Spider(scrapy.Spider):
   name = 'login2'
   allowed_domains = ['github.com']
   start_urls = ['https://github.com/login']

   def parse(self, response):
       authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
       utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
       commit = response.xpath("//input[@name='commit']/@value").extract_first()
        
        #构造POST请求,传递给引擎
       yield scrapy.FormRequest(
           "https://github.com/session",
           formdata={
               "authenticity_token":authenticity_token,
               "utf8":utf8,
               "commit":commit,
               "login":"noobpythoner",
               "password":"***"
           },
           callback=self.parse_login
       )

   def parse_login(self,response):
       ret = re.findall(r"noobpythoner|NoobPythoner",response.text)
       print(ret)
Tips

By setting COOKIES_DEBUG=TRUE in settings.py, you can see the cookie delivery process in the terminal


summary

  1. The URL address in start_urls is handed over to start_request for processing. If necessary, the start_request function can be rewritten
  2. Log in with cookies directly: cookies can only be passed to cookies parameter reception
  3. scrapy.Request() send post request

This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/111991111