Scrapy or Requests提交表单

这里我以GitHub登录这个网站为例

https://github.com/login

通过分析，我们可以得知这个网站上post必须带一个参数(用chrome或firefox都能看到表单提交了什么)authenticity_token

然后我们发现authenticity_token这个参数是一直在变的，所以我们需要使用Session会话使得我们的多个请求都在一个Session里，

从而得到相同的authenticity_token

总的来说就是Scrapy的Session用meta = {'cookiejar': i}，只要这个cookiejar的值相同，他们就在一个Session里，

Requests的Session用s = requests.Session(),s.get(), s.post

先贴scrapy的code(emmm，这个是spider，大家肯定都知道):

关于scrapy提交表单的中文文档:https://www.rddoc.com/doc/Scrapy/1.3/zh/topics/request-response/

官方文档:https://doc.scrapy.org/en/latest/topics/request-response.html

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.http import Request, FormRequest
 4 
 5 class GithubSpider(scrapy.Spider):
 6     name = 'github'
 7     #allowed_domains = ['github.com']
 8     #start_urls = ['https://github.com/login']
 9     headers = {
10         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
11         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
12         'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
13         'Accept-Encoding': 'gzip, deflate, br',
14         'Referer': 'https://github.com/login',
15         'Content-Type': 'application/x-www-form-urlencoded',
16     }
17 
18     def start_requests(self):
19         urls = ['https://github.com/login']
20         for i, url in enumerate(urls, 1):
21             yield Request(url, meta = {'cookiejar': i}, callback = self.github_parse)
22 
23     def github_parse(self, response):
24         authenticity_token = response.xpath('//*[@id="login"]/form/input[2]/@value').extract()[0]#or use extract_first
25         self.logger.info('authenticity_token=' + authenticity_token)
26         return FormRequest.from_response(
27             response,
28             url = 'https://github.com/session',
29             meta = {'cookiejar': response.meta['cookiejar']},
30             headers = self.headers,
31             formdata = {
32                 'login':'email',#youremail
33                 'password':'password',#yourpassword
34                 'authenticity_token':authenticity_token,
35                 'utf8':'✓'
36             },
37             callback = self.github_login,
38             # dont_click = True,
39             )
40 
41     def github_login(self, response):
42         data = response.xpath('//*[@id="dashboard"]/div[1]/div[2]/h3/text()').extract_first()
43         if data:
44             self.logger.info('我已经登录成功了!')
45             self.logger.info(data)
46         else:
47             self.logger.error('登录失败!')

然后是用requests的Session（这个写的比较随意，当初是为了看看对requests的Session理解有没有问题），

不知道的详见文档http://docs.python-requests.org/zh_CN/latest/user/advanced.html#advanced

 1 import requests
 2 # from bs4 import BeautifulSoup
 3 from lxml import etree
 4 
 5 headers = {
 6     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
 7     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 8     'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
 9     'Accept-Encoding': 'gzip, deflate, br',
10     'Referer': 'https://github.com/login',
11     'Content-Type': 'application/x-www-form-urlencoded',
12 }
13 
14 session = requests.Session()
15 url1 = 'https://github.com/login'
16 r1 = session.get(url1)
17 html1 = r1.text
18 s = etree.HTML(html1)
19 authenticity_token = s.xpath('//*[@id="login"]/form/input[2]/@value')
20 
21 url2 = 'https://github.com/session'
22 formdata = {
23     'login':'email',#enter you email
24     'password':'password',#enter you password
25     'authenticity_token':authenticity_token,
26     'utf8':'✓'
27 }
28 r2 = session.post(url2, data = formdata, headers = headers)
29 html2 = r2.text
30 s2 = etree.HTML(html2)
31 data = s2.xpath('//*[@id="dashboard"]/div[1]/div[2]/h3/text()')
32 if data:
33     print('Success')
34 else:
35     print('Fail')

再重复一遍

总的来说就是Scrapy的Session用meta = {'cookiejar': i}，只要这个cookiejar的值相同，他们就在一个Session里，

Requests的Session用s = requests.Session(),s.get(), s.post

Scrapy or Requests提交表单

猜你喜欢