scrapy crawling pictures

1. Summary of problems encountered

The role of the meta parameter in Request is to pass information to the next function . The use process can be understood as:

把需要传递的信息赋值给这个叫meta的变量,但meta只接受字典类型的赋值,因此要把待传递的信息改成“字典”的形式,即:
meta={'key1':value1,'key2':value2}
如果想在下一个函数中取出value1,只需得到上一个函数的meta['key1']即可,因为meta是随着Request产生时传递的,下一个函数得到的Response对象中就会有meta,
即response.meta,
前一个函数的meta和后一个函数的meta是相同的,取value1则是value1=response.meta['key1']

This information can be of any type, such as values, strings , lists, dictionaries... The method is to assign the information to be passed to the keys of the dictionary. For analysis, see the following statement (crawler file):

1  Author: Urban
 2 Link: https://www.zhihu.com/question/54773510/answer/146971644
 3  Source:
 Zhihu 4  Copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
5  
6  class example(scrapy.Spider):
 7      name= ' example ' 
8      allowed_domains=[ ' example.com ' ]
 9      start_urls=[ ' http://www.example.com ' ]
 10      def parse(self,response) :
 11             #A URL parsed from start_urls is assigned to url 
12             url=response.xpath( '...... ' ).extract()
 13             # ExamleClass is defined in items.py and will be written below. 
14             """ Remember that item itself is a dictionary """ 
15             item= ExampleClass()
 16             item[ ' name ' ]=response.xpath( ' ....... ' ).extract()
 17             item[ ' htmlurl ' ]=response.xpath( ' ...... ' ).extract()
 18             """ Through the meta parameter, assign the item dictionary to the 'key' key in meta (remember that meta itself is also a dictionary).
 19             Scrapy., this meta dictionary (containing the key value 'key', the value of 'key' is also a dictionary, namely item)
 20             will be "put" in the "Request object" and sent to the parse2() function """ 
21             yield Request( url,meta={ ' key ' :item},callback= ' parse2 ' )
 22       def parse2(self,response):
 23             item=response.meta[ ' key ' ] #(Accurately a shallow copy, see The end of the text) realizes the sharing of parameters between the two functions, which is equivalent to item=item, and operates the item dictionary in the same items.py
 24             """ This response already contains the above meta dictionary, this sentence assigns this dictionary to item,
 25             completes the information transfer. This item is already the same as the item in parse """
]=response.xpath( ' ....... ' ).extract()
 27             # There are three key values ​​for item, and all 28 yield items have been added here
            

The statement in items.py is as follows:

1 class ExampleClass(scrapy.Item):
2     name = scrapy.Field()
3     htmlurl = scrapy.Field()
4     text=scrapy.Field()

Of course meta can pass cookies (the first kind):

The key 'cookiejar' in start_requests below is a special key. After scrapy sees this key in meta, it will automatically pass the cookie to the function to be called back. Since it is a key (key), it needs to have a value (value) corresponding to it. In the example, the number 1 is given, or it can be other values, such as any string.

1 def start_requests(self):
2     yield Request(url,meta={'cookiejar':1},callback=self.parse)
It should be noted that the assignment of meta to 'cookiejar' can not only indicate that the cookie should be passed on, but also mark the cookie. A cookie represents a session. If you need to crawl a website through multiple sessions, you can mark the cookie, 1, 2, 3, 4... In this way, scrapy maintains multiple sessions.

 1 def parse(self,response):
 2     key=response.meta['cookiejar']    #经过此操作后,key=1【cookiejar:1---->key:cookiejar----->key=reponse.meta['cookiejar']    =1】
 3     yield Request(url2,meta={'cookiejar':key},callback='parse2')
 4 def parse2(self,response):
 5     pass
 6 

 

The above paragraph and the following paragraph are equivalent:
1  def parse(self,response):
 2      yield Request(url2,meta={ ' cookiejar ' :response.meta[ ' cookiejar ' ]},callback= ' parse2 ' )
 3      #The token of cookiejar is still number 1 
4  def parse2(self, response):
 5      passes

The second way to pass a cookie:

If it is not marked, it can be written as follows:

1  #First introduce CookieJar() method 
2  from scrapy.http.cookies import CookieJar

When writing spider methods:

1 def start_requests(self):
2     yield Request(url,callback=self.parse)#此处写self.parse或‘parse’都可以
3 def parse(self,response):
4     cj = response.meta.setdefault('cookie_jar', CookieJar())
5     cj.extract_cookies(response, response.request)
6     container = cj._cookies
7     yield Request(url2,cookies=container,meta={'key':container},callback='parse2')
8 def parse2(self,response):
9     pass

meta is a shallow copy and needs a deep copy if necessary.

It can be imported like this:

1 import copy
2 meta={'key':copy.deepcopy('value')}

 






Author: Urban
Link : https://www.zhihu.com/question/54773510/answer/146971644
Source: Zhihu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324612268&siteId=291194637