The use of a recording scrapy

Prospects description:

APP need to get a national orders, and grab a single volume. Since no country can only option separately for each city to traverse order. Reptiles run once a day, once acquired order within 48 hours, remove the last data from the database are compared, there are orders robbed is updated, without no action. (Update logic here is not important, important is crawling logic). Each order has a release time, according to Published judge, outside the 48-hour stop crawling, remove a city began to climb.

Look at the first edition:

#spider

# 构造一些请求参数,此处省略
# 从配置中读取所有城市列表
cities = self.settings['CITY_CH']

# end_signal为某个城市爬取完毕的信号,
self.end_signal = False

for city in cities:
    # 通过for循环对每个城市进行订单爬取
    post_data.update({'locationName':city})
    count = 1
    while not self.end_signas:
        post_data.update({'pageNum':str(count)})
        data = ''.join(json.dumps(post_data, ensure_ascii=False).split())
        sign = MD5Util.hex_digest(api_key + data + salt).upper()
        params = {
            'apiKey':api_key,
            'data':data,
            'system':system,
            'sign':sign
        }
        meta = {'page':count}
        yield scrapy.Request(url=url, method='POST', body=json.dumps(params, ensure_ascii=False),
                             headers=self.headers, callback=self.parse,meta=meta, dont_filter=True)
        count+=1
    self.end_signal = False

def parse(self,response):
    # 略
# 在spiderMiddleware中根据返回的item中的订单时间进行判断(此处不详写)

def process_spider_output(self, response, result, spider):
    result_bkp = []
    for res in result:
        if res['order_time'] < before_date(2): #before_date为自定义的时间函数
            logger.info("{%s}爬取完毕,开始爬取下一个城市" % (res['city_name']))
            spider.end_signal = True
            break
        result_bkp.append(res.copy())
    return result_bkp

At first glance there is no problem, through each city, to resolve resolved after the return item to spiderMiddleware in order to judge whether or not more than 48 hours, more than is set self.end_signal为Trueout of the while loop in the spider, pay attention to the back of the while loop turn this parameter is set Falsethen the next cycle of the city began.
The question is:
will spider request to return out added to the queue, there is a queue , when the response time by downloading a good return back parse function also has to deal with a queue of well-known bad luck people will occasionally encounter a little network question, to give clear chestnuts
chestnuts: in the spider city a , 2, 3 orders page (2, 3 is more than 48 hours of the order page), added to the queue, downloader to download when possible first two agents linked to the third page more than 48 hours, middleware determines successfully set self.end_signal=Truethe next city crawling. City B added 1,2,3 (all within 48 hours), this time the second page of the order of A city judge in the download is complete middleware turnself.end_signal=True , so the city behind B orders also gone, all Gone. . . Directly started under the orders of a city!

A summary version:

Do not go through the entire program control flow a global variable in an asynchronous program. (Summarized bad, you can help me summarize)

second edition:

Since it can not be controlled by global variables that can make every city with a logo to indicate the order crawl ended.
Look at the code

#spider
cities = self.settings['CITY_CH']

# end_signal为某个城市爬取完毕的信号,
self.end_signal = False

for city in cities:
    # 通过for循环对每个城市进行订单爬取
    post_data.update({'locationName':city})
    count = 1
    print(cities)
    print(city)
    while in cities:
        post_data.update({'pageNum':str(count)})
        data = ''.join(json.dumps(post_data, ensure_ascii=False).split())
        sign = MD5Util.hex_digest(api_key + data + salt).upper()
        params = {
            'apiKey':api_key,
            'data':data,
            'system':system,
            'sign':sign
        }
        meta = {'page':count}
        yield scrapy.Request(url=url, method='POST', body=json.dumps(params, ensure_ascii=False),
                             headers=self.headers, callback=self.parse,meta=meta, dont_filter=True)
        count+=1
    self.end_signal = False

def parse(self,response):
    # 略
# 在spiderMiddleware中根据返回的item中的订单时间进行判断(此处不详写)

def process_spider_output(self, response, result, spider):
    result_bkp = []
    for res in result:
        if res['order_time'] < before_date(2): #before_date为自定义的时间函数
            if res['city_name'] in spider.cities:
                spider.cities.remove(res['city_name'])
                logger.info("{%s}爬取完毕,开始爬取下一个城市" % (res['city_name']))
            break
        result_bkp.append(res.copy())
    return result_bkp

See also a bit mean logic to determine whether the city in the list, in the words of explanation have not completed crawling, crawling finished delete this city. Ok! Run it!

Interestingly come, first city crawling normal, the second city was gone, the appeal of the city code printed does not show the second city to jump directly to the last (located on the three cities) how it was swallowed .
Sensitive data is not a screenshot.

Printing can see a list of cities in Beijing obviously there is not deleted, why a city directly to the last of it?
There may have been big brother saw it, I was interrupted by life and debug point for a long time, even suspect what is inside a for loop bug.
Finally, an idea (funny), stumped because the problem is the city list? I for circulating it, then remove it within him to go inside the element, can it?
Write a demo test

cities = ['鞍山', '北京', '昆玉',]

for city in cities:
    cities.remove('鞍山')
    print(city)
# 错误就来了! 果然不能在循环它的时候再对它进行删除操作
ValueError: list.remove(x): x not in list

As for why this error is not reported in the run scrapy, it may be somewhere else doing exception handling, but there is this problem, we go fix it a bit.
Will for city in citiesbe changed for city in cities.copy(), the perfect solution! ! !
There is also a small python point is the value of delivery and delivery address, when dealing item to pay attention.

Guess you like

Origin www.cnblogs.com/mangM/p/11515376.html