Scrapy - Every Page is scraped but scrapy wraps around and scrapes first x amount of pages

chrisHG :
class HomedepotcrawlSpider(CrawlSpider):

    name = 'homeDepotCrawl'
    #allowed_domains = ['homedepot.com']
    start_urls =['https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=0']

    def parse(self, response):

        for item in self.parseHomeDepot(response):
            yield item

        next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)



    def parseHomeDepot(self, response):

        items = response.css('.plp-pod')
        for product in items:
            item = HomedepotSpiderItem()

    #get SKU
            productSKU = product.css('.pod-plp__model::text').getall()

    #get rid of all the stuff i dont need
            productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
            productSKU = [x.strip('\n') for x in productSKU]
            productSKU = [x.strip('\t') for x in productSKU]
            productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name
            productSKU = [x.strip('\xa0') for x in productSKU] #gets rid of the model name



            item['productSKU'] = productSKU

            yield item

Explanation of the Problem

Here is part of the program that I have been working on to scrape data. I left out my code for scraping other fields because I did not think it was necessary to include with this post. When I run this program and export data to excel, I get the first 240 items (10 pages). That goes up to row 241 of my spreadsheet(The first row is occupied by labels). Then starting from row 242, the first 241 rows are repeated once again. Then again on row 482 and 722.

The Scraper outputs the first 240 items 3 times

EDIT So I was looking through the log of during scraping and it turned out that every page was getting scraped. The last page is:

https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=696&Ns=None>

then afterwards the logfile is showing the first page getting scraped again, which is:

https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default

I assume because of.. enter image description here

The terminal command that I'm using to export to excel is:

scrapy crawl homeDepotCrawl -t csv -o - > "(File Location)"

Edit: The reason why I am using this command is because when exporting, Scrapy appends the scraped data to the file, so this erases the target file and just creates it again.

The code that I used to derive getting all pages is:

<a class="hd-pagination__link" title="Next" href="/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&amp;Nao=24&amp;Ns=None" data-pagenumber="2"></a>

Originally I thought it was the website that was causing this unexpected behavior so on settings.py I changed ROBOTSTXT_OBEY = 0 and I added a delay but that did not change anything.

So what I would like help with:

-Figuring out why the CSV output only takes the first 240 Items (10 Pages) and repeats 3 times

-How ensure the spider doesn't go back to the first page after scraping the first 30

David542 :

I would suggest doing something like this. The main difference is I'm grabbing the info from the json stored on the page and I'm paginating myself by recognizing the Nao is the product offset. The code is much shorter too:

import requests,json,re
product_skus = set()
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
base_url = 'https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=%s'
for page_num in range(1,1000):
    url = base_url % (page_num*24)
    res = requests.get(url, headers=headers)
    json_data = json.loads(re.search(r'digitalData\.content=(.+);', res.text).group(1))
    prev_len = len(product_skus)
    for product in json_data['product']:
        product_skus.add(product['productInfo']['sku'])
    if len(product_skus) == prev_len: break # this line is optional and can determine when you want to break

Additionally, it looks like the Home Depot pages repeat every 10 pages (at least in what you sent) which is why you're seeing the 240 limitation. Here is an example from browsing it myself:

enter image description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=17007&siteId=1