Python3 uses Scrapy to quickly build the first crawler

foreword

Recently, because I want to build my own application, I need to use crawlers, and then I started tinkering and learning about crawlers. In order to let everyone get started with a crawler faster, I will explain the basic principles of scrapy and get started quickly. The crawled page is Bole Online, and you can familiarize yourself with it in advance.


Environment construction

Operating system: WIN10

IDE: Using the family bucket Pycharm

1. Install scrapy globally

pip install scrapy -g

2. Create a folder to store the project

mkdir Spider-Python3

3. Create scrapy project

scrapy startproject ArticleSpider

4. Enter the ArticleSpider project directory and use the template to create a crawler

cd ArticleSpider
scrapy genspider jobbole blog.jobbole.com
Note: scrapy genspider <crawler name> <main domain name crawled>

5. Import PyCharm and modify settngs.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
Note: Whether the domain name conforms to the Robots protocol will not be detected, which means that the domain name will not be filtered


crawler writing

1. Write the debug test entry

In order to be able to run and debug in PyCharm, create main.py under the project as the program startup entry


Add the following code:

from scrapy.cmdline import execute

import sys
import them

sys.path.append(os.path.abspath(__file__))
execute(['scrapy', 'crawl', 'jobbole'])

Note: After adding the above test entry file, right-click to debug in main.py, you can debug the test crawler. If the crawler name is different, just replace jobbole with your own crawler name.

2. Enter the crawler file to modify the URL of the main page to be crawled

class JobboleSpider(scrapy.Spider):
    name = 'right'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

Note: If start_urls is modified, the crawler will crawl from this URL

3. Loop crawling of the next page in the entry function

    def parse(self, response):           
        #Crawl all news urls on the current page and hand them over to parse_detail for parsing
        post_urls = response.css('.post-meta a.archive-title::attr(href)').extract()
        for post_url in post_urls:
            yield Request(parse.urljoin(response.url, post_url), callback=self.parse_detail)
        # Crawl the url of the next page and recursively call the current parse for parsing
        next_href = response.css('.next.page-numbers::attr(href)').extract_first()
        if next_href:
            yield Request(url=parse.urljoin(response.url, next_href), callback=self.parse)

Note: Because the parse function will be executed directly after the crawler starts to execute, the operation of crawling all the news on the current page should be written in this function and handed over to the parser parse_detail for parsing. After the current page is crawled, the url of the next page is crawled, and the next page is handed over to parse to crawl each news.

4. Write page crawling logic

    def parse_detail(self, response):
        # page crawling logic
        title = response.css('.entry-header h1::text').extract_first().strip()
        pass

Note: response.css() is to filter each url page downloaded by the downloader using css syntax to get the required content. Use your browser's developer tools to get the element you want to select and get its style. .extract() is to get the content of the current element, because the form is an array, so you can get the first element by [0] or directly extract_first() to get the content of the first element, and strip() is to remove the spaces before and after.




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325226145&siteId=291194637
Recommended