foreword
Recently, because I want to build my own application, I need to use crawlers, and then I started tinkering and learning about crawlers. In order to let everyone get started with a crawler faster, I will explain the basic principles of scrapy and get started quickly. The crawled page is Bole Online, and you can familiarize yourself with it in advance.
Environment construction
Operating system: WIN10
IDE: Using the family bucket Pycharm
1. Install scrapy globally
pip install scrapy -g
2. Create a folder to store the project
mkdir Spider-Python3
3. Create scrapy project
scrapy startproject ArticleSpider
4. Enter the ArticleSpider project directory and use the template to create a crawler
cd ArticleSpider scrapy genspider jobbole blog.jobbole.comNote: scrapy genspider <crawler name> <main domain name crawled>
5. Import PyCharm and modify settngs.py
# Obey robots.txt rules ROBOTSTXT_OBEY = FalseNote: Whether the domain name conforms to the Robots protocol will not be detected, which means that the domain name will not be filtered
crawler writing
1. Write the debug test entry
In order to be able to run and debug in PyCharm, create main.py under the project as the program startup entry
Add the following code:
from scrapy.cmdline import execute import sys import them sys.path.append(os.path.abspath(__file__)) execute(['scrapy', 'crawl', 'jobbole'])
Note: After adding the above test entry file, right-click to debug in main.py, you can debug the test crawler. If the crawler name is different, just replace jobbole with your own crawler name.
2. Enter the crawler file to modify the URL of the main page to be crawled
class JobboleSpider(scrapy.Spider): name = 'right' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/']
Note: If start_urls is modified, the crawler will crawl from this URL
3. Loop crawling of the next page in the entry function
def parse(self, response): #Crawl all news urls on the current page and hand them over to parse_detail for parsing post_urls = response.css('.post-meta a.archive-title::attr(href)').extract() for post_url in post_urls: yield Request(parse.urljoin(response.url, post_url), callback=self.parse_detail) # Crawl the url of the next page and recursively call the current parse for parsing next_href = response.css('.next.page-numbers::attr(href)').extract_first() if next_href: yield Request(url=parse.urljoin(response.url, next_href), callback=self.parse)
Note: Because the parse function will be executed directly after the crawler starts to execute, the operation of crawling all the news on the current page should be written in this function and handed over to the parser parse_detail for parsing. After the current page is crawled, the url of the next page is crawled, and the next page is handed over to parse to crawl each news.
4. Write page crawling logic
def parse_detail(self, response): # page crawling logic title = response.css('.entry-header h1::text').extract_first().strip() pass
Note: response.css() is to filter each url page downloaded by the downloader using css syntax to get the required content. Use your browser's developer tools to get the element you want to select and get its style. .extract() is to get the content of the current element, because the form is an array, so you can get the first element by [0] or directly extract_first() to get the content of the first element, and strip() is to remove the spaces before and after.