If a worker wants to be good at his work, he must first sharpen his tools
First install the scrapy framework package, refer to https://blog.csdn.net/m0_46202060/article/details/106201764
1. Basic operation of Scrapy framework
Using the Scrapy framework to make a crawler generally requires the following four steps:
(1) Create a new project: create a new crawler project scrapy startproject ***
(2) Clear the goal: specify the target to be crawled scrapy genspider 爬虫名称 “爬虫域”
(3) Make a crawler: make a crawler, start crawling web pages scrapy crawl 爬虫名称
(4) ) Store data: store crawled contentscrapy crawl 爬虫名称 -o json/xml/jsonl/csv
(1) New project
Open cmd and create a new project:
scrapy startproject myspider01
(2) Clear goals (crawling news headlines, time, publishers)
First switch to the project path just created
scrapy genspider xinwen "news.sina.com.cn"
Note: The content in "" must be in the form of a URL,
then we open pycharm to view:
here are automatically generated several files and directories
spiders: the directory for storing crawler code
scrapy.cfg: configuration file
items.py: used to define the target entity of the project
(3) Making crawlers
1. Replace the address with the address of the web page you want to crawl
2. Crawl the content of the web page with xpath
import scrapy
from myspider01.items import Myspider01Item
class XinwenSpider(scrapy.Spider):
name = 'xinwen'
allowed_domains = ['new.sina.com.cn']
start_urls = ['https://news.sina.com.cn/gov/xlxw/2020-05-20/doc-iircuyvi4073201.shtml']
def parse(self, response):
item = Myspider01Item()
item['title'] = response.xpath('//*[@id="top_bar"]/div/div[1]/text().extract()[0]')
item['time'] = response.xpath('//*[@id="top_bar"]/div/div[2]/span[1]/text().extract()[0]')
item['source'] = response.xpath('//*[@id="top_bar"]/div/div[2]/span[2]/text().extract()[0]')
yield item
3. The extract() method returns a list of corresponding strings
3. Add the corresponding attributes title, time, and source to items, which represent the title, time, and source of the news respectively
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class Myspider01Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
time = scrapy.Field()
source = scrapy.Field()
4. Create a main file in the myspider01 project to execute statements
from scrapy import cmdline
cmdline.execute('scrapy crawl xinwen'.split())
In this way, the scrapy framework has been successfully used to scrape the news down!
In addition, you can also run directly in the cmd command line, let’s take a look at the effect
scrapy crawl xinwen
(4) Store the crawled content
This is very easy, just slightly modify the code in the main file:
Similarly, you can execute commands directly in cmd:
scrapy crawl xinwen -o xinwen.xml
This will automatically write the xml file in the project file
You can also write commands directly in pycharm:
from scrapy import cmdline
cmdline.execute('scrapy crawl xinwen -o news.json'.split())