Scrapy combat, use Scrapy to simply crawl news and store content

If a worker wants to be good at his work, he must first sharpen his tools

First install the scrapy framework package, refer to https://blog.csdn.net/m0_46202060/article/details/106201764

1. Basic operation of Scrapy framework

Using the Scrapy framework to make a crawler generally requires the following four steps:
(1) Create a new project: create a new crawler project scrapy startproject ***
(2) Clear the goal: specify the target to be crawled scrapy genspider 爬虫名称 “爬虫域”
(3) Make a crawler: make a crawler, start crawling web pages scrapy crawl 爬虫名称
(4) ) Store data: store crawled contentscrapy crawl 爬虫名称 -o json/xml/jsonl/csv

(1) New project

Open cmd and create a new project:

scrapy startproject myspider01

Insert picture description here

(2) Clear goals (crawling news headlines, time, publishers)

First switch to the project path just created

scrapy genspider xinwen "news.sina.com.cn"

Note: The content in "" must be in the form of a URL,
![Insert picture description here](https://img-blog.csdnimg.cn/2020052112403477.png
then we open pycharm to view:
here are automatically generated several files and directories

spiders: the directory for storing crawler code
scrapy.cfg: configuration file
items.py: used to define the target entity of the project
![Insert image description here](https://img-blog.csdnimg.cn/20200521124713806.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubyMw0FF,size

(3) Making crawlers

1. Replace the address with the address of the web page you want to crawl
2. Crawl the content of the web page with xpath

import scrapy

from myspider01.items import Myspider01Item

class XinwenSpider(scrapy.Spider):
    name = 'xinwen'
    allowed_domains = ['new.sina.com.cn']
    start_urls = ['https://news.sina.com.cn/gov/xlxw/2020-05-20/doc-iircuyvi4073201.shtml']

    def parse(self, response):
        item = Myspider01Item()
        item['title'] = response.xpath('//*[@id="top_bar"]/div/div[1]/text().extract()[0]')
        item['time'] = response.xpath('//*[@id="top_bar"]/div/div[2]/span[1]/text().extract()[0]')
        item['source'] = response.xpath('//*[@id="top_bar"]/div/div[2]/span[2]/text().extract()[0]')
        yield item

3. The extract() method returns a list of corresponding strings
![Insert picture description here](https://img-blog.csdnimg.cn/20200521132244862.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubymV0LFF_color
3. Add the corresponding attributes title, time, and source to items, which represent the title, time, and source of the news respectively

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Myspider01Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    time = scrapy.Field()
    source = scrapy.Field()

Insert picture description here

4. Create a main file in the myspider01 project to execute statements

from scrapy import cmdline
cmdline.execute('scrapy crawl xinwen'.split())

![Insert image description here](https://img-blog.csdnimg.cn/20200521134015292.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubymV0LFF_color

In this way, the scrapy framework has been successfully used to scrape the news down!

In addition, you can also run directly in the cmd command line, let’s take a look at the effect

scrapy crawl xinwen

Insert picture description here

(4) Store the crawled content

This is very easy, just slightly modify the code in the main file:

Similarly, you can execute commands directly in cmd:

scrapy crawl xinwen -o xinwen.xml

Insert picture description here
This will automatically write the xml file in the project file

Insert picture description here

You can also write commands directly in pycharm:

from scrapy import cmdline
cmdline.execute('scrapy crawl xinwen -o news.json'.split())

Insert picture description here

Guess you like

Origin blog.csdn.net/m0_46202060/article/details/106254047