Create a simple scrapy project, and use it to crawl data

  A recent study scrapy, leave study notes

1. Create a spider project

  scrapy startproject + item name  such as: Scrapy startproject runoob , generating a project file is structured as follows:

  

2. Generate a spider

  Once created spider project runoob, into the project ( cd runoob ) using the command: scrapy genspider + name + reptiles crawling range (url address)  generated a spider

  Such as: scrapy genspider firstSpider "https://www.runoob.com/" , spider folder in a new multi-py file, as follows:

  

  The role of each file within the project:

  

3. Data crawling

  Open firstSpider.py file, perfect spider

  

import scrapy


class FirstspiderSpider(scrapy.Spider):
    name = 'firstSpider'  # 爬虫名
    allowed_domains = ['https://www.runoob.com/']  # 爬取的范围
    start_urls = ['https://www.runoob.com/w3cnote/scrapy-detail.html'] # 开始爬取的url地址

    def parse(self, response):
        pass

4.进入网址https://www.runoob.com/w3cnote/scrapy-detail.html选择下面的内容作为爬取的目标,如下图:

 

  5.进入网页源代码,找到爬取内容所在的位置:

  

  6.在parse方法下写爬取的逻辑,代码如下:

  

import scrapy


class FirstspiderSpider(scrapy.Spider):
    name = 'firstSpider'  # 爬虫名
    allowed_domains = ['https://www.runoob.com/']  # 爬取的范围
    start_urls = ['https://www.runoob.com/w3cnote/scrapy-detail.html'] # 开始爬取的url地址

    def parse(self, response):
        # 使用xpath定位元素位置,获取class='article-intro'的div下的ul下的li
        li_list = response.xpath("//div[@class='article-intro']//ul/li")
        # 遍历li获取结果
        for li in li_list:
            item = {}
            if li.xpath(".//strong/text()").extract_first() is not None:
                item["title"] = li.xpath(".//strong/text()").extract_first()
                item["text"] = li.xpath(".//p/text()").extract_first()
                print(item)


  在 setting.py中设置LOG_LEVEL = 'WARNING',输出比WARNING等级要高的日志信息

  

 

  7.runoob目录下执行使用scrapy crawl+爬虫名 即 scrapy crawl firstSpider运行项目,结果打印如下:

  

  至此一个简单的爬虫完成了。

  

  

 

 

 

Guess you like

Origin www.cnblogs.com/xifengmo/p/10990168.html