scrapy爬虫框架记录语句

.extract()将xpath对象转换为Unicode字符串

scrapy startproject XXX 新建一个scrapy爬虫文件

scrapy crawl XXX 执行scrapy 项目爬虫

scrapy genspider xxx "http://www.baidu.com/" 新建一个爬虫执行文件

scrapy list 查询当前可以执行的爬虫文件清单

管道文件写法（pipelines.py）

import json


class LspidersPipeline(object):
    def __init__(self):  # 类初始化方法
        self.f = open("itcast_pipeline.txt", 'w', encoding="utf-8")  # 只执行一次的初始化方法
        print("初始化）））））））））））））））））））））")

    # item 有爬虫返回到这里的
    def process_item(self, item, spider):  # 帮我们处理每一个item
        content = json.dumps(dict(item), ensure_ascii=False) + ',\n'  # 后面一个参数表示中文会按照unicode形式
        self.f.write(content)
        print("***********************")
        return item  # 将item返回给引擎，告知引擎处理完毕，可以接受下一个item

    def close_spider(self, spider):
        self.f.close()

itmes写法：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


# 管道文件
# item定义结构化数据字段，用来保存爬取到的数据，有点像python中的dict，但是提供了一些额外的保护减少错误
# 可以通过创建一个scrapy.item类，并且定义类型为scrapy.Field的类属性来定义一个item（可以理解为类似于ORM映射关系）

class LspidersItem(scrapy.Item):
    # 老师姓名
    name = scrapy.Field()
    # 老师职称
    title = scrapy.Field()
    # 老师信息
    info = scrapy.Field()

爬虫执行文件处理

# -*- coding: utf-8 -*-
import scrapy
from Lspiders.items import LspidersItem


class ItcastSpider(scrapy.Spider):
    name = 'itcast'  # 爬虫名 启动爬虫时需要的参数
    # 爬取域的范围，只允许爬虫在这个域名下进行爬取
    allowed_domains = ['http://www.itcast.cn']  # 可选参数
    # 起始url列表，爬虫执行后第一批请求，将从这个列表里获取
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajavaee']

    def parse(self, response):
        node_list = response.xpath("//div[@class='li_txt']")
        for node in node_list:
            item = LspidersItem()  # 创建item字段对象，用来存储信息
            # .extract()将xpath对象转换为Unicode字符串
            name = node.xpath("./h3/text()").extract()
            title = node.xpath("./h4/text()").extract()
            info = node.xpath("./p/text()").extract()
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]
            # 返回提取到的每一个item数据，给管道文件处理，同时还会回来继续执行后面的for中的代码
            yield item  # 类似于return，但是会执行后面的语句 提交给引擎，引擎转交给管道文件处理

scrapy shell "http://www.itcast.cn/channel/teacher.shtml"可以进行模拟请求，和requests库中的get方法类似

如若有响应，则可以进行如下操作：

scrapy提供使用xpath，css ，re（正则表达式）三种提取数据的方法

Spider类

class scrapy.Spider 是最基础的类，所有的爬虫必须继承这个类。

scrapy爬虫框架记录语句

猜你喜欢