创建scrapy 项目框架

创建项目
scrapy startproject project_name
创建spider
cd project_name
scrapy genspider 模块名网址
scrapy genspider hangzhou www.xxxx.com
目录结构
├── hangzhounews – 项目根目录
│ ├── init.py
│ ├── pycache --python运行临时文件 pyc
│ │ ├── init.cpython-36.pyc
│ │ └── settings.cpython-36.pyc
│ ├── items.py – 用来定义爬取哪些内容（类似Django中的models）
│ ├── middlewares.py --中间件
│ ├── pipelines.py --管道，用来处理爬取的数据
│ ├── settings.py --配置文件
│ └── spiders --自定义爬虫包
│ ├── init.py
│ ├── pycache
│ │ └── init.cpython-36.pyc
│ └── hangzhou.py --一个爬虫文件
└── scrapy.cfg – 部署时候用的配置文件

class HangzhouSpider(scrapy.Spider):
    name = 'hangzhou'
    allowed_domains = ['hznews.hangzhou.com.cn']
    start_urls = ['http://hznews.hangzhou.com.cn/']

# 如果不重写start_requests方法 父类的方法会自动迭代start_urls生成一个生成器，每个元素为一个scrapy.Request()对象

    def start_requests(self):
        print('1,start request')
        for url in self.start_urls:
            # 生成一个Request对象，callback为None则表示，回调函数为parse
            print('2,生成Request对象')
            req = scrapy.Request(url,callback=None)

            # start_requests的返回值应该是一个可迭代对象。列表之类都可以，尽量使用yield关键字构造生成器
            print('3,生成器')
            yield req


    def parse(self, response):
        print('4,解析')
        print(response)

        # 类型为<class 'scrapy.selector.unified.SelectorList'>
        all_news = response.xpath('//td[@class="hzwNews_L_link"]/a')

        item = {}
        for news in all_news:
            # news 类型为Selector
            headline = news.xpath('.//text()').extract_first()
            href = news.xpath('.//@href').extract_first()
            item['headline'] = headline
            item['href'] = href
			print(item)
            yield item

注：

    SelectorList对象:可以看做Selector的列表集合。可迭代
    关键方法 extract()=getall()。extract_first() = get()
    extract()取出SelectorList中的Selector依次执行.get()方法。将结果放在列表中返回。
        def getall(self):
            return [x.get() for x in self]
        extract = getall()
        
    extract_first()取出SelectorList中的第一个Selector，返回Selector.get()。即返回第一个Selector的文本信息
        def get(self, default=None):
            for x in self:
                return x.get()
            else:
                return default
        extract_first = get
        
    Selector：
    关键方法
    get() = extract() 。getall() 
    extract()将Selector中的文本提取出来。返回值为str。
    getall() 将Selector中的文本提取出来。放在列表中。列表中只有一个元素

终端执行：scrapy crawl hangzhou --nolog
结果
1,start request
2,生成Request对象
3,生成器
4,解析
<200 http://hznews.hangzhou.com.cn/>
{‘headline’: ‘杭州构建"一心八射"交通网 1小时通勤圈来了’, ‘href’: ‘http://hznews.hangzhou.com.cn/chengshi/content/2018-10/29/content_7088441.htm’}
{‘headline’: ‘文一路隧道形成8个大小堵点如何缓解？’, ‘href’: ‘http://hznews.hangzhou.com.cn/chengshi/content/2018-10/29/content_7088393.htm’}
{‘headline’: ‘杭州开启阳光常驻模式早晚温差有点大’, ‘href’: ‘http://hznews.hangzhou.com.cn/chengshi/content/2018-10/29/content_7088317.htm’}
{‘headline’: ‘杭州发出国际级软件名城创建政策“大礼包”’, ‘href’: ‘http://hznews.hangzhou.com.cn/jingji/content/2018-10/29/content_7088345.htm’}
{‘headline’: ‘东站"乞讨奶奶"家有五层楼存款超10万’, ‘href’: 。。。

创建scrapy 项目框架

猜你喜欢