【python3爬虫】Scrapy Win10下安装与新建Scrapy项目

详细安装教程可参考：

http://www.runoob.com/w3cnote/scrapy-detail.html

https://segmentfault.com/a/1190000013178839

其他教程：

https://oner-wv.gitbooks.io/scrapy_zh/content/%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5/%E9%80%89%E6%8B%A9%E5%99%A8.html

https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/images.html

过程：

1. 安装框架：

pip install --user Scrapy

报错的话：

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

----------------------------------------
Command ""c:\program files\python37\python.exe" -u -c "import setuptools, tokenize;__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip-install-vizrew_c\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\ADMINI~1\AppData\Local\Temp\pip-record-qka9_ywo\install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-install-vizrew_c\Twisted\

安装 Microsoft visual c++ 14.0 即可

下载地址1：https://964279924.ctfile.com/fs/1445568-239446865

下载地址2：http://makeoss.oss-cn-hangzhou.aliyuncs.com/%E5%BE%AE%E8%BD%AFwin10/visualcppbuildtools_full.exe

2. 创建一个新项目，在你电脑想要放置框架的目录cmd，然后运行创建命令：

scrapy startproject mySpider

该目录就会多出一个叫做 mySipder 的文件夹。

创建一个爬虫项目示例：

打算抓取 http://www.itcast.cn/channel/teacher.shtml 网站。

1）在iterms.py新增一个class：

class ItcastItem(scrapy.Item):
    # 声明变量，要抓哪些数据
    name = scrapy.Field()
    title = scrapy.Field()
    info = scrapy.Field()

    pass

2）在spider目录新建一个文件itcast.py，并写入代码：

import scrapy


class ItcastSpider(scrapy.Spider):
    name = "itcast"  # 爬虫名，要启动的爬虫项目名
    allowed_domains = ["itcast.cn"]  # 约束区域
    start_urls = (  # 爬取地址白名单，可用把多个页面爬下来，解析页面时要确保html标签结构类似。
        'http://www.itcast.cn/channel/teacher.shtml#aphp',
        'http://www.itcast.cn/channel/teacher.shtml#apython',
    )

    def parse(self, response):
        print(response.body.decode('utf-8'))  #网页html文件。 # 编码格式gb2312,utf-8,GBK

        pass

    pass

3. 运行项目：

1). 安装pywin32

下载对应版本：https://github.com/mhammond/pywin32/releases 安装即可。

不然启动项目的时候会报错 ModuleNotFoundError: No module named 'win32api'

2). 启动项目的命令：python -m scrapy crawl 项目名或爬虫名：

python -m scrapy crawl itcast

或者用 scrapy crawl itcast 也可以启动

将print(response.body) #网页html文件打印出来，html页面地址为start_urls元组中的地址。

start_urls可以将类似html结构的不同的多个页面url爬下来。

注意网页文件的编码！编码格式gb2312,utf-8,GBK。可以用.decode('utf8')直接编码html的string流。当然，scrapy原生并不需要decode编码。

一个简单的页面抓取项目完成！

4. 抓取网页中的数据：

导入之前在items.py中写入的class,

itcast.py中完整的代码：

import scrapy
from mySpider.items import ItcastItem


class ItcastSpider(scrapy.Spider):
    name = "itcast"  # 爬虫名
    allowed_domains = ["itcast.cn"]  # 约束区域
    start_urls = (  # 爬取地址白名单。可用把多个页面爬下来，解析页面时要确保html标签结构类似。
        'http://www.itcast.cn/channel/teacher.shtml#aphp',
        'http://www.itcast.cn/channel/teacher.shtml#apython',
    )

    def parse(self, response):

        # html = response.body.decode('utf-8')
        # print(html)

        items = []

        for each in response.xpath("//div[@class='li_txt']"):

            # 将我们得到的数据封装到一个 `ItcastItem` 对象
            item = ItcastItem()

            # extract()方法返回的都是unicode字符串
            name = each.xpath("h3/text()").extract()
            title = each.xpath("h4/text()").extract()
            info = each.xpath("p/text()").extract()

            # xpath返回的是包含一个元素的列表
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            items.append(item)
            pass
        # 直接返回最后数据
        print(items)
        yield item
        # return items

        pass

    pass

标签节点如下：

运行如下，可以看到，成功抓取了html标签中的文字：

要看完整的入门要点，请阅读本文最上面的参考教程地址，里面有另外一些知识点介绍。

我感觉，操作DOM，还是原生来的爽快，一气呵成，一个文件即可搞定。更觉得框架就是懒人工具。那为什么还要学习框架，为了体现我的学习能力还没老，为了涨工资。

【python3爬虫】Scrapy Win10下安装与新建Scrapy项目

猜你喜欢