基于Scrapy框架编写爬虫项目

知识点:

Scrapy模块安装

2种安装模块的方式。

以下两种方式可以安装绝大部分模块,

网络安装:指直接在控制台 pip install XX

下载安装:网络安装虽然简便,但时不时就会失败,这时就可以前往https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

下载后,用控制台移动到下载文件夹后 pip install xx 来安装。

第6条,配置过程:

1.复制:F:\编程\python\Lib\site-packages\pywin32_system32 下的两个.dll文件

2.粘贴到:C:\Windows\System32 里

Scrapy常用指令

语法格式

1.创建爬虫:scrapy startproject XX

2.查看模版:scrapy genspider -l

basic:

# -*- coding: utf-8 -*-
import scrapy


class FstSpider(scrapy.Spider):
    name = 'fst'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/']

    def parse(self, response):
        pass
           

crawl:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SecondSpider(CrawlSpider):
    name = 'second'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

csvfeed:

# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider


class ThirdSpider(CSVFeedSpider):
    name = 'third'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/feed.csv']
    # headers = ['id', 'name', 'description', 'image_link']
    # delimiter = '\t'

    # Do any adaptations you need here
    #def adapt_response(self, response):
    #    return response

    def parse_row(self, response, row):
        i = {}
        #i['url'] = row['url']
        #i['name'] = row['name']
        #i['description'] = row['description']
        return i

xmlfeed:

# -*- coding: utf-8 -*-
from scrapy.spiders import XMLFeedSpider


class FourthSpider(XMLFeedSpider):
    name = 'fourth'
    allowed_domains = ['aliwx.com.cn']
    start_urls = ['http://aliwx.com.cn/feed.xml']
    iterator = 'iternodes' # you can change this; see the docs
    itertag = 'item' # change it accordingly

    def parse_node(self, response, selector):
        i = {}
        #i['url'] = selector.select('url').extract()
        #i['name'] = selector.select('name').extract()
        #i['description'] = selector.select('description').extract()
        return i

3.在spiders中创建爬虫: scrapy genspider -t basic/crawl/csvfeed/xmlfeed 爬虫名 模板网站域名

4.运行爬虫:scrapy crawl 爬虫文件名

项目结构:

items : 存储想要爬取的目标字段

siders:存储多个爬虫文件

middelwares:中间件,用处不明

pipelines:爬后处理,

Scrapy爬虫项目编写基础


 

将数据存储到数据库中


使用pymysql

首先需在pymysql文件夹中的connections.py中,更改charset为utf8,可防止乱码.

项目地址:

https://github.com/ljx4471817/Scrapy-respository

Scrapy开发手册

猜你喜欢

转载自blog.csdn.net/LJXZDN/article/details/81272974
今日推荐