python 爬虫 scrapy框架的详细使用

版权声明:欢迎转载,请标明文章出处 https://blog.csdn.net/IT_arookie/article/details/82874541

scrapy框架爬取内容详细介绍:

scrapy: python开发的一个快速、高层次的屏幕抓取和web抓取框架,简单,方便,易上手
在这里插入图片描述

一、scrapy 的工作流程

1、引擎从调度器中取出一个URL链接(url)用来接下来的爬取
2、引擎把URL封装成一个Request请求传给下载器,下载器把资源下下来,并封装成应答包Response
3、爬虫解析Response
4、若是解析出实体(Item),则交给实体管道(pipelines)进行进一步的处理。
5、若是解析出的是链接(URL),则把URL交给Scheduler等待抓取

二、Scrapy主要文件作用:

1、Items是将要装载抓取的数据的容器(存放变量)。
2、Spider是用户编写的类,用于从一个域(或域组)中抓取信息(爬虫文件)。
3、 pipelines :项目管道文件,用于提取Items内容,储存成不同的文件格式
4、 settings: 项目配置文件,各种设置启动项

下面以爬取51job内容详细介绍scrapy框架的详细写法;

项目实战1:简单爬取51job职位信息

需要scrapy模块(建议安装adaconda)
在cmd中,与特定存放代码的路径下执行,

scrapy startproject job51

在pycharm中

1.首先编辑items.py文件

设置几个需要的变量容器

import scrapy
#from scrapy import Item,Field
class Job51Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    jobName = scrapy.Field()
    companyName = scrapy.Field()
    address = scrapy.Field()
    money = scrapy.Field()
    ptime = scrapy.Field()

2.在spiders文件夹内新建一个job51.py爬虫文件

基本全是固定的框架写法,向里面填内容即可

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from ..items import Job51Item  #导入items文件里的类
class Job51(CrawlSpider):    #继承类,固定写法
    name = 'job51'   #名字为文件名,固定写法
    start_urls =[    #网址变量列表
    'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=4&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='
    ]
    def parse(self, response):   #爬取函数,固定写法
        selector = Selector(response)
        item = Job51Item()
        divs = selector.xpath('//div[@id="resultList"]//div[@class="el"]')
        for each in divs:
            jobName = each.xpath('./p/span/a/@title').extract()
            companyName =each.xpath('./span[1]/a/text()').extract()
            address = each.xpath('./span[2]/text()').extract()
            money = each.xpath('./span[3]/text()').extract()
            ptime = each.xpath('./span[4]/text()').extract()
            print(jobName,address,companyName,money,ptime)
            item['jobName'] = jobName[0]
            item['companyName'] = companyName[0]
            item['address'] = address[0]
            if money:
                item['money'] = money[0]
            else:
                item['money'] = '面谈'
            item['ptime'] = ptime[0]
            yield item  #提交变量 

3.在pipeline.py中编辑管道文件,存为不同格式的文件(txt,excel,csv,json,mongodb,mysql)

固定语法:

class name(object):
    def __init__(self):
        pass
    def process_item(self,item,spider):
        pass
        return item
    def close_spider(self,spider):
        pass

保存到excel

from openpyxl import Workbook
class saveToExcel(object):
    def __init__(self):
        self.wb=Workbook()
        self.ws = self.wb.active
		self.ws.append(['职位名','公司名','工作地点','薪资','发布日期'])
		
    def process_item(self,item,spider):
    	# item 是一个字典形式
        self.ws.append(list(dict(item).values()))

    def close_spider(self,spider):
        self.wb.save('人工智能.xlsx')

保存到csv(默认的存储方式):字段名是自动排列的

#默认保存原样就可以,但在settings内启动有特殊的写法
class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

保存到csv的另一种写法

import csv
import codecs   #消除空行的一种方法
class saveToCsv(object):
    def __init__(self):
        with codecs.open('人工智能.csv', 'w', encoding="utf-8") as csvfile:
            self.write1 = csv.writer(csvfile)
            # 文件写入,写入列表
            self.write1.writerow(['职位名', '公司名', '工作地点', '薪资', '发布时间'])

    def process_item(self, item, spider):
        #item是一个字典
        with codecs.open('人工智能.csv', 'a', encoding="utf-8") as csvfile:
            self.write1 = csv.writer(csvfile)
            # 文件写入,写入列表
            self.write1.writerow([jobName,companyName,address,money,ptime])
        return item
        
    def close_spider(self,spider):
        pass

保存为json格式

import json
from pymongo import MongoClient
class saveToJson(object):
    def __init__(self):
        self.film = open('job51.json','w',encoding = 'utf-8')
    def process_item(self,item,spider):
        #把每个item转换成json
        echo = json.dumps(dict(item),ensure_ascii=False)  #不是ascii码
        self.film.write(echo+'\n')
        return item
    def close_spider(self,spider):
        self.film.close()
        #saveToMongondb()

保存到Mongodb

class saveToMongodb(object):
    def __init__(self):
        conn = MongoClient('localhost')  #链接到本地
        db = conn.newdb      #打开数据库
        self.col = db.newjob51   #打开(新建)集合
        self.col.remove(None)   #清空集合之前的内容(如果有数据的话)
    def process_item(self,item,spider):
        self.col.insert_one(dict(item))
    def close_spider(self,spider):
        print('存储结束')

4.可以在middlewares .py(中间件)里设置浏览器头部和代理

首先准备头部和代理ip列表
两个列表可以放在本 文件里面,也可以放到settings.py里面,建议放在settings里面,但是需要导入列表到本文件内。

user_agent = [   #准备头部,列表
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
    "UCWEB7.0.2.37/28/999",
    "NOKIA5700/ UCWEB7.0.2.37/28/999",
    "Openwave/ UCWEB7.0.2.37/28/999",
    "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
    # iPhone 6:
    "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",

]
ips = [
 'HTTP://116.1.11.19:80',
 'HTTPS://140.207.50.246:51426',
 'HTTP://118.178.227.171:80',
 'HTTP://118.190.95.43:9001',
 'HTTP://61.135.217.7:80',
 'HTTP://106.75.225.83:808',
 'HTTPS://106.75.226.36:808',
 'HTTP://118.190.95.35:9001',
 'HTTPS://123.207.30.131:80',
 'HTTPS://60.12.89.218:57299',
 'HTTPS://124.235.135.74:80',
 'HTTPS://59.45.16.10:59156',
 'HTTP://218.23.124.52:59361',
 'HTTP://124.234.157.228:80',
 'HTTP://58.51.83.102:808',
 'HTTP://110.73.42.11:8123',
 'HTTPS://222.242.155.69:40919',
 'HTTP://110.73.10.32:8123',
 'HTTPS://220.172.40.190:80',
 'HTTPS://60.211.192.54:40700',
 'HTTP://182.88.135.132:8123',
 'HTTPS://122.227.182.102:33174',
 'HTTPS://219.139.35.70:42993',
 'HTTPS://222.76.204.110:808',
 'HTTPS://222.245.165.154:36522']

代理ip列表可能已经失效了,可以不用加入代理:

然后在process_request()方法内添加两句话即可

    def process_request(self, request, spider):
        import random
        #设置浏览器头部
        request.headers['User-Agent'] = random.choice(user_agent)
        #设置代理
        #request.meta['proxy'] = random.choice(ips)
        return None

5.settings.py文件

以上所有的配置文件,都必须在settings.py内启动才可以生效

BOT_NAME = 'job51'

SPIDER_MODULES = ['job51.spiders']
NEWSPIDER_MODULE = 'job51.spiders'

#如果是默认保存csv(没改动过pipeline.py文件),要写入下面两行
# FEED_URI='51job.csv'
# FEED_FORMAT='CSV'

#下载延迟,一般要打开
DOWNLOAD_DELAY = 2

#打开管道开关
ITEM_PIPELINES = {
   'job51.pipelines.saveToCsv': 300,
   'job51.pipelines.saveToExcel': 310,
   'job51.pipelines.saveToJson': 320,
   'job51.pipelines.saveToMangodb': 330,
}

#启动中间件,使浏览器头部生效
DOWNLOADER_MIDDLEWARES = {
   'job51.middlewares.Job51DownloaderMiddleware': 543,
}

6.最后启动爬虫程序

方法一:

在有scrapy.cfg文件的目录下执行cmd;运行:

    scrapy crawl job51

方法二:

在pycharm中运行
![3][3]

方法三:

在有scrapy.cfg文件的目录下建一个main.py文件

from scrapy import cmdline
cmdline.execute('scrapy crawl job51'.split())

运行这个文件就行

三个方法异曲同工

基础本分就到这里了,下一节介绍递归爬取和翻页爬取

猜你喜欢

转载自blog.csdn.net/IT_arookie/article/details/82874541
今日推荐