使用Scrapy_redis进行分布式爬虫 - 代码天地

使用Scrapy_redis进行分布式爬虫

其他 2019-02-26 20:30:34 阅读次数: 0

1.创建项目：scrapy startproject mySpider
2.创建爬虫：scrapy genspider –t crawl tencent3 hr.tencent.com
3.安装需要的软件包
4.tencent3.py代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider


class TencentSpider(RedisCrawlSpider):
    name = 'tencent3'
    allowed_domains = ['hr.tencent.com']
    #start_urls = ['https://hr.tencent.com/position.php']
    redis_key='tencent3:start_urls'

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//tr[@class="f"]',)), follow=True),
        Rule(LinkExtractor(restrict_xpaths=('//tr[@class="odd"]','//tr[@class="even"]')), callback='parse_item'),
    )

    def parse_item(self, response):
        item = {}
        item['name'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()
        item['address'] = response.xpath('//tr[@class="c bottomline"]/td[1]/text()').extract_first()

        print(item)
        # yield item

'''
启动命令：
    sudo redis-server /etc/redis/redis.conf
    redis-cli
    select 15
    LPUSH tencent3:start_urls  https://hr.tencent.com/position.php

'''

5.配置settings.py里面的文件

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True

REDIS_URL = 'redis://192.168.12.189:6379/15
#192.168.12.189为本地虚拟机ip地址

当然还需要在settings.py其他基础配置，这里不做详细介绍
5.运行爬虫项目scrapy crawl tencent3
6.开启redis并配置建和值

 sudo redis-server /etc/redis/redis.conf
    redis-cli
    select 15
    LPUSH tencent3:start_urls  https://hr.tencent.com/position.php

7.成功爬取
在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/qq_34663267/article/details/84190567

使用Scrapy_redis进行分布式爬虫

分布式爬虫scrapy_redis

Scrapy_Redis分布式爬虫

基于scrapy_redis部署的scrapy分布式爬虫

基于Scrapy_redis部署scrapy分布式爬虫

scrapy_redis分布式爬虫总结

十六、scrapy_redis（分布式爬虫）

scrapy_redis实现分布式爬虫

Scrapy_redis分布式基础_redis使用

scrapy_redis分布式

Scrapy_Redis 分布式处理

Python爬虫之Scrapy框架系列（22）——初识分布式爬虫scrapy_redis

python爬虫入门 ✦ 乞丐版scrapy_redis分布式 + 增量式爬虫的实现

Scrapy基于scrapy_redis实现分布式爬虫部署

Scrapy基于scrapy_redis分布式爬虫的布隆去重

Centos7__Scrapy + Scrapy_redis 用Docker 实现分布式爬虫

python爬虫之基于scrapy_redis的分布式爬虫

scrapy_redis分布式爬虫遇到的问题DEBUG: Filtered offsite request to

京东全网爬虫（scrapy_redis分布式，IP代理池反爬）

分布式scrapy_redis源码总结，及其架构

scrapy_redis分布式组件策略图解

运用scrapy框架爬取数据的流程和组件Scrapy_redis分布式爬虫的应用

【Python爬虫】轻松几步将一个 scrapy项目变成 scrapy_redis 分布式爬取

Python爬虫之Scrapy框架系列（23）——分布式爬虫scrapy_redis浅实战【XXTop250部分爬取】

使用scrapy-redis构建简单的分布式爬虫

使用 scrapy-redis实现分布式爬虫

使用scrapy-redis搭建分布式爬虫环境

使用scrapy-redis 搭建分布式爬虫环境

基于scrapy-redis的分布式爬虫简单使用

scrapy-redis分布式爬虫

今日推荐

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

周排行

[编程题]学英语

[codeforces 1288A] Deadline 约数+模

Python的web开发

Docker在Centos 7上的部署

python编码

解决Ubuntu16.04 fatal error: json/json.h: No such file or directory

mysql并发插入

rest接口如何适应jsonp的方案

linux 终端上网设置

高数——等号两边同时求导、积分的解释

每日归档

更多

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)