爬取阳光问政平台 - 代码天地

爬取阳光问政平台

其他 2018-06-21 22:54:47 阅读次数: 5

创建项目

scrapy startproject dongguan

items.py

import scrapy


class DongguanItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()
    url = scrapy.Field()
    number = scrapy.Field()

创建CrawSpider，使用模版craw

scrapy genspider -t craw sun 'wz.sun0769.com'

sun.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dongguan.items import DongguanItem


class SunSpider(CrawlSpider):
    name = 'sun'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=0']

    rules = (
        Rule(LinkExtractor(allow=r'type=4&page=\d+')),
        Rule(LinkExtractor(allow=r'/html/question/\d+/\d+.shtml'), callback = 'parse_item'),
    )

    def parse_item(self, response):
        item = DongguanItem()

        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        item['title'] = response.xpath('//div[contains(@class, "pagecenter p3")]//strong/text()').extract()[0]
        # 编号
        item['number'] = item['title'].split(' ')[-1].split(":")[-1]
        # 内容
        item['content'] = response.xpath('//div[@class="c1 text14_2"]/text()').extract()[0]
        # 链接
        item['url'] = response.url

        yield item

pipelines.py

import json

class DongguanPipeline(object):
    def __init__(self):
        self.filename = open("dongguan.json", "w")

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii = False) + ",\n"
        self.filename.write(text.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.filename.close()

settings.py

BOT_NAME = 'dongguan'

SPIDER_MODULES = ['dongguan.spiders']
NEWSPIDER_MODULE = 'dongguan.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'dongguan.pipelines.DongguanPipeline': 300,
}

LOG_FILE = "dg.log"
LOG_LEVEL = "DEBUG"

执行

scrapy crawl sun

猜你喜欢

转载自www.cnblogs.com/wanglinjie/p/9211212.html

爬取阳光问政平台

使用scrapy爬取阳光热线问政平台

Python使用scrapy爬取阳光热线问政平台过程解析

Scrapy 爬取阳光热线问政平台存储为json 文件（使用 CrawlSpider）

Python:阳光热线问政平台爬虫

Scrapy实战之阳光热线问政平台

【Python学习之旅】---Scrapt框架-实战爬取阳关问政平台多个信息(完整版）

爬取青岛问政各个分类模块存储到csv

scrapy爬取阳光政务投诉

阳光高考爬取、、采用神奇的pandas

scrapy爬虫案例：问政平台

分布式爬取阳光热线网

增量式爬取阳光热线网

爬取斗鱼平台

爬虫技术：携程爬虫阳光问政数据

Scrapy项目(东莞阳光网)---利用Spider爬取贴子内容，包含图片（使用Pycharm）

Scrapy项目(东莞阳光网)---利用CrawlSpider爬取贴子内容，不含图片

python3 requests + BeautifulSoup 爬取阳光网投诉贴详情实例代码

selenium爬取壹共享平台

阳光服务平台-敏捷开发

爬取各公寓平台租房信息并进行分析

python爬虫爬取各大平台女主播图片

爬虫-某直播平台图片批量爬取url并下载

教育行业漏洞报告平台（Beta）数据爬取分析

爬取奇安信认证培训平台课程

python爬取泰迪内推平台数据

爬取

阳光在线平台下载

数据爬去(js) 中国空气质量在线监测平台加密数据爬取

中国空气质量在线监测平台加密数据爬取中国空气质量在线监测平台加密数据爬取

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)