基于Centos7+pycharm搭建python获取爬虫小项目

一.安装python环境

网上教程查阅

安装成功后运行：python

查看版本：python-V

二.安装pycharm

应在步骤一完成后进行

到pycharm官网下载最新版本

下载链接：https://www.jetbrains.com/pycharm/

三.创建Scrapy项目

1.Scrapy的安装

Scrapy，Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。Scrapy吸引人的地方在于它是一个框架，任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类，如BaseSpider、sitemap爬虫等，最新版本又提供了web2.0爬虫的支持。

由于Scrapy框架不是Python的标准模块，因此我们需要先安装Scrapy

使用命令：

pipinstall scrapy

安装中若报错的话根据相应提示进行缺少模块的安装与更新即可

注：若显示部分模块已存在但版本过低的情况，只需执行：

pipuninstall 模块名称 //删除已有模块

pipinstall 模块名称 //安装模块-安装最新版本

来尝试解决问题

2.Scrapy项目的创建

2.1scrapy项目创建语法

scrapystartproject 工程名称

2.2Spider爬虫创建语法

scrapygenspider 爬虫程序名称访问的域名

进入到爬虫项目文件夹

到这里，我们的Scrapy项目文件创建完成

使用pycharm工具打开我们刚才创建的weatherSpider项目，可看到它的目录文件如下：

Scrapy项目下的常用文件说明：

scrapy.cfg:该文件为Scrapy项目的核心配置文件，设置项目默认配置文件路径名称以及项目发布名称等信息

items.py：该文件的作用是设置网络采集的每条数据的共同属性，它就是一个实体自定义模块，用于定义每条采集数据的属性

pipelines.py：该文件的作用是将爬虫每次爬取到单位信息进行后续处理，比如写入到txt文件或将采集到的数据添加到数据库等，根据不同的存储方式，则需要创建不同的pipelines.py文件进行相关处理操作

settings.py：该文件的作用是整个项目的配置驱动文件，我们自定义的piplines.py则需要在该文件中进行配置才可以正常使用，否则项目不进行后续数据处理

spiders：一个文件夹，我们爬虫项目文件会自动生成在该文件夹中

四.数据的爬取

步骤：

1.在item.py实体类文件中定义我们要爬取的数据

2.爬虫主程序weatherspider.py实现

3.获取到的数据的处理（控制台，文本，json，mysql数据库）

4.setting.py文件配置

5.运行项目查看结果

执行命令：scrapycrawl weatherspider

1.在item.py实体类文件中定义我们要爬取的数据

这时需要导入scrapy下的item.py文件中的Item类和Filed类

import scrapy
from scrapy.item import Item,Field
class WeatherspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #日期
    date=scrapy.Field()
    #天气状况
    wea=scrapy.Field()
    #最高温度
    temp_max=scrapy.Field()
    #最低温度
    temp_min=scrapy.Field()
    pass

2.爬虫主程序weatherspider.py实现

在这里面我们需要更改的只有status_urls去指定我们需要数据的具体网页

其次是重写这个类中的parse方法：来获取我们需要的数据

import sys
import scrapy
from weatherSpider.items import WeatherspiderItem
default_encoding='utf-8'
if sys.getdefaultencoding() != default_encoding:
    reload(sys)
    sys.setdefaultencoding(default_encoding)
class WeatherspiderSpider(scrapy.Spider):
    name = 'weatherspider'
    allowed_domains = ['www.weather.com.cn']
    start_urls = ['http://www.weather.com.cn/weather/101100101.shtml']
    def parse(self, response):
        current_item=response.xpath('//div[@class="c7d"]/ul/li')
        #获取当前页面中的天气信息标签并生成一个列表
        weather = WeatherspiderItem()
        weather['date'] = current_item.xpath('h1/text()').extract()
        weather['wea'] = current_item.xpath('p[@class="wea"]/text()').extract()
        weather['temp_max'] = current_item.xpath('p[@class="tem"]/span/text()').extract()
        weather['temp_min'] = current_item.xpath('p[@class="tem"]/i/text()').extract()
        yield weather
        pass

3.获取到的数据的处理

3.1在控制台输出

def process_item(self, item, spider):
    for i in range(7):
        print u'日期：', item['date'][i]
        print u'天气状况：', item['wea'][i]
        print u'最高温度：', item['temp_max'][i]
        print u'最低温度：', item['temp_min'][i]
    return item

3.2将数据写入txt文本

import os
import time
class WeatherspiderPipeline(object):
    def __init__(self):
        #创建文件夹
        self.folder_name='output'
        if not os.path.exists(self.folder_name):
            os.mkdir(self.folder_name)
            pass
    def process_item(self, item, spider):
        currentTime=time.strftime("%Y-%m-%d",time.localtime())
        fileName='weather-7'+currentTime+'.txt'
        try:
            with open(self.folder_name+'/'+fileName,'a') as fp:
                for i in range(7):
                    fp.write(u'日期：' + item['date'][i])
                    fp.write(u'天气状况：' + item['wea'][i])
                    fp.write(u'最低温度：' + item['temp_min'][i] + '\n')
                    pass
        except EOFError as er:
            print er
            pass
        finally:
            fp.close()
            pass
        return item

3.3将数据写成json形式

class WeatherspiderPipeline(object):
    def __init__(self):
        #创建文件夹
        self.folder_name='output'
        if not os.path.exists(self.folder_name):
            os.mkdir(self.folder_name)
            pass
    def process_item(self, item, spider):
        currentTime=time.strftime("%Y-%m-%d",time.localtime())
        fileName='weather-7'+currentTime+'.json'
        try:
            with codecs.open(self.folder_name+'/'+fileName,'a') as fp:
                jsonLine=json.dumps(dict(item),ensure_ascii=False)+'\n'
                fp.write(jsonLine)
            with open(self.folder_name+'/'+fileName,'a') as fp:
                 pass
        except EOFError as er:
            print er
            pass
        finally:
            fp.close()
            pass
        return item

3.4将数据写入mysql数据库

#MySQLdb需要下载安装

import MySQLdb
class WeatherspiderPipeline(object):
    def process_item(self, item, spider):
        conn=MySQLdb.connect(host='localhost',user='root',passwd='***',db='lv',port=3306)
        conn.begin()#开始事物
        conn.set_character_set('utf8')
        #得到游标对象
        c=conn.cursor()
        c.execute('SET NAMES utf8')
        c.execute('SET CHARACTER SET utf8')
        c.execute('SET character_set_connection=utf8 ')
        #执行SQL语句
        for i in range(7):
            date=item['date'][i]
            wea=item['wea'][i]
            temp_max=item['temp_max'][i]
            temp_min=item['temp_min'][i]
            sql='INSERT INTO weather VALUES(%s,%s,%s,%s)'
            c.execute(sql,(date,wea,temp_max,temp_min))
            conn.commit() #提交事物
            print '----------------'
        return item

4.settind.py文件配置

该文件的作用是整个项目的配置驱动文件，我们自定义的piplines.py则需要在该文件中进行配置才可以正常使用，否则项目不进行后续数据处理

ITEM_PIPELINES = {
  # 'weatherSpider.pipelines.WeatherspiderPipeline': 300,
   # 'weatherSpider.pipelines2txt.WeatherspiderPipeline': 1,
    'weatherSpider.pipelines2json.WeatherspiderPipeline': 1,
    'weatherSpider.pipelines2mysql.WeatherspiderPipeline': 1,
}
# 配置动态代理模块
DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'weatherSpider.rotate_useragent.RotateUserAgentMiddleware' :400}

5.运行项目查看结果

执行命令：scrapycrawl weatherspider

控制台

mysql数据库

基于Centos7+pycharm搭建python获取爬虫小项目

猜你喜欢