python爬虫初探--第一个python爬虫项目

前两天把python基础语法看了下,简单做了点练习,今天开始做了第一个python爬虫项目,用了scrapy框架,从安装python开始记录下步骤。

一。安装python和pycharm

1.从官网:https://www.python.org/downloads/ 下载python3.6.5,不装在默认位置,我选择安装在D:\py\python3.6.5下。

2.从官网:http://www.jetbrains.com/pycharm/ 下载pycharm,选择professional版,安装在D:\ProgramFiles下,安装路径不要有空格,不然无法安装。

3.设置python环境变量:在path下添加python的安装路劲 D:\py\python3.6.5\python.exe


二。安装scrapy库

scrapy库安装可费劲了,安装过程两部一个坑,感觉跟配置spring xml一样哈哈,扯远了。下面我们就来看看具体步骤。

网上很多教程都教我们在命令行里安装,其实这样挺麻烦的,既然有工具,为什么不用工具呢!OK,我们打开刚安装好的pycharm,新建一个工程,注意新建工程时有个选项 Project Virtual Interpreter,点开,选择Existing interpreter,选择我们上面安装好的python.exe,不然pycharm就会创建新的python环境,我们前面已经安装好了,这不是多此一举嘛。

然后一次点击File-Default Settings-Project Interpreter,在Project Interpreter那里选择我们上面安装好的python.exe


然后点击File-Settings-Project Interpreter,这里的Project Interpreter应该也更新了。点击右侧的绿色加号,搜索要安装的包,点击install package依次安装以下库:


1.lxml

2.zope.interface

3.twisted:安装时可能会提示error:Microsoft Visual C++ 10.0 is required (Unable to find vcvarsall.bat),改为手动安装。

https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 在这个网站下载对应的twisted库,我的python版本是3.6.5,所以选择

  • Twisted‑18.4.0‑cp34‑cp34m‑win32.whl ,文件放在了D:\py\python3.6.5\Scripts。命令行里进入D:\py\python3.6.5\Scripts目录
  • 执行命令 pip install Twisted-18.4.0-cp36-cp36m-win32.whl,片刻后安装成功。
  • 4.回到pycharm,安装OpenSSL
  • 5.安装scrapy,安装成功
  • 三。创建scrapy项目
  • 1.先按照上面的方法把scrapy添加到环境变量里,命令行里执行 scrapy startproject scrapy_exam,可以看到生成了scrapy工程,
  • 用pycharm打开这个项目,编辑Items.py文件
  • import scrapy
    from scrapy import Item,Field
    
    class ScrapyExamItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass
    
    class TestItem(Item):
        address = Field()
        price = Field()
        lease_type = Field()
        bed_amount = Field()
        suggestion = Field()
        comment_star = Field()
        comment_amount = Field()
    
    2.在spiders目录下创建python文件spider_test.py:

import scrapy

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy_exam.items import TestItem
from scrapy import cmdline

class xiaozhu(CrawlSpider):
    name = 'xiaozhu'
    start_urls = ['http://bj.xiaozhu.com/search-duanzufang-p1-0/']

    def parse(self, response):
        item = TestItem()
        selector = Selector(response)
        commoditys = selector.xpath('//ul[@class="pic_list clearfix"]/li')

        for commodity in commoditys:
            address = commodity.xpath('div[2]/div/a/span/text()').extract()[0]
            price = commodity.xpath('div[2]/span[1]/i/text()').extract()[0]
            lease_type = commodity.xpath('div[2]/div/em/text()').extract()[0].split('/')[0].strip()
            bed_amount = commodity.xpath('div[2]/div/em/text()').extract()[0].split('/')[1].strip()
            suggestion = commodity.xpath('div[2]/div/em/text()').extract()[0].split('/')[2].strip()
            infos = commodity.xpath('div[2]/div/em/span/text()').extract()[0].strip()
            comment_star = infos.split('/')[0] if '/' in infos else '无'
            comment_amount = infos.split('/')[1] if '/' in infos else infos

            item['address'] = address
            item['price'] = price
            item['lease_type'] = lease_type
            item['bed_amount'] = bed_amount
            item['suggestion'] = suggestion
            item['comment_star'] = comment_star
            item['comment_amount'] = comment_amount

            yield item

        urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1, 14)]
        for url in urls:
            yield Request(url, callback=self.parse)

cmdline.execute("scrapy crawl xiaozhu".split())

3.右击spider_test.py,选择Run spider_test.py,可以看到输出:


今天先写到这里,明天再分析一下程序

猜你喜欢

转载自blog.csdn.net/smh2208/article/details/80358702