Scrapy framework: First Program

First, create a project:

scrappy start project maitian

The second step: a clear field to crawlitems.py

import scrapy

class MaitianItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    area = scrapy.Field()
    district = scrapy.Field()

Step 3: Create the spider in the file directory crawler: zufang_spider.py
2.1 Create a class and a subclass inherits the scrapy: scrapy.Spider
2.2 climb custom name, name=""run behind the need to use a frame;
2.3 destination URL defined crawling
2.4 methods defined scrapy

Here is a simple project:

import scrapy
from maitian.items import MaitianItem

class MaitianSpider(scrapy.Spider):
    name = "zufang"
    start_urls = ['http://bj.maitian.cn/zfall/PG1']

    def parse(self, response):
        for zufang_itme in response.xpath('//div[@class="list_title"]'):
            yield {
                'title': zufang_itme.xpath('./h1/a/text()').extract_first().strip(),
                'price': zufang_itme.xpath('./div[@class="the_price"]/ol/strong/span/text()').extract_first().strip(),
                'area': zufang_itme.xpath('./p/span/text()').extract_first().replace('㎡', '').strip(),
                'district': zufang_itme.xpath('./p//text()').re(r'昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城')[0],
            }

        next_page_url = response.xpath(
            '//div[@id="paging"]/a[@class="down_page"]/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

Step 4: settings.pySet the data saved to the database file

.
.
.

ITEM_PIPELINES = {'maitian.pipelines.MaitianPipeline': 300,}

MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'maitian'
MONGODB_DOCNAME = 'zufang'

Step Five: through a pipe pipelines.pyconnected to the above operation

import pymongo
from scrapy.conf import settings

class MaitianPipeline(object):
    def __init__(self):
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        db_name = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        db = client[db_name]
        self.post = db[settings['MONGODB_DOCNAME']]

    def process_item(self, item, spider):
        zufang = dict(item)
        self.post.insert(zufang)
        return item

In which the middlewares.pyfile on hold

More than a simple scrapy reptiles finished project to build

Guess you like

Origin www.cnblogs.com/hankleo/p/11823994.html