[Python] crawler

Install scrapy

Create a spider project

Use pycharm to open the'project:

Modify the python interpreter: anaconda's virtual environment myspark

Two tools:

1.

scrapy view https://bj.lianjia.com/ershoufang/

view can download a web page to the local, this web page is the page seen by the crawler. Note: The page seen by the crawler is different from the page seen directly on the target website because of anti-crawler processing.

The webpage has been downloaded to the local

2.

scrapy shell https://bj.lianjia.com/ershoufang/

 Shell tool for easy debugging

Check web page elements --->copy xpath

xpath syntax:

expression

description

//

Select node, use/represent nesting

.

Represents the current element

..

Represents the parent element of the current element

@

Select node by attribute

        xpath:       //*[@id="content"]/div[1]/ul/li[3]/div[1]/div[1]/a

response.xpath('//*[@id="content"]/div[1]/ul/li[3]/div[1]/div[1]/a/text()')

Show that the data can be obtained

Process

1. Parse the data:

Write crawler code:

# -*- coding: utf-8 -*-
import scrapy

from myfirstspider.items import MyfirstspiderItem


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['https://bj.lianjia.com/ershoufang/']

    def parse(self, response):
        
        # 一个一个爬去
        
        # housename=response.xpath('//*[@id="content"]/div[1]/ul/li[3]/div[1]/div[1]/a/text()').extract_first()
        # price=response.xpath('//*[@id="content"]/div[1]/ul/li[3]/div[1]/div[6]/div[1]/span/text()').extract_first()
        #
        # item=MyfirstspiderItem()
        # item["housename"]=housename
        # item["price"] = price
        # return item

        #多个爬取
        
        li_list=response.xpath('//*[@id="content"]/div[1]/ul/li')
        for li in li_list:
            item=MyfirstspiderItem()
            # .表示当前元素
            item["housename"] = li.xpath('./div[1]/div[1]/a/text()').extract_first()
            item["price"] =li.xpath('.//div[1]/div[6]/div[1]/span/text()').extract_first()
            yield item

2. Create an instance:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyfirstspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    housename = scrapy.Field()
    price = scrapy.Field()

3. Set up middleware (optional):

4. Enable the pipeline:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class MyfirstspiderPipeline(object):
    def process_item(self, item, spider):
        with open("house.txt","a") as f:
            content="{},{}\n".format(item["housename"],item["price"])
            f.write(content)
        return item

5. Enable configuration

a. Setting does not comply with the robots agreement:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

b. Open the pipeline: uncomment

ITEM_PIPELINES = {
   'myfirstspider.pipelines.MyfirstspiderPipeline': 300,
}

6. Start the crawler

Common shell commands

start up:

result:

顺驰领海 东门大三居 南北通透,630
亦庄与通州交界处 润枫领尚小区两居室 看房方便,340
南北通透挂西窗 全明格局 前后不临街,940
满五唯一商品房 板楼 层高三米,430
精装修 中楼层 南北通透 满五唯一,133
None,None
四环内韩庄子四里 满五唯一 正规三居室带阳台 中间层,356
世嘉丽晶 南北两居 落地窗 满五唯一2004年商品房,386
梧桐苑南北小三居无遮挡房主诚心出售,400
西四环低密度德式花园下跃复式房源,1195
怡海恒泰园南北通透两居,产权清晰方便看房,低楼层,436
全西开间,采光视野很好,装修不错,380
南北一居,小区内有公立幼儿园、业主优先入园。,285
四环内 正规朝南一居室,楼层好,看房方便,266
郁花园一里 精装修 看房方便 落地飘窗,280
沸城南向两居室 看房方便 业主诚心出售,349
天下儒寓 一室一厅 南向 采光充足,270
西南四环科技园  精装  南北两居  诚意出售,360
户型方正,电梯房,采光好,小区东侧对面就是森林公园,1548
四环内 地铁旁 低楼层 满五唯一,329
南北通透  把边户型  有钥匙 随时看,257
南北向小两居 有钥匙看房方便 诚心出售,228
大峪南路一居室南北通透楼层适中,已满两年,143
南北通透 满五唯一 电梯高层板楼 业主诚心出售,488
全南精装小两居,阳光充足,视野开阔,拎包入住。,261
南北通透 带电梯 两居室 采光好,305
朝阳金台路道家园满五年唯一南北两居室,355
果园地铁 大两居 满五唯一 精装修 诚心卖 带电梯,460
满五唯一 南北通透 中间楼层 交通便利 配套成熟,470
大井南里,南北通透两居室,地铁14号线,看房方便,275
宋家庄地铁  政馨园一区  269万  满五年唯一商品房,269

ps1: Set page turning:

# -*- coding: utf-8 -*-
import scrapy
from crawlpage.items import BookItem

class ExampleSpider(scrapy.Spider):
    # 爬虫名称,执行爬虫命令scrapy crawl name  时对应起来
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = []
    for page in range(1, 50):
        start_urls.append('http://www.bookschina.com/kinder/54000000_0_0_11_0_1_'+str(page)+'_0_0/')

    def parse(self, response):
        links = response.xpath('//*[@id="container"]/div/div/div/ul/li')

        for link in links:
            item = BookItem()

            item['title'] = link.xpath('div[2]/h2/a/text()').extract()[0]

            item['author'] = link.xpath('div[2]/div/a[1]/text()').extract()[0]
            yield item
            #item['category'] = link.xpath('div/div[1]/span[3]/text()').extract()[0].strip().split('|')[1]

            #url = link.xpath('div[1]/div[1]/a/@href').extract()[0]

            #yield scrapy.Request(,callback=self.parse_detail,meta={'item':item},dont_filter=True)


    def parse_detail(self, response):
        item = response.meta['item']
        yield item

ps2: Connect to the database

Pipeline file:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql.cursors

class MysqlPipeline(object):
    def open_spider(self,spider):
        self.mysql_conn = pymysql.connect(host='127.0.0.1',user='root',password='123456',database='bookrecommend',charset='utf8',cursorclass=pymysql.cursors.DictCursor)

    def process_item(self, item, spider):
        if 'book' == item.className:
            pass

            sql_insert = 'insert into book values (%s,%s,%s,%s)'

            with self.mysql_conn.cursor() as cursor:
                cursor.execute(sql_insert, (1, item.get('title', ''),item.get('author', ''),'文学'))
                
            self.mysql_conn.commit()
        return item

    def close_spider(self, spider):
        self.mysql_conn.close()
        pass

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/Qmilumilu/article/details/104802591