Instalar scrapy

Crea un proyecto de araña

Use pycharm para abrir el 'proyecto:

Modificar el intérprete de python: el entorno virtual de anaconda myspark

Dos herramientas:

scrapy view https://bj.lianjia.com/ershoufang/

ver puede descargar una página web al local, esta página web es la página que ve el rastreador. Nota: La página que ve el rastreador es algo diferente de la página que ve directamente en el sitio web de destino debido al procesamiento anti-rastreador.

La página web se ha descargado en el local

scrapy shell https://bj.lianjia.com/ershoufang/

Herramienta de shell para una depuración sencilla

Verifique los elementos de la página web ---> copiar xpath

sintaxis de xpath:

expresión	descripción
//	Seleccionar nodo, usar / representar anidamiento
.	Representa el elemento actual
. .	Representa el elemento padre del elemento actual
@	Seleccionar nodo por atributo

xpath ： // * [@ id = "contenido"] / div [1] / ul / li [3] / div [1] / div [1] / a

response.xpath ('// * [@ id = "contenido"] / div [1] / ul / li [3] / div [1] / div [1] / a / text ()')

Demuestre que los datos se pueden obtener

Proceso

1. Analizar los datos:

Escriba el código del rastreador:

# -*- coding: utf-8 -*-
import scrapy

from myfirstspider.items import MyfirstspiderItem


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['https://bj.lianjia.com/ershoufang/']

    def parse(self, response):
        
        # 一个一个爬去
        
        # housename=response.xpath('//*[@id="content"]/div[1]/ul/li[3]/div[1]/div[1]/a/text()').extract_first()
        # price=response.xpath('//*[@id="content"]/div[1]/ul/li[3]/div[1]/div[6]/div[1]/span/text()').extract_first()
        #
        # item=MyfirstspiderItem()
        # item["housename"]=housename
        # item["price"] = price
        # return item

        #多个爬取
        
        li_list=response.xpath('//*[@id="content"]/div[1]/ul/li')
        for li in li_list:
            item=MyfirstspiderItem()
            # .表示当前元素
            item["housename"] = li.xpath('./div[1]/div[1]/a/text()').extract_first()
            item["price"] =li.xpath('.//div[1]/div[6]/div[1]/span/text()').extract_first()
            yield item

2. Crea una instancia:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyfirstspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    housename = scrapy.Field()
    price = scrapy.Field()

3. Configure el middleware (opcional):

4. Habilite la canalización:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class MyfirstspiderPipeline(object):
    def process_item(self, item, spider):
        with open("house.txt","a") as f:
            content="{},{}\n".format(item["housename"],item["price"])
            f.write(content)
        return item

5. Habilite la configuración

a. La configuración no cumple con el acuerdo de robots:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

b. Abrir la canalización: descomentar

ITEM_PIPELINES = {
   'myfirstspider.pipelines.MyfirstspiderPipeline': 300,
}

6. Inicie el rastreador

Comandos de shell comunes

puesta en marcha:

resultado:

顺驰领海 东门大三居 南北通透,630
亦庄与通州交界处 润枫领尚小区两居室 看房方便,340
南北通透挂西窗 全明格局 前后不临街,940
满五唯一商品房 板楼 层高三米,430
精装修 中楼层 南北通透 满五唯一,133
None,None
四环内韩庄子四里 满五唯一 正规三居室带阳台 中间层,356
世嘉丽晶 南北两居 落地窗 满五唯一2004年商品房,386
梧桐苑南北小三居无遮挡房主诚心出售,400
西四环低密度德式花园下跃复式房源,1195
怡海恒泰园南北通透两居，产权清晰方便看房，低楼层,436
全西开间，采光视野很好，装修不错,380
南北一居，小区内有公立幼儿园、业主优先入园。,285
四环内 正规朝南一居室，楼层好，看房方便,266
郁花园一里 精装修 看房方便 落地飘窗,280
沸城南向两居室 看房方便 业主诚心出售,349
天下儒寓 一室一厅 南向 采光充足,270
西南四环科技园  精装  南北两居  诚意出售,360
户型方正，电梯房，采光好，小区东侧对面就是森林公园,1548
四环内 地铁旁 低楼层 满五唯一,329
南北通透  把边户型  有钥匙 随时看,257
南北向小两居 有钥匙看房方便 诚心出售,228
大峪南路一居室南北通透楼层适中，已满两年,143
南北通透 满五唯一 电梯高层板楼 业主诚心出售,488
全南精装小两居，阳光充足，视野开阔，拎包入住。,261
南北通透 带电梯 两居室 采光好,305
朝阳金台路道家园满五年唯一南北两居室,355
果园地铁 大两居 满五唯一 精装修 诚心卖 带电梯,460
满五唯一 南北通透 中间楼层 交通便利 配套成熟,470
大井南里，南北通透两居室，地铁14号线，看房方便,275
宋家庄地铁  政馨园一区  269万  满五年唯一商品房,269

ps1: Establecer cambio de página:

# -*- coding: utf-8 -*-
import scrapy
from crawlpage.items import BookItem

class ExampleSpider(scrapy.Spider):
    # 爬虫名称，执行爬虫命令scrapy crawl name  时对应起来
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = []
    for page in range(1, 50):
        start_urls.append('http://www.bookschina.com/kinder/54000000_0_0_11_0_1_'+str(page)+'_0_0/')

    def parse(self, response):
        links = response.xpath('//*[@id="container"]/div/div/div/ul/li')

        for link in links:
            item = BookItem()

            item['title'] = link.xpath('div[2]/h2/a/text()').extract()[0]

            item['author'] = link.xpath('div[2]/div/a[1]/text()').extract()[0]
            yield item
            #item['category'] = link.xpath('div/div[1]/span[3]/text()').extract()[0].strip().split('|')[1]

            #url = link.xpath('div[1]/div[1]/a/@href').extract()[0]

            #yield scrapy.Request(,callback=self.parse_detail,meta={'item':item},dont_filter=True)


    def parse_detail(self, response):
        item = response.meta['item']
        yield item

ps2: conectarse a la base de datos

Archivo de canalización:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql.cursors

class MysqlPipeline(object):
    def open_spider(self,spider):
        self.mysql_conn = pymysql.connect(host='127.0.0.1',user='root',password='123456',database='bookrecommend',charset='utf8',cursorclass=pymysql.cursors.DictCursor)

    def process_item(self, item, spider):
        if 'book' == item.className:
            pass

            sql_insert = 'insert into book values (%s,%s,%s,%s)'

            with self.mysql_conn.cursor() as cursor:
                cursor.execute(sql_insert, (1, item.get('title', ''),item.get('author', ''),'文学'))
                
            self.mysql_conn.commit()
        return item

    def close_spider(self, spider):
        self.mysql_conn.close()
        pass

[Python] rastreador