Use Scrapy crawler frame piece (entry)

scrapy reptile learning framework


Target site: segment subnet

Create a project:

In mmd or run commands in Terminal pycharm

scrapy startproject text

(Text as project name)
This command will create a text directory that contains the following contents:

test/  
    scrapy.cfg  
    test/  
        __init__.py  
        items.py  
        pipelines.py  
        settings.py  
        spiders/  
            __init__.py  
            ...

These documents are:

  • scrapy.cfg: The project configuration files
  • tutorial/: The project python module. After you join in this code.
  • tutorial/items.py: Item files in the project.
  • tutorial/pipelines.py: Pipelines files in the project.
  • tutorial/settings.py: Set the file for the project.
  • tutorial/spiders/: Place the directory spider code.

Item definitions

Item a storage container to crawl data; using methods similar to the dictionary and the python, and provides additional protection mechanism to avoid undefined field misspelling errors.

Similarly do the same in ORM, you can create a scrapy.Itemclass, and type of definition scrapy.Fieldto define a class Item property. (If you do not understand the ORM, do not worry, you will find this step is very simple)

First item modeling needed to obtain data from dmoz.org. We need to get the name, url, and a description of the site from dmoz. In this regard, the definition of the item in the corresponding field. Edited tutorialdirectory of items.pythe file:

import scrapy
class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

This might seem a bit complicated, but by defining the item, you can easily use other methods of Scrapy. These methods need to know the definition of your item.

Use command to create a reptile

Remember to switch to a directory folder level with spider

scrapy genspider duanzi "ishuo.cn"

Creating a name is duanzi reptile ( reptile name can not be repeated with the project name ), and web pages can be crawled only limitations in ishuo.cn this domain name, so that you can discover new duanzi.py in the spider directory

Look at the code in spider.py

# -*- coding: utf-8 -*-
import scrapy

class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    allowed_domains = ["ishuo.cn"]
    start_urls = ["https://ishuo.cn/"]

    def parse(self, response):
        
        pass
  • name: Used to distinguish Spider. The name must be unique, you can not set the same name for different Spider.
  • start_urls: It contains a list of url Spider crawling at startup. Therefore, the first page will be acquired is one of them. Subsequent URL acquired from the URL of the original data is extracted.
  • parse()It is a method of spider. When invoked, each initial URL generated after completion of downloading Responsethe object will be passed to the function as the only parameter. The analytical method is responsible for returning data (response data), data extraction (generation item) that requires further processing and generating a URL Requestobject.

Scrapy as Spider of start_urlsattributes for each URL created scrapy.Requestobject and parsemethod assigned to the Request a callback function (callback).

Request object through scheduling, execution generating scrapy.http.Responseobject and back to the spider parse()method.

Continue to parse () function to write, can be used here Selectors Selector

	def parse(self, response):
        for sel in response.xpath('//*[@id="list"]/ul'):
            text = sel.xpath('.//div[contains(@class,"content")]/text()').extract()
            text2 = sel.xpath('.//div[contains(@class,"info")]/a/text()').extract()
            for i in text:
                print(i)

Here we get a piece of success

如果你得罪了老板,失去的只是一份工作;如果你得罪了客户,失去的不过是一份订单;是的,世上只有一个人可以得罪:你给她脸色看,你冲她发牢骚
,你大声顶撞她,甚至当 着她的面摔碗,她都不会记恨你,原因很简单,因为她是你的母亲。

有位非常漂亮的女同事,有天起晚了没有时间化妆便急忙冲到公司。结果那天她被记旷工了……吃惊]

悟空和唐僧一起上某卫视非诚勿扰,悟空上台,24盏灯全灭。理由:1.没房没车只有一根破棍. 2.保镖职业危险.3.动不动打妖精,对女生不温柔. 4.坐过牢,
曾被压五指山下500年。唐僧上台,哗!灯全亮。 理由:1.公务员; 2.皇上兄弟,后台最硬 3.精通梵文等外语 4.长得帅 5.最关键一点:有宝马!

Save your heart for someone who cares. 为了某个在乎你的人,请节约你的真心!


Published 13 original articles · won praise 0 · Views 94

Guess you like

Origin blog.csdn.net/qq_43630441/article/details/104691956