scrapy reptile learning framework
Target site: segment subnet
Create a project:
In mmd or run commands in Terminal pycharm
scrapy startproject text
(Text as project name)
This command will create a text directory that contains the following contents:
test/
scrapy.cfg
test/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These documents are:
scrapy.cfg
: The project configuration filestutorial/
: The project python module. After you join in this code.tutorial/items.py
: Item files in the project.tutorial/pipelines.py
: Pipelines files in the project.tutorial/settings.py
: Set the file for the project.tutorial/spiders/
: Place the directory spider code.
Item definitions
Item a storage container to crawl data; using methods similar to the dictionary and the python, and provides additional protection mechanism to avoid undefined field misspelling errors.
Similarly do the same in ORM, you can create a scrapy.Item
class, and type of definition scrapy.Field
to define a class Item property. (If you do not understand the ORM, do not worry, you will find this step is very simple)
First item modeling needed to obtain data from dmoz.org. We need to get the name, url, and a description of the site from dmoz. In this regard, the definition of the item in the corresponding field. Edited tutorial
directory of items.py
the file:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
This might seem a bit complicated, but by defining the item, you can easily use other methods of Scrapy. These methods need to know the definition of your item.
Use command to create a reptile
Remember to switch to a directory folder level with spider
scrapy genspider duanzi "ishuo.cn"
Creating a name is duanzi reptile ( reptile name can not be repeated with the project name ), and web pages can be crawled only limitations in ishuo.cn this domain name, so that you can discover new duanzi.py in the spider directory
Look at the code in spider.py
# -*- coding: utf-8 -*-
import scrapy
class DuanziSpider(scrapy.Spider):
name = 'duanzi'
allowed_domains = ["ishuo.cn"]
start_urls = ["https://ishuo.cn/"]
def parse(self, response):
pass
name
: Used to distinguish Spider. The name must be unique, you can not set the same name for different Spider.start_urls
: It contains a list of url Spider crawling at startup. Therefore, the first page will be acquired is one of them. Subsequent URL acquired from the URL of the original data is extracted.parse()
It is a method of spider. When invoked, each initial URL generated after completion of downloadingResponse
the object will be passed to the function as the only parameter. The analytical method is responsible for returning data (response data), data extraction (generation item) that requires further processing and generating a URLRequest
object.
Scrapy as Spider of start_urls
attributes for each URL created scrapy.Request
object and parse
method assigned to the Request a callback function (callback).
Request object through scheduling, execution generating scrapy.http.Response
object and back to the spider parse()
method.
Continue to parse () function to write, can be used here Selectors Selector
def parse(self, response):
for sel in response.xpath('//*[@id="list"]/ul'):
text = sel.xpath('.//div[contains(@class,"content")]/text()').extract()
text2 = sel.xpath('.//div[contains(@class,"info")]/a/text()').extract()
for i in text:
print(i)
Here we get a piece of success
如果你得罪了老板,失去的只是一份工作;如果你得罪了客户,失去的不过是一份订单;是的,世上只有一个人可以得罪:你给她脸色看,你冲她发牢骚
,你大声顶撞她,甚至当 着她的面摔碗,她都不会记恨你,原因很简单,因为她是你的母亲。
有位非常漂亮的女同事,有天起晚了没有时间化妆便急忙冲到公司。结果那天她被记旷工了……吃惊]
悟空和唐僧一起上某卫视非诚勿扰,悟空上台,24盏灯全灭。理由:1.没房没车只有一根破棍. 2.保镖职业危险.3.动不动打妖精,对女生不温柔. 4.坐过牢,
曾被压五指山下500年。唐僧上台,哗!灯全亮。 理由:1.公务员; 2.皇上兄弟,后台最硬 3.精通梵文等外语 4.长得帅 5.最关键一点:有宝马!
Save your heart for someone who cares. 为了某个在乎你的人,请节约你的真心!