eclipse project development scrapy reptiles, reptile attached tutorial level Blessings

EDITORIAL

Reptiles feel self-learning process should tidy up after entry, but also to leave a mark it.

Scrapy also configure the environment of your own Baidu, in fact, it is not difficult (only for windows system, centos configuration for two days, until now did not understand the whole)

After installing python is downloaded pip, setup pip, pip install and then downloaded on the line (pyspider is so configured).

 

The main reference address information attached

scrapy tutorials address   https://www.bilibili.com/video/av13663892?t=129&p=2

eclipse development scrapy   https://blog.csdn.net/ioiol/article/details/46745993

 

First of all to ensure that the host is configured eclipse, python there pip environment

The method of mounting frame scrapy

Enter cmd interface

:: pip update command

pip install --upgrade pip

:: pip install scrapy

pip intall scrapy

After the installation is complete, you can use the

 

cmd environment created scrapy demo process

 

First, create a directory, random location, then enter the directory, enter the command mode to view scrapy

startproject create project command. Project name format scrapy startproject

 

genspider create a crawler command, a project can have multiple crawlers. Format scrapy genspider reptiles name (and can not project the same name) crawler ip address initial value

 

 

 

 

The goal is to obtain the contents of the home tmooc sidebar (sub text child element of the child elements of a span of)

tmooc Home

 

 

 Sidebar content

Edit item.py, spider position in the directory at the same level (the code is simple, do not paste the code a)

 

 

 

Edit test.py

 

Code section

# -*- coding: utf-8 -*-

import scrapy

# Leader packet need, a method is introduced to generate item

from demo.items import DemoItem

 

class TestSpider(scrapy.Spider):

# Reptile name to use when running

    name = 'test'

    # A restricted domain, are not beyond the scope of the process, may be omitted

    allowed_domains = ['http://www.tmooc.cn']

    # initial address

    start_urls = ['http://www.tmooc.cn/']

 

# Callback

    def parse(self, response):

        # Reptile nature of the program is to address the request, parse the response, and then request the next address again

        # So the main part of the reptiles lies in how to operate the response objects

        nodes=response.xpath("//li[@class='sub']")

        for node in nodes:

        #item generated by the spider items.py method in the same directory, is similar to a dictionary (java map) Type

        item=DemoItem()

        item['name']=node.xpath("./a/span/text()").extract()[0]

        #yield similar return, details of Baidu.

            yield item

Compile test.py, run a spider

 

 

 

crawl is run spider commands. Format scrapy crawl reptile name [-o filename]

-o parameters Alternatively, the role of the data stored in the spider crawling. Stored in the directory operation command can be saved into a csv (excel table) json jsonl xml ... other formats

The results show

 

eclipse development scrapy Spider Project

First, to ensure there python eclipse development environment

New python project, the default option to

 

Create a good directory structure

 

 

 

 

Into the local workspace, locate the project directory

 

 

 

 

The scrapy project just created directory copied, do not start to create that folder

The demo directory

 

Copy to

 

Project directory. Remember to delete the last run of results file

 

 

 

 

run -> run configuretion ->

 

 

 

 

 

 

 

operation result

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/kvii/p/11649337.html