EDITORIAL
Reptiles feel self-learning process should tidy up after entry, but also to leave a mark it.
Scrapy also configure the environment of your own Baidu, in fact, it is not difficult (only for windows system, centos configuration for two days, until now did not understand the whole)
After installing python is downloaded pip, setup pip, pip install and then downloaded on the line (pyspider is so configured).
The main reference address information attached
scrapy tutorials address https://www.bilibili.com/video/av13663892?t=129&p=2
eclipse development scrapy https://blog.csdn.net/ioiol/article/details/46745993
First of all to ensure that the host is configured eclipse, python there pip environment
The method of mounting frame scrapy
Enter cmd interface
:: pip update command
pip install --upgrade pip
:: pip install scrapy
pip intall scrapy
After the installation is complete, you can use the
cmd environment created scrapy demo process
First, create a directory, random location, then enter the directory, enter the command mode to view scrapy
startproject create project command. Project name format scrapy startproject
genspider create a crawler command, a project can have multiple crawlers. Format scrapy genspider reptiles name (and can not project the same name) crawler ip address initial value
The goal is to obtain the contents of the home tmooc sidebar (sub text child element of the child elements of a span of)
tmooc Home
Sidebar content
Edit item.py, spider position in the directory at the same level (the code is simple, do not paste the code a)
Edit test.py
Code section
# -*- coding: utf-8 -*-
import scrapy
# Leader packet need, a method is introduced to generate item
from demo.items import DemoItem
class TestSpider(scrapy.Spider):
# Reptile name to use when running
name = 'test'
# A restricted domain, are not beyond the scope of the process, may be omitted
allowed_domains = ['http://www.tmooc.cn']
# initial address
start_urls = ['http://www.tmooc.cn/']
# Callback
def parse(self, response):
# Reptile nature of the program is to address the request, parse the response, and then request the next address again
# So the main part of the reptiles lies in how to operate the response objects
nodes=response.xpath("//li[@class='sub']")
for node in nodes:
#item generated by the spider items.py method in the same directory, is similar to a dictionary (java map) Type
item=DemoItem()
item['name']=node.xpath("./a/span/text()").extract()[0]
#yield similar return, details of Baidu.
yield item
Compile test.py, run a spider
crawl is run spider commands. Format scrapy crawl reptile name [-o filename]
-o parameters Alternatively, the role of the data stored in the spider crawling. Stored in the directory operation command can be saved into a csv (excel table) json jsonl xml ... other formats
The results show
eclipse development scrapy Spider Project
First, to ensure there python eclipse development environment
New python project, the default option to
Create a good directory structure
Into the local workspace, locate the project directory
The scrapy project just created directory copied, do not start to create that folder
The demo directory
Copy to
Project directory. Remember to delete the last run of results file
run -> run configuretion ->
operation result