scrapy learn, can refer to: scrapy1.5 Chinese documents, http://www.scrapyd.cn/doc/
1) Create a project
- Specified folder directory create a project, cmd into the folder path, use the command: scrapy startproject project name
Create a project directory structure after the success:
2) write your first spider, reference: http://www.scrapyd.cn/doc/140.html
import scrapy
class mingyan(scrapy.Spider): # 需要继承scrapy.Spider类
name = "mingyan2" # 定义蜘蛛名(crwal后的名称)
start_urls = ['http://lab.scrapyd.cn']
def parse(self, response):
mingyan = response.css('div.quote')
for v in mingyan: # 循环获取每一条名言里面的:名言内容、作者、标签
text = v.css('.text::text').extract_first() # 提取名言
autor = v.css('.author::text').extract_first() # 提取作者
tags = v.css('.tags .tag::text').extract() # 提取标签
tags = ','.join(tags) # 数组转换为字符串
#保存
fileName = '%s-语录.txt' % autor # 爬取的内容存入文件,文件名为:作者-语录.txt
with open(fileName, "a+") as f: # 不同人的名言保存在不同的txt文档,“a+”以追加的形式
f.write(text)
f.write('\n') # ‘\n’ 表示换行
f.write('标签:' + tags)
f.write('\n-------\n')
f.close()
3) pycharm run Scrapy reptiles project reference: https://www.cnblogs.com/llssx/p/8378832.html
Define a py, as follows:
from scrapy import cmdline
# 参数三为爬虫的名字name
cmdline.execute(['scrapy', 'crawl', 'mingyan2'])
4) scrapy extract data:
1. css selector
2. scrapy extract data: xpath selector
5) scrapy command