-
Concept: Scrapy is a fast, high-level screen scraping and web scraping package
-
Scrapy installation
pip3 install scrapy
-
Scrapy is simple and practical-crawler
3.1 Create a test Scrapy project:
scrapy startproject project_name
3.2 Set the field definitions to be captured in the items.py file
import scrapy
class DemoItem(scrapy.Item):
#define the fields for you item here like:
#name = scrapy.Field()
title = scrapy.Field() #抓取的标题
link = scrapy.Field() #抓取的链接
3.3 First get the content of the page and take a look to create MydemoSpider.py in the spiders folder
import scrapy
class MydemoSpider(scrapy.Spider):
# name表示爬虫的名字,在`scrapy crawl xx`时使用
name = “mydemo"
# allowed_domains定义了爬取的范围,限定在某几个域名内
allowed_domains = ['http://www.n360.cn']
# start_urls定义了从哪里开始爬取
start_urls = ['http://www.n360.cn']
# parse方法用于接收Downloader返回的结果
def parse(self,response):
with open('homepage','wb') as f:
f.write(response.body)
3.4 Simple test: A homepage will be generated in the current directory, the content is the html code of the page
cd /Users/xietong/Desktop/mydemo/mydemo
scrapy crawl mydemo
3.5 understand scrapy's xpath syntax
grammar | meaning | |
---|---|---|
/html/head/title | Select in the tag in the HTML document |
|
/html/head/title/text() | Select the text of the title element mentioned above | |
//td | Select all | element |
//div[@class=‘mine’] | Select all div elements with class='mine' attribute |
3.6 Test scrapy's xpath syntax
# 终端输入,进入scrapy shell模式
$scrapy shell "http://www.n360.cn"
# response 可以用它查看网页信息
# response.headers 获取网页的请求头
# response.xpath('//title’) 获取页面title 返回的是Selector对象
# response.xpath('//title').extract() extract()方法会将Selector对象转化为列表对象
# response.xpath('//title/text()').extract() title标签的文本内容
# response.xpath("//ul[@class='newslist']").extract() 获取ul元素class属性为newslist
# 获取li标签
sites = response.xpath('//ul[@class="newslist"]/li’)
for site in sites:
# 获取a标签的text内容
title = site.xpath('a/text()').extract()
print(title)
3.7 Modify the content of MydemoSpider.py directly
import scrapy
from mydemo.items import DemoItem
class MydemoSpider(scrapy.Spider):
# name表示爬虫的名字,在`scrapy crawl xx`时使用
name = "mydemo"
# allowed_domains定义了爬取的范围,限定在某几个域名内
allowed_domains = ['http://www.n360.cn']
# start_urls定义了从哪里开始爬取
start_urls = ['http://www.n360.cn']
# parse方法用于接受Downloader返回的结果
def parse(self,response):
sites = response.xpath("//ul[@class='newslist']/li")
items = []
for site in sites:
item = DemoItem()
item['title'] = site.xpath('a/text()').extract()[0]
item['link'] = self.allowed_domains[0] + site.xpath('a/@href').extract()[0]
items.append(item)
return items
3.8 Generate a json file and specify the encoding for the complete test results of the project
# 将parse函数返回的结果进行导出,使用json格式
scrapy crawl demo -o items.json -t json -s FEED_EXPORT_ENCODING=UTF-8