Python_Scrapy play crawler

  1. Concept: Scrapy is a fast, high-level screen scraping and web scraping package

  2. Scrapy installation

     pip3 install scrapy
    
  3. Scrapy is simple and practical-crawler
    3.1 Create a test Scrapy project:

	scrapy startproject project_name 

Insert picture description here
Insert picture description here
3.2 Set the field definitions to be captured in the items.py file

import scrapy
class DemoItem(scrapy.Item):
  #define the fields for you item here like:
  #name = scrapy.Field()
  title = scrapy.Field() #抓取的标题
  link = scrapy.Field() #抓取的链接

3.3 First get the content of the page and take a look to create MydemoSpider.py in the spiders folder

import scrapy

class MydemoSpider(scrapy.Spider):
    # name表示爬虫的名字,在`scrapy crawl xx`时使用
    name = “mydemo"
    # allowed_domains定义了爬取的范围,限定在某几个域名内
    allowed_domains = ['http://www.n360.cn']
    # start_urls定义了从哪里开始爬取
    start_urls = ['http://www.n360.cn']

    # parse方法用于接收Downloader返回的结果
    def parse(self,response):
        with open('homepage','wb') as f:
            f.write(response.body)

3.4 Simple test: A homepage will be generated in the current directory, the content is the html code of the page

cd /Users/xietong/Desktop/mydemo/mydemo
scrapy crawl mydemo 

Insert picture description here
3.5 understand scrapy's xpath syntax

grammar meaning
/html/head/title Select in the tag in the HTML document元素
/html/head/title/text() Select the text of the title element mentioned above
//td Select all element
//div[@class=‘mine’] Select all div elements with class='mine' attribute

3.6 Test scrapy's xpath syntax

# 终端输入,进入scrapy shell模式
$scrapy shell "http://www.n360.cn"
# response 可以用它查看网页信息
# response.headers 获取网页的请求头

Insert picture description here

# response.xpath('//title’)  获取页面title 返回的是Selector对象

Insert picture description here

# response.xpath('//title').extract() extract()方法会将Selector对象转化为列表对象

Insert picture description here

# response.xpath('//title/text()').extract() title标签的文本内容

Insert picture description here

# response.xpath("//ul[@class='newslist']").extract() 获取ul元素class属性为newslist

Insert picture description here

# 获取li标签
sites = response.xpath('//ul[@class="newslist"]/li’)
for site in sites:
	# 获取a标签的text内容
	title = site.xpath('a/text()').extract()
	print(title)

Insert picture description here
3.7 Modify the content of MydemoSpider.py directly

import scrapy
from mydemo.items import DemoItem 

class MydemoSpider(scrapy.Spider):
    # name表示爬虫的名字,在`scrapy crawl xx`时使用
    name = "mydemo"
    # allowed_domains定义了爬取的范围,限定在某几个域名内
    allowed_domains = ['http://www.n360.cn']
    # start_urls定义了从哪里开始爬取
    start_urls = ['http://www.n360.cn']

    # parse方法用于接受Downloader返回的结果
    def parse(self,response):
        sites = response.xpath("//ul[@class='newslist']/li")
        items = []
        for site in sites:
            item = DemoItem()
            item['title'] = site.xpath('a/text()').extract()[0]
            item['link'] = self.allowed_domains[0] + site.xpath('a/@href').extract()[0]
            items.append(item)
        return items

3.8 Generate a json file and specify the encoding for the complete test results of the project

# 将parse函数返回的结果进行导出,使用json格式
scrapy crawl demo -o items.json -t json -s FEED_EXPORT_ENCODING=UTF-8

Insert picture description here

Guess you like

Origin blog.csdn.net/q18729096963/article/details/106080362