web crawler, scrapy module tag selector download pictures, as well as the regular tag match

Select the object tag

HtmlXPathSelector () to create a label selected object, html parameters received response callback objects
need to import modules: from scrapy.selector import HtmlXPathSelector

select () tag selector method, is a method in HtmlXPathSelector, the parameter selection rules receiver, returns a list of element is a target tag

extract () to acquire the selected content filter, the content is returned list element

Selection rules

  // x represents the n-layer to find the specified tag down, such as: // div indicates Find all div tags
  / x to find the next layer represents the specified label
  / @ x represents for the specified property, such as can be put together: @id @src
  [ @ class = "class name"] represents for the specified attribute is equal to the label specified value, can be put together, look for the class name is equal to the specified name label
  / text () Gets the label text-based content
  [x] Gets the collection of specified element through the index

If you are still confused in the programming world, you can join us to learn Python buckle qun: 784758214, look at how seniors are learning. Exchange of experience. From basic web development python script to, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Share some learning methods and need to pay attention to small details, click on Join us python learner gathering

Gets the specified label objects

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from urllib import request                     #导入request模块
import os

class AdcSpider(scrapy.Spider):
    name = 'adc'                                        #设置爬虫名称
    allowed_domains = ['www.shaimn.com']
    start_urls = ['http://www.shaimn.com/xinggan/']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)               #创建HtmlXPathSelector对象,将页面返回对象传进去

        items = hxs.select('//div[@class="showlist"]/li')  #标签选择器,表示获取所有class等于showlist的div,下面的li标签
        print(items)                                       #返回标签对象

web crawler, scrapy module tag selector download pictures, as well as the regular tag match

web crawler, scrapy module tag selector download pictures, as well as the regular tag match

Each cycle to obtain a sub-tag label li, and various attributes or text

web crawler, scrapy module tag selector download pictures, as well as the regular tag match

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from urllib import request                     #导入request模块
import os

class AdcSpider(scrapy.Spider):
    name = 'adc'                                        #设置爬虫名称
    allowed_domains = ['www.shaimn.com']
    start_urls = ['http://www.shaimn.com/xinggan/']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)               #创建HtmlXPathSelector对象,将页面返回对象传进去

        items = hxs.select('//div[@class="showlist"]/li')  #标签选择器,表示获取所有class等于showlist的div,下面的li标签
        # print(items)                                     #返回标签对象
        for i in range(len(items)):                        #根据li标签的长度循环次数
            title = hxs.select('//div[@class="showlist"]/li[%d]//img/@alt' % i).extract()   #根据循环的次数作为下标获取到当前li标签,下的img标签的alt属性内容
            src = hxs.select('//div[@class="showlist"]/li[%d]//img/@src' % i).extract()     #根据循环的次数作为下标获取到当前li标签,下的img标签的src属性内容
            if title and src:
                print(title,src)  #返回类容列表

web crawler, scrapy module tag selector download pictures, as well as the regular tag match

The acquired images downloaded to the local

urlretrieve () to save the file to the local src, to save the file parameter 1, parameter 2 to save the path
urlretrieve is a method in urllib request module need to import from urllib import request

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from urllib import request                     #导入request模块
import os

class AdcSpider(scrapy.Spider):
    name = 'adc'                                        #设置爬虫名称
    allowed_domains = ['www.shaimn.com']
    start_urls = ['http://www.shaimn.com/xinggan/']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)               #创建HtmlXPathSelector对象,将页面返回对象传进去

        items = hxs.select('//div[@class="showlist"]/li')  #标签选择器,表示获取所有class等于showlist的div,下面的li标签
        # print(items)                                     #返回标签对象
        for i in range(len(items)):                        #根据li标签的长度循环次数
            title = hxs.select('//div[@class="showlist"]/li[%d]//img/@alt' % i).extract()   #根据循环的次数作为下标获取到当前li标签,下的img标签的alt属性内容
            src = hxs.select('//div[@class="showlist"]/li[%d]//img/@src' % i).extract()     #根据循环的次数作为下标获取到当前li标签,下的img标签的src属性内容
            if title and src:
                # print(title[0],src[0])                                                    #通过下标获取到字符串内容
                file_path = os.path.join(os.getcwd() + '/img/', title[0] + '.jpg')          #拼接图片保存路径
                request.urlretrieve(src[0], file_path)                          #将图片保存到本地,参数1获取到的src,参数2保存路径

web crawler, scrapy module tag selector download pictures, as well as the regular tag match

xpath () tag selector, is a method Selector class, the argument is the selection rule [Recommended]

Select the rules above

Selector () create a selected class, need to receive the html
need to import: from scrapy.selector import Selector

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from scrapy.selector import Selector

class AdcSpider(scrapy.Spider):
    name = 'adc'                                        #设置爬虫名称
    allowed_domains = ['www.shaimn.com']
    start_urls = ['http://www.shaimn.com/xinggan/']

    def parse(self, response):
        items = Selector(response=response).xpath('//div[@class="showlist"]/li').extract()
        # print(items)                                     #返回标签对象
        for i in range(len(items)):
            title = Selector(response=response).xpath('//div[@class="showlist"]/li[%d]//img/@alt' % i).extract()
            src = Selector(response=response).xpath('//div[@class="showlist"]/li[%d]//img/@src' % i).extract()
            print(title,src)

Application of regular expressions

Regular expressions are used to make up the time, selector rules do not meet the filter case,

It divided into two regular use

  1, the filter selection rules out regular matching results

  2, the selection rules in the application being filtered

1, the filter selection rules out regular matching results, taken with a regular final content

Finally .RE ( 'regular')

# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from scrapy.selector import Selector

class AdcSpider(scrapy.Spider):
    name = 'adc'                                        #设置爬虫名称
    allowed_domains = ['www.shaimn.com']
    start_urls = ['http://www.shaimn.com/xinggan/']

    def parse(self, response):
        items = Selector(response=response).xpath('//div[@class="showlist"]/li//img')[0].extract()
        print(items)                                     #返回标签对象
        items2 = Selector(response=response).xpath('//div[@class="showlist"]/li//img')[0].re('alt="(\w+)')
        print(items2)

# <img src="http://www.shaimn.com/uploads/170724/1-1FH4221056141.jpg" alt="人体艺术mmSunny前凸后翘性感诱惑写真">
# ['人体艺术mmSunny前凸后翘性感诱惑写真']

2, the selection rules in the application being filtered

[Re: the rules of regular]

在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容
# -*- coding: utf-8 -*-
import scrapy       #导入爬虫模块
from scrapy.selector import HtmlXPathSelector  #导入HtmlXPathSelector模块
from scrapy.selector import Selector

class AdcSpider(scrapy.Spider):
    name = 'adc'                                        #设置爬虫名称
    allowed_domains = ['www.shaimn.com']
    start_urls = ['http://www.shaimn.com/xinggan/']

    def parse(self, response):
        items = Selector(response=response).xpath('//div').extract()
        # print(items)                                     #返回标签对象
        items2 = Selector(response=response).xpath('//div[re:test(@class, "showlist")]').extract()  #正则找到div的class等于showlist的元素
        print(items2)

Guess you like

Origin blog.51cto.com/14510224/2432988