[Reptile] sister python-- beauty map Street beat reptiles (being a gentleman you know)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_43017750/article/details/90345965

Each sister have seen estimates gentleman map, how to save little sister inside the picture down?
The bloggers to explain how to do a sister car map street shooting beautiful pictures crawlers.
Bit, gentlemen please get on the train as soon as possible, this destination kindergarten. O ( //// //// ▽ ) q, you know! ! !

Premise preparation

Gentleman you are on the bus now! Well, the main Bo doors welded shut, no one other than the nursery to get off. ヾ (• ω • `) o

In order to be safe to eat this tutorial, you need the following:
Python (Version: 3.7)
Requests (Version: 2.21.0)
lxml (Version: 4.3.3)
Firefox or Google browser
win7 or a win10 computer

demand analysis

Figure crawling all welfare sister beauty map Street beat module, and download and save to a local folder [picture] in
the sister beauty map Street beat module URL [ https://www.mzitu.com/jiepai/comment-page-1 / # comments]

Web analytics

First, we open the sister beauty map Street beat URL , then right-click on the picture to see the elements] [View image link position
Here Insert Picture Descriptionyou gentlemen see page code, do not look sister (lll¬ω¬)!
Can be found on the page exists in the class attribute lazy label the data-original property, so it is easy to write a good match pictures of xpath. Now picture all pages can be extracted out. Because we have to crawl all the pages, all the functions but also to achieve flip, analysis URL changes after the next page.
Here Insert Picture DescriptionURL of the first page
[ https://www.mzitu.com/jiepai/comment-page-1/#comments]
URL of the second page of the
[ https://www.mzitu.com/jiepai/comment-page-2 / # comments]
As can be seen at page N is the number page- URL later becomes N.

Code combat

1. premise parameters

# -*- coding:utf-8 -*-
#作者:猫先生的早茶
#时间:2019年5月19日
import requests
from lxml import etree

"""
设置浏览器头部,
User-Agent用于表示浏览器的参数信息
Referer用于设置使用那个网页跳转过来的
url用于设置网址模板,可以通过.format参数补充网址
"""
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0",
          "Referer":"https://www.mzitu.com/jiepai/comment-page-1/",}
url = 'https://www.mzitu.com/jiepai/comment-page-{}/#comments'
name = 0

# -*- coding:utf-8 -*-Role is the default encoded code disposed utf-8 used is
import requestsintroduced into the module named requests, if possible by pip install requestsinstalling the module
from lxml import etreeintroduced lxml etree function module for web pages to etree format, image matching link
header for save the headers parameter, used to simulate browser
url URL template for saving, convenient post-build page URL
name used to save the current first few pictures

2. Download page

def get_html(url):
    """获取网页代码并以返回值的形式弹出"""
    html = requests.get(url,headers=header).text
    return html

定义一个函数get_html用于下载网页代码,调用时需要传递一个网址作为参数
使用requests模块的get函数,将网址作为传递进去,并将header作为浏览器头部传入关键字形参headers,将内容以text文本的形式保存到变量html中,并使用return将网页内容弹出。

3.下载图片

def get_img(url):
    """下载图片并保存到指定文件夹下"""
    global name
    name +=1
    img_name = 'picture\\{}.jpg'.format(name)
    img = requests.get(url,headers=header).content
    with open (img_name,'wb') as save_img:
        save_img.write(img)

定义一个函数get_img用于下载图片,调用时需要传递一个网址作为参数
global name ,用于该函数修改全局变量name 的值
name += 1 设置当前name的值为原先name的值加一,等价于 name += 1
设置一个变量img_name用于设置图片名称。使用requests模块的get函数,将网址作为传递进去,并将header作为浏览器头部传入关键字形参headers,将内容以content二进制数据的形式保存到变量img中,并且保存到picture文件夹中。现在也能获取网页代码,也能下载图片,我们还需要获取所有图片的网址

4.获取图片链接

def get_url(html):
    """获取图片链接并以返回值的形式弹出"""
    etree_html = etree.HTML(html)
    img_url = etree_html.xpath('//img[@class="lazy"]/@data-original')
    return img_url

使用etree.HTML方法将网页代码转换为etree格式的数据
使用xpath匹配处所有的图片链接,便以reture的方式弹出

5.主函数

def main():
    '''使用for循环爬取所有网页'''
    for n in range(1,24):
        print ("正在爬取第{}页".format(n))
        html = get_html(url.format(n))
        img_list = get_url(html)
        for img in img_list:
            get_url(img)
main()

Using the value set for loop variable n, grows from the number 1, the end value of 24, a default step is positive.
Use print ("正在爬取第{}页".format(n))Print is crawling the first few pages.
The value of n is then passed into the url variable filled URL and the URL is passed into get_html function, to save the download URL into the html variable.
Variables saved in html page code into get_url transfer function for extracting all the benefits FIG link page.
Use a for loop to extract each image using a specific link transfer function for downloading images into get_img
last used main () calls the main function

Verify the effect

Now we execute the code and see what happens back to
Here Insert Picture Description
show crawling the 1-23, crawling in to look at the picture, you gentlemen please hold on O ( //// //// ▽ ) q
Here Insert Picture Description

The complete code

# -*- coding:utf-8 -*-
#作者:猫先生的早茶
#时间:2019年5月19日
import requests
from lxml import etree


"""
设置浏览器头部,
User-Agent用于表示浏览器的参数信息
Referer用于设置使用那个网页跳转过来的
url用于设置网址模板,可以通过.format参数补充网址
"""
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0",
          "Referer":"https://www.mzitu.com/jiepai/comment-page-1/",}
url = 'https://www.mzitu.com/jiepai/comment-page-{}/#comments'
name = 0

def get_html(url):
    """获取网页代码并以返回值的形式弹出"""
    html = requests.get(url,headers=header).text
    return html


def get_img(url):
    """下载图片并保存到指定文件夹下"""
    global name
    name +=1
    img_name = 'picture\\{}.jpg'.format(name)
    img = requests.get(url,headers=header).content
    with open (img_name,'wb') as save_img:
        save_img.write(img)

def get_url(html):
    """获取图片链接并以返回值的形式弹出"""
    etree_html = etree.HTML(html)
    img_url = etree_html.xpath('//img[@class="lazy"]/@data-original')
    return img_url


def main():
    '''使用for循环爬取所有网页'''
    for n in range(1,24):
        print ("正在爬取第{}页".format(n))
        html = get_html(url.format(n))
        img_list = get_url(html)
        for img in img_list:
            get_img(img)
main()

Guess you like

Origin blog.csdn.net/qq_43017750/article/details/90345965
Recommended