Each sister have seen estimates gentleman map, how to save little sister inside the picture down?
The bloggers to explain how to do a sister car map street shooting beautiful pictures crawlers.
Bit, gentlemen please get on the train as soon as possible, this destination kindergarten. O ( //// //// ▽ ) q, you know! ! !
Premise preparation
Gentleman you are on the bus now! Well, the main Bo doors welded shut, no one other than the nursery to get off. ヾ (• ω • `) o
In order to be safe to eat this tutorial, you need the following:
Python (Version: 3.7)
Requests (Version: 2.21.0)
lxml (Version: 4.3.3)
Firefox or Google browser
win7 or a win10 computer
demand analysis
Figure crawling all welfare sister beauty map Street beat module, and download and save to a local folder [picture] in
the sister beauty map Street beat module URL [ https://www.mzitu.com/jiepai/comment-page-1 / # comments]
Web analytics
First, we open the sister beauty map Street beat URL , then right-click on the picture to see the elements] [View image link position
you gentlemen see page code, do not look sister (lll¬ω¬)!
Can be found on the page exists in the class attribute lazy label the data-original property, so it is easy to write a good match pictures of xpath. Now picture all pages can be extracted out. Because we have to crawl all the pages, all the functions but also to achieve flip, analysis URL changes after the next page.
URL of the first page
[ https://www.mzitu.com/jiepai/comment-page-1/#comments]
URL of the second page of the
[ https://www.mzitu.com/jiepai/comment-page-2 / # comments]
As can be seen at page N is the number page- URL later becomes N.
Code combat
1. premise parameters
# -*- coding:utf-8 -*-
#作者:猫先生的早茶
#时间:2019年5月19日
import requests
from lxml import etree
"""
设置浏览器头部,
User-Agent用于表示浏览器的参数信息
Referer用于设置使用那个网页跳转过来的
url用于设置网址模板,可以通过.format参数补充网址
"""
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Referer":"https://www.mzitu.com/jiepai/comment-page-1/",}
url = 'https://www.mzitu.com/jiepai/comment-page-{}/#comments'
name = 0
# -*- coding:utf-8 -*-
Role is the default encoded code disposed utf-8 used is
import requests
introduced into the module named requests, if possible by pip install requests
installing the module
from lxml import etree
introduced lxml etree function module for web pages to etree format, image matching link
header for save the headers parameter, used to simulate browser
url URL template for saving, convenient post-build page URL
name used to save the current first few pictures
2. Download page
def get_html(url):
"""获取网页代码并以返回值的形式弹出"""
html = requests.get(url,headers=header).text
return html
定义一个函数get_html用于下载网页代码,调用时需要传递一个网址作为参数
使用requests模块的get函数,将网址作为传递进去,并将header作为浏览器头部传入关键字形参headers,将内容以text文本的形式保存到变量html中,并使用return将网页内容弹出。
3.下载图片
def get_img(url):
"""下载图片并保存到指定文件夹下"""
global name
name +=1
img_name = 'picture\\{}.jpg'.format(name)
img = requests.get(url,headers=header).content
with open (img_name,'wb') as save_img:
save_img.write(img)
定义一个函数get_img用于下载图片,调用时需要传递一个网址作为参数
global name ,用于该函数修改全局变量name 的值
name += 1 设置当前name的值为原先name的值加一,等价于 name += 1
设置一个变量img_name用于设置图片名称。使用requests模块的get函数,将网址作为传递进去,并将header作为浏览器头部传入关键字形参headers,将内容以content二进制数据的形式保存到变量img中,并且保存到picture文件夹中。现在也能获取网页代码,也能下载图片,我们还需要获取所有图片的网址
4.获取图片链接
def get_url(html):
"""获取图片链接并以返回值的形式弹出"""
etree_html = etree.HTML(html)
img_url = etree_html.xpath('//img[@class="lazy"]/@data-original')
return img_url
使用etree.HTML方法将网页代码转换为etree格式的数据
使用xpath匹配处所有的图片链接,便以reture的方式弹出
5.主函数
def main():
'''使用for循环爬取所有网页'''
for n in range(1,24):
print ("正在爬取第{}页".format(n))
html = get_html(url.format(n))
img_list = get_url(html)
for img in img_list:
get_url(img)
main()
Using the value set for loop variable n, grows from the number 1, the end value of 24, a default step is positive.
Use print ("正在爬取第{}页".format(n))
Print is crawling the first few pages.
The value of n is then passed into the url variable filled URL and the URL is passed into get_html function, to save the download URL into the html variable.
Variables saved in html page code into get_url transfer function for extracting all the benefits FIG link page.
Use a for loop to extract each image using a specific link transfer function for downloading images into get_img
last used main () calls the main function
Verify the effect
Now we execute the code and see what happens back to
show crawling the 1-23, crawling in to look at the picture, you gentlemen please hold on O ( //// //// ▽ ) q
The complete code
# -*- coding:utf-8 -*-
#作者:猫先生的早茶
#时间:2019年5月19日
import requests
from lxml import etree
"""
设置浏览器头部,
User-Agent用于表示浏览器的参数信息
Referer用于设置使用那个网页跳转过来的
url用于设置网址模板,可以通过.format参数补充网址
"""
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Referer":"https://www.mzitu.com/jiepai/comment-page-1/",}
url = 'https://www.mzitu.com/jiepai/comment-page-{}/#comments'
name = 0
def get_html(url):
"""获取网页代码并以返回值的形式弹出"""
html = requests.get(url,headers=header).text
return html
def get_img(url):
"""下载图片并保存到指定文件夹下"""
global name
name +=1
img_name = 'picture\\{}.jpg'.format(name)
img = requests.get(url,headers=header).content
with open (img_name,'wb') as save_img:
save_img.write(img)
def get_url(html):
"""获取图片链接并以返回值的形式弹出"""
etree_html = etree.HTML(html)
img_url = etree_html.xpath('//img[@class="lazy"]/@data-original')
return img_url
def main():
'''使用for循环爬取所有网页'''
for n in range(1,24):
print ("正在爬取第{}页".format(n))
html = get_html(url.format(n))
img_list = get_url(html)
for img in img_list:
get_img(img)
main()