Python crawler instance of beautiful pictures
- ==This crawler is aimed at adults and has a certain degree of self-control (involving some sexy pictures, only for crawler case study)==
- foreword
- Target
- train of thought
- step
- result
This crawler is aimed at adults and has a certain degree of self-control (involves some sexy pictures, only for crawler case studies)
foreword
Recently, the mentality of writing a thesis has exploded, and I feel that I need to find something fun to relax. Browsing the web, I saw a software for drawing pictures: [秀人网] Meitu Download Version 1.1—By, Xiaogucheng,
but the trial effect was not good, so I was going to do it myself and have enough food and clothing ( mainly the girls in it, what I want )
Full of warmth and lust, hands and enough food
Target
Through reptiles, get the ladies and sisters in [秀人网] and give them a warm home
It is not a pity to stain the clothes, but the wish is not violated
train of thought
- baiduXiuren.com, get URL: https://www.xrmn5.com/XiuRen/ , go to home page: https://www.xrmn5.com/
- F12 to see the structure of the web page, analyze and obtain links and structures through regular expressions
- Pycharm+Python 3.7, what you want
- Get it successfully, you're done
Scattered flowers gradually become charming eyes
step
Step 1: Check the web page structure
F12 Dafa
Get a rough look at the webpage
By "inspecting" the front page, we get this:
<a href="/XiuRen/2021/20219023.html" alt="[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P"><img src="/UploadFile/pic/9023.jpg" style="opacity:1;display:inline;">
<div class="postlist-imagenum"><span>芝芝</span></div></a>
Then by clicking with the mouse, <img src="/UploadFile/pic/9023.jpg">
we can get such information:
What do these two pictures represent?
Corresponding web page structure
- Web page URL (i.e. homepage URL): https://www.xrmn5.com
- Key points of the home page - take the picture of the lady above as an example: src="/UploadFile/pic/9023.jpg"
- The web page address corresponding to the picture: href="/XiuRen/2021/20219023.html"
- The corresponding title of the picture:
[XiuRen秀人网]No.3825_Goddess Zhizhi Booty Jiangsu, Zhejiang and Shanghai travel photo shoots light hanging skirt revealing ultra-thin meat shreds show buttocks temptation photo 81P(not suitable for children) - The corresponding character of the picture: span>Zhizhi</span
Take a closer look at the picture information
Let's take a closer look at this picture and find that there is still time and the number of viewers (seemingly) under the title.
The specific web page code is as follows
<div class="case_info" style="background-color: rgb(204, 232, 207);"><div class="meta-title">[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P</div>
<div class="meta-post"><i class="fa fa-clock-o"></i>2021.09.01<span class="cx_like"><i class="fa fa-eye"></i>101</span></div></div>
Corresponding web page structure
From here it follows that:
- Image creation time: /i>2021.09.01<span
- Image views: /i>101</span
Step 2: Obtain the corresponding web page
Development environment:
Windows 10 64-bit Professional + PyCharm + Python3.7 +
import os
import time
import requests
import re
in:
- os for path acquisition
- time for delay
- requests is used to obtain web page information
- re for parsing
web page information
fromStep 1: Check the web page structureIn can know:
-
The home page image link address is:Home link+ the link in src, namely: https://www.xrmn5.com/UploadFile/pic/9023.jpg (actually the specific web page is: https://pic.xrmn5.com/Uploadfile/pic/9023.jpg, but link above available)
-
The name of the album corresponding to the picture is:
[XiuRen秀人网]No.3825_Goddess Zhizhi Booty Jiangsu, Zhejiang and Shanghai travel photo shoots light hanging skirt revealing ultra-thin meat silk show buttocks temptation photo 81P -
The URL of the album corresponding to the picture is:Home linkThe link in +href, namely: https://www.xrmn5.com/XiuRen/2021/20219023.html
-
Corresponding character of the picture: Zhizhi
-
Image creation time: 2021.09.01
python get:
Step 1: Get the web page
Through requests.get
we can get the web page, because there is Chinese ‘utf-8’
, encode it, and finally turn it into text for display (the headers are the headers information of the current web page obtained through F12), so far, the home page information is obtained Get_html
in
'''
第一步:请求网页
'''
import requests
# 头标签
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84'
}
Get_url = requests.get('https://www.xrmn5.com/',headers=headers)
Get_url.encoding = 'utf-8'
# print(Get_url.text)
# print(Get_url.request.headers)
Get_html = Get_url.text
Step 2: Parse the web page
At the beginning, I used re.findall (rule, data), and then I felt that it was too troublesome. I changed it to re.compile (rule, re.S). Findall (data)
focuses on the rule part.urlsHere is the corresponding to get the pictureAlbum page information (ie link in href), title information and path information of the picture itself
inforNameHere is to get character information and title information
likeNumHere is the creation time and number of viewers corresponding to the album.
So far, all relevant information has been obtained.
'''
第二步:解析网页
'''
import re
# 正则表达式对应的为:
# (.*?):获取()内的所有
# \"(.*?)\" 用于匹配网页
# re.findall 用于获取()内的数据并每存为元组
urls = re.findall('<li class="i_list list_n2"><a href=\"(.*?)\" alt=(.*?) title=.*?><img src=\"(.*?)\"',Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1,re.S).findall(Get_html)
likeNum = re.compile(patren2,re.S).findall(Get_html)
Step 3: Get the home page cover image
The first is to specify the picture storage directory: (I set it myself)
dir = r"D:/Let'sFunning/Picture/PythonGet/"
Then I made a judgment, I only get the cover picture for the photo album with more than 500 viewers, and save it in
dir 下的 人名 下的 时间名 下的 专辑名 文件夹中
os.makedirs()for creating folders
urls[i][2].split(’/’)[-1]Here is the path corresponding to the picture, such as: /UploadFile/pic/9023.jpg
the last section, namely9023.jpgSave it as the name of the picture
and finally splicing the path of the picture with the webpage, which is the url of the picture, throughrequests.get()Just get it and write it
'''
第三步:存储封面
'''
import os
import time
dir = r"D:/Let'sFunning/Picture/PythonGet/"
url = "https://pic.xrmn5.com"
# 创建目录:人名+时间+专辑名
num = len(likeNum)
for i in range(num):
if (int(likeNum[i][1]) > 500):
getImgDir=dir+str(inforName[i][0])+'/'+str(likeNum[i][0])+'/'+str(inforName[i][1]+'/')
# 创建对应目录
if not os.path.exists(getImgDir):
os.makedirs(getImgDir)
imgUrl = url+urls[i][2]
imgName = getImgDir+urls[i][2].split('/')[-1]
print(imgName)
time.sleep(1)
# 获取封面图片
Get_Img = requests.get(imgUrl, headers=headers)
with open(imgName,'wb') as f:
f.write(Get_Img.content)
# 进入具体网页
So far, the cover images with more than 500 viewers on the homepage have been saved to the corresponding directory
Step 3: Get photo album pictures
passStep 2: Obtain the corresponding web page, we successfully obtained the pictures of young ladies on the homepage, but these are not crawlers. We only crawled a few pictures on the homepage. We should go deep inside to get the complete photo pictures.
Purpose
Get photo sets with more than 500 viewers, not just the cover
train of thought
Step 1: Obtain the link of the web page where the set of pictures is located:
Right now:Step 2: Obtain the corresponding web pageofweb page information
- The URL of the album corresponding to the picture is:Home linkThe link in +href, namely: https://www.xrmn5.com/XiuRen/2021/20219023.html
Step 2: Analyze the corresponding web page of the photo set:
The same F12, we can know that there are three pictures on this page:
the corresponding codes of the pictures are as follows:
<p style="text-align: center"><img onload="size(this)" alt="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" src="/uploadfile/202109/1/47201045101.jpg"><br>
<br>
<img onload="size(this)" alt="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" src="/uploadfile/202109/1/07201045631.jpg"><br>
<br>
<img onload="size(this)" alt="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" src="/uploadfile/202109/1/16201045377.jpg"><br>
<br>
</p>
The corresponding page code is as follows:
<div class="page"><a href="/XiuRen/2021/20219023.html" class="current">1</a><a href="/XiuRen/2021/20219023_1.html">2</a><a href="/XiuRen/2021/20219023_2.html">3</a><a href="/XiuRen/2021/20219023_3.html">4</a><a href="/XiuRen/2021/20219023_4.html">5</a><a href="/XiuRen/2021/20219023_5.html">6</a><a href="/XiuRen/2021/20219023_6.html">7</a><a href="/XiuRen/2021/20219023_7.html">8</a><a href="/XiuRen/2021/20219023_8.html">9</a><a href="/XiuRen/2021/20219023_9.html">10</a><a href="/XiuRen/2021/20219023_10.html">11</a><a href="/XiuRen/2021/20219023_11.html">12</a><a href="/XiuRen/2021/20219023_12.html">13</a><a href="/XiuRen/2021/20219023_13.html">14</a><a href="/XiuRen/2021/20219023_14.html">15</a><a href="/XiuRen/2021/20219023_15.html">16</a><a href="/XiuRen/2021/20219023_16.html">17</a><a href="/XiuRen/2021/20219023_17.html">18</a><a href="/XiuRen/2021/20219023_18.html">19</a><a href="/XiuRen/2021/20219023_19.html">20</a><a href="/XiuRen/2021/20219023_20.html">21</a><a href="/XiuRen/2021/20219023_21.html">22</a><a href="/XiuRen/2021/20219023_22.html">23</a><a href="/XiuRen/2021/20219023_23.html">24</a><a href="/XiuRen/2021/20219023_24.html">25</a><a href="/XiuRen/2021/20219023_25.html">26</a><a href="/XiuRen/2021/20219023_26.html">27</a><a href="/XiuRen/2021/20219023_1.html">下页</a></div>
Step 3: The corresponding web page information of the photo set:
From the above two pictures and the corresponding code, we can know that,
- This set has a total of 27 pages>27</a
- Each page has three imagessrc="/uploadfile/202109/1/47201045101.jpg">、src="/uploadfile/202109/1/07201045631.jpg">、src="/uploadfile/202109/1/16201045377.jpg">
- Corresponding links to the rest of the pages except this one are as follows:href="/XiuRen/2021/20219023_2.html">
Code
The idea is very simple, just go one level deeper on the original basis to the corresponding webpage and recycle to obtain it
Similar to the previous part
The previous part is similar, throwing a variable on the home page:WebURLThen added a new ruletraining3It is used to obtain the information of the three pictures of the set.
'''
第一步:请求网页
'''
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84'
}
WebURL = "https://www.xrmn5.com/"
Get_url = requests.get(WebURL,headers=headers)
# print(Get_url.text)
# print(Get_url.request.headers)
Get_html = Get_url.text
'''
第二步:解析网页
'''
import re
# 正则表达式对应的为:
# (.*?):获取()内的所有
# \"(.*?)\" 用于匹配网页
# re.findall 用于获取()内的数据并每存为元组
urls = re.findall('<li class="i_list list_n2"><a href=\"(.*?)\" alt=(.*?) title=.*?><img src=\"(.*?)\"',Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1,re.S).findall(Get_html)
likeNum = re.compile(patren2,re.S).findall(Get_html)
# 针对套图所在网页的图片链接信息,添加新的解析规则
# <img οnlοad="size(this)" alt=.*? title=.*? src="/uploadfile/202109/1/07201045631.jpg" />
patren3 = '<img οnlοad=.*? alt=.*? title=.*? src=\"(.*?)\" />'
'''
第三步:进一步解析网页
'''
'''
第三步:存储封面
'''
import os
import time
dir = r"D:/Let'sFunning/Picture/PythonGet/"
url = "https://pic.xrmn5.com"
Loop to get later
The main ideas here are as follows:
- Enter a specific web page, that is, againrequests.get()
- Analyze the webpage obtained at this time, and store all the page paths corresponding to the set of pictures inAllPage( Note that I have a little problem with regular expressions, put the last link: iea href="/XiuRen/2021/20219023_1.html">下页</aThis is also obtained, so I made a break to jump out, resulting in the unobtainable picture of the last page of the set of pictures, that is, there will be 3 less pictures in the end (if you put it after the for loop, I am too lazy to change it)
- Then get the pictures of each page in a loop, that is, throw the picture link of each page to ==GetPageImg, then proceedrequests.get() == save and done
# 创建目录:人名+时间+专辑名
num = len(likeNum)
for i in range(num):
if (int(likeNum[i][1]) > 500):
getImgDir=dir+str(inforName[i][0])+'/'+str(likeNum[i][0])+'/'+str(inforName[i][1]+'/')
# 创建对应目录
if not os.path.exists(getImgDir):
os.makedirs(getImgDir)
imgUrl = url+urls[i][2]
imgName = getImgDir+urls[i][2].split('/')[-1]
print(imgName)
time.sleep(1)
# 获取封面图片
Get_Img = requests.get(imgUrl, headers=headers)
with open(imgName,'wb') as f:
f.write(Get_Img.content)
# 进入具体网页
IntoPageUrl = WebURL + urls[i][0]
Get_InPage = requests.get(IntoPageUrl, headers=headers)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
AllPage = re.findall('</a><a href=\"(.*?)\">([0-9]*)', Get_InPagehtml)
for k in range(len(AllPage)):
if k == len(AllPage) - 1:
break
else:
imgPageUrl = re.compile(patren3, re.S).findall(Get_InPagehtml)
PageNum = len(imgPageUrl)
# 循环获取并保存图片
for l in range(PageNum):
GetPageImg = url+imgPageUrl[l]
print(GetPageImg)
PageImgeName = getImgDir+imgPageUrl[l].split('/')[-1]
print(PageImgeName)
time.sleep(1)
# 获取内部图片
Get_PImg = requests.get(GetPageImg, headers=headers)
with open(PageImgeName, 'wb') as f:
f.write(Get_PImg.content)
# 继续下一页获取图片
NewPaperUrl = WebURL + AllPage[k][0]
time.sleep(1)
Get_InPage = requests.get(NewPaperUrl, headers=headers)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
So far, the set of pictures with more than 500 viewers on the homepage has been saved in the corresponding directory
Step 4: Obtain the photo album pictures of the whole station
question
OK, now we have obtained all the sets of pictures, but we found that we only got the set of pictures with more than 500 viewers on the homepage, but the actual set of pictures is far more than one page, so we need to find all the set of pictures corresponding to the website:
So we found: If we enter the webpage as https://www.xrmn5.com/XiuRen/ , we can see:128Page
This is where we are going to crawl.
Purpose
Obtainhttps://www.xrmn5.com/XiuRen/All pictures in
train of thought
The idea is similar, except that before getting the home page, get the data of this web page first, and then get it in the loop
Code
Step 1: Gethttps://www.xrmn5.com/XiuRen/All pages of
Specifically, it is to add a rule patrenForPageNum
, and then get the number that the current page can reach: here is to match a number and return it,
PageNum = "".join(list(filter(str.isdigit, temp)))
then perform format splicing to generate all web page URLs and save them in ==GetAllPage==
import os
import time
dir = r"D:/Let'sFunning/Picture/PythonGet/"
url = "https://pic.xrmn5.com"
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84'
}
URL = "https://www.xrmn5.com/XiuRen/"
WebURL = "https://www.xrmn5.com/"
Get_url = requests.get(URL,headers=headers)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
print(Get_html)
import re
patrenForPageNum = '</a><a href=\"(.*?)\">'
Get_PageNum = re.compile(patrenForPageNum,re.S).findall(Get_html)
temp = str(Get_PageNum[len(Get_PageNum)-1])
PageNum = "".join(list(filter(str.isdigit, temp)))
print(temp)
print(PageNum)
# 获取所有网页,存入AllPage中
AllPageTemp = []
GetAllPage = ()
for i in range(int(PageNum)):
if i > 0:
AllPageTemp.append(WebURL+"/XiuRen/index"+str(i+1)+".html")
GetAllPage += tuple(AllPageTemp)
Step 2: Loop in PageNum to get pictures
There is actually a problem here, that is, the data on page 128 is not available. There is a bit of a logical flaw here, but basically this pagehttps://www.xrmn5.com/XiuRen/All over 500 views in the set can be climbed to
for pagenum in range(int(PageNum)):
urls = re.findall('<li class="i_list list_n2"><a href=\"(.*?)\" alt=(.*?) title=.*?><img class="waitpic" src=\"(.*?)\"', Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1, re.S).findall(Get_html)
likeNum = re.compile(patren2, re.S).findall(Get_html)
print(urls)
print(inforName)
print(likeNum)
num = len(likeNum)
patren3 = '<img οnlοad=.*? alt=.*? title=.*? src=\"(.*?)\" />'
for i in range(num):
if (int(likeNum[i][1]) > 500):
getImgDir = dir + str(inforName[i][0]) + '/' + str(likeNum[i][0]) + '/' + str(inforName[i][1] + '/')
# 创建对应目录
if not os.path.exists(getImgDir):
os.makedirs(getImgDir)
imgUrl = url + urls[i][2]
imgName = getImgDir + urls[i][2].split('/')[-1]
print(imgName)
time.sleep(1)
# 获取封面图片
Get_Img = requests.get(imgUrl, headers=headers)
with open(imgName, 'wb') as f:
f.write(Get_Img.content)
# 进入具体网页
IntoPageUrl = WebURL + urls[i][0]
Get_InPage = requests.get(IntoPageUrl, headers=headers)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
AllPage = re.findall('</a><a href=\"(.*?)\">([0-9]*)', Get_InPagehtml)
for k in range(len(AllPage)):
imgPageUrl = re.compile(patren3, re.S).findall(Get_InPagehtml)
PageNum = len(imgPageUrl)
# 循环获取并保存图片
for l in range(PageNum):
GetPageImg = url + imgPageUrl[l]
print(GetPageImg)
PageImgeName = getImgDir + imgPageUrl[l].split('/')[-1]
print(PageImgeName)
time.sleep(1)
# 获取封面图片
Get_PImg = requests.get(GetPageImg, headers=headers)
with open(PageImgeName, 'wb') as f:
f.write(Get_PImg.content)
if k == len(AllPage) - 1:
break
# 继续下一页获取图片
NewPaperUrl = WebURL + AllPage[k][0]
time.sleep(1)
Get_InPage = requests.get(NewPaperUrl, headers=headers)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_url = requests.get(GetAllPage[pagenum],headers=headers)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
If you want to do a good job, you must first sharpen your tools
result
Hello World!