An example of a Python reptile for a picture of a beautiful woman who is full of warmth and lust (1)

This crawler is aimed at adults and has a certain degree of self-control (involves some sexy pictures, only for crawler case studies)

foreword

Recently, the mentality of writing a thesis has exploded, and I feel that I need to find something fun to relax. Browsing the web, I saw a software for drawing pictures: [秀人网] Meitu Download Version 1.1—By, Xiaogucheng,
but the trial effect was not good, so I was going to do it myself and have enough food and clothing ( mainly the girls in it, what I want )

Full of warmth and lust, hands and enough food

Target

Through reptiles, get the ladies and sisters in [秀人网] and give them a warm home

It is not a pity to stain the clothes, but the wish is not violated

train of thought

  1. baiduXiuren.com, get URL: https://www.xrmn5.com/XiuRen/ , go to home page: https://www.xrmn5.com/
  2. F12 to see the structure of the web page, analyze and obtain links and structures through regular expressions
  3. Pycharm+Python 3.7, what you want
  4. Get it successfully, you're done

Scattered flowers gradually become charming eyes

step

Step 1: Check the web page structure

F12 Dafa

Get a rough look at the webpage

By "inspecting" the front page, we get this:
Home F12 Results

<a href="/XiuRen/2021/20219023.html" alt="[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P"><img src="/UploadFile/pic/9023.jpg" style="opacity:1;display:inline;">
<div class="postlist-imagenum"><span>芝芝</span></div></a>

Then by clicking with the mouse, <img src="/UploadFile/pic/9023.jpg">we can get such information:
image link
What do these two pictures represent?

Corresponding web page structure

  • Web page URL (i.e. homepage URL): https://www.xrmn5.com
  • Key points of the home page - take the picture of the lady above as an example: src="/UploadFile/pic/9023.jpg"
  • The web page address corresponding to the picture: href="/XiuRen/2021/20219023.html"
  • The corresponding title of the picture: [XiuRen秀人网]No.3825_Goddess Zhizhi Booty Jiangsu, Zhejiang and Shanghai travel photo shoots light hanging skirt revealing ultra-thin meat shreds show buttocks temptation photo 81P (not suitable for children)
  • The corresponding character of the picture: span>Zhizhi</span

Take a closer look at the picture information

Let's take a closer look at this picture and find that there is still time and the number of viewers (seemingly) under the title.
Image Details
The specific web page code is as follows

<div class="case_info" style="background-color: rgb(204, 232, 207);"><div class="meta-title">[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P</div>
<div class="meta-post"><i class="fa fa-clock-o"></i>2021.09.01<span class="cx_like"><i class="fa fa-eye"></i>101</span></div></div>

Corresponding web page structure

From here it follows that:

  • Image creation time: /i>2021.09.01<span
  • Image views: /i>101</span

Step 2: Obtain the corresponding web page

Development environment:

Windows 10 64-bit Professional + PyCharm + Python3.7 +

import os
import time
import requests
import re

in:

  • os for path acquisition
  • time for delay
  • requests is used to obtain web page information
  • re for parsing

web page information

fromStep 1: Check the web page structureIn can know:

  • The home page image link address is:Home link+ the link in src, namely: https://www.xrmn5.com/UploadFile/pic/9023.jpg (actually the specific web page is: https://pic.xrmn5.com/Uploadfile/pic/9023.jpg, but link above available)

  • The name of the album corresponding to the picture is: [XiuRen秀人网]No.3825_Goddess Zhizhi Booty Jiangsu, Zhejiang and Shanghai travel photo shoots light hanging skirt revealing ultra-thin meat silk show buttocks temptation photo 81P

  • The URL of the album corresponding to the picture is:Home linkThe link in +href, namely: https://www.xrmn5.com/XiuRen/2021/20219023.html

  • Corresponding character of the picture: Zhizhi

  • Image creation time: 2021.09.01

python get:

Step 1: Get the web page

Through requests.getwe can get the web page, because there is Chinese ‘utf-8’, encode it, and finally turn it into text for display (the headers are the headers information of the current web page obtained through F12), so far, the home page information is obtained Get_htmlin

'''
第一步:请求网页
'''
import requests
# 头标签
headers = {
    
    
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84'
}

Get_url = requests.get('https://www.xrmn5.com/',headers=headers)
Get_url.encoding = 'utf-8'
# print(Get_url.text)
# print(Get_url.request.headers)
Get_html = Get_url.text

Step 2: Parse the web page

At the beginning, I used re.findall (rule, data), and then I felt that it was too troublesome. I changed it to re.compile (rule, re.S). Findall (data)
focuses on the rule part.urlsHere is the corresponding to get the pictureAlbum page information (ie link in href), title information and path information of the picture itself
inforNameHere is to get character information and title information
likeNumHere is the creation time and number of viewers corresponding to the album.
So far, all relevant information has been obtained.

'''
第二步:解析网页
'''

import re
# 正则表达式对应的为:
# (.*?):获取()内的所有
# \"(.*?)\" 用于匹配网页
# re.findall 用于获取()内的数据并每存为元组
urls = re.findall('<li class="i_list list_n2"><a  href=\"(.*?)\" alt=(.*?) title=.*?><img src=\"(.*?)\"',Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1,re.S).findall(Get_html)
likeNum = re.compile(patren2,re.S).findall(Get_html)

Step 3: Get the home page cover image

The first is to specify the picture storage directory: (I set it myself)
dir = r"D:/Let'sFunning/Picture/PythonGet/"
Then I made a judgment, I only get the cover picture for the photo album with more than 500 viewers, and save it in
dir 下的 人名 下的 时间名 下的 专辑名 文件夹中
os.makedirs()for creating folders
urls[i][2].split(’/’)[-1]Here is the path corresponding to the picture, such as: /UploadFile/pic/9023.jpgthe last section, namely9023.jpgSave it as the name of the picture
and finally splicing the path of the picture with the webpage, which is the url of the picture, throughrequests.get()Just get it and write it

'''
第三步:存储封面
'''
import os
import time

dir = r"D:/Let'sFunning/Picture/PythonGet/"
url = "https://pic.xrmn5.com"
# 创建目录:人名+时间+专辑名
num = len(likeNum)
for i in range(num):
	if (int(likeNum[i][1]) > 500):
		getImgDir=dir+str(inforName[i][0])+'/'+str(likeNum[i][0])+'/'+str(inforName[i][1]+'/')
		# 创建对应目录
		if not os.path.exists(getImgDir):
			os.makedirs(getImgDir)
		imgUrl = url+urls[i][2]
		imgName = getImgDir+urls[i][2].split('/')[-1]
		print(imgName)
		time.sleep(1)
		# 获取封面图片
		Get_Img = requests.get(imgUrl, headers=headers)
		with open(imgName,'wb') as f:
			f.write(Get_Img.content)
		# 进入具体网页

So far, the cover images with more than 500 viewers on the homepage have been saved to the corresponding directory

Step 3: Get photo album pictures

passStep 2: Obtain the corresponding web page, we successfully obtained the pictures of young ladies on the homepage, but these are not crawlers. We only crawled a few pictures on the homepage. We should go deep inside to get the complete photo pictures.

Purpose

Get photo sets with more than 500 viewers, not just the cover

train of thought

Step 1: Obtain the link of the web page where the set of pictures is located:

Right now:Step 2: Obtain the corresponding web pageofweb page information

Step 2: Analyze the corresponding web page of the photo set:

The same F12, we can know that there are three pictures on this page:
Package information
the corresponding codes of the pictures are as follows:

<p style="text-align: center"><img onload="size(this)" alt="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" src="/uploadfile/202109/1/47201045101.jpg"><br>
<br>
<img onload="size(this)" alt="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" src="/uploadfile/202109/1/07201045631.jpg"><br>
<br>
<img onload="size(this)" alt="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" title="Xrmn.Top_[XiuRen秀人网]No.3825_女神芝芝Booty江浙沪旅拍撩轻薄吊裙露超薄肉丝秀翘臀诱惑写真81P" src="/uploadfile/202109/1/16201045377.jpg"><br>
<br>
</p>

page jump
The corresponding page code is as follows:

<div class="page"><a href="/XiuRen/2021/20219023.html" class="current">1</a><a href="/XiuRen/2021/20219023_1.html">2</a><a href="/XiuRen/2021/20219023_2.html">3</a><a href="/XiuRen/2021/20219023_3.html">4</a><a href="/XiuRen/2021/20219023_4.html">5</a><a href="/XiuRen/2021/20219023_5.html">6</a><a href="/XiuRen/2021/20219023_6.html">7</a><a href="/XiuRen/2021/20219023_7.html">8</a><a href="/XiuRen/2021/20219023_8.html">9</a><a href="/XiuRen/2021/20219023_9.html">10</a><a href="/XiuRen/2021/20219023_10.html">11</a><a href="/XiuRen/2021/20219023_11.html">12</a><a href="/XiuRen/2021/20219023_12.html">13</a><a href="/XiuRen/2021/20219023_13.html">14</a><a href="/XiuRen/2021/20219023_14.html">15</a><a href="/XiuRen/2021/20219023_15.html">16</a><a href="/XiuRen/2021/20219023_16.html">17</a><a href="/XiuRen/2021/20219023_17.html">18</a><a href="/XiuRen/2021/20219023_18.html">19</a><a href="/XiuRen/2021/20219023_19.html">20</a><a href="/XiuRen/2021/20219023_20.html">21</a><a href="/XiuRen/2021/20219023_21.html">22</a><a href="/XiuRen/2021/20219023_22.html">23</a><a href="/XiuRen/2021/20219023_23.html">24</a><a href="/XiuRen/2021/20219023_24.html">25</a><a href="/XiuRen/2021/20219023_25.html">26</a><a href="/XiuRen/2021/20219023_26.html">27</a><a href="/XiuRen/2021/20219023_1.html">下页</a></div>

Step 3: The corresponding web page information of the photo set:

From the above two pictures and the corresponding code, we can know that,

  • This set has a total of 27 pages>27</a
  • Each page has three imagessrc="/uploadfile/202109/1/47201045101.jpg">、src="/uploadfile/202109/1/07201045631.jpg">、src="/uploadfile/202109/1/16201045377.jpg">
  • Corresponding links to the rest of the pages except this one are as follows:href="/XiuRen/2021/20219023_2.html">

Code

The idea is very simple, just go one level deeper on the original basis to the corresponding webpage and recycle to obtain it

Similar to the previous part

The previous part is similar, throwing a variable on the home page:WebURLThen added a new ruletraining3It is used to obtain the information of the three pictures of the set.


'''
第一步:请求网页
'''
import requests

headers = {
    
    
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84'
}

WebURL = "https://www.xrmn5.com/"

Get_url = requests.get(WebURL,headers=headers)
# print(Get_url.text)
# print(Get_url.request.headers)
Get_html = Get_url.text

'''
第二步:解析网页
'''

import re
# 正则表达式对应的为:
# (.*?):获取()内的所有
# \"(.*?)\" 用于匹配网页
# re.findall 用于获取()内的数据并每存为元组
urls = re.findall('<li class="i_list list_n2"><a  href=\"(.*?)\" alt=(.*?) title=.*?><img src=\"(.*?)\"',Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1,re.S).findall(Get_html)
likeNum = re.compile(patren2,re.S).findall(Get_html)


# 针对套图所在网页的图片链接信息,添加新的解析规则
# <img οnlοad="size(this)" alt=.*? title=.*? src="/uploadfile/202109/1/07201045631.jpg" />
patren3 = '<img οnlοad=.*? alt=.*? title=.*? src=\"(.*?)\" />'

'''
第三步:进一步解析网页
'''
'''
第三步:存储封面
'''
import os
import time

dir = r"D:/Let'sFunning/Picture/PythonGet/"
url = "https://pic.xrmn5.com"

Loop to get later

The main ideas here are as follows:

  • Enter a specific web page, that is, againrequests.get()
  • Analyze the webpage obtained at this time, and store all the page paths corresponding to the set of pictures inAllPage( Note that I have a little problem with regular expressions, put the last link: iea href="/XiuRen/2021/20219023_1.html">下页</aThis is also obtained, so I made a break to jump out, resulting in the unobtainable picture of the last page of the set of pictures, that is, there will be 3 less pictures in the end (if you put it after the for loop, I am too lazy to change it)
  • Then get the pictures of each page in a loop, that is, throw the picture link of each page to ==GetPageImg, then proceedrequests.get() == save and done

# 创建目录:人名+时间+专辑名
num = len(likeNum)
for i in range(num):
	if (int(likeNum[i][1]) > 500):
		getImgDir=dir+str(inforName[i][0])+'/'+str(likeNum[i][0])+'/'+str(inforName[i][1]+'/')
		# 创建对应目录
		if not os.path.exists(getImgDir):
			os.makedirs(getImgDir)
		imgUrl = url+urls[i][2]
		imgName = getImgDir+urls[i][2].split('/')[-1]
		print(imgName)
		time.sleep(1)
		# 获取封面图片
		Get_Img = requests.get(imgUrl, headers=headers)
		with open(imgName,'wb') as f:
			f.write(Get_Img.content)
		# 进入具体网页
		IntoPageUrl = WebURL + urls[i][0]
		Get_InPage = requests.get(IntoPageUrl, headers=headers)
		Get_InPage.encoding = 'utf-8'
		Get_InPagehtml = Get_InPage.text

		AllPage = re.findall('</a><a href=\"(.*?)\">([0-9]*)', Get_InPagehtml)

		for k in range(len(AllPage)):
			if k == len(AllPage) - 1:
				break
			else:
				imgPageUrl = re.compile(patren3, re.S).findall(Get_InPagehtml)
				PageNum = len(imgPageUrl)
				# 循环获取并保存图片
				for l in range(PageNum):
					GetPageImg = url+imgPageUrl[l]
					print(GetPageImg)
					PageImgeName = getImgDir+imgPageUrl[l].split('/')[-1]
					print(PageImgeName)
					time.sleep(1)
					# 获取内部图片
					Get_PImg = requests.get(GetPageImg, headers=headers)
					with open(PageImgeName, 'wb') as f:
						f.write(Get_PImg.content)


				# 继续下一页获取图片
				NewPaperUrl = WebURL + AllPage[k][0]
				time.sleep(1)
				Get_InPage = requests.get(NewPaperUrl, headers=headers)
				Get_InPage.encoding = 'utf-8'
				Get_InPagehtml = Get_InPage.text

So far, the set of pictures with more than 500 viewers on the homepage has been saved in the corresponding directory

Step 4: Obtain the photo album pictures of the whole station

question

OK, now we have obtained all the sets of pictures, but we found that we only got the set of pictures with more than 500 viewers on the homepage, but the actual set of pictures is far more than one page, so we need to find all the set of pictures corresponding to the website:
So we found: If we enter the webpage as https://www.xrmn5.com/XiuRen/ , we can see:128Page
https://www.xrmn5.com/XiuRen/,
This is where we are going to crawl.

Purpose

Obtainhttps://www.xrmn5.com/XiuRen/All pictures in

train of thought

The idea is similar, except that before getting the home page, get the data of this web page first, and then get it in the loop

Code

Step 1: Gethttps://www.xrmn5.com/XiuRen/All pages of

Specifically, it is to add a rule patrenForPageNum, and then get the number that the current page can reach: here is to match a number and return it,
PageNum = "".join(list(filter(str.isdigit, temp)))
then perform format splicing to generate all web page URLs and save them in ==GetAllPage==

import os
import time

dir = r"D:/Let'sFunning/Picture/PythonGet/"
url = "https://pic.xrmn5.com"

import requests

headers = {
    
    
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84'
}

URL = "https://www.xrmn5.com/XiuRen/"
WebURL = "https://www.xrmn5.com/"
Get_url = requests.get(URL,headers=headers)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
print(Get_html)

import re
patrenForPageNum = '</a><a href=\"(.*?)\">'
Get_PageNum = re.compile(patrenForPageNum,re.S).findall(Get_html)
temp = str(Get_PageNum[len(Get_PageNum)-1])
PageNum = "".join(list(filter(str.isdigit, temp)))
print(temp)
print(PageNum)

# 获取所有网页,存入AllPage中
AllPageTemp = []
GetAllPage = ()
for i in range(int(PageNum)):
	if i > 0:
		AllPageTemp.append(WebURL+"/XiuRen/index"+str(i+1)+".html")
GetAllPage += tuple(AllPageTemp)

Step 2: Loop in PageNum to get pictures

There is actually a problem here, that is, the data on page 128 is not available. There is a bit of a logical flaw here, but basically this pagehttps://www.xrmn5.com/XiuRen/All over 500 views in the set can be climbed to

for pagenum in range(int(PageNum)):
	urls = re.findall('<li class="i_list list_n2"><a  href=\"(.*?)\" alt=(.*?) title=.*?><img class="waitpic" src=\"(.*?)\"', Get_html)
	patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
	patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
	inforName = re.compile(patren1, re.S).findall(Get_html)
	likeNum = re.compile(patren2, re.S).findall(Get_html)
	print(urls)
	print(inforName)
	print(likeNum)
	num = len(likeNum)
	
	patren3 = '<img οnlοad=.*? alt=.*? title=.*? src=\"(.*?)\" />'
	
	for i in range(num):
		if (int(likeNum[i][1]) > 500):
			getImgDir = dir + str(inforName[i][0]) + '/' + str(likeNum[i][0]) + '/' + str(inforName[i][1] + '/')
			# 创建对应目录
			if not os.path.exists(getImgDir):
				os.makedirs(getImgDir)
			imgUrl = url + urls[i][2]
			imgName = getImgDir + urls[i][2].split('/')[-1]
			print(imgName)
			time.sleep(1)
			# 获取封面图片
			Get_Img = requests.get(imgUrl, headers=headers)
			with open(imgName, 'wb') as f:
				f.write(Get_Img.content)
			# 进入具体网页
			IntoPageUrl = WebURL + urls[i][0]
			Get_InPage = requests.get(IntoPageUrl, headers=headers)
			Get_InPage.encoding = 'utf-8'
			Get_InPagehtml = Get_InPage.text

			AllPage = re.findall('</a><a href=\"(.*?)\">([0-9]*)', Get_InPagehtml)

			for k in range(len(AllPage)):
				imgPageUrl = re.compile(patren3, re.S).findall(Get_InPagehtml)
				PageNum = len(imgPageUrl)
				# 循环获取并保存图片
				for l in range(PageNum):
					GetPageImg = url + imgPageUrl[l]
					print(GetPageImg)
					PageImgeName = getImgDir + imgPageUrl[l].split('/')[-1]
					print(PageImgeName)
					time.sleep(1)
					# 获取封面图片
					Get_PImg = requests.get(GetPageImg, headers=headers)
					with open(PageImgeName, 'wb') as f:
						f.write(Get_PImg.content)

				if k == len(AllPage) - 1:
					break

				# 继续下一页获取图片
				NewPaperUrl = WebURL + AllPage[k][0]
				time.sleep(1)
				Get_InPage = requests.get(NewPaperUrl, headers=headers)
				Get_InPage.encoding = 'utf-8'
				Get_InPagehtml = Get_InPage.text
	Get_url = requests.get(GetAllPage[pagenum],headers=headers)
	Get_url.encoding = 'utf-8'
	Get_html = Get_url.text

If you want to do a good job, you must first sharpen your tools

result

crawling process

Saved pictures
currently climbed

Hello World!

Guess you like

Origin blog.csdn.net/jack_zj123/article/details/120082974