Python novice combat crawling expression package
Foreword
I'm a Python white.
If wrong, please forgive me.
This article is Python crawling expression package
for beginners.
Code, there are many areas for improvement.
This secondary use of the library: ①requests②os③re
checked and found that OS is built Python library,
Re is the Python standard library, do not need to download pip
me a sand sculpture
Note: Reproduced indicate the source, it will be related to law tort
Preparatory
- Installing Python development environment 3X Series
- win + R open operation, input cmd, Python input, to verify whether the installation Python
- win + R open operation, input cmd, pip install requests input
- Crawling target
start working:
Into the target site
https://qq.yh31.com/zjbq/0551964.html
Into the target site, press F12 to open the Developer Tools
to get
- Picture Address (incomplete): / tp / zjbq / 201903271348331856.gif
- Own browser UA: User-Agent: Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 80.0.3987.106 Safari / 537.36
Code
'''
作者:血饮
功能:爬取指定网页表情包
时间:2020.02.20
'''
import requests
import os
import re
target_url = "https://qq.yh31.com/zjbq/0551964.html"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36"
}
Then get the page source
UA for the anti-anti-climbing, simulated own request is sent by the browser says
Python obtain the source code, it will be seen by a bunch of people at the beginning of something less than b, so we have to be decoded
source_code = requests.get(target_url,headers=headers).content.decode("utf-8")
Then use a regular get the picture link
where you can see directly what we want is not there will be some emotion package gif image
and then we find that comparison
1
<img src="/tp/zjbq/201903271348331856.gif" />
2
<img src="/images/ontop3.gif" alt="热门图片">
The difference is that they have no back /
So positive was we need to use
regex_1 = r'img[\s]+src="(.*?\.gif)"[\s]+/'
xueyin = re.compile(regex_1)
get_img_url = re.findall(xueyin,source_code)
Made for a very long incomplete image link
[ '/tp/zjbq/201903271348331856.gif', '/ tp / zj
ellipsis
What we get is a list format
for the following code to parse
carried out to obtain the location of this catalog with os
list form for each line of output
and then the link becomes a complete picture where the link really
get the image name
to obtain the specific location of the picture to be output
acquisition byte form of pictures
to open the output directory output picture
path = os.getcwd()
for x in get_img_url:
x = "https://qq.yh31.com/" + x
file_name = x.split("/")[-1]
file_path = path +"\\"+file_name
response = requests.get(x,headers=headers)
with open(file_path, "wb") as f:
f.write(response.content)
print("完成")
There are some pictures of expression package is jpg format
not say, similar to the above method
Finally get
'''
作者:血饮
功能:爬取制定网页表情包
时间:2020.02.20
'''
import requests
import os
import re
target_url = "https://qq.yh31.com/zjbq/0551964.html"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
source_code = requests.get(target_url,headers=headers).content.decode("utf-8")
regex_1 = r'img[\s]+src="(.*?\.gif)"[\s]+/'
xueyin = re.compile(regex_1)
get_img_url = re.findall(xueyin,source_code)
path = os.getcwd()
for x in get_img_url:
x = "https://qq.yh31.com/" + x
file_name = x.split("/")[-1]
file_path = path +"\\"+file_name
response = requests.get(x,headers=headers)
with open(file_path, "wb") as f:
f.write(response.content)
print("完成")
#转载请注明出处,侵权将按相关法律处理