Python novice combat crawling expression package

Python novice combat crawling expression package

Foreword

I'm a Python white.
If wrong, please forgive me.
This article is Python crawling expression package
for beginners.
Code, there are many areas for improvement.
This secondary use of the library: ①requests②os③re
checked and found that OS is built Python library,
Re is the Python standard library, do not need to download pip
me a sand sculpture

Note: Reproduced indicate the source, it will be related to law tort

BergeBlog(https://www.xueyin.cf/)

Preparatory

  1. Installing Python development environment 3X Series
  2. win + R open operation, input cmd, Python input, to verify whether the installation Python
  3. win + R open operation, input cmd, pip install requests input
  4. Crawling target

start working:

Into the target site
https://qq.yh31.com/zjbq/0551964.html

Into the target site, press F12 to open the Developer Tools
BergeBlog
BergeBlog
to get

  1. Picture Address (incomplete): / tp / zjbq / 201903271348331856.gif
  2. Own browser UA: User-Agent: Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 80.0.3987.106 Safari / 537.36

Code

'''
    作者:血饮
    功能:爬取指定网页表情包
    时间:2020.02.20
'''
import requests
import os
import re

target_url = "https://qq.yh31.com/zjbq/0551964.html"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36"
}

Then get the page source
UA for the anti-anti-climbing, simulated own request is sent by the browser says
Python obtain the source code, it will be seen by a bunch of people at the beginning of something less than b, so we have to be decoded

source_code = requests.get(target_url,headers=headers).content.decode("utf-8")

BergeBlogBergeBlog
BergeBlog
Then use a regular get the picture link
BergeBlog
where you can see directly what we want is not there will be some emotion package gif image
and then we find that comparison
1

<img src="/tp/zjbq/201903271348331856.gif" />

2

<img src="/images/ontop3.gif" alt="热门图片">

The difference is that they have no back /
So positive was we need to use

regex_1 = r'img[\s]+src="(.*?\.gif)"[\s]+/'
xueyin = re.compile(regex_1)
get_img_url = re.findall(xueyin,source_code)

Made for a very long incomplete image link
[ '/tp/zjbq/201903271348331856.gif', '/ tp / zj
ellipsis

What we get is a list format
for the following code to parse
carried out to obtain the location of this catalog with os
list form for each line of output
and then the link becomes a complete picture where the link really
get the image name
to obtain the specific location of the picture to be output
acquisition byte form of pictures
to open the output directory output picture

path = os.getcwd()
for x in get_img_url:
    x = "https://qq.yh31.com/" + x
    file_name = x.split("/")[-1]
    file_path = path +"\\"+file_name
    response = requests.get(x,headers=headers)
    with open(file_path, "wb") as f:
        f.write(response.content)
print("完成")

There are some pictures of expression package is jpg format
not say, similar to the above method

Finally get

'''
    作者:血饮
    功能:爬取制定网页表情包
    时间:2020.02.20
'''
import requests
import os
import re

target_url = "https://qq.yh31.com/zjbq/0551964.html"

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}

source_code = requests.get(target_url,headers=headers).content.decode("utf-8")

regex_1 = r'img[\s]+src="(.*?\.gif)"[\s]+/'

xueyin = re.compile(regex_1)

get_img_url = re.findall(xueyin,source_code)

path = os.getcwd()

for x in get_img_url:
    x = "https://qq.yh31.com/" + x
    file_name = x.split("/")[-1]
    file_path = path +"\\"+file_name
    response = requests.get(x,headers=headers)
    with open(file_path, "wb") as f:
        f.write(response.content)
print("完成")
#转载请注明出处,侵权将按相关法律处理

BergeBlog
Personal blog BergeBlog

Note: Reproduced indicate the source

Published an original article · won praise 4 · Views 216

Guess you like

Origin blog.csdn.net/qq_38958476/article/details/104411064