Python's requests library crawls Chen Baiqiang's "I just like you", use the re library (regular expression) to extract, use the os system module, and remove the'\' anti-climbing symbol

Today, I heard a good old song named "I Just Like You". Brother Chen Baiqiang sang, he is really strong as the name suggests! One of the Cantonese singers I admire. Then, it was too difficult for me. It took 3 hours, which is really embarrassing for me, an old driver who has learned crawling for 1 month. However, this time I did not use the Selenium library to crawl. If I use selenium, it will be done quickly and there will be no problem. Let's experience the feeling of learning together! !

Complete code:

#encoding = "utf-8"
#Author:"Mr.Pan_学狂"
#start time:2021/2/22/22:30
#finish time:2021/2/23/00:40
#requests库爬取 陈百强 《偏偏喜欢你》
from selenium import webdriver
from lxml import etree
import re
import requests
from urllib.request import urlretrieve
import time
import random
import os
url = 'https://haokan.baidu.com/v?vid=17139400655098661254&pd=bjh&fr=bjhauthor&type=video'
html = requests.get(url,headers=headers).text
#print(html)
#reg = r'<video class="" autoplay="" tabindex="2" mediatype="video" crossorigin="anonymous" src="(.*?)"</video>'video不存在,原因可能是JS的异步加载,网页是动态的。
reg = '"url":(.*?),"videoBps":352'
url = re.findall(reg,html)[0]
reg2 = '<h1 class="videoinfo-title">(.*?)</h1>'
video_name = re.findall(reg2,html)[0]
print(video_name)
print(url)
reg3 = '"(.*?)"'
url2 = re.findall(reg3,url)[0]#将url的引号处理掉!!因为直接用requests访问会进行引号嵌套出错!!
print(url2)#需要对url2进行处理,因为url2有反斜杠访问会出错!!
ls = []
for i in url2:
    ls.append(i)
#print(ls)
while True:
    if '\\' in ls:
        ls.remove('\\')
    else:
        break
print(ls)
url3 = ''
for l in ls:
    url3 += l
print('url3:',url3)

os.mkdir('E:/Example')
video = requests.get(url3,headers=headers)
with open(r'E:/Example/{}.mp4'.format(video_name),'wb+') as f:
       f.write(video.content)

Next, I’m going to tell the story. First of all, it’s the old way. If you don’t say a lot, if you don’t agree, you just do it. It comes up as a library for all kinds of crawlers.
Insert picture description here
Let's go to the URL of this video to check the elements, as shown in the following figure:
Insert picture description here
If you use the selenium library to crawl this video, it will be very easy, because you can just extract the xpath directly. However, we only use requests and re (regular) extraction will be a lot of trouble, because the following situation will appear, according to the usual routine, src is the content we need to extract. But here this web page made a little joke with us, because the source code of this video tag does not exist (I think it is loaded by JS, dynamic). This extraction can only get an empty list, nothing at all. When doing this, I believe that most of the comrades who are just getting started will give up directly, don't be discouraged, we still have hope!
Insert picture description here
Insert picture description here
The result can only be an empty list, as shown below:
Insert picture description here

However, I think carefully that if there is no link to this video on this page, then it must not be linked to the video, and it must be found on this page, that is, the link to the video can be found in the source code of the page. Therefore, I directly press the key combination Ctrl+F in the pycharm editor to check whether the word mp4 exists in the source code, as shown in the figure below:
Insert picture description here
We will find four more links corresponding to mp4, I clicked in and found that all four It's the same link to the video. Then, we just need to find a way to extract one of the links. I chose the link whose value of videoBps is 352 to extract with regular expressions, as shown in the figure below:
Insert picture description here
Insert picture description here
Run result:
Insert picture description here
hahahahahaha, here I really can't help but smile. Are you happy to crawl to the link? ? It is an inaccessible string. . . . Ask for the shadow area S ∈ (-∞, +∞) in my heart at that time. It is estimated that at this point, some friends with a little crawling experience gave up on the spot and started crying directly. However, we must believe in ourselves and believe in hope!

No hurry, our re library 6 is too! Get it done for you in minutes, running and efficient! As shown in the figure below: the
Insert picture description here
result of the operation, as shown in the figure below: the
Insert picture description here
quotation marks have been removed, are you very happy? Hahahahahaha, I laughed again. Because it's not over yet, it's not so easy to learn from the scriptures. . You still can't use this link to access requests, because there is an escape character \, which leads to an error, and the heart collapses directly at that time! This is the end after a long time? ? That's it? ? As shown below:
Insert picture description here
Calm! If we don’t panic and observe carefully, we will find that every escape character \ is in front of /. Just think of a way to replace the escape character with /. I tried it and it didn’t work! ! So, I tried other methods. Until the end, I thought of the idea of ​​reorganization after dismantling (somewhat like distributed). I first dismantled the problematic link and reorganized it without adding the escape character! Following this train of thought, I finally found a way out and successfully solved the problem! !

It’s too late to say that it’s fast, I saw that I defined a list (container), then disassembled it and put it into the list, as shown in the figure below:
Insert picture description here
Insert picture description here
From this figure similar to Morse code, we can see that the escape character \ is in the list (container) ) Has become \, there is something to do!
Next, we need to remove the methods (functions) that are not needed in the list (container) \using the list. As shown in the picture below:
Insert picture description here
Insert picture description here
Hahahaha, I finally get the truth, happy! The following operation is relatively simple, that is, to define an intermediate variable to combine the list elements (disassembled single characters), as shown in the following figure:
Insert picture description here
As a result, very nice! !
Insert picture description here
This is the last video link we need, exuding a blu-ray of victory! ! Let’s click in and take a look:
Insert picture description here
After that, we also need to get the video name. We use regular rules to extract it from the original page source code, as shown in the figure below: The
Insert picture description here
result of the operation is as shown in the figure below:
Insert picture description here
Now, we have the video address and the video name, which can be used. The os module creates an Example folder to store the video on the E drive, and then starts to request the video address of the routine, as shown in the figure below:
Insert picture description here
You can see that my E drive does not have an Example folder (directory) now, as shown in the figure below:
Insert picture description here
Run results : The
Insert picture description here
Insert picture description here
Example folder (directory) is automatically created, and the video is also put into it, as shown in the figure below:
Insert picture description here
Let’s click to play and take a screenshot, as shown in the figure below:
Insert picture description here
For the classic songs of Chen Baiqiang, why don’t you go and fetch it once? ? We have to believe in ourselves, all difficulties will gradually get better.

Finally, thank you all for coming to watch my article. There may be many improprieties in the article, and I hope to point out He Haihan.

Guess you like

Origin blog.csdn.net/weixin_43408020/article/details/113981567