Article directory
- foreword
- 1. Analyze the website
- 2. Crawl information
- 1. Import library
- 2. Analyzing web pages
- Summarize
foreword
Before using a crawler to crawl a website, you must know the source code of the website, and analyze it after requesting the source code.
1. Analyze the website
In the browser, press F12 on the keyboard (press Fn+F12 on the notebook) to open the developer tool, or right-click on the web page and select "Inspect". The developer interface has a total of: Elements, Console, Sources, Network, Performance, Memory, Application, Security, Lighthouse. 9 options.
Click the Elements page to see the source code of the web page, and edit the content here. The desired content can be found in the Elements option.
2. Crawl information
Take crawling the leaderboard of Station B as an example:
1. Import library
import requests
# 如果颜色为灰色说明没有下载,在Terminal页面中输入 pip install requests
from bs4 import BeautifulSoup
# 如果颜色为灰色说明没有下载,在Terminal页面中输入 pip install bs4
import re
2. Analyzing web pages
Here is the url request and UA disguise
url_Bzhan = 'https://www.bilibili.com/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39' # 爬取网站
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'} # UA伪装
Here is the code to analyze the leaderboard a label
def Ranking():
pa =requests.get(url=url_Bzhan,headers=headers)
soup = BeautifulSoup(pa.text, 'html.parser')
# print(soup.prettify())
bs = []
bs = soup.find_all('a', {'class': 'title'})
# print(bs)
for i in range(len(bs)):
data = bs[i]
print(data)
Ranking()
There are both the url and the name of the leaderboard in the a tag, which is what we need to extract these two.
def Ranking():
pa =requests.get(url=url_bzhan,headers=headers)
soup = BeautifulSoup(pa.text, 'html.parser')
# print(soup.prettify()) 检验请求的网站源代码
bs = []
bs = soup.find_all('a', {'class': 'title'})
for i in range(len(bs)):
data = bs[i]
wenzi = BeautifulSoup(bs[i].text,'html.parser')
# re只能匹配字符串,但是前面获得的数据不是,所以将它转化为string类型
shuju = re.findall(r'<a class="title" href="(.*?)" target="_blank">', str(data))
i = int(i)+1 # 序号
all_data = str(i) + "\t" + str(wenzi) + "\t" + str(shuju)
# 序号 + 标题 + 网址 中间用tab分割开来
print(all_data)
Ranking()
What it looks like after analysis
The last step is to export the content into a document format
Full code:
import requests
from bs4 import BeautifulSoup
import re
url_bzhan = 'https://www.bilibili.com/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39'
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}
def Ranking():
pa =requests.get(url=url_bzhan,headers=headers)
soup = BeautifulSoup(pa.text, 'html.parser')
bs = []
bs = soup.find_all('a', {'class': 'title'})
with open('D:/爬B站/B站.txt', 'w', encoding='utf-8') as f: #存储位置自己定
for i in range(len(bs)):
data = bs[i]
wenzi = BeautifulSoup(bs[i].text,'html.parser')
shuju = re.findall(r'<a class="title" href="(.*?)" target="_blank">', str(data))
i = int(i)+1
all_data = str(i) + "\t" + str(wenzi) + "\t" + str(shuju)
f.write(str(all_data) + "\n")
Ranking()
Summarize
All in all, I wish you all good health.