Python crawler development basics #2——Taking the ranking list of station B as an example to analyze the website

Article directory

  • foreword
  • 1. Analyze the website
  • 2. Crawl information
    • 1. Import library
    • 2. Analyzing web pages
  • Summarize





foreword

        Before using a crawler to crawl a website, you must know the source code of the website, and analyze it after requesting the source code.




1. Analyze the website

In the browser, press F12 on the keyboard (press Fn+F12 on the notebook) to open the developer tool, or right-click on the web page and select "Inspect". The developer interface has a total of: Elements, Console, Sources, Network, Performance, Memory, Application, Security, Lighthouse. 9 options.

Click the Elements page to see the source code of the web page, and edit the content here. The desired content can be found in the Elements option.




2. Crawl information

Take crawling the leaderboard of Station B as an example:




1. Import library

import requests
# 如果颜色为灰色说明没有下载,在Terminal页面中输入 pip install requests 
from bs4 import BeautifulSoup
# 如果颜色为灰色说明没有下载,在Terminal页面中输入 pip install bs4 
import re




2. Analyzing web pages

Here is the url request and UA disguise

url_Bzhan = 'https://www.bilibili.com/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39' # 爬取网站

headers ={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'} # UA伪装

Here is the code to analyze the leaderboard a label

def Ranking():
    pa =requests.get(url=url_Bzhan,headers=headers)
    soup = BeautifulSoup(pa.text, 'html.parser')
    # print(soup.prettify())
    bs = []

    bs = soup.find_all('a', {'class': 'title'})
    # print(bs)
    for i in range(len(bs)):
        data = bs[i]
        print(data)

Ranking()

 There are both the url and the name of the leaderboard in the a tag, which is what we need to extract these two.

def Ranking():
    pa =requests.get(url=url_bzhan,headers=headers)
    soup = BeautifulSoup(pa.text, 'html.parser')
    # print(soup.prettify()) 检验请求的网站源代码
    bs = []
    bs = soup.find_all('a', {'class': 'title'})
        for i in range(len(bs)):
            data = bs[i]
            wenzi = BeautifulSoup(bs[i].text,'html.parser')
            # re只能匹配字符串,但是前面获得的数据不是,所以将它转化为string类型
            shuju = re.findall(r'<a class="title" href="(.*?)" target="_blank">', str(data))
            i = int(i)+1 # 序号
            all_data = str(i) + "\t" + str(wenzi) + "\t" + str(shuju)
            # 序号 + 标题 + 网址 中间用tab分割开来
            print(all_data)

Ranking()

What it looks like after analysis

The last step is to export the content into a document format

Full code:

import requests
from bs4 import BeautifulSoup
import re

url_bzhan = 'https://www.bilibili.com/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39'

headers ={
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'}

def Ranking():
    pa =requests.get(url=url_bzhan,headers=headers)
    soup = BeautifulSoup(pa.text, 'html.parser')
    bs = []
    bs = soup.find_all('a', {'class': 'title'})
    with open('D:/爬B站/B站.txt', 'w', encoding='utf-8') as f:  #存储位置自己定
        for i in range(len(bs)):
            data = bs[i]
            wenzi = BeautifulSoup(bs[i].text,'html.parser')
            shuju = re.findall(r'<a class="title" href="(.*?)" target="_blank">', str(data))
            i = int(i)+1
            all_data = str(i) + "\t" + str(wenzi) + "\t" + str(shuju)
            f.write(str(all_data) + "\n") 

Ranking()


 




Summarize

All in all, I wish you all good health.

Guess you like

Origin blog.csdn.net/i__saber/article/details/120229250