Python crawler easily extracts Bilibili column pictures

requests+bs4 module-simple crawler example-pictures of Bilibili column

section1: statement

1. The crawled pictures are all pictures that can be downloaded for free on the Bilibili platform.
2. My own study notes will not be commercially available. After I crawled 2 g pictures and selected a few as backup wallpapers, they have all been deleted.
3. If there is any infringement in this article, please contact me to delete the article.

section2: Download link analysis

Since there are many columns in Station B, I searched it and chose " Demon Slayer: Midouzi Wallpaper ". Then there are 4 pages in total. This article only crawls the first page.
The renderings are as follows:
Insert picture description here
Insert picture description here

First of all, we want to find the details page crawling edge off the ghost Mi beans wallpaper

Then analyze the source code:
(Take the first article as an example)
Insert picture description here
Find the hyperlink at the href tag, click to check it
Insert picture description here
is the article we want. But it needs to be combined with **"https:"** when crawling.

Then analyze the source code of the pictures appearing in the article:
(Take the first picture as an example)
Insert picture description here
Each figure tag contains the URL of a picture. Then extract the content of data-src. (But in fact it is not. When using code to get the text, there is no such content, you can look at it later)

section3: Code writing

1. Import section

import requests
import re
import bs4
import os

2. Add a request header

headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}

3. Create a folder

if not os.path.exists('D:/鬼灭之刃'):
    os.mkdir('D:/鬼灭之刃')

4. Encapsulation function

def get_first_url(url):       #得到第一个url,即每一篇文章的url,结果是未遍历的
    res_1=requests.get(url=url,headers=headers)
    html_1=res_1.text
    first_url=re.findall('<li.*?<a.*?"(//w.*?search)"',html_1,re.S)
    return first_url

The function of this function is to analyze the details page visited at the beginning, and extract the content that can constitute the article link
(extracted by regular expression)

I also wrote an article about regular expressions before

Portal

def get_second_url(url):   #得到第二个url,即文章中每个图片的url,结果是未遍历的
    res_2 = requests.get(url=url,headers=headers)
    html_2=res_2.text
    soup=bs4.BeautifulSoup(html_2,'html.parser')
    picture_list = soup.select('.img-box img')
    return picture_list

The function of this function is to analyze the content of the article and extract the content that can form a picture link
(extracted using bs4)

def download_picture(url,num1,i):   #下载图片
    res_3=requests.get(url=url,headers=headers)
    picture_data=res_3.content
    picture_name='img{}_{}.jpg'.format(num1,i)
    picture_path='D:/鬼灭之刃/'+picture_name
    with open(picture_path,'wb') as f:
        f.write(picture_data)
        print(picture_path,'打印成功')

The function of this function is to download pictures and print the progress bar
(I like to write the main function first, and then add this function)

def main():
    base_url='https://search.bilibili.com/article?keyword=%E9%AC%BC%E7%81%AD%E4%B9%8B%E5%88%83%E5%BC%A5%E8%B1%86%E5%AD%90%E5%A3%81%E7%BA%B8'
    fist_urls=get_first_url(base_url)
    num1=1
    for first_url in fist_urls:
        first_url='https:'+first_url
        second_url=get_second_url(first_url)
        for i in range(len(second_url)):
            picture_urls=second_url[i].get('data-src')
            picture_url='https:'+picture_urls
            download_picture(picture_url,num1,i)
        num1+=1

5. Complete code

import requests
import re
import bs4
import os

headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}

#创建文件夹
if not os.path.exists('D:/鬼灭之刃'):
    os.mkdir('D:/鬼灭之刃')

def get_first_url(url):       #得到第一个url,即每一篇文章的url,结果是未遍历的
    res_1=requests.get(url=url,headers=headers)
    html_1=res_1.text
    first_url=re.findall('<li.*?<a.*?"(//w.*?search)"',html_1,re.S)
    return first_url

def get_second_url(url):   #得到第二个url,即文章中每个图片的url,结果是未遍历的
    res_2 = requests.get(url=url,headers=headers)
    html_2=res_2.text
    soup=bs4.BeautifulSoup(html_2,'html.parser')
    picture_list = soup.select('.img-box img')
    return picture_list

def download_picture(url,num1,i):   #下载图片
    res_3=requests.get(url=url,headers=headers)
    picture_data=res_3.content
    picture_name='img{}_{}.jpg'.format(num1,i)
    picture_path='D:/鬼灭之刃/'+picture_name
    with open(picture_path,'wb') as f:
        f.write(picture_data)
        print(picture_path,'打印成功')


def main():
    base_url='https://search.bilibili.com/article?keyword=%E9%AC%BC%E7%81%AD%E4%B9%8B%E5%88%83%E5%BC%A5%E8%B1%86%E5%AD%90%E5%A3%81%E7%BA%B8'
    fist_urls=get_first_url(base_url)
    num1=1
    for first_url in fist_urls:
        first_url='https:'+first_url
        second_url=get_second_url(first_url)
        for i in range(len(second_url)):
            picture_urls=second_url[i].get('data-src')
            picture_url='https:'+picture_urls
            download_picture(picture_url,num1,i)
        num1+=1


if __name__ =='__main__'    :
    main()

section4: supplement

At this point, the text is nearing the end, let's take a look at the small problem left above.

This is the result of parsing and this is the result
Insert picture description here
of "checking" the webpage
Insert picture description here

Take a closer look at whether there is a difference, or the analytical result shall prevail.

section5: reference blog post

Reference blog post 1
Reference blog post 2

If you are new to bs4, please enlighten me!

Guess you like

Origin blog.csdn.net/qq_44921056/article/details/113398597