python crawler, crawling Bian net HD beautiful wallpaper


I like to collect wallpapers. I found the wallpapers under the beautiful category of Bian desktop wallpapers. I like them very much. So I wrote a crawler, and later found that the webpage structure of the entire website is basically the same, so I added a bit of code to make the HD wallpapers of the entire webpage Climbed down

Contents 1: Overview

On the computer, create a folder to store pictures of crawling the desktop

There are 25 folders under this folder, corresponding to categories
Insert picture description here
. There are several folders under each category folder, corresponding to the page
Insert picture description here
number. Under the page number folder, picture files are stored

Table of Contents 2: Environmental Preparation

You also need to use three third-party packages (you can look at the official documents if you are interested)

  • requests : Get pages through http requests, official documents
  • lxml : It is a parsing library of python, supports the parsing of HTML and XML, supports XPath parsing, and the parsing efficiency is very high, official documents
  • Beautiful Soup4 : can extract data from HTML or XML files, official documents

Enter the following pip commands in the terminal to install them

python -m pip install beautifulsoup4
python -m pip install lxml
python -m pip install requests

Table of Contents 3: Analyze the page structure

  • Because the resolution of my computer is 1920 × 1080, the resolution of the image I crawled is this
  • Bi'an desktop wallpaper provides many categories for us to browse: calendar, animation, scenery, beauty, games, movies, dynamics, beauty, design... Wallpapers under the

    4k category are important resources for the website’s revenue, and I don’t have a need for 4k wallpapers.

    CSS selectors are not crawled : #header> div.head> ul> li:nth-child(1)> div> a , locate the a tag of the package classification

I will use the wallpapers under the beautiful category to explain how to crawl pictures next
Insert picture description here

  1. There are a total of 73 pages, except for the last page, each page has 18 pictures.

    But in the code, we better need to automatically get the total page number. Well, the structure of the Bian desktop wallpaper website is really comfortable, basically the HTML structure of each page number They are all similar

    CSS selectors: div.page a , there are only 6 a tag that locates the number of the package page
  2. And the third image on each page is the same advertisement, it needs to be filtered out in the code
  3. The hyperlink on each page is very clear: http://www.netbian.com/weimei/index_ x .htm
    x is exactly the page number of the page
  4. Note: The pictures you see under the category are thumbnails with low resolution; to get the picture with 1920 × 1080 resolution, you need to jump twice

The following figure is an example.

In the category page, we can directly obtain the url of the image, but unfortunately, its resolution is not satisfactory;
through inspection, it is obvious that every image displayed in the category page is Point to another hyperlink
Insert picture description here
CSS selector: div#main div.list ul li a , locate the a tag that wraps the picture.
Click on the picture and jump to the new link for the first time . The following content is displayed on the page:
Insert picture description here
CSS Selector: div#main div.endpage div.pic div.pic-down a , locate the a tag that wraps the picture

Click the button to download wallpaper (1920 × 1080), jump to a new link for the second time , and finally achieve the goal. The resolution of the picture displayed in the link is 1920 × 1080

After twists and turns, I finally found the 1920 × 1080 high-resolution image of the picture.

CSS selector: div#main table a img , locate the img tag of the image

After my crawling test, a very few pictures failed to download due to many fragmentary problems, and a small number of pictures were given other resolutions although the website provided a download button of 1920 × 1080 resolution.

Table of Contents Four: Code Analysis

  • Hereinafter all red bold content , follow my explanation, modify according to their own circumstances

Step 1: Set up global variables

index = 'http://www.netbian.com' # 网站根地址
interval = 10 # 爬取图片的间隔时间
firstDir = 'D:/zgh/Pictures/netbian' # 总路径
classificationDict = {
    
    } # 存放网站分类子页面的信息
  • index, to crawl the website root address of the webpage, the crawling image in the code needs to use its complete url to be stitched
  • Interval, when we crawl the content of a website, we must take into account the affordability of the website server. Crawling a large amount of the website content in a short period of time will cause great pressure on the website server. We need to set the interval time
    unit when crawling : second
    because I want to climb the other side to take high-resolution images of all desktop site, and if crawling concentrated in a short time, on the one hand will give tremendous pressure web server, on the one hand our web server will force the broken links, so I set The crawl interval for each picture is 10 seconds; if you only crawl a few pictures, you can set the interval to a shorter time
  • firstDir , the root path of crawling pictures stored on your computer; when crawling pictures in the code, in the first-level directory, folders will be generated and stored according to the page number under the beautiful category of Bian desktop
  • classificationDict , which stores the URL pointed to by the classification under the website and the path of the corresponding classification folder

Step 2: Get the content list after the page is filtered

Write a function to get the filtered content array of the page

  • Two parameters are passed in
    url: the url of the webpage
    select: selector (seamlessly docked with the selector in CSS, I like it very much, locate the corresponding element in HTML)
  • Return a list
def screen(url, select):
    html = requests.get(url = url, headers = UserAgent.get_headers()) # 随机获取一个headers
    html.encoding = 'gbk'
    html = html.text
    soup = BeautifulSoup(html, 'lxml')
    return soup.select(select)
  • headers, the function is to pretend to be a user visiting the website. In order to ensure the success rate of crawlers, a header is randomly selected every time a page is crawled
  • encoding, the encoding of the website

Step 3: Get the URLs of all categories

# 将分类子页面信息存放在字典中
def init_classification():
    url = index
    select = '#header > div.head > ul > li:nth-child(1) > div > a'
    classifications = screen(url, select)
    for c in classifications:
        href = c.get('href') # 获取的是相对地址
        text = c.string # 获取分类名
        if(text == '4k壁纸'): # 4k壁纸,因权限问题无法爬取,直接跳过
            continue
        secondDir = firstDir + '/' + text # 分类目录
        url = index + href # 分类子页面url
        global classificationDict
        classificationDict[text] = {
    
    
            'path': secondDir,
            'url': url
        }

In the next code, I will use the wallpapers under the beautiful category to explain how to crawl high-definition pictures by jumping two links

Step 4: Get the URLs of all the pages under the category page

Most of the categories have more than 6 pages, you can directly use the screen function defined above, select is defined as div.page a, and then the sixth element in the list returned by the screen function can get the page number of the last page we need

However, some categories have less than 6 pages,
such as:

Need to rewrite a filter function to get through sibling elements

# 获取页码
def screenPage(url, select):
    html = requests.get(url = url, headers = UserAgent.get_headers())
    html.encoding = 'gbk'
    html = html.text
    soup = BeautifulSoup(html, 'lxml')
    return soup.select(select)[0].next_sibling.text

Get the url of all pages under the category page

url = 'http://www.netbian.com/weimei/'
select = '#main > div.page > span.slh'
pageIndex = screenPage(secondUrl, select)
lastPagenum = int(pageIndex) # 获取最后一页的页码
for i in range(lastPagenum):
    if i == 0:
        url = 'http://www.netbian.com/weimei/index.htm'
    else:
        url = 'http://www.netbian.com/weimei/index_%d.htm' %(i+1)

Since the HTML structure of the website is very clear, the code is simple and clear to write

Step 5: Get the URL of the picture under the page

Through inspection, you can see that the obtained URL is a relative address, which needs to be converted into an absolute address

 
 
select = 'div#main div.list ul li a'
imgUrls = screen(url, select)

The values ​​in the list obtained by these two lines of code are like this:

<a href="/desk/21237.htm" target="_blank" title="星空 女孩 观望 唯美夜景壁纸 更新时间:2019-12-06"><img alt="星空 女孩 观望 唯美夜景壁纸" src="http://img.netbian.com/file/newc/e4f018f89fe9f825753866abafee383f.jpg"/><b>星空 女孩 观望 唯美夜景壁纸</b></a>
  • Need to process the obtained list
  • Get the href attribute value in the a tag and convert it into an absolute address, which is the url needed for the first jump

The sixth step: locate the 1920 × 1080 resolution picture

# Position to 1920 1080 resolution picture
def handleImgs(links, path):
    for link in links:
        href = link.get('href')
        if(href == 'http://pic.netbian.com/'): # 过滤图片广告
            continue

        # 第一次跳转
        if('http://' in href): # 有极个别图片不提供正确的相对地址
            url = href
        else:
            url = index + href
        select = 'div#main div.endpage div.pic div.pic-down a'
        link = screen(url, select)
        if(link == []):
            print(url + ' 无此图片,爬取失败')
            continue
        href = link[0].get('href')

        # 第二次跳转
        url = index + href

        # 获取到图片了
        select = 'div#main table a img'
        link = screen(url, select)
        if(link == []):
            print(url + " 该图片需要登录才能爬取,爬取失败")
            continue
        name = link[0].get('alt').replace('\t', '').replace('|', '').replace(':', '').replace('\\', '').replace('/', '').replace('*', '').replace('?', '').replace('"', '').replace('<', '').replace('>', '')
        print(name) # 输出下载图片的文件名
        src = link[0].get('src')
        if(requests.get(src).status_code == 404):
            print(url + ' 该图片下载链接404,爬取失败')
            print()
            continue
        print()
        download(src, name, path)
        time.sleep(interval)

Step 7: Download pictures

Download operation

def download(src, name, path):
    if(isinstance(src, str)):
        response = requests.get(src)
        path = path + '/' + name + '.jpg'
        while(os.path.exists(path)): # 若文件名重复
            path = path.split(".")[0] + str(random.randint(2, 17)) + '.' + path.split(".")[1]
        with open(path,'wb') as pic:
            for chunk in response.iter_content(128):
                pic.write(chunk)

Table of Contents Five: Code's fault tolerance

One: Filter image ads

if(href == 'http://pic.netbian.com/'): # 过滤图片广告
    continue

Two: Jumping to the page for the first time , there is no link we need

On the other side wallpaper website, the link to the first page is given a relative address

However, very few pictures are given the absolute address directly, and the category URL is given, so two steps are required.

if('http://' in href):
    url = href
else:
    url = index + href

...

if(link == []):
    print(url + ' 无此图片,爬取失败')
    continue

The following are the problems encountered when jumping to the page for the second time

Three: Unable to crawl pictures due to permission issues

if(link == []):
    print(url + "该图片需要登录才能爬取,爬取失败")
    continue

Four: Get the alt of img, when it is used as the file name of the downloaded image file, the name carries \t or special characters that are not allowed in the file name:

  • In Python,'\t' is an escape character: space
  • For file naming in windows system, the file name cannot contain \ /: *? "<> |A total of 9 special characters
name = link[0].get('alt').replace('\t', '').replace('|', '').replace(':', '').replace('\\', '').replace('/', '').replace('*', '').replace('?', '').replace('"', '').replace('<', '').replace('>', '')

Five: Get the alt of img, when it is used as the file name of the downloaded image file, the name is repeated

path = path + '/' + name + '.jpg'
while(os.path.exists(path)): # 若文件名重复
    path = path.split(".")[0] + str(random.randint(2, 17)) + '.' + path.split(".")[1]

Six: Picture link 404

such as:
Insert picture description here

if(requests.get(src).status_code == 404):
    print(url + ' 该图片下载链接404,爬取失败')
    print()
    continue

Directory six: complete code

The original site of the original blog: https://blog.csdn.net/Zhangguohao666/article/details/105131503

Guess you like

Origin blog.csdn.net/My_daily_life/article/details/109223587