python crawls wallpaperup wallpapers based on beautifulsoup4

Crawling wallpapers based on beautifulsoup4

beautifulsoup4 is commonly used to crawl web pages. Since I didn’t want to download a wallpaper software and wanted to achieve the effect of automatically changing wallpapers, I wrote this crawler.

Analyze websites to be crawled

After entering wallpaperup, there is a search box. Obviously, you can search all pictures without keywords. At the same time, there are different filter items under the search box, which will help us analyze the next search result URL.

URL analysis

  1. Search without keywords and filter items, the search result URL is https://www.wallpaperup.com/most/popular
  2. Add filter items one by one. When the resolution is specified, the search result is https://www.wallpaperup.com/resolution/2560/1440 . After adding the aspect ratio to 16:9, the search result URL is https://www. wallpaperup.com/search/results/ratio:1.78+resolution:2560x1440 . According to this, it can be found that the search results are sorted by popularity by default. The search result URL for multiple filters should be https://www.wallpaperup.com/search/ results/ key 1 : {key1}:k e y 1:{value1}+ k e y 2 : {key2}: key2: The format of {value2}.
  3. The second point is only speculation. When you need to confirm only one filter item, you can also use the URL in the search result URL format of the second point to filter. Accessible at https://www.wallpaperup.com/search/results/resolution:2560x1440 .
  4. The crawler must not crawl only one page. When turning to the next page, the URL changes to append "/2" at the end. Based on this, it is inferred that the URL to access page x of the search results is ${search_url}/x.

At this point, the first stage of URL analysis comes to an end.

Analysis of content to be crawled

  1. The content to be crawled is the pictures in the search results, so right-click on any picture and select check. The complete html element of a picture is
<div class="thumb-adv " data-ratio="1.7777777777778" style="width: 513px; height: 288.562px; top: 0px; left: 0px;"><figure class="black"><a href="/9024/Clouds_cityscapes_architecture_buildings_skyscrapers.html" title="View wallpaper" class="thumb-wrp" style="height:0;padding-bottom:56.325301204819%;"><img width="2560" height="1440" class="thumb black    lazy " data-src="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg" data-srcset="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-500.jpg 889w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-375.jpg 667w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-250.jpg 444w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg 332w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-125.jpg 222w" alt="Clouds cityscapes architecture buildings skyscrapers wallpaper" data-wid="9024" data-group="gallery" sizes="513px" srcset="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-500.jpg 889w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-375.jpg 667w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-250.jpg 444w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg 332w,https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-125.jpg 222w" src="https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg"></a><figcaption class="attached-bottom on-hover compact"><div class="sections center-y"><div class="section-left"><div class="button manage compact xsmall transparent" title="Manage"><i class="icon"></i></div></div><span class="section-center forced" title="Resolution">2560x1440</span><div class="section-right"><div class="favoriter subscriber no-remote-state no-label multiple joined no-separators center-x" data-state="null" data-state-batch-url="/favorite/get_states_batch/9024+42467+248989+15719+218610+12857+104187+52269+100670+667839+234960+172181+917952+19472+104183+156805+243384+54956+232856+9634+67838+189500+172180+675992" data-state-url="/favorite/get_state/9024" data-url="/favorite/do_toggle/9024"><div title="Add to favorites" class="toggle button bordered compact xsmall transparent" data-text-on="Favorite" data-text-off="Favorite" data-icon=""></div><div class="button bordered compact xsmall transparent remote-modal-trigger" data-url="/favorite/move_modal/9024" title="Move favorite"><i class="icon"></i></div></div></div></div></figcaption></figure></div>
and
block can be ignored. Pay attention to two places. One is the href attribute */9024/Clouds_cityscapes_architecture_buildings_skyscrapers.html* in the tag. Click to find that it points to the details of the image; the other place is the data-src in the tag. It is found that the attribute value points to The suffix of the file is .jpg, there is a possibility that the URL pointed to is the desired one, click [https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187. jpg](https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg) Found that it points to a thumbnail, observe the file name ce58524d13c01b1f347affc343de0a91-187.j pg, bold speculation The file name should be a hexadecimal string. If the "-" and the content after it are weird, delete them and visit again [https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024 /ce58524d13c01b1f347affc343de0a91.jpg](https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91-187.jpg), found that this is the original image. 2. Analyze the picture details page. Similar to the operation in step 1, focus on two places in this page: the picture itself and the download button. Check the download button and find that it points to a jumpable link, which is not a direct link, so it is not considered for crawling; check the image itself, the label content is ```html Clouds cityscapes architecture buildings skyscrapers wallpaper ``` Apparently the data-origin attribute [https://www.wallpaperup.com/uploads/wallpapers/2012/08/05/9024/ce58524d13c01b1f347affc343de0a91.jpg](https://www.wallpaperup.com/uploads/wallpapers/ 2012/08/05/9024/ce58524d13c01b1f347affc343de0a91.jpg) is the crawled content.

Develop a crawling strategy

According to the website content analysis, two strategies can be determined for crawling: 1. Extract all data-src on the search results page and replace "-number" as the crawl result; 2. Record all the details pages first, and then extract the data from each page. Get data-origin from each detail page.

Code

Generate url

Use a dictionary to record filter items and values,

        # 设置筛选条件
        option = {
    
    
            'cats_ids': '',  # 类别,自行修改为需要的类别的编号
            # 'cats_ids': '1',  # example

            'license': '',  # 版权,自行修改为需要的版权的编号
            # 'license': '1',  # example

            'ratio': '',  # 长宽比,
            # 'ratio': str(round(16/9, 2)),  # example 这里是16:9

            # 'resolution_mode': '',  # 分辨率筛选模式
            'resolution_mode': ':',  # 这里是at least
            # 'resolution_mode': ':=:',  # 这里是exactly

            # 'resolution': '',  # 分辨率,
            'resolution': '2560x1440',  # example 这里是2560*1440

            'color': '',
            # 'color': '#80c0e0',  # 颜色,自行修改颜色的编码
            
            'oder': '',  # 自行根据需要修改
        }

When generating URLs based on the dictionary, two special points should be noted:

  1. When filtering by resolution, there are two modes: exact and at least. The difference when generating URLs lies in the operator between resolution and value.
  2. The aspect ratio is calculated, for example, the value of 16:10 is 1.6. Compare https://www.wallpaperup.com/search/results/resolution:2560x1440+order:date_added/1 and https://www.wallpaperup.com/search/results/resolution:=:2560x1440+order:date_added/1 .

Generate search result url based on dictionary

        url = 'https://www.wallpaperup.com/search/results/'
        for key in option.keys():
            if option[key] == '' or key == 'resolution_mode':
                pass
            elif option[key] == 'resolution':
                url += (key + option['resolution_mode'] + option[key])
            else:
                url += (key + ':' + option[key])
                url += "+"
        url = url[:-1]+"/"+str(page_num)

Crawl images

For the first strategy, just parse to data-src and then use regular replacement, no more details.
The second strategy is to obtain the code of each detail page URL of the search results:

        r_url = requests.get(url, headers=headers)
        soup_a = BeautifulSoup(r_url.text, 'lxml')
        r_url.close()  # 关闭连接以防止被拒绝

        wallpaper_pages = []
        for links in soup_a.find_all(attrs={
    
    'title': "View wallpaper"}):
            soup2 = BeautifulSoup(str(links), 'lxml')
            wallpaper_pages.append(
                'https://www.wallpaperup.com'+str(soup2.a.attrs['href']))

Get the data-origin of the details page

        wallpaper_imgaes_link = []
        for page in wallpaper_pages:
            r_page = requests.get(page, headers=headers)
            soup_page = BeautifulSoup(r_page.text, 'lxml')
            div_img = str(
                (soup_page.find_all(attrs={
    
    'class': 'thumb-wrp'}))[0])
            soup_div_img = BeautifulSoup(div_img, 'lxml')
            wallpaper_imgaes_link.append(
                str(soup_div_img.div.img.attrs['data-original']))

Download pictures to local

            filename = str(re.compile(r"\d{4}\/\d{2}\/\d{2}\/\d+").findall(
                image_link)[0]).replace("/", "", 2).replace("/", "_")+".jpg"
            print("filename=", filename)
            with open(wallpapers_folder+filename, 'wb+') as f:
                f.write(r_image_link.content)

At this point, you can obtain all the images on each search results page. You only need to nest a loop in the outer layer to crawl multiple pages of images.

Environmental Statement

python version: python3
dependencies: bs4, requests, lxml

Guess you like

Origin blog.csdn.net/u013943146/article/details/118702425
Recommended