Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Preamble content

 

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Python crawler novice introductory teaching (2): crawling novels

Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data

Python crawler novice introductory teaching (4): crawling 51job.com recruitment information

Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B

Python crawler novice introductory teaching (6): making word cloud diagrams

Python crawler beginners introductory teaching (7): crawling Tencent video barrage

Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF

 

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

  • re
  • requests

Install Python and add it to the environment variables, pip installs the required related modules.

One, clear needs

The wallpaper of the other side is really beautiful in my opinion. Although it can be downloaded for free, but for small partners with conditions, you can still pay for it. After all, it is not expensive. It only costs 30 yuan to download the whole site without restrictions.

 

2. Web page data analysis

Click on the picture step by step and you can find that the url address of the details page is composed of the  picture ID  and the  picture resolution  .

 


Click again, and then you can see the real address of the picture. If you just want to find a wallpaper picture, then save the picture and get a high-definition wallpaper.

 


If you want to crawl the picture, there is a picture link in the details page.

 


So you only need to get the ID of the picture in the list page, because the resolution we crawled itself is 1920*1080, so this resolution is fixed. Just get the image ID and stitch the url.

 

There is a problem to pay attention to here:

Among the data on each page, there are two things that we don't need.

 

 


The data we need is with a title, and the data we don't need is without a title, so we can make a judgment based on this.

Three, code implementation

Get web page data

def get_response(html_url):
    """获取网页数据"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response

Get the ID of each wallpaper

def main(html_url):
    """获取图片ID"""
    response = get_response(html_url)
    selector = parsel.Selector(response.text)
    image_info = selector.css('.list ul li')
    for link in image_info:
        image_title = link.css('a::attr(title)').get()
        # 进行简单的判断,如果有标题就获取ID 
        if image_title:
            id_info = link.css('a::attr(href)').get()
            # /desk/23177.htm
            image_id = id_info.replace('.htm', '').split('/')[-1]

Get the url address of each wallpaper

def get_image_url(image_id):
    """获取图片的真实url地址"""
    page_url = f'http://www.netbian.com/desk/{image_id}-1920x1080.htm'
    response = get_response(page_url)
    selector = parsel.Selector(response.text)
    image_url = selector.css('#endimg a::attr(href)').get()
    return image_url

Save wallpaper

def save(image_url, title):
    filename = 'images\\' + title + '.jpg'
    image_content = get_response(image_url).content
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print('正在保存:', title)

When I finished writing the code, I found a problem when I ran it

 

 


The picture can be saved, but it is garbled, so you need to add a line of code to the function to get the webpage data

# 万能的转码方式
response.encoding = response.apparent_encoding

After updating the code, the title has changed to Chinese, but there are new problems

 

 


Each title is followed by an update time, return to the web page to see the source code.

 


The title attribute in the a tag contains the update time, and the alt attribute in the img tag does not have the update time, so we need to extract the title data again.

After the change:

def main(html_url):
    """获取图片ID"""
    response = get_response(html_url)
    selector = parsel.Selector(response.text)
    image_info = selector.css('.list ul li')
    for link in image_info:
        image_title = link.css('a::attr(title)').get()
        if image_title:
            title = link.css('a img::attr(alt)').get()
            id_info = link.css('a::attr(href)').get()
            # /desk/23177.htm
            image_id = id_info.replace('.htm', '').split('/')[-1]

 

 

 


These wallpapers are quite big, just turn the page according to the page number. For multi-threaded crawling, you can refer to the previous article.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113404843