NetEase counts all the exquisite pictures, and Python's hundred lines of code get it!

Original link to WeChat official account

NetEase Digital Reading is a data news visualization column dedicated to providing a lightweight reading experience. Its content is often combined with current news hot spots to visualize relevant data and present it in exquisite graphic form.

Give a chestnut, everyone feels the style of others:

Image source: NetEase Digital Reading

 

Xiaobencong felt that the charts they made were still beautiful, clear and novel, and wanted to download them all to learn . It's a lot of trouble to manually download one by one. Well, life is short, I use Python!

The download of a single picture is very simple, you can use the get request of the requests library, and then use Responsethe contentproperties of the object to save the picture in binary form. That is done with the following 5 lines of code:

import requests
url = 'http://cms-bucket.ws.126.net/2019/02/02/81b9ebced7514e66b4e969bab19af69c.png'
response = requests.get(url)
with open('2018百家姓.jpg', 'wb') as f:
    f.write(response.content)

 

This method is also learned when you first learn the requests library; and, as long as you modify the url, any picture can be downloaded. But our goal is to download all the pictures read by NetEase , how to write at this time?

 

1.requests get web content

The requests library is a crawler weapon in Python, and it has been featured in the previous articles of Xiaobencong. For those who want to learn quickly, I recommend a link to everyone:

https://cuiqingcai.com/5517.html (Cui Qingcai's personal blog)

https://2.python-requests.org//zh_CN/latest/index.html (Official document)

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }

def get_page_index():
    url = 'http://data.163.com/special/datablog/'
    try:
        response = requests.get(url,headers = headers)
        if response.status_code == 200:
            return response.text
            # print(response.text)  # 测试网页内容是否提取成功
    except RequestException:
        print('网页请求失败')
        return None

def get_page_detail(item):
    url = item.get('url')
    try:
        response = requests.get(url,headers = headers)
        if response.status_code == 200:
            # print(url) #测试url ok
            return response.text
    except RequestException:
        print('网页请求失败')
        return None

2. Parse the web content

Through the above method, you can get the html content, then parse the html string content, and extract the image url from the web page. There are many ways to parse and extract url, there are five common ones, namely: regular expression, Xpath, BeautifulSoup, CSS, PyQuery . Here Xiaobencong adopted BeautifulSoup.

def parse_page_index(html):
    pattern = re.compile(r'"url":"(.*?)".*?"title":"(.*?)".*?"img":"(.*?)".*?"time":"(.*?)".*?"comment":(.*?),',re.S)
    items = re.findall(pattern,html)
    # print(items)
    for item in items:
        yield{
        'url':item[0],
        'title':item[1],
        'img':item[2],
        'time':item[3],
        'comment':item[4][1:-1]
        }

def parse_page_detail2(html):
    soup = BeautifulSoup(html,'lxml')
    items = soup.select('p > a > img')
    # print(len(items))
    title = soup.h1.string
    for i in range(len(items)):
        pic = items[i].attrs['src']
        yield{
        'title':title,
        'pic':pic,
        'num':i  # 图片添加编号顺序
        }

3. Download and save the picture

The extracted URL is a  dict dictionary , which calls the keys and values ​​in the dict get method. Then create a folder to store the numbered pictures.

def save_pic(pic):
    title = pic.get('title')
    title = re.sub('[\/:*?"<>|]','-',title)
    url = pic.get('pic')
    # 设置图片编号顺序
    num = pic.get('num')

    if not os.path.exists(title):
        os.mkdir(title)

    # 获取图片url网页信息
    response = requests.get(url,headers = headers)
    try:
    # 建立图片存放地址
        if response.status_code == 200:
            file_path = '{0}\{1}.{2}' .format(title,num,'jpg')
            # 文件名采用编号方便按顺序查看,而未采用哈希值md5(response.content).hexdigest()
            if not os.path.exists(file_path):
                # 开始下载图片
                with open(file_path,'wb') as f:
                    f.write(response.content)
                    print('该图片已下载完成',title)
            else:
                print('该图片%s 已下载' %title)
    except RequestException as e:
        print(e,'图片获取失败')
        return None

4. Download results

Just take a look:

Image source: NetEase Digital Reading

The above is the process of crawling NetEase's digital reading pictures.

WeChat public account " financial learner who learns programming " back-end " NetEase digital reading " to get the source code. ( Commercial use is prohibited, otherwise the consequences will be at your own risk. The pictures used in this article are invaded and deleted. )

https://mp.weixin.qq.com/s?__biz=MzI1NzY0MTY3MA==&mid=2247484175&idx=1&sn=84ff6105f75169ea15b596c4a231f280&chksm=ea151ea6dd6297b0b022002c6af735cf2e5557faf8079d60372266f0f8b2366c83bb4ec77620&token=1892740331&lang=zh_CN#rd​mp.weixin.qq.com

 

Recommended in the past

1. Wandering Earth Movie Review

2. North Shanghai, Guangzhou and Shenzhen renting house book

3. Figure insect net beauty

4. Pig fart video

5. Lagou network data

Your likes and attention is my greatest support!

Save the scan code and pay attention to the public number

Published 11 original articles · won 11 · visited 5724

Guess you like

Origin blog.csdn.net/weixin_39270299/article/details/90082565