ajax分析-今日头条街拍美图抓取

　　我们知道，有时候直接利用requests请求得到的原始数据是无效的，因为很多时候，这样获取的一个网页的源代码很可能就几行，明显不是我们想要的东西，这个时候，我们就可以分析，这样的网页中是不是加入了ajax请求，即原始页面加载完成后，会再向服务器请求某个接口去获取数据，然后才被呈现到网上的。

　　ajax并不是一门语言，而是一门技术，它可以在页面不被刷新的情况下，利用javascript进行数据交换，渲染网页进而呈现，我们平时用手机刷微博，往下拉就会出现加载的小圆圈，那个就是加入了ajax请求。

　　所以，我们要想爬取这类网页，就必须要知道ajax的工作原理，爬取之前，安装好必要的库。

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool


#我们进入今日头条的网页后，打开开发者工具，找到ajax请求后，发现它的url里面有以下几个参数，滑动鼠标下拉网页，
#我们发现，除了offset这个参数会改变以外，其他的参数都不会改变，而offset正是每页显示的数据条数，即偏移量
#由此，我们传入offset参数。
def get_page(offset):
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '1',
        'from': 'search_tab'
    }
    base_url = 'https://www.toutiao.com/search_content/?'
    #这里我们将构造出来的新的url作为请求对象
    url = base_url + urlencode(params)
    try:
        resp = requests.get(url)
        if codes.ok == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None


#这里我们再定义一个方法，加入了一个生成器，用于提取每条数据的图片链接和标题，一并返回。
def get_images(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            title = item.get('title')
            images = item.get('image_list')
            for image in images:
                yield {
                    'image': 'https:' + image.get('url'),
                    'title': title
                }


#这里我们定义一个保存数据的方法，引入os模块，以图片的标题来创建文件夹，并请求图片链接获得二进制数据，以二进制数据的形式写入，此处的md5可以做到加密及去重的作用。
def save_image(item):
    if not os.path.exists(item.get('title')):
        os.makedirs(item.get('title'))
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = '{0}/{1}.{2}'.format(item.get('title'),md5(response.content).hexdigest(),'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save Image，item %s' % item)


#这里定义的一个主方法，构造offset变量数组，下面的方法就可以被调用了。
def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


GROUP_START = 0
GROUP_END = 20

#此处用到了进程池，调用了map方法，pool.close()表示不加入新的任务，pool.join()表示等待所有子进程结束后再向下执行，也就是整个爬虫的结束。
if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()

ajax分析-今日头条街拍美图抓取

猜你喜欢