Ajax crawling today's headline street shooting improvement - various minefield solutions including data:none problem

Ajax crawling today's headline street shooting improvement - various minefield solutions including data:none problem

This worm is self-learning crawlers to prepare for big data. Due to the progress of the times, all websites have strengthened anti-crawling measures, and the teaching in books is no longer applicable during the learning process. This worm has slightly improved the code during the learning process But there are still bugs, and I hope you guys can give me more pointers.

Web Analysis:

Entering the page, we open F12 and click XHR (the data loaded by Ajax can be found here) and try to open these data, and then click Preview to find that the data of the first data has the data we want, and click to open the analysis code.
insert image description hereThere are a lot of data in data, but we can now see that the data in 0 is not what we want. We can skip it when writing the code, and then continue to read the rest of the code. We can find that the image_list contains the url of the picture we are looking for insert image description hereinsert image description here. Drag the progress bar and find that more and more data are loaded. We click to find that their basic structure is similar to the data we just saw. Analyzing their URLs, we can know that their offset parameters are different. We can see that these data are equivalent to Their pages can be linked together to extract the desired data!insert image description here

write magic code

Some bugs found that the Request URL that got the data found data: none with a code request, and the data could not be requested. I don’t know if the bugs have found that sometimes there is a graphic verification code when refreshing the page. This is the key point. We add a few more information into the request header, such as cookie, referer and so on.
We will find ok if we ask again.
Here we use from urllib.parse import urlencode to construct url. In order to be more secure, the timestamp timestamp on the url can be obtained directly with int(time.time()), so that a complete URL can be constructed, and it is easy to request loose it!

Note: Some insect lovers will start with the path and parameters on the url. Although the data can be requested, it is no longer the person (website) you want!

def get_page(offset):
    headers={
    
    
        'cookie': 'tt_webid=6788065855508612621; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6788065855508612621; csrftoken=495ae3a5659fcdbdb78e255464317789; s_v_web_id=k66hcay0_qsRG7emW_x2Qj_4R3o_AeAG_iT4JWmz83jzr; __tasessionId=23dn3qk0f1580738708512',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest'
    }#有隐形的图片验证,偶尔出现,cookie是重要参数 这是解决data:none的关键    
    params={
    
    
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': 20,
        'en_qc': 1,
        'cur_tab': 1,
        'from': 'search_tab',
        'pd': 'synthesis',
        'timestamp':int(time.time())             #获取时间戳
    }#采用构造url方式更加简洁明了,后期易于修改
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
    json = requests.get(url, headers=headers).json()  # 列表0,1,2,3......
    return json

This bug has spent a lot of effort here, and has been debugged many times. Because there are several lists in data without image_list, some of the lists without image_list contain messy things, such as user information and so on.
In this case, we directly get the pictures in image_list, filter them out with conditional statements, and then construct a generator to return them.

def get_image(json):
    for item in json['data']:
        if 'image_list' in item:
            title = item['title']
            for image_urls in item['image_list']:
                image_url = image_urls['url']
                yield{
    
    
                    'image':image_url,
                    'title':title
                }
def save_image(content):
    if not os.path.exists(content['title']):
        if '|' in content['title']:
            c_title = content['title'].replace(' | ','')#有些title里面有特殊符号无法用其名创建文件夹
            os.mkdir(c_title)
        else:
            os.mkdir(content['title'])
    response = requests.get(content['image'])
    if response.status_code == 200:
        file_path = '{0}/{1}.{2}'.format(content['title'].replace(' | ','')if '|' in content['title'] else content['title'],md5(response.content).hexdigest(),'jpg')#文件的名字和地址,用三目运算符来调试文件夹的名字
        if not os.path.exists(file_path):
            with open(file_path,'wb')as f:
                f.write(response.content)
        else:
            print('已下载',file_path)
if __name__ == '__main__':
    for i in range(6,8):
        offset = i*20           #构造Ajax加载的offset
        print(offset)
        json=get_page(offset)
        for content in get_image(json):
            try:
                save_image(content)
            except FileExistsError and OSError:
                print('文件夹创建名格式错误:包含特殊字符')#由于实验需要就直接跳过了那些不必要的错误
                continue

Full code:

import requests
from urllib.parse import urlencode
import time
import os
from hashlib import md5

def get_page(offset):
    headers={
    
    
        'cookie': 'tt_webid=6788065855508612621; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6788065855508612621; csrftoken=495ae3a5659fcdbdb78e255464317789; s_v_web_id=k66hcay0_qsRG7emW_x2Qj_4R3o_AeAG_iT4JWmz83jzr; __tasessionId=23dn3qk0f1580738708512',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest'
    }#有隐形的图片验证,偶尔出现,cookie是重要参数 这是解决data:none的关键
    params={
    
    
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': 20,
        'en_qc': 1,
        'cur_tab': 1,
        'from': 'search_tab',
        'pd': 'synthesis',
        'timestamp':int(time.time())             #获取时间戳
    }#采用构造url方式更加简洁明了,后期易于修改
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
    json = requests.get(url, headers=headers).json()  # 列表0,1,2,3......
    return json
def get_image(json):
    for item in json['data']:
        if 'image_list' in item:
            title = item['title']
            for image_urls in item['image_list']:
                image_url = image_urls['url']
                yield{
    
    
                    'image':image_url,
                    'title':title
                }
def save_image(content):
    if not os.path.exists(content['title']):
        if '|' in content['title']:
            c_title = content['title'].replace(' | ','')#有些title里面有特殊符号无法用其名创建文件夹
            os.mkdir(c_title)
        else:
            os.mkdir(content['title'])
    response = requests.get(content['image'])
    if response.status_code == 200:
        file_path = '{0}/{1}.{2}'.format(content['title'].replace(' | ','')if '|' in content['title'] else content['title'],md5(response.content).hexdigest(),'jpg')#文件的名字和地址,用三目运算符来调试文件夹的名字
        if not os.path.exists(file_path):
            with open(file_path,'wb')as f:
                f.write(response.content)
        else:
            print('已下载',file_path)
if __name__ == '__main__':
    for i in range(6,8):
        offset = i*20           #构造Ajax加载的offset
        print(offset)
        json=get_page(offset)
        for content in get_image(json):
            try:
                save_image(content)
            except FileExistsError and OSError:
                print('文件夹创建名格式错误:包含特殊字符')
                continue

Summarize

This bug improves the code according to "Python3 Web Crawler Development Combat", I hope you can point out the shortcomings, and this bug will continue to work hard! I hope the epidemic situation in the country is getting better and better. I believe that the new coronavirus will be eliminated by us! ! ! come on! China!

Guess you like

Origin blog.csdn.net/qq_44627822/article/details/104211941