python3 Web crawler developed combat crawling Today's headlines street shooting pictures

2020/2/17
recent computer is broken, I just bought three months of the poor little new pro13 ah, the display out of the question. I do not know where is broken, opened a black screen, the display will show only light at a particular angle (the angle the other is actually the darkest brightness, lying on top can see the outline of vague, honey issue) , show that most things I use a heavier suppress the lower left corner of the computer is the location of my left hand, then it displayed relatively normal? ? ? Really I do not understand what the problem is, but the future of Lenovo computer at arm's length now. Recently because of the epidemic, Lenovo aftermarket is not to open the door, mushrooms.
Well, said a lot of the learning process is purely to vent grievances, the computer makes me hard.
Had just crawling reptiles develop real picture of today's headlines code completion, we found that some sites change, which is the book of the code will not be crawled. There are two major changes, 1) ajax request to imitate crawling pictures need cookie; 2) json returned content varies, and there is no image_detail field. The following are specific content

1) Add the cookie
the cookie xhr first page request can shoot in the street to get in
Here Insert Picture Descriptionand, before the first request xhr can see there is a response field img request to set the cookie, and expired a long time, expired copy it once again visit the cookie on it.
So, with headers after setting a cookie as follows.

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
    "Referer":"https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D",
    "X-Requested-With":"XMLHttpRequest",
    "Cookie":"__tasessionId=da6dc6d4q1581851848391; s_v_web_id=k6oxqyas_VGn20UCx_WXjQ_40eQ_9nhD_h0a5HUmjAsyD; csrftoken=cdcf90d6d3d490ab1326e261b2eff18a; tt_webid=6794001919121065486"
}

2) the format returned json content change
does not image_detail in data fields, a field image_list, but not in data field each have image_list

Specific code:
code section is divided into three parts, main function; get_one_page functions: acquiring data in a json offset range, and return, which is a generator, can traverse in main; save_img function: img obtained from the builder information and save the url

import requests, json, time,os
from lxml import etree
from requests import RequestException
from hashlib import md5

base_url1 = "https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset="
base_url2 = "&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp="
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
    "Referer":"https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D",
    "X-Requested-With":"XMLHttpRequest",
    "Cookie":"__tasessionId=da6dc6d4q1581851848391; s_v_web_id=k6oxqyas_VGn20UCx_WXjQ_40eQ_9nhD_h0a5HUmjAsyD; csrftoken=cdcf90d6d3d490ab1326e261b2eff18a; tt_webid=6794001919121065486"
}
base_save_path = "D:/toutiao/" 

def get_one_page(base_url:str)->dict:
    '''
    @base_url: request url
    @return: nop
    '''

    try:
        resp = requests.get(base_url, headers = headers)
    except RequestException:
        return None

    json_content = resp.json()
    
    print(base_url, '\n')
    # print(json.dumps(json_content, indent=2, ensure_ascii=False))
    data = json_content.get("data")
    for item in data:
        title = item.get("title")
        imgs = item.get("image_list")
        if (imgs):
            title_img = {}
            title_img["title"] = title
            title_img["image_list"] = imgs
            yield title_img

        else:
            continue

def save_img(title_img: dict):
    title = title_img.get("title")
    title_path = base_save_path +title
    if not os.path.exists(title_path):
        os.mkdir(title_path)
    for img in title_img.get("image_list"):
        img_url = img.get("url")
        img_resp = requests.get(img_url)
        if img_resp.status_code == 200:
            file_path = '{0}{1}.{2}'.format(title_path + '/', md5(img_resp.content).hexdigest(), 'jpg')
            if not os.path.exists(file_path):
                with open(file_path,'wb') as img_f:
                    img_f.write(img_resp.content)
            else:
                print("already downloaded :", file_path)
        
def main():
    for i in range(0, 120, 20):
        base_url = base_url1 + str(i) + base_url2 + str(time.time()).replace('.','')[:-4]
        title_imgs = get_one_page(base_url)
        for title_img in title_imgs:
            save_img(title_img)

        time.sleep(1)

if __name__ == "__main__":
    main()






Published 15 original articles · won praise 3 · views 10000 +

Guess you like

Origin blog.csdn.net/biziwaiwai/article/details/104366762