2020/2/17
recent computer is broken, I just bought three months of the poor little new pro13 ah, the display out of the question. I do not know where is broken, opened a black screen, the display will show only light at a particular angle (the angle the other is actually the darkest brightness, lying on top can see the outline of vague, honey issue) , show that most things I use a heavier suppress the lower left corner of the computer is the location of my left hand, then it displayed relatively normal? ? ? Really I do not understand what the problem is, but the future of Lenovo computer at arm's length now. Recently because of the epidemic, Lenovo aftermarket is not to open the door, mushrooms.
Well, said a lot of the learning process is purely to vent grievances, the computer makes me hard.
Had just crawling reptiles develop real picture of today's headlines code completion, we found that some sites change, which is the book of the code will not be crawled. There are two major changes, 1) ajax request to imitate crawling pictures need cookie; 2) json returned content varies, and there is no image_detail field. The following are specific content
1) Add the cookie
the cookie xhr first page request can shoot in the street to get in
and, before the first request xhr can see there is a response field img request to set the cookie, and expired a long time, expired copy it once again visit the cookie on it.
So, with headers after setting a cookie as follows.
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
"Referer":"https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D",
"X-Requested-With":"XMLHttpRequest",
"Cookie":"__tasessionId=da6dc6d4q1581851848391; s_v_web_id=k6oxqyas_VGn20UCx_WXjQ_40eQ_9nhD_h0a5HUmjAsyD; csrftoken=cdcf90d6d3d490ab1326e261b2eff18a; tt_webid=6794001919121065486"
}
2) the format returned json content change
does not image_detail in data fields, a field image_list, but not in data field each have image_list
Specific code:
code section is divided into three parts, main function; get_one_page functions: acquiring data in a json offset range, and return, which is a generator, can traverse in main; save_img function: img obtained from the builder information and save the url
import requests, json, time,os
from lxml import etree
from requests import RequestException
from hashlib import md5
base_url1 = "https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset="
base_url2 = "&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis×tamp="
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0",
"Referer":"https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D",
"X-Requested-With":"XMLHttpRequest",
"Cookie":"__tasessionId=da6dc6d4q1581851848391; s_v_web_id=k6oxqyas_VGn20UCx_WXjQ_40eQ_9nhD_h0a5HUmjAsyD; csrftoken=cdcf90d6d3d490ab1326e261b2eff18a; tt_webid=6794001919121065486"
}
base_save_path = "D:/toutiao/"
def get_one_page(base_url:str)->dict:
'''
@base_url: request url
@return: nop
'''
try:
resp = requests.get(base_url, headers = headers)
except RequestException:
return None
json_content = resp.json()
print(base_url, '\n')
# print(json.dumps(json_content, indent=2, ensure_ascii=False))
data = json_content.get("data")
for item in data:
title = item.get("title")
imgs = item.get("image_list")
if (imgs):
title_img = {}
title_img["title"] = title
title_img["image_list"] = imgs
yield title_img
else:
continue
def save_img(title_img: dict):
title = title_img.get("title")
title_path = base_save_path +title
if not os.path.exists(title_path):
os.mkdir(title_path)
for img in title_img.get("image_list"):
img_url = img.get("url")
img_resp = requests.get(img_url)
if img_resp.status_code == 200:
file_path = '{0}{1}.{2}'.format(title_path + '/', md5(img_resp.content).hexdigest(), 'jpg')
if not os.path.exists(file_path):
with open(file_path,'wb') as img_f:
img_f.write(img_resp.content)
else:
print("already downloaded :", file_path)
def main():
for i in range(0, 120, 20):
base_url = base_url1 + str(i) + base_url2 + str(time.time()).replace('.','')[:-4]
title_imgs = get_one_page(base_url)
for title_img in title_imgs:
save_img(title_img)
time.sleep(1)
if __name__ == "__main__":
main()