爬虫(AJEX)——豆瓣动态页面

工具:python3

解释:Ajax 是一种用于创建快速动态网页的技术,在无需重新加载整个网页的情况下,能够更新部分网页的技术。

目标:爬取使用Ajex结束的豆瓣网页

import urllib.request

# url为抓包(get请求)获取的,而不是web页面上的 url
= "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=80" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", }
# fiddle中webforms中得到的表格数据 formdata
={ "page_limit": "20", "page_start": "80", "sort": "recommend", "tag" : "热门", "type": "movie" } data = urllib.parse.urlencode(formdata) data = bytes(data, "utf8")
request
= urllib.request.Request(url, data=data, headers=headers) response = urllib.request.urlopen(request).read()
# response
= response.decode("utf-8")
with open(
"douban.json","w") as f: f.write(str(response))

执行上述代码后,将得到的内容在json.cn中转码,出现如下错误:

说明文件格式不对,没能正确转码,尝试将返回值response进行解码:response=response.decode("utf-8")

得到正确的json格式的文件:

观察发现url中包含了formdata中的全部数据,尝试将formdata删除:
import urllib.request

url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=80"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
           }
# formdata ={
#     "page_limit": "20",
#     "page_start": "80",
#     "sort": "recommend",
#     "tag"    : "热门",
#     "type": "movie"
# }
# data = urllib.parse.urlencode(formdata)
# data = bytes(data, "utf8")
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request).read()
response = response.decode("utf-8")
with open("douban.json","w") as f:
    f.write(str(response))

运行结果与之前相同!

猜你喜欢

转载自www.cnblogs.com/gaoquanquan/p/9102307.html