嘉兴市人才网即时招聘栏目Ajax动态翻页爬虫练习

声明:代码仅供技术学习交流,不作其他用途

即时招聘:https://www.jxrsrc.com/Index/MoreInfo.aspx?TypeID=34
打开页面后拉到底下点下一页翻页发现浏览器中的地址没有发生变化,分析后这个网站是用ASP.NET,并且采用AJAX动态刷新。所以真正的动态页面地址需要通过开发者工具分析查找,F12打开开发者工具或者鼠标右键检查元素。

上图划红线的Ruquest URL就是真实的请求地址,并且请求方法是POST方法
在这里插入图片描述
我们看到上面Form Data里的参数是实际页面的参数,分别是页码,关键字和类型ID,请求头就是上方的内容。

import requests
from lxml import html
import time
import random
#下面的url就是实际地址
url = "https://www.jxrsrc.com/Index/Ashx/MoreInfo2.ashx"

cookie = "ASP.NET_SessionId=r2rwx4rzl4xu11e3s5131qjn; ASPSESSIONIDSGCDABQQ=GHDEIGBAJIGFIFHNICDEPCGF; Hm_lvt_8779a80c84018cd39c87c4dd911d90ba=1603981533,1604155293,1604417018,1604417059; Hm_lpvt_8779a80c84018cd39c87c4dd911d90ba=1604417066"
page_num = 0
#模拟浏览器信息
ugList = [
          "Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0",
          "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
          "Opera/9.80(Windo wsNT6.1;U;en)Presto/2.8.131Version/11.11",
          "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
]
#构造浏览器请求头
headers = {
    
    "User-Agent":random.choice(ugList),"Host":"www.jxrsrc.com","Connection":"keep-alive","Accept":"*/*","Referer":"https://www.jxrsrc.com/Index/MoreInfo.aspx?TypeID=34","Accept-Encoding":"gzip, deflate, br","Accept-Language":"zh-CN,zh;q=0.9"}
page = int(input("您想爬取几页的内容:"))
try:
		for page_num in range(page):  #括号里是翻页的页数
        #获取formdata页面参数的内容
        time.sleep(1) #强制等待,太快会被反爬
        page_num += 1
        formdata = {
    
    "PageIndex":page_num, "KeyWord": "","typeID":34}
        r = requests.post(url,data=formdata,headers=headers)
        preview_html = html.fromstring(r.content.decode("utf-8"))
        #xpath提取相关内容
        list_url = preview_html.xpath("//dl//dt//a//@href")
        print("--------------------------------------------这是第"+str(page_num)+"页-----------------------------------------")
        print(list_url) #输出是一个每页地址列表,测试输出3页试一下
except Exception as e:
    print(e)

如图所示:
在这里插入图片描述
大功告成,接下来就是请求每一页获取内容,这里不作详细解释,有空再单独写一篇,主要还是思路。

猜你喜欢

转载自blog.csdn.net/weixin_51424938/article/details/111408353