利用Selenium实现动态渲染页面的爬取------爬取拉勾网中python的岗位信息并存入数据库

上次利用requests库模拟Ajax请求爬取拉勾网,结果没有成功,一直禁止我的爬虫访问,这次我将利用Selenium来爬取拉勾网,找到python的职位信息,并存入数据库。

找到拉勾网,分析网页源代码

 

 找到链接我们需要获取的数据是:

 那我们如何查看下一页呢??请看这里:

 分析之后,写代码,获取链接

 先获取第一个页面的链接:

代码:

from selenium import webdriver
from lxml import etree

browser = webdriver.Edge()
browser.get("https://www.lagou.com/jobs/list_python ?labelWords=&fromSearch=true&suginput=")
page = browser.page_source

html = etree.HTML(page)
urls = html.xpath("//a[@class = 'position_link']/@href")
print(urls)

之所以没有用Selenium自带的查找方式,是因为lxml库的查找方式比较快。

结果是:

这样我们就完成了第一页的查找;那如何查找第二页呢??第三页呢??

我们就得点击【下一页】的按钮了,并且在这期间要用到页面等待,防止页面还没加载出来,程序就急于获取信息而导致的抛出异常!! 

 找到【下一页】的代码如下:

while True:
    button = browser.find_element_by_xpath("//div[@class = 'pager_container']//span[last()]")
    if "pager_next_disabled" in button.get_attribute("class"):
        break
    button.click()

重点:找到【下一页】终止条件:pager_next_disabled在【下一页】的class属性中!!!

获取所有职位的信息的全部代码如下:

browser = webdriver.Edge()
    browser.implicitly_wait(10)
    browser.get(url)
    job_urls = []
    while True:
        page = browser.page_source
        html = etree.HTML(page)
        job_url = html.xpath("//a[@class = 'position_link']/@href")
        job_urls.extend(job_url)
        button = browser.find_element_by_xpath("//div[@class = 'pager_container']//span[last()]")
        if "pager_next_disabled" in button.get_attribute("class"):
            break
        button.click()
    browser.close()

这样我们就能获得所有的职位的URL了

找到URL,提取信息

 我们知道了所有的URL之后,就可以根据一个界面的信息来写对应的xpath语句,这样就能提取相应的信息了

提取代码:

import requests
from  lxml import etree
import re
from selenium import webdriver

browser = webdriver.Edge()
browser.implicitly_wait(10)
browser.get("https://www.lagou.com/jobs/2108656.html")
page = browser.page_source
html = etree.HTML(page)
name = html.xpath("//div[@class = 'job-name']//span[@class = 'name']/text()")
salary = html.xpath("//dd[@class = 'job_request']//span[1]/text()")
didian = html.xpath("//dd[@class = 'job_request']//span[2]/text()")
jingyan = html.xpath("//dd[@class = 'job_request']//span[3]/text()")
xueli = html.xpath("//dd[@class = 'job_request']//span[4]/text()")
zhiwei = html.xpath("//dd[@class = 'job_request']//span[5]/text()")
job_advantage = html.xpath("//dd[@class = 'job-advantage']//p/text()")
job_detail = html.xpath("//div[@class = 'job-detail']//text()")
job_detail_new = ",".join(list(map(lambda x:x.strip(", \n"),job_detail))).strip(",")
print(job_detail_new)
work_addr = html.xpath("//div[@class = 'work_addr']//text()")
work_addr_new = "".join(list(map(lambda x:x.strip(),work_addr))).replace("查看地图","")

说实话,我提取的信息包含的杂质很多,有兴趣的可以尝试一下把信息更加具体化,欢迎评论哦!!这样就能提取一个页面的所有职位信息!!

综合一下,存入数据库

事先利用navicat建好了一个数据库(lagou),和一张表(job),如图所示:

注意点:建表的时候,一定要考虑好数据类型!!! 

具体的存入数据库的代码如下:

 conn = pymysql.connect(host="localhost", user="root", password="yanzhiguo140710", port=3306, db="lagou")
    cur = conn.cursor()
    keys = ",".join(job_dict.keys())
    values = ",".join(['%s']*len(job_dict))
    sql = "insert into job ({keys}) values ({value})".format(keys = keys,value = values)
    cur.execute(sql,tuple(job_dict.values()))
    conn.commit()
    conn.close()

 总结

这样就基本完成了所有的操作,最终的全部代码如下:

from selenium import webdriver
from lxml import etree
import pymysql

#获取总的URL信息
def get_URLS(url):
    browser = webdriver.Edge()
    browser.implicitly_wait(10)
    browser.get(url)
    job_urls = []
    while True:
        page = browser.page_source
        html = etree.HTML(page)
        job_url = html.xpath("//a[@class = 'position_link']/@href")
        job_urls.extend(job_url)
        button = browser.find_element_by_xpath("//div[@class = 'pager_container']//span[last()]")
        if "pager_next_disabled" in button.get_attribute("class"):
            break
        button.click()
    browser.close()
    return job_urls

#对每一个URL进行信息提取
def getJob(job_url):
    browser = webdriver.Edge()
    browser.implicitly_wait(10)
    browser.get(job_url)
    page = browser.page_source
    html = etree.HTML(page)
    name = html.xpath("//div[@class = 'job-name']//span[@class = 'name']/text()")[0]
    salary = html.xpath("//dd[@class = 'job_request']//span[1]/text()")[0]
    didian = html.xpath("//dd[@class = 'job_request']//span[2]/text()")[0]
    jingyan = html.xpath("//dd[@class = 'job_request']//span[3]/text()")[0 ]
    xueli = html.xpath("//dd[@class = 'job_request']//span[4]/text()")[0]
    zhiwei = html.xpath("//dd[@class = 'job_request']//span[5]/text()")[0]
    job_advantage = html.xpath("//dd[@class = 'job-advantage']//p/text()")[0]
    job_detail = html.xpath("//div[@class = 'job-detail']//text()")
    job_detail_new = ",".join(list(map(lambda x: x.strip(", \n"), job_detail))).strip(",")
    work_addr = html.xpath("//div[@class = 'work_addr']//text()")
    work_addr_new = "".join(list(map(lambda x: x.strip(), work_addr))).replace("查看地图", "")
    job_dict = {
        "name":name,
        "salary":salary,
        "didian": didian,
        "jingyan": jingyan,
        "xueli": xueli,
        "zhiwei": zhiwei,
        "job_advantage": job_advantage,
        "job_detail": job_detail_new,
        "work_addr": work_addr_new
    }
    browser.close()
    return job_dict

#存入数据库

def push_data(job_dict):

    conn = pymysql.connect(host="localhost", user="root", password="yanzhiguo140710", port=3306, db="lagou")
    cur = conn.cursor()
    keys = ",".join(job_dict.keys())
    values = ",".join(['%s']*len(job_dict))
    sql = "insert into job ({keys}) values ({value})".format(keys = keys,value = values)
    cur.execute(sql,tuple(job_dict.values()))
    conn.commit()
    conn.close()

if __name__ == '__main__':
    url = "https://www.lagou.com/jobs/list_python?city=全国&cl=false&fromSearch=true&labelWords=&suginput="
    job_urls = get_URLS(url)
    for item in job_urls:
        push_data(getJob(item))



 在数据库中的效果库如下:

这样最终完成了我的作品!!!

最后,还是那句话,有兴趣的可以一起交流哦!!! 

猜你喜欢

转载自blog.csdn.net/yanzhiguo98/article/details/86708941