腾讯招聘之存入mysql数据库

大数据时代，已不局限于书面了解世界，而是通过物联网达到足不出户眺望远方。这次小编带大家来看看另一种爬虫技术：动态爬虫，不同于之前两篇的静态爬虫，数据不存在于网页源码中，而是由Ajax渲染的接口数据。我们将通过“腾讯招聘”爬虫来了解这方面的知识。

操作环境： Windows10、Python3.6、Pycharm、谷歌浏览器
目标网址： https://careers.tencent.com/search.html?pcid=40001

===============================================================

爬虫目录

腾讯招聘之存入mysql数据库

1、摘要

1.1、Ajax简介
1.2、爬虫思路

2、分析网页
3、参数了解

3.1、请求的区分
3.2、时间戳
3.3、categoryId参数

4、提取数据

4.1、构建params
4.2、请求列表页
4.3、请求详情页

5、存入MySQL数据库
6、代码总结

================================================

1、摘要

1.1、Ajax简介

我们与网站服务器通信的唯一方式，就是发出HTTP请求获取新页面。如果提交表单之后，或从服务器获取信息之后，网站的页面不需要重新刷新，那么你访问的网站就在用Ajax技术。
Ajax其实并不是一门语言，而是用来完成网络任务(可以认为它与网络数据采集差不多)的一系列技术。Ajax全称是Asynchronous JavaScript and XML(异步JavaScript和XML)，网站不需要使用单独的页面请求就可以和网络服务器进行交互(收发信息)
Ajax是利用JavaScript在保证页面不被刷新、页面链接不改变的情况下与服务器交换数据并更新部分网页的技术。适用于GET、POST、DELETE等请求，服务器返回的XML、HTML、JSON等文本。

1.2、爬虫思路

1.获取目标网址的列表页中渲染的json数据；
2.拼接完整的详情链接；
3.请求并提取详情页数据。

2、分析网页

目标网址：
在这里插入图片描述

当我们想对目标网址进行爬取时，首先考虑的不是立即写代码测试该网站，而是先判断其是何种数据。有时候使用requests程序进行爬取时，会发现获取的结果与浏览器展示的数据不一致。即获取的数据在源代码中不存在，而是经过JavaScript处理，由Ajax加载于接口中存在。
在这里插入图片描述

如今的科技技术蓬勃发展，由Web发展趋势来看，越来越多的网页都将通过Ajax加载来呈现数据，即网页数据加载是一种异步加载的方式，网页本身不包含与这些数据，而是在初始化页面后自动通过向服务器发送Ajax请求，然后从服务器获取响应数据之后在渲染到网页上。

那么该如何获取Ajax渲染加载的数据呢？接下来就先找到其加载渲染的数据接口步骤如下：
1.右键点击检查，或F12打开开发者模式；
2.选择Network；
3.点击XHR(一般接口数据都在这寻找)；
4.Name中寻找目标接口；
5.Privew与Response，其中一个皆为链接返回的数据；
6.判断数据是否来自当前找到的目标接口。
在这里插入图片描述

3、参数了解

3.1、请求的区分

学过前端的小伙伴都了解，服务器发送的请求有Get和Post等几种，其中requests模块中发送请求有data、params两种携带参数的方法，而params在get请求中使用，data在post请求中使用。

两者的区分在于params是添加到url的请求字符串中的，用于get请求。
而data是添加到请求体（body）中的，用于post请求。当前的腾讯招聘便是Get请求的params参数，而Post请求的data参数小编会在下一篇博客中介绍，敬请期待后续！
在这里插入图片描述

3.2、时间戳

params表单中的第一个timestamp参数字面意思是时间戳，如何确认它是不是一个时间戳呢？首先百度一个url编码网站，找到Unix时间戳转换即可查看，需要注意的是参数里的数字是13位，转换时只需要前十位即可（去掉后三位）。
在这里插入图片描述
确定其为时间戳后，首先导入内置模块-时间模块：import time （注：转转换翻译的时候是十位整数，但作为参数请求的时候是十三位整数）,

import time
# 时间戳
timestamp = int(time.time()*1000)
print(timestamp)

time.time()获取的时间戳为小数，将其*1000达到13位整数后用int()函数去掉去除的小数即可。
在这里插入图片描述

3.3、categoryId参数

params表单里的categoryId参数是职业类别对应的招聘类型的分类，想获取哪方面的职位，点击其位置，categoryId参数便会改变，但勾选职位分类之后categoryId参数是固定的。

技术类别： 在这里插入图片描述
技术+设计分类：
这里我们只勾选技术的分类，获取这些岗位信息。

4、提取数据

分析完网页，下一步开始写程序代码。首先导入相关的库，以及写好请求头，为你的程序伪造身份。后续需要用到其他模块再进行导入。

import requests
import time
import json
import re

# 请求头，伪造身份
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}

4.1、构建params

除去params表单中空字符的参数，前面介绍过的岗位categoryId参数是固定不变的外，时间戳timestamp参数和pageIndex参数是翻页页数，这是每次请求都会改变的参数。

因为每次翻页请求都会用到params表单，所以定义一个函数来放置params表单，以便调用。

def get_params(pn):  # pn 页数
    params = {
        "timestamp": str(int(time.time() * 1000)),  # 时间戳
        "countryId": "",
        "cityId": "",
        "bgIds": "",
        "productId": "",
        "categoryId": "40001001,40001002,40001003,40001004,40001005,40001006",  # 技术岗位分类
        "parentCategoryId": "",
        "attrId": "",
        "keyword": "",
        "pageIndex": str(pn),   # 页数
        "pageSize": "10",   # 每页固定十条招聘岗位
        "language": "zh-cn",
        "area": "cn",
    }

    return params

4.2、请求列表页

从目标网址可看出，技术分类岗共有274页数据信息。两种代码实现方式：

for循环： 自己定义翻页的页数循环。

for pn in range(1,275,1):    # 最多274 间隔为1
	params = get_params(pn)  # 将页数传入paranms表单函数

while循环： 请求json数据中的Count即总岗位数，通过总岗位数来判断总页数。（这种方法是我在请求ajax加载的数据中最常用的，简单有效。）

请求列表页中的岗位名以及详情页中的岗位id。
json.loads：json反序列化（可以识别出字符串中的json格式：去掉引号，并变成通用的json，所有语言都识别。）

total = 10  # 一页的数据量  等于第一页
while True:
    params = get_params(int(total / 10))   # params表单
    # 请求目标网址
    response = requests.get(url=url,headers=headers,params=params)
    text = response.text    # 编码
    datas = json.loads(text)     # json反序列化
    
    for data in datas['Data']['Posts']:
        title = data['RecruitPostName']  # 职位名
        postid = data['PostId']          # 岗位id
        print(title,postid)
    
    # 增加十条数据
    total += 10
    # 判断定义的total > 总职位数+10 便结束 
    if total >= datas['Data']['Count'] + 10:	
        break  # 超过每页的长度便结束

代码执行部分结果为：
在这里插入图片描述

4.3、请求详情页

上面列表页请求到的岗位id：postId 为请求详情页中params表单的重要参数。
在这里插入图片描述

构建详情页的params表单： 将获取到的岗位id填充进去即可。

detail_params = {   # 详情页请求体
        "postId": str(postid),
        "timestamp": str(int(time.time()*1000)),    # 时间戳
        "language": "zh-cn",
    }

请求详情页数据，数据无误。

# 详情页接口链接
detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId'
# 请求并编码
resp_href = requests.get(url=detail_url,headers=headers,params=detail_params).text
# json.loads 反序列化
results = json.loads(resp_href)
print(results)

代码输出结果为：
在这里插入图片描述
下一步将请求到的详情页json数据继续提取对应的字段数据。
re.sub：正则替换(空字符)

result = results['Data']
postid = result['PostId']                       # id
position = result['RecruitPostName']            # 职位
position_id = result['RecruitPostId']           # 职位id
city = result['LocationName']                   # 城市
BGId = result['BGId']                           # BGid
BGName = result['BGName']                       # BG名
CategoryName = result['CategoryName']           # 分类名称
Responsibility = result['Responsibility']       # 职务
Responsibility = re.sub(r'[\n\t\s]','',Responsibility)
Requirement = result['Requirement']             # 需求
Requirement = re.sub(r'[\n\t\s]', '', Requirement)
last_time = result['LastUpdateTime']            # 最后时间
PostURL = result['PostURL']                     # 链接
print('id：',postid,'\n','职位：',position,'\n','职位id：',position_id,'\n','城市：',city,'\n',
    'BGid：',BGId,'\n','BG名：',BGName,'\n','分类名称：',CategoryName,'\n','最后时间：',last_time,'\n',
    '职务：',Responsibility,'\n','需求：',Requirement,'\n','链接：',PostURL,'\n')

代码输出部分结果：
在这里插入图片描述
动态爬虫较好提取数据，一层层剥削提取你想要的字段即可。较比与静态数据写提取语法来看轻松许多。

定义一个空列表，将获取到的字段数据填入其中，方便后续保存mysql数据库。

data_list = []  # 定义一个空列表，用于存储获取到的字段信息
data_list.append([postid,position,position_id,city,BGId,BGName,CategoryName,last_time,Responsibility,Requirement,PostURL])

将数据列表返回出去，get_data函数接收。
在这里插入图片描述

5、存入MySQL数据库

以管理员身份打开cmd终端，启动mysql服务命令：net start mysql；反之：net stop mysql 。

开启mysql：mysql -h 主机名 -u 用户名 -P 端口号 -p密码例如：mysql -h localhost -u root -P 3306 -p123456。出现以下情况表示进入mysql数据库环境，若有小伙伴没有安装的可自行百度或者B站，毕竟这类型的博客和视频还是很多的，新手切记别安装最新版本。
在这里插入图片描述
查看mysql数据库是否连接成功：

# 连接数据库
connection = pymysql.connect(
    host = 'localhost',          # 本机
    port = 3306,                 # 端口
    user = 'root',               # 用户名
    password = '123456',      # 密码
    charset = 'utf8'
)
print(connection)   # 查看是否连接成功

执行结果，连接成功。
在这里插入图片描述
cursor = connection.cursor()：命令游标，用于pycharm中执行SQL语法。
cursor.execute(‘sql语句’)：执行SQL语法。

创建数据库与数据表。
1.CREATE：创建
2.USE：选择
3.DROP：删除

# 判断是否存在数据库，不存在则创建
cursor.execute("create database if not exists spider_dataes character set 'utf8'")
cursor.execute("use spider_dataes")
# 判断表格若存在则先删除
cursor.execute("drop table if exists tx_spider;")
# 创建表格
tx_spider_sql = '''CREATE TABLE tx_spider(postid VARCHAR(50),position VARCHAR(150),position_id VARCHAR(20),city VARCHAR(10),
                    BGId VARCHAR(10),BGName VARCHAR(10),CategoryName VARCHAR(15),last_time DATETIME,Responsibility TEXT,
                    Requirement TEXT,PostURL VARCHAR(150)
);'''
cursor.execute(tx_spider_sql)  # 执行创建表的命令

在mysql可视化端——SQLyog查看是否创建成功。
在这里插入图片描述
插入数据
connection.commit() # 提交修改
cursor.close() # 关闭游标
connection.close() # 关闭mysql

# 保存数据
for data in data_list:
    # 插入数据
    cursor.execute(
        "insert into tx_spider(postid,position,position_id,city,BGId,BGName,CategoryName,last_time,Responsibility,Requirement,PostURL) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",data
    )
    connection.commit() # 提交修改
    cursor.close()  # 关闭游标
    connection.close()  # 关闭mysql

如前面所分析的技术分类岗位总页数，ajax渲染的数据Count大概两千七百多条数据。
在这里插入图片描述
数据无误，程序运行成功。。。

6、代码总结

腾讯招聘项目还是比较友好的，没有设置time.sleep时间延迟，对方服务器也无限制爬虫速度，所以程序运行速率较快。而腾讯招聘也是比较经典的动态爬虫之一，列表页及详情页均是ajax加载的json数据，有兴趣的小伙伴可以自己动动手玩耍一波。

# 导入需要的库
import requests
import time
import json
import re
import pymysql

# 请求头，伪造身份
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
data_list = []  # 定义一个空列表，用于存储获取到的字段信息

# 请求体
def get_params(pn):
    params = {
        "timestamp": str(int(time.time() * 1000)),  # 时间戳
        "countryId": "",
        "cityId": "",
        "bgIds": "",
        "productId": "",
        "categoryId": "40001001,40001002,40001003,40001004,40001005,40001006",  # 技术岗位分类
        "parentCategoryId": "",
        "attrId": "",
        "keyword": "",
        "pageIndex": str(pn),   # 页数
        "pageSize": "10",   # 每页固定十条招聘岗位
        "language": "zh-cn",
        "area": "cn",
    }

    return params

# 获取数据
def get_data(url):
    total = 10  # 一页的数据量  等于第一页
    while True:
    # 获取请求参数
    # 第一次请求
        # 请求目标网址
        params = get_params(int(total / 10))

        response = requests.get(url=url,headers=headers,params=params)
        text = response.text    # 编码
        datas = json.loads(text)     # json反序列化
        # try:
        for data in datas['Data']['Posts']:
            title = data['RecruitPostName']  # 职位名
            postid = data['PostId']          # 岗位id

            # print(title,postid)
            detail_params = {   # 详情页请求体
                "postId": str(postid),
                "timestamp": str(int(time.time()*1000)),    # 时间戳
                "language": "zh-cn",
            }
            detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId'
            resp_href = requests.get(url=detail_url,headers=headers,params=detail_params).text
            results = json.loads(resp_href)
            result = results['Data']

            postid = result['PostId']                       # id
            position = result['RecruitPostName']            # 职位
            position_id = result['RecruitPostId']           # 职位id
            city = result['LocationName']                   # 城市
            BGId = result['BGId']                           # BGid
            BGName = result['BGName']                       # BG名
            CategoryName = result['CategoryName']           # 分类名称
            Responsibility = result['Responsibility']       # 职务
            Responsibility = re.sub(r'[\n\t\s]','',Responsibility)
            Requirement = result['Requirement']             # 需求
            Requirement = re.sub(r'[\n\t\s]', '', Requirement)
            last_time = result['LastUpdateTime']            # 最后时间
            PostURL = result['PostURL']                     # 链接
            print(
                'id：',postid,'\n','职位：',position,'\n','职位id：',position_id,'\n','城市：',city,'\n',
                'BGid：',BGId,'\n','BG名：',BGName,'\n','分类名称：',CategoryName,'\n','最后时间：',last_time,'\n',
                '职务：',Responsibility,'\n','需求：',Requirement,'\n','链接：',PostURL,'\n'
            )
            data_list.append([postid,position,position_id,city,BGId,BGName,CategoryName,last_time,Responsibility,Requirement,PostURL])

        # 增加十条数据
        total += 10
        if total >= datas['Data']['Count'] + 10:
            break  # 超过每页的长度便结束

    return data_list

# 存入mysql数据库
def save_mysql(data_list):
    # 连接数据库
    connection = pymysql.connect(
        host = 'localhost',          # 本机
        port = 3306,                 # 端口
        user = 'root',               # 用户名
        password = '123456',      # 密码
        charset = 'utf8'
    )
    # print(connection)   # 查看是否连接成功
    # 执行命令游标
    cursor = connection.cursor()

    # 判断是否存在数据库，不存在则创建
    cursor.execute("create database if not exists spider_dataes character set 'utf8'")
    cursor.execute("use spider_dataes")
    # 判断表格若存在则先删除
    cursor.execute("drop table if exists tx_spider;")
    # 创建表格
    tx_spider_sql = '''CREATE TABLE tx_spider(postid VARCHAR(50),position VARCHAR(150),position_id VARCHAR(20),city VARCHAR(10),
                        BGId VARCHAR(10),BGName VARCHAR(10),CategoryName VARCHAR(15),last_time VARCHAR(50),Responsibility TEXT,
                        Requirement TEXT,PostURL VARCHAR(150)
    );'''
    cursor.execute(tx_spider_sql)  # 执行创建表的命令
    # 保存数据
    for data in data_list:
        # 插入数据
        cursor.execute(
            "insert into tx_spider(postid,position,position_id,city,BGId,BGName,CategoryName,last_time,Responsibility,Requirement,PostURL) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",data
        )

    connection.commit() # 提交修改
    cursor.close()  # 关闭游标
    connection.close()  # 关闭mysql

if __name__ == '__main__':
    url = 'https://careers.tencent.com/tencentcareer/api/post/Query'
    data_list = get_data(url)   # 获取数据
    save_mysql(data_list)       # 存入mysql

最后，对代码或步骤有什么疑问的小伙伴可在评论区进行留言，小编看到会尽快回复，若此项目有改进的地方，希望大佬不吝赐教，感谢！

注：该项目仅用于学习用途，若用于商业用途，请自行负责！！！

Python3网络爬虫之requests动态爬虫：腾讯招聘