python 3 crawl national college enrollment data for the calendar year in Sichuan (ARTS provincial control line, professional line ARTS) into Excel table (reconstruction)

Original article addresses: https://blog.csdn.net/memory_qianxiao/article/details/82388370

Since many of my friends need this data, since its launch, there are a lot of people ask me for the source code, I have to, until the 10th of March more than one day, I was told there is no data after running, I have also to test it really is no data, then a simple investigation found that the requested URL is changed due to the ready .... blue bridge Cup, do question, no careful investigation, careful investigation later found that the requested URL has changed is one thing, focus is on the site using ajax technology, data is not inside page source code, request data through ajax is local, then loaded into a web page ... so it is necessary to address ajax request to get the data analysis. . . Then explain the process of reconstruction, although resolved in trouble, but reduces the amount of code, just like 200 lines ... on it.

Development environment: python 3.6 + pycahrm 

Third discharging Library: request (request page-level library), xlwt (excel manipulation library), json (python and json library system conversion)

Since the original article written detailed crawling process, this time do not write so detailed, and want to learn more to learn, understand, please refer to the original article: https://blog.csdn.net/memory_qianxiao/article/details/82388370 (Although the code I can not use, but thought it important, after all, the most important learning ideas, code logic is a manifestation of your own thoughts)

Note: Due to site updates again, but this update is the new name, the API change requests, resulting code can not use, only need to request a change of address to the ajax request follows the original path like :( thanks in Jiangsu Province Yi Wei pointed out that the students). The final code integrity in updating the code once (April 30 July 19 update, please click on the following can GIthub :).

Meanwhile, the site a little more anti-climb measures, we need to request disguised as a browser, so it needs also increase when a request header in the request, as follows:

Updated: 19 July 2019 Source click on the latest available GitHub Portal: point I entered, GitHub portal ~

Old rules, Town House FIG follows (crawled part): a excel4 page science professional line, line liberal arts, science provincial control line, control line MEXT

 A: the analysis is: Since the update site changes, the ajax data, so the URL request useless, logic analysis is given below:

Press F12 to enter developer mode, click on the network, press and hold the XHR, just point at a college in, find every point of the year or the arts and sciences, there will be a api, data is returned in api key inside, the next analysis is how analysis api to get data inside.

II: Analysis api inside headers as follows:

Request url is: https://gkcx.eol.cn/api

Request method: POST! POST! POST! It is important, not the get method.

Request parameters: post request parameters are

And we get the data the most important thing here is the request parameters:

local_province_id: is the province of the code, the site updates are automatic positioning, it will select the current positioning location ip, which is code-named after the location automatically filled in, so if you need to crawl data elsewhere, you can change the request parameters, provinces own code, so we need to try.

local_type: code name is liberal arts, science is 1, 2 liberal arts, can go test, each click will have a new api to see on the line.

school_id: is the school's id, it is necessary to check which school to change the id on the line

uri: "hxsjkqt / api / gk / score / province" is the provincial control line hxsjkqt / api / gk / score / special professional line

year: the year is crawling that is what gave the year a year in which parameters need to crawl

Therefore, when a request is request.POST () method, and pass in the above parameters.

Here attach the name of the country's 34 provinces as well as the corresponding code:

北京:11
天津:12
河北:13
山西:14
内蒙古:15
辽宁:21
吉林:22
黑龙江:23
上海:31
江苏:32
浙江:33
安徽:34
福建:35
江西:36
山东:37
河南:41
湖北:42
湖南:43
广东:44
广西:45
海南:46
重庆:50
四川:51
贵州:52
云南:53
西藏:54
陕西:61
甘肃:62
青海:63
宁夏:64
新疆:65
台湾:71
香港:81
澳门:82


Three: Because I do not know the id to school or take the form of violence, from 3000 to the beginning of the cycle of violence 30 school id, then stitching the URL request.

Or define three functions, main () function entry, get_html () request information, the get_info () to get school information, and to excel stored inside. Code which has notes, there will be no repeat of the notes, it is the same. However, there may be less error, so it is still using breakpoint climb, that is, every time a print URL behind a school id, if the error continues to change what time whether the starting point for the cycle, if there is such a big God knows error solution, welcome message guidance grateful.

 Full code is as follows: excel save path into its own file path ok.

# -*- coding: utf-8 -*-
# @Filename: 全国高校录取数据重构.py
# @Time    : 2019/3/22 16:49
# @Author  : LYT
"""

"""
import requests,re,xlwt,json
from bs4 import BeautifulSoup
def Get_html(url,data):
    try:
        headers = {
            'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3676.400 QQBrowser/10.4.34"
        }#更新后添加的请求头,伪装成浏览器
        r=requests.post(url,params=data,headers=headers)#带请求参数params,同时把headers传进去
        r.encoding=r.apparent_encoding
        print("请求服务成功!")
        print(r.text)
        return r.text
    except:
        print("请求失败")
def Get_info(url,id):
    url_api = "https://gkcx.eol.cn/gkcx/api"#更新后请求ajax api  
    #请求参数信息
    data = {
        'local_province_id': '51',  # 省份 51为四川  默认是自动定位的地方
        'local_type_id': 1,  # 理科为1,文科为2,
        'school_id': '30',  # 学校id
        'uri': "hxsjkqt/api/gk/score/province",  # 专业线special     省线province
        'year': 2018  # 年份 网站更新后 省录取线2014 -2018  专业录取线2014-2017
    }
    info_like=[]#存理科省线信息列表
    info_wenke=[]#存文科省线信息列表
    special_like=[]#存理科专业信息列表
    special_wenke=[]#存文科专信息业表
    name=""
    for i in range(2014,2019):#请求2014-2018的省控线信息

        data['year']=i #更换参数年份
        data['school_id']=id#更换学校id
        data['local_type_id']=1#更换为理科
        infoes=json.loads(Get_html(url_api,data))
        # print(info)
        # print(info['data']['item'])
        #存理科省录取线
        if len(infoes['data']['item'])==0:
            l=[]
            l.append(name)  # 没有数据就填充--
            l.append(str(i))
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            info_like.append(l)
        for j in range(len(infoes['data']['item'])):  # 可能有多个省录取线,比如提前批,一批,二批等
            l = []
            info = infoes['data']['item'][j]
            name = info['name']
            l.append(info['name'])#存学校名字
            l.append(info['year'])#存年份
            l.append(info['local_type_name'])  # 科类
            l.append(info['max'])#最大录取分数
            l.append(info['average'])#平均分
            l.append(info['min']) #最低分
            l.append(info['proscore']) #省控线
            l.append(info['local_batch_name'])#批次
            info_like.append(l)

        #存文科录取线
        data['local_type_id']=2#更换为文科
        infoes = json.loads(Get_html(url_api,data))
        if len(infoes['data']['item'])==0:
            l=[]
            l.append(name)  # 没有数据就填充--
            l.append(str(i))
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            info_wenke.append(l)
        for j in range(len(infoes['data']['item'])):#可能有多个省录取线,比如提前批,一批,二批等
            l=[]
            info = infoes['data']['item'][j]
            name=info['name']
            l.append(info['name'])  # 存学校名字
            l.append(info['year'])  # 存年份
            l.append(info['local_type_name'])#科类
            l.append(info['max'])   #最大录取分数
            l.append(info['average']) #平均分
            l.append(info['min'])      #最低分
            l.append(info['proscore']) #省控线
            l.append(info['local_batch_name'])#批次
            info_wenke.append(l)

        print(info_like)
        print(info_wenke)

    for i in range(2014,2018):#请求2014-2017的专业录取信息
        # 请求参数信息
        data['year'] = i  # 更换参数年份
        data['school_id'] = id  # 更换学校id
        data['local_type_id'] = 1  # 更换为理科
        data['uri']='/'.join(data['uri'].split('/')[0:-1])+"/special"#更换为请求参数为专业录取线的参数网址
        infoes = json.loads(Get_html(url_api,data))
        #存理科专业信息
        if len(infoes['data']['item']) == 0:
            l = []
            l.append(name)
            l.append(str(i))
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            info_like.append(l)
        for j in range(len(infoes['data']['item'])):
            l=[]
            info=infoes['data']['item'][j]
            name=info['name']
            l.append(info['name'])#学校名字
            l.append(info['year'])#年份
            l.append(info['spname'])#专业名称
            l.append(info['local_type_name'])#科类
            l.append(info['average'])#专业平均分
            l.append(info['max'])#最高分
            l.append(info['min'])#最低分
            l.append(info['local_batch_name'])#录取批次
            special_like.append(l)
        #存文科专业录取信息
        data['local_type_id']=2#更换为文科
        infoes = json.loads(Get_html(url_api, data))
        if len(infoes['data']['item']) == 0:
            l = []
            l.append(name)   #没有信息就填充--符号
            l.append(str(i))
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            l.append('--')
            info_like.append(l)
        for j in range(len(infoes['data']['item'])):
            l=[]
            info=infoes['data']['item'][j]
            name=info['name']
            l.append(info['name'])#学校名字
            l.append(info['year'])#年份
            l.append(info['spname'])#专业名称
            l.append(info['local_type_name'])#科类
            l.append(info['average'])#专业平均分
            l.append(info['max'])#最高分
            l.append(info['min'])#最低分
            l.append(info['local_batch_name'])#录取批次
            special_wenke.append(l)
        print(special_like)
        print(special_wenke)
    #*******************保存数据到excel**************************
    print("*************正在写入Excrl数据***************")
    # 创建excel簿指定编码
    file=xlwt.Workbook(encoding='utf-8')
    #创建表 一个表有4个sheet
    print("*************正在写入Excrl理科专业线数据***************")
    table1=file.add_sheet(name+'理科专业线')
    value = ['学校名称', '年份','专业名称','科类', '平均分', '最高分', '最低分', '批次']
    for i in range(len(value)):
        table1.write(0,i,value[i])
    for i in range(len(special_like)):
        for j in range(len(special_like[i])):
            table1.write(i+1,j,special_like[i][j])

    print("*************正在写入Excrl文科专业线数据***************")
    table2 = file.add_sheet(name + '文科专业线')
    value = ['学校名称', '年份', '专业名称', '科类', '平均分', '最高分', '最低分', '批次']
    for i in range(len(value)):
        table2.write(0, i, value[i])
    for i in range(len(special_wenke)):
        for j in range(len(special_wenke[i])):
            table2.write(i + 1, j, special_wenke[i][j])

    print("*************正在写入Excrl理科省控线数据***************")
    table3 = file.add_sheet(name + '理科省控线')
    value = ['学校名称', '年份', '科类', '最高分', '平均分', '最低分', '省控线', '批次']
    for i in range(len(value)):
        table3.write(0, i, value[i])
    for i in range(len(info_like)):
        for j in range(len(info_like[i])):
            table3.write(i + 1, j, info_like[i][j])

    print("*************正在写入Excrl文科省控线数据***************")
    table4 = file.add_sheet(name + '文科省控线')
    value = ['学校名称', '年份', '科类', '最高分', '平均分', '最低分', '省控线', '批次']
    for i in range(len(value)):
        table4.write(0, i, value[i])
    for i in range(len(info_wenke)):
        for j in range(len(info_wenke[i])):
            table4.write(i + 1, j, info_wenke[i][j])

    # ***********指定保存路径******************
    file.save('D:\QQPCMgr(1)\Desktop\高校/' + name + '录取数据.xls')
    print("****************"+name+"所有数据爬取成功"+"*********************")
def main():
    for id in range(30,3000):
        url="https://gkcx.eol.cn/school/"+str(id)
        print(url)
        Get_info(url,id)

if __name__ == '__main__':
    main()

 

Published 395 original articles · won praise 126 · Views 200,000 +

Guess you like

Origin blog.csdn.net/memory_qianxiao/article/details/88767327