Python 爬取数电教案并保存为pdf

网站

http://cc.xjtu.edu.cn/G2S/site/preview#/rich/v/124828?ref=&currentoc=192

目标:将教案内容保存为pdf作为复习材料

首先在源代码中没有找到相关内容,即左边的列表和右边的内容都是生成的

刷新页面,查看XHR请求,找到页面规律和数据接口

侧边栏的数据获取,数据个数与左边列表数目一致,id与列表的url对应

# 侧边栏的url
list_url = 'http://cc.xjtu.edu.cn/G2S/DataProvider/OC/Site/SiteProvider.aspx/OCSiteColumn_List'

with open('./html/list.json', encoding='utf8', mode='w+') as f:
    text = requests.post(url, data=json.dumps({'ColumnID': '124828'}), headers=headers).text
    f.write(text)
    print(
        text
    )

获取内容,内容在conten字段中,获取后直接保存即可


url = 'http://cc.xjtu.edu.cn/G2S/DataProvider/OC/Site/SiteProvider.aspx/OCSiteColumn_Get'

headers = {
    'Host': 'cc.xjtu.edu.cn',
    'Connection': 'keep-alive',
    'Content-Length': '21',
    'Accept': 'application/json, text/plain, */*',
    'Origin': 'http://cc.xjtu.edu.cn',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36',
    'Content-Type': 'application/json;charset=UTF-8',
    'Referer': 'http://cc.xjtu.edu.cn/G2S/site/preview',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cookie': 'ASP.NET_SessionId=ouwltyfoe4v1lshm4kscyvga',

}
print(
    requests.post(url, data=json.dumps({'ColumnID': '124843'}), headers=headers).text
)

获取所有教案信息并保存为对应的文件

import json
import requests
import pdfkit

headers = {
    'Host': 'cc.xjtu.edu.cn',
    'Connection': 'keep-alive',
    'Content-Length': '21',
    'Accept': 'application/json, text/plain, */*',
    'Origin': 'http://cc.xjtu.edu.cn',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36',
    'Content-Type': 'application/json;charset=UTF-8',
    'Referer': 'http://cc.xjtu.edu.cn/G2S/site/preview',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cookie': 'ASP.NET_SessionId=ouwltyfoe4v1lshm4kscyvga',

}


def saveHtml(uid):
    url = 'http://cc.xjtu.edu.cn/G2S/DataProvider/OC/Site/SiteProvider.aspx/OCSiteColumn_Get'
    with open(f"./html/{uid}.html", mode='w+', encoding='utf8') as f:
        text = requests.post(url, data=json.dumps({'ColumnID': uid}), headers=headers).json()
        f.write('<head><meta charset="UTF-8"> </head>' + text['d'][0]['Conten'])


def saveAll():
    # saveHtml('124905')
    with open('./html/t.json', mode='r', encoding='utf8')  as f:
        url_list = json.load(f)
        for i in url_list['d']:
            print(i['ColumnID'])
            saveHtml(i['ColumnID'])


saveAll()

将html文件合并为pdf文件,注意内容乱码的解决是添加编码格式的标签

<head><meta charset="UTF-8"> </head>
import pdfkit
import json


def savePdf():
    with open('./html/t.json', encoding='utf8', mode='r') as f:
        url_list = json.load(f)['d']
        file_paths = [f"./html/{i['ColumnID']}.html" for i in url_list]

        pdfkit.from_file(file_paths, 'out.pdf')


savePdf()

结果,感觉还是要挂啊,。。。。。

猜你喜欢

转载自my.oschina.net/ahaoboy/blog/1825602