作业共分为两个部分
第一部分:
请分析作业页面,爬取已提交作业信息,并生成已提交作业名单,保存为英文逗号分隔的csv文件。文件名为:
hwlist.csv 。
文件内容范例如下形式:
学号,姓名,作业标题,作业提交时间,作业URL
20194010101,张三,羊车门作业,2018-11-13 23:47:36.8,
http://www.cnblogs.com/sninius/p/12345678.html
20194010102,李四,羊车门,2018-11-14 9:38:27.03,
http://www.cnblogs.com/sninius/p/87654321.html
*注1:如制作定期爬去作业爬虫,请注意爬取频次不易太过密集;
*注2:本部分作业用到部分库如下所示:
(1)requests —— 第3方库
(2)json —— 内置库
第二部分:
在生成的 hwlist.csv 文件的同文件夹下,创建一个名为 hwFolder 文件夹,为每一个已提交作业的同学,新建一个以该生学号命名的文件夹,将其作业网页爬去下来,并将该网页文件存以学生学号为名,“.html”为扩展名放在该生学号文件夹中。
代码如下:
main.py
1 import os 2 import json 3 import time 4 import requests 5 6 from config import * 7 8 def getJsonData(): 9 url = BASE_URL 10 try: 11 ret = requests.get(url=url,headers=HEADERS, timeout=30) 12 ret.raise_for_status() 13 except: 14 return False 15 else: 16 ret.encoding = ret.apparent_encoding 17 json_data = ret.text 18 return json_data 19 20 def jsonLoad(json_data): 21 if json_data: 22 json_obj = json.loads(json_data) 23 data_list = json_obj["data"] 24 with open("hwlist.csv","w") as f: 25 for data in data_list: 26 f.write(str(data["StudentNo"]) + "," 27 + str(data["RealName"]) + "," 28 + str(data["Title"]) + "," 29 + str(data["DateAdded"]).replace('T',"") + "," 30 + str(data["Url"]) + "\n" 31 ) 32 f.close() 33 return True 34 else: 35 return False 36 37 def getHtmlContent(url): 38 if url: 39 try: 40 ret = requests.get(url=url, headers=HEADERS,timeout=30) 41 ret.raise_for_status() 42 except: 43 return False 44 else: 45 return ret.content 46 47 def dirMake(json_data): 48 if json_data: 49 path = "./hwFolder" 50 if not os.path.exists(path): 51 os.mkdir(path) 52 53 data_list = json.loads(json_data)["data"] 54 for data in data_list: 55 ID = data["StudentNo"] 56 if ID: 57 child_path = path + "/" + ID 58 url = str(data["Url"]) 59 ID = str(ID) 60 if not os.path.exists(child_path): 61 os.mkdir(child_path) 62 63 htmlContent = getHtmlContent(url) 64 time.sleep(SLEEP_TIME) 65 with open(child_path + "/" + ID + ".html", "wb") as f: 66 f.write(htmlContent) 67 f.close() 68 69 def main(): 70 json_data = getJsonData() 71 jsonLoad(json_data) 72 dirMake(json_data) 73 print("Done...") 74 75 if __name__ == "__main__": 76 main()
config.py
1 # -*- coding:utf-8; -*- 2 SLEEP_TIME = 0.2 3 4 HEADERS = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0"} 5 6 BASE_URL = "https://edu.cnblogs.com/Homework/GetAnswers?homeworkId=2420&_=1542959851766"