Python 爬虫作业

作业共分为两个部分

第一部分:

 
请分析作业页面,爬取已提交作业信息,并生成已提交作业名单,保存为英文逗号分隔的csv文件。文件名为: hwlist.csv 。
 
文件内容范例如下形式:
 
学号,姓名,作业标题,作业提交时间,作业URL
20194010101,张三,羊车门作业,2018-11-13 23:47:36.8, http://www.cnblogs.com/sninius/p/12345678.html
20194010102,李四,羊车门,2018-11-14 9:38:27.03, http://www.cnblogs.com/sninius/p/87654321.html
 
*注1:如制作定期爬去作业爬虫,请注意爬取频次不易太过密集;
*注2:本部分作业用到部分库如下所示:
(1)requests —— 第3方库
(2)json    —— 内置库
 

第二部分:

在生成的 hwlist.csv 文件的同文件夹下,创建一个名为 hwFolder 文件夹,为每一个已提交作业的同学,新建一个以该生学号命名的文件夹,将其作业网页爬去下来,并将该网页文件存以学生学号为名,“.html”为扩展名放在该生学号文件夹中。

代码如下:

 main.py

 1 import os
 2 import json
 3 import time
 4 import requests
 5 
 6 from config import *
 7 
 8 def getJsonData():
 9     url = BASE_URL
10     try:
11         ret = requests.get(url=url,headers=HEADERS, timeout=30)
12         ret.raise_for_status()
13     except:
14         return False
15     else:
16         ret.encoding = ret.apparent_encoding
17         json_data = ret.text
18         return json_data
19 
20 def jsonLoad(json_data):    
21     if json_data:
22         json_obj = json.loads(json_data)
23         data_list = json_obj["data"]
24         with open("hwlist.csv","w") as f:
25             for data in data_list:
26                 f.write(str(data["StudentNo"]) + ","
27                         + str(data["RealName"]) + ","
28                         + str(data["Title"]) + ","
29                         + str(data["DateAdded"]).replace('T',"") + ","
30                         + str(data["Url"]) + "\n"
31                         )
32             f.close()
33         return True            
34     else:
35         return False
36 
37 def getHtmlContent(url):
38     if url:
39         try:
40             ret = requests.get(url=url, headers=HEADERS,timeout=30)
41             ret.raise_for_status()
42         except:
43             return False
44         else:
45             return ret.content
46 
47 def dirMake(json_data):
48     if json_data:
49         path = "./hwFolder"
50         if not os.path.exists(path):
51             os.mkdir(path)
52             
53         data_list = json.loads(json_data)["data"]
54         for data in data_list:
55             ID = data["StudentNo"]
56             if ID:
57                 child_path = path + "/" + ID
58                 url = str(data["Url"])
59                 ID = str(ID)
60                 if not os.path.exists(child_path):
61                     os.mkdir(child_path)
62                 
63                 htmlContent = getHtmlContent(url)
64                 time.sleep(SLEEP_TIME)
65                 with open(child_path + "/" + ID + ".html", "wb") as f:
66                     f.write(htmlContent)
67                     f.close()
68                                  
69 def main():
70     json_data = getJsonData()
71     jsonLoad(json_data)
72     dirMake(json_data)
73     print("Done...")
74         
75 if __name__ == "__main__":
76     main()

  config.py

1 # -*- coding:utf-8; -*-
2 SLEEP_TIME = 0.2
3 
4 HEADERS = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0"}
5 
6 BASE_URL = "https://edu.cnblogs.com/Homework/GetAnswers?homeworkId=2420&_=1542959851766"

猜你喜欢

转载自www.cnblogs.com/zmq620/p/10064689.html
今日推荐