python篇-第一个爬虫程序

突然觉得爬虫的功能非常强大，于是就学了学，试着在牛客网上操作了一番。

功能为爬取牛客网竞赛上的每一场比赛，从而以列表的形式得出每场比赛的过题量，以及获取总过题量。。。效果如下：

经过谷歌浏览器的F12发现，这些信息存储于一个动态的js中（ps：我也不是很懂，所以就先这样描述了），因为过题信息肯定不是静态的，你AC一个题，服务器那边就会将你过题信息局部地刷新，说了这么多废话，其实就是说这些信息不会在网页源代码中，原来直接把源代码当作字符串处理的方法就不行了。。然而，这些动态信息也是有网址的，通过f12可以获取到实时的动态信息，再把含有动态信息的源代码

当作字符串处理就行了。（ps：还需要对应身份的实时cookie）

 1 import requests
 2 import re
 3 import time
 4 import urllib
 5 from bs4 import BeautifulSoup
 6 headers = {
 7     
 8     'Cookie':'_'
 9    
10 }
11 
12 
13 urls = ['https://www.nowcoder.com/acm/contest/problem-list?token=&id={}&_=**********'.format(str(num)) for num in range(1,139)]
14 num = 0
15 tot = 0
16 tottxt='data :\n'
17 for url in urls:
18     res = requests.get(url,headers=headers, verify = False)
19     num = num+1
20     try:
21         ss = re.findall('"index":"\D","myStatus":"通过"',res.text)
22        
23         if int(len(ss))>0:
24             ss2 = re.search('"problemCount":\d+',res.text)
25             tottxt=tottxt+'Contest id:'+str(num)+'\n'
26             temp = requests.get('https://www.nowcoder.com/acm/contest/{}#question'.format(str(num)))
27             ss3 = re.search('<title>(.*?)</title>',temp.text)
28             if ss3!=None:
29                 newss3 =re.sub('<title>','',ss3.group())
30                 newss3 =re.sub('_牛客网</title>','',newss3)
31                 tottxt =tottxt+newss3+'\n'
32            
33             tottxt=tottxt+'     "Accepted":'+str(len(ss))+'/'
34           
35             tot = tot+len(ss)
36             if ss2!=None:
37                 newss2=re.sub('\D','',ss2.group())
38                 tottxt = tottxt +newss2
39             tottxt=tottxt+'\n'
40             time.sleep(2)
41            
42     except ConnectionError:
43         print("**ConnectionError**")
44         
45 tottxt=tottxt+'\n'+'All Accepted:'+str(tot)
46 f = open('tot2.txt','wb+')
47 tottxt = tottxt.encode('utf-8')
48 f.write(tottxt)
49 f.close()

python篇-第一个爬虫程序

猜你喜欢