OJ系统爬虫总结

背景

最近导师让我帮他把OJ系统上的学生代码导出来，怎知系统并没有一键导出的功能，无奈只能对着百度众多繁杂的教程咬咬牙爬虫，折腾了1天半总算搞出来交差了。

需求

1.提取验证码
2.模拟登陆
3.提取学生账号（学号）stuID、运行编号runID、题目编号pID.，构成学生代码提交页面的链接link
4.根据链接link提取学生提交的代码
5.匹配merge学号和学生姓名
6.保存每位学生的代码到本地，并且以学生姓名命名文件夹

总结

总结出以下几点：
1. opener是一个好东西，一开始将cookie绑定到opener里面，后面都无需麻烦了
2. 构造request请求，post方法代码如下

response=opener.open(postURL,postdata) #如果是get方法的话就无需第二个参数
content=response.read().decode()
print(content) #查看post登陆结果

3.读取excel文件：

pd.read_excel('d:/stulist.xlsx', sheetname='stulist',dtype=str)

4.抓包遇到有验证码，顺藤摸瓜找出验证码的网址，用opener访问验证码地址，获取cookie，将验证码保存到本地再读进来显示出图像来。
5.正则表达式问号需要加斜杆转义
6.如何转到下一页：抓取下页网址信息，继续往下爬，直到末页跳出while循环
7.用try-except可以在报错之后继续执行下去
8.一开始不懂，看了很多教程后，拼凑代码，一会儿用opener，一会儿用requests，结果浪费很多时间，爬虫框架很多，建议用的话只用一个就行，既然opener这么容易，以后我就专门用它好了。

代码：

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 10 15:01:36 2018

@author: Administrator
"""


from http import cookiejar
import urllib.request
import re  
import pandas as pd
import os
import urllib.parse
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

stulist=pd.read_excel('d:/stulist.xlsx', sheetname='stulist',dtype=str)


orginURL="http://116.56.140.75:8000/JudgeOnline/"
cid=1117
vcodeURL=orginURL+"vcode.php"
postURL=orginURL+"login.php"
statusURL=orginURL+'status.php'
page_URL= statusURL+'?problem_id=&user_id=&cid='+str(cid)+'&language=-1&jresult=-1&showsim=0'

#将cookie绑定到一个opener里，cookie由cookielib自动管理
cookie=cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
#根据抓包信息，构造头文件headers
opener.addheaders=[("User-Agent","'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)'")]


#用opener访问验证码地址，获取cookie
pic=opener.open(vcodeURL).read()

#保存验证码到本地（以二进制格式打开一个文件只用于写入）
local=open('vcode.jpg','wb')
local.write(pic)
local.close()

#读取并显示验证码（以二进制格式打开一个文件用于只读）
f = open('vcode.jpg', mode='rb')
plt.imshow(mpimg.imread('vcode.jpg'))
plt.show()


vcode=input('请输入验证码：')


#根据抓包信息构造表单
postData={'password':'xxxxxx',
          'submit':'Submit',    
          'user_id':'xxxxxx',
          'vcode':vcode
         }
#生成post数据（key1=value1&key2=value2的形式）
data=urllib.parse.urlencode(postData).encode('utf-8')

#构造request请求
response=opener.open(postURL,data)
content=response.read().decode()
print(content) #返回登录结果


runID=[]
stuID=[]
pID=[]

while 1:
    response = opener.open(page_URL)
    content=response.read().decode()

    """    
    pattern0 = re.compile(r'<tr><td>(.*?)</td><td>',re.S)  
    match_ct0 = re.findall(pattern0,content)  
    for i in range(len(match_ct0)):  
        runID.append(match_ct0[i].lstrip()) #去除左空格
    """    
    pattern1 = re.compile(r'#(.*?)\'',re.S)  
    match_ct1 = re.findall(pattern1,content)  
    for i in range(len(match_ct1)-1):  
        stuID.append(match_ct1[i+1]) 

    pattern2 = re.compile(r'href="submitpage.php\?cid='+str(cid)+'&pid=(.*?)&sid=(.*?)">Edit</a>',re.S)  #正则表达式中问号需要转义
    match_ct2 = re.findall(pattern2,content)  
    for i in range(len(match_ct2)):  
       runID.append(match_ct2[i][1]) 
       pID.append(match_ct2[i][0])
       # pID.append(match_ct2[i].replace('problem','submitpage')) #字符串替换

    pattern3 = re.compile(r'Previous Page</a>]&nbsp;&nbsp;\[<a href=status.php?(.*?)>Next Page',re.S)  
    match_ct3 = re.findall(pattern3,content)  
    next_page_URL=statusURL+match_ct3[0]

    pre_page_URL=page_URL
    page_URL=next_page_URL  #转到下一页
    if page_URL==pre_page_URL:  #末页跳出while循环
        break

#配对表
pair_table=pd.DataFrame()
pair_table['runID']=pd.Series(runID)
pair_table['stuID']=pd.Series(stuID)
pair_table['pID']=pd.Series(pID)
pair_table['link']=orginURL+'submitpage.php?cid=1117&pid='+pair_table
['pID']+"&sid="+pair_table['runID']

pair_table = pd.merge(pair_table, stulist, on='stuID',how='left') #左连接
pair_table=pair_table.sort_values(by = ['name'],axis = 0,ascending = True) #按name升序排序
pair_table = pair_table.reset_index(drop=True) #重置index

#读取学生代码，写到本地，每个学生一个文件夹
for i in range(len(pair_table)):
    folder =  'd:\OJ\\'+str(pair_table.ix[i,'name']) #路径
    fileName=str(pair_table.ix[i,'pID'])+'.txt'  #文件名
    if not os.path.exists(folder): #判断文件夹是否存在
        os.makedirs(folder) #若否新建文件夹

    submitpageURL=pair_table.ix[i,'link']
    response = opener.open(submitpageURL)
    submitpage_content=response.read().decode()

    pattern = re.compile(r'<textarea style="width:80%" cols=180 rows=20 id="source" name="source">(.*?)</textarea>',re.S)  
    match_ct = re.findall(pattern,submitpage_content)  
    grab_content =match_ct[0]
    #用try-except可以在报错之后继续执行下去
    try:
        with open(folder+"\\"+fileName,"w") as f:
            f.write(grab_content)
    except:
        print(str(pair_table.ix[i,'name'])+"\terror\t"+submitpageURL)

背景

需求

总结

代码：

猜你喜欢