[Python] OJ brush reptile combat title record (a) statistical Codeforces brush title record


Do not know that other people will not like this, anyway, as a Acmer more than a year into the pit, and sometimes particularly want to know in the end how much brush the question, of course, the number one question on a question of OJ can go to, but this is not the programmer ah, so so mechanical work how hands-on it. Although there is now an open source project, you can count in case of major problems brush OJ, but still want their own hands to achieve something. Just recently I learned Python reptile, just to complete their previously set a flag up. We will achieve Codeforces, HDU, POJ, Luo Valley and other mainstream OJ brush crawling title records, statistics and the number of AC, and finally achieve the interface with QT. Man of few words said, directly to the question, start with codeforces start.

First, the needs analysis

  1. First, we should find url submission interface.
  2. Analysis request method.
  3. Web page source code analysis tool by checking the browser.
  4. Get what we need, and then saved to a text or database.
  5. After parsing the current interface to flip, and then repeat the above steps.

Second, the specific implementation

Below tourist submit Gangster information interface, for example.

  • Analysis url link

When just beginning to enter the submission interface, url is this

Here Insert Picture Description

Then we turn the page and see. Submit url page can be found in the laws of the interface, as follows, and finally the current page number. Using the sync request. Since it is a synchronous request, then it is very easy to handle, direct access to html text and then parse the line.

https://codeforces.com/submissions/username/page/2

Here Insert Picture Description

  • send request

library requests by sending a request, may choose not to set the headers, since not anti codeforces crawlers.

import requests
url = 'https://codeforces.com/submissions/tourist/page/1'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
}

response = requests.get(url=url, headers=headers)
  • Analysis page, then parse

Using the chrome browser, right-click on the submit records, choose Check. Other browsers are similar.

After observation can be found, the record information is filed in a table in each of a commit record tr tag. So we first is to bring these tr tag to be extracted, and then parse the information we want (subject heading, submission status, etc.).

Here Insert Picture Description

经过检查可以确定 data-submission-id这个属性是tr标签特有的,那么我们就可以通过这个属性来定位tr标签。使用pyquery进行解析(跟JQuery语法很类似,支持css selector)。

from pyquery import PyQuery as pq
doc=pq(response.text)
items = doc.find('[data-submission-id]').items() #将查到的元素生成一个迭代器方便便利。

把每个tr标签提取出来后,接下来开始对里面的td每个td标签下手了。我就提取了题目名字和提交状态。其他的方法一样。

Here Insert Picture Description
现在对每个td单独进行解析。所以可以通过**.status-small>a定位到a标签。然后提取它的文本内容,也就是题目标题。但如果是从整个网页中中则不行,可以选择通过data-problemid这个特定标签来定位。使用:nth-child(6)**获取提交状态的内容。将获取的内容打包成一个字典。(当然其他形式也可以,然后看你要保存到数据库或者文本中了)

items = doc.find('[data-submission-id]').items()
for item in items:
    it = solve_tr(item)
    problemName = tr.find('.status-small>a').text()
    state = tr.find(':nth-child(6)').text()
    it = {'problemName': problemName, 'state': state}
problemName=tr.find('[data-problemid]>a').text()
  • 获取页码数量

至此已经我们已经可以解析一页的提交内容了,但是如果实现翻页功能呢。要实现翻页功能其实很简单,只要将url最后的数字替换掉,就可以了。

https://codeforces.com/submissions/username/page/2
https://codeforces.com/submissions/username/page/3

重点在于我们应该翻页几次。对于翻页问题有两个解决方案,一个是死循环,一直进行下去,出现异常在停止,但是这个在codeforces这里是行不通的,因为如果超出页码后,会默认显示最后一页。所以我们就只能老老实实的统计页码数量了。

页码标签是在一个无序表中,倒数第二个li标签的内容存的就是总页码数,那么我们只要获取里面的内容就可以了。

Here Insert Picture Description
通过nth-child选择器获取。

def get_pages_num(doc):
    """
    获取需要爬取的页码数量
    :param doc: pyquery返回的解析器
    :return: int,页码数量
    """
    try:
        length = doc.find('#pageContent>.pagination>ul>*').length
        last_li = doc.find('#pageContent>.pagination>ul>li:nth-child(' + str(length-1) + ')')
        print('length', length)
        print(last_li.text())
        # for item in items:
        #     print(item)

        return max(1, int(last_li.text()))

    except Exception :
        return None

现在爬取codeforces刷题记录的需求基本完成了,将上面的代码整理一下,就完工了。

三、完整代码

有个可以直接运行的代码是件非常愉快的事,所以我就把完整代码贴上来了

from pyquery import PyQuery as pq
import requests
import time


def solve_tr(tr):
    """
    解析我们所需要的内容
    :param tr: tr元素
    :return: dict
    """
    problemName = tr.find('.status-small>a').text()
    state = tr.find(':nth-child(6)').text()
    it = {'problemName': problemName, 'state': state}
    return it


def get_pages_num(doc):
    """
    获取需要爬取的页码数量
    :param doc: pyquery返回的解析器
    :return: int,页码数量
    """
    try:
        length = doc.find('#pageContent>.pagination>ul>*').length
        last_li = doc.find('#pageContent>.pagination>ul>li:nth-child(' + str(length-1) + ')')
        print('length', length)
        print(last_li.text())
        # for item in items:
        #     print(item)

        return max(1, int(last_li.text()))

    except Exception :
        return None


def crawl_one_page(doc):
    """
    爬取每一页中的内容
    :param doc: pyquery返回的解析器
    """
    items = doc.find('[data-submission-id]').items()
    for item in items:
        it = solve_tr(item)
        with open('data.txt', 'a+', encoding='utf-8') as f:
            f.write(str(it) + '\n')
        print(it)


def get_username():
    """
    获取用户名
    :return:
    """
    username = input('请输入用户名:')
    return username


def main():
    base = 'https://codeforces.com/submissions/'
    username = get_username()
    url = base+username+'/page/1'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
    }

    response = requests.get(url=url, headers=headers)
    doc = pq(response.text)
    # 注释部分为测试代码
    # crawl_one_page(doc)
    # with open('index.html', 'w', encoding='utf-8') as f:
    #     f.write(doc.text())
    num = get_pages_num(doc)

    if num is not None:
        for i in range(1, num + 1):
            url = base+username+'/page/'+str(i)
            print(url)
            response = requests.get(url=url)
            doc = pq(response.text)
            crawl_one_page(doc)
            time.sleep(2)

    else:
        print('username is no exist or you are no submission')


if __name__ == '__main__':
    main()

运行效果如下
Here Insert Picture Description

Published 45 original articles · won praise 50 · views 30000 +

Guess you like

Origin blog.csdn.net/qq_43058685/article/details/104248092