Programming Practice (1) - Crawling Niuke ACM Programming Question Information

Programming Practice (1) - Crawling Niuke ACM Programming Question Information

Brief description

For a recent project, I needed to crawl out the information from the programming questions on ACM Niuke.com and then analyze it. The data that needs to be crawled include: question number (starting with NC), title name, difficulty, knowledge points, question description information, restriction information, input and output samples, etc.;

  • Running environment: Anaconda python 3.8;

  • The crawler library is BeautifulSoup+requests. You can refer to the relevant blogs for the installation tutorial;

  • The crawled base URL: https://ac.nowcoder.com/acm/problem/list (not the Niuke homepage, because there are relatively few programming questions on the homepage, so in order to pursue the number of questions, I chose to directly crawl the question bank in acm)

web analysis

First log in to the base website ( if you don’t have an account, you must register an account, because viewing the detailed information of the questions requires the user to be logged in ), and you can see the list of questions.

Insert image description here

There will be pagination when you scroll down. There are more than 20,000 pages in total. We can't ask for that many questions, so we can only crawl about 30 pages.

Click on a different page, and his url will change to this format: https://ac.nowcoder.com/acm/problem/list?keyword=&tagId=&platformTagId=0&sourceTagId=0&difficulty=0&status=all&order=id&asc=true&pageSize=50&page = + page number, so we can use changes in this page number to traverse different pages.

The question list contains the question number, title, knowledge points, and difficulty information of each question. If you want to get the input and output, restrictions, and question description information, you have to click on it. Let’s take the number sequence of Niu Niu in the first question as an example to explain the principle.

You must have an account to log in here, otherwise the web page will not be crawled for you when the crawler program is executed!

Insert image description here

The page you click on looks like this, with all the information we want in it (but there are no knowledge point labels). Then look at the URL of the web page and find that the format is: https://ac.nowcoder.com/acm/problem/ + question number.

Therefore, we can use the fixed formats of these two URLs to complete a large number of traversals and crawls. The specific ideas are:

  • Traverse each page, crawl the question numbers, difficulty, and labels of all questions on each page (because the details page does not have label information and difficulty), and save the question numbers into a list numlist;
  • Use this page to crawl the question number list numlist, traverse it, splice the question number with https://ac.nowcoder.com/acm/problem/, enter each detail page, and crawl other detailed information;
  • Organize and output all information;
for i in 页面范围:
	#爬取每一页的所有题号列表numlist
    for j in numlist:
        #根据每个题号进入对应详情页面,爬取信息
        #存储

Next is the code part.

Code explanation

For the element attribute name selector involved in this part, please refer to the element column of F12 on the web page;

Information header and storage structure

    url = "https://ac.nowcoder.com/acm/problem/list?		keyword=&tagId=&platformTagId=0&sourceTagId=0&difficulty=0&status=all&order=id&asc=true&pageSize=50&page=" + str(page)
    
    headers = {
    
    
        "User-Agent": '你的user agent',
        "Cookie": '你的cookie'
    }

Here are the simple steps to obtain these two parameters:

  • Right-click on the logged-in page -> Check, find "Network" in the menu bar, then press ctrl+R, select the first one in the list (usually the name is the question number);
  • Find cookie and user agent in the right column and copy them;

Legend (same for user agent):

Insert image description here

In terms of storage structure, I used a simple and crude method: first make a list for each attribute, and add them one by one during traversal. Other methods can also be used.

QNum = []
QTitle = []
QDifficulty = []
QContent = []
QTag = []
QTimeLimit = []
QSpaceLimit = []
QInput = []
QOutput = []

Get a list of question numbers

	response = requests.get(url, headers=headers) #获取请求
    BasicSoup = BeautifulSoup(response.content, 'lxml') #用bs4爬取list全部的页面元素

    tablelist = BasicSoup.find_all(name="table", attrs={
    
    "class": "no-border"})[0].find_all(name="tr")[1:] #根据属性class、属性名为no-border来寻找符合条件的table元素,然后获取除去表头之外所有的tr元素
    numlist = [] #存储题号的列表
    
    for i in tablelist:
        item = i.attrs['data-problemid'] #获取所有tr元素的data-problemid属性值
        numlist.append(item) #添加属性

Question number, restriction information, question title

    urls = 'https://ac.nowcoder.com/acm/problem/' + i #每个详情页面的url

    responses = requests.get(urls, headers=headers)
    content = responses.content
    soup = BeautifulSoup(content, 'lxml') #得到所有元素的html文档

    QTitle.append(soup.find_all(name="div", attrs={
    
    "class": "question-title"})[0].text.strip("\n"))#获取标题

    mainContent = soup.find_all(name="div", attrs={
    
    "class": "terminal-topic"})[0]#找到主体部分

    div = mainContent.find_all(name="div", attrs={
    
    "class": "subject-item-wrap"})[0].find_all("span")#找到题号、限制信息所在的div
    
    num = div[0].text.strip("题号:") #其实前面已经爬过了一次了。。
    QNum.append(num)
    QTimeLimit.append(div[1].text.strip("时间限制:"))
    QSpaceLimit.append(div[2].text.strip("空间限制:"))#获取信息

Question description

    div1 = mainContent.find_all(name="div", attrs={
    
    "class": "subject-question"})[0]#题目描述
    div2 = mainContent.find_all(name="pre")[0]#输入描述
    div3 = mainContent.find_all(name="pre")[1]#输出描述

    descriptDict = {
    
    "题目描述:": divTextProcess(div1), "输入描述:": divTextProcess(div2), "输出描述:": divTextProcess(div3)}#这三个信息都属于题目描述,弄成一个字典
    QContent.append(descriptDict)#把字典存到题目描述列表中

Handling strange strings

def divTextProcess(div):

    strBuffer = div.get_text()    #获取文本

    strBuffer = strBuffer.replace("{", " $").replace("}", "$ ")    #替换公式标记

    strBuffer = strBuffer.replace("  ", "")    #去除多个空格

    strBuffer = strBuffer.replace("\n\n\n", "\n")    #去除多个换行符

    strBuffer = strBuffer.replace("\xa0", "")    #去除内容中用\xa0表示的空格

    strBuffer = strBuffer.strip()    #去除首位空格

    return strBuffer

Input and output samples

	div4 = mainContent.find_all(name="div", attrs={
    
    "class": "question-oi-cont"})[0]
    div5 = mainContent.find_all(name="div", attrs={
    
    "class": "question-oi-cont"})[1]
    QInput.append(divTextProcess(div4))
    QOutput.append(divTextProcess(div5))

label,difficulty

Here, because I discovered that the tag is not in the detailed page after starting to crawl, I have to return to the corresponding list page to crawl. The logic is a bit confusing. . . .

	response = requests.get(url, headers=headers)
    BasicSoup = BeautifulSoup(response.content, 'lxml')#返回list页面

    diff = BasicSoup.find_all(name="tr", attrs={
    
    "data-problemid": i})[0].find_all(name="td")[3].text.strip("\n")
    QDifficulty.append(diff)#获取难度信息

    problem = BasicSoup.find_all(name="tr", attrs={
    
    "data-problemid": i})[0].find_all(name="a", attrs={
    
    
        "class": "tag-label js-tag"})
    tag = "" #由于一道题目的标签可能不止一个,因此为了后续的格式化,做一下字符串处理
    count = 0 #指针,如果是第一个标签就不要在前面加逗号
    for i in problem:
        if count == 0:
            tag = tag + i.text
        else:
            tag = tag + "," + i.text
            count = count + 1
    QTag.append(tag)#获取标签信息

Convert json file output

result = {
    
    }#python里面的json库必须要字典型数据

for i in range(len(QNum)):#存储
    message = {
    
    }
    message.update({
    
    "questionNum": QNum[i]})
    message.update({
    
    "questionTitle": QTitle[i]})
    message.update({
    
    "difficulty": QDifficulty[i]})
    message.update({
    
    "content": QContent[i]})
    message.update({
    
    "PositiveTags": QTag[i]})
    message.update({
    
    "TimeLimit": QTimeLimit[i]})
    message.update({
    
    "SpaceLimit": QSpaceLimit[i]})
    message.update({
    
    "Input": QInput[i]})
    message.update({
    
    "Output": QOutput[i]})
    result.update({
    
    str(i+1): message})

with open("文件名.json","w",encoding="UTF-8") as f:
    json.dump(result, f, ensure_ascii=False) #输出文件

Source code

import json
import requests
from bs4 import BeautifulSoup


def divTextProcess(div):

    strBuffer = div.get_text()    #获取文本

    strBuffer = strBuffer.replace("{", " $").replace("}", "$ ")    #替换公式标记

    strBuffer = strBuffer.replace("  ", "")    #去除多个空格

    strBuffer = strBuffer.replace("\n\n\n", "\n")    #去除多个换行符

    strBuffer = strBuffer.replace("\xa0", "")    #去除内容中用\xa0表示的空格

    strBuffer = strBuffer.strip()    #去除首位空格

    return strBuffer


QNum = []
QTitle = []
QDifficulty = []
QContent = []
QTag = []
QTimeLimit = []
QSpaceLimit = []
QInput = []
QOutput = []

for page in range(1, 36):
    print("page " + str(page) + " begin----------------------------")
    url = "https://ac.nowcoder.com/acm/problem/list?keyword=&tagId=&platformTagId=0&sourceTagId=0&difficulty=0&status=all&order=id&asc=true&pageSize=50&page=" + str(
        page)
    headers = {
    
    
        "User-Agent": '你的user agent',
        "Cookie": '你的cookie'
    }
    response = requests.get(url, headers=headers)
    BasicSoup = BeautifulSoup(response.content, 'lxml')

    tablelist = BasicSoup.find_all(name="table", attrs={
    
    "class": "no-border"})[0].find_all(name="tr")[1:]
    numlist = []
    for i in tablelist:
        item = i.attrs['data-problemid']
        numlist.append(item)

    for i in numlist:

        urls = 'https://ac.nowcoder.com/acm/problem/' + i
        headers = {
    
    
            "User-Agent": '你的user agent',
            "Cookie": '你的cookie'
        }

        responses = requests.get(urls, headers=headers)
        content = responses.content
        soup = BeautifulSoup(content, 'lxml')

        QTitle.append(soup.find_all(name="div", attrs={
    
    "class": "question-title"})[0].text.strip("\n"))

        mainContent = soup.find_all(name="div", attrs={
    
    "class": "terminal-topic"})[0]

        div = mainContent.find_all(name="div", attrs={
    
    "class": "subject-item-wrap"})[0].find_all("span")
        num = div[0].text.strip("题号:")
        QNum.append(num)
        QTimeLimit.append(div[1].text.strip("时间限制:"))
        QSpaceLimit.append(div[2].text.strip("空间限制:"))

        div1 = mainContent.find_all(name="div", attrs={
    
    "class": "subject-question"})[0]
        div2 = mainContent.find_all(name="pre")[0]
        div3 = mainContent.find_all(name="pre")[1]
        descriptDict = {
    
    "题目描述:": divTextProcess(div1), "输入描述:": divTextProcess(div2), "输出描述:": divTextProcess(div3)}
        QContent.append(descriptDict)

        div4 = mainContent.find_all(name="div", attrs={
    
    "class": "question-oi-cont"})[0]
        div5 = mainContent.find_all(name="div", attrs={
    
    "class": "question-oi-cont"})[1]
        QInput.append(divTextProcess(div4))
        QOutput.append(divTextProcess(div5))

        response = requests.get(url, headers=headers)
        BasicSoup = BeautifulSoup(response.content, 'lxml')

        diff = BasicSoup.find_all(name="tr", attrs={
    
    "data-problemid": i})[0].find_all(name="td")[3].text.strip("\n")
        QDifficulty.append(diff)

        problem = BasicSoup.find_all(name="tr", attrs={
    
    "data-problemid": i})[0].find_all(name="a", attrs={
    
    
            "class": "tag-label js-tag"})
        tag = ""
        count = 0
        for i in problem:
            if count == 0:
                tag = tag + i.text
            else:
                tag = tag + "," + i.text
            count = count + 1
        QTag.append(tag)

        print("-----------------" + str(num) + "  finished-----------------")
    print("page " + str(page) + " finished----------------------------\n")

# print(QNum)
# print(QTitle)
# print(QTag)
# print(QContent)
# print(QDifficulty)
# print(QInput)
# print(QOutput)
# print(QSpaceLimit)
# print(QTimeLimit)

result = {
    
    }

for i in range(len(QNum)):
    message = {
    
    }
    message.update({
    
    "questionNum": QNum[i]})
    message.update({
    
    "questionTitle": QTitle[i]})
    message.update({
    
    "difficulty": QDifficulty[i]})
    message.update({
    
    "content": QContent[i]})
    message.update({
    
    "PositiveTags": QTag[i]})
    message.update({
    
    "TimeLimit": QTimeLimit[i]})
    message.update({
    
    "SpaceLimit": QSpaceLimit[i]})
    message.update({
    
    "Input": QInput[i]})
    message.update({
    
    "Output": QOutput[i]})
    result.update({
    
    str(i+1): message})

# for item in result.items():
#     print(item)

with open("NiuKeACM.json","w",encoding="UTF-8") as f:
    json.dump(result, f, ensure_ascii=False)

Guess you like

Origin blog.csdn.net/qq_45882682/article/details/122697529