Academic Document Element Classification Challenge: Using machine learning and deep learning to automatically organize academic documents#¥30,000

CompHub[1]  aggregates multi-platform data class (Kaggle, Tianchi...) and OJ class (Leetcode, Niu Ke...) competitions in real time. This account will push the latest game news, welcome to pay attention!


The following information is created with AI assistance and is for reference only

game name

Academic Document Element Classification Challenge [2] (see the end of the article to read the original text )

Part1 1. Competition background

With the development of the digital age, people rely more and more on electronic documents to record, transmit and share information. In academic scenarios, the elements of a document include title, author, email address, references, text, pictures, tables, etc., which are all indispensable elements in a document. However, in practical applications, these elements need to be classified in order to better manage and utilize documents. For example, in an academic publishing institution, it is necessary to identify and classify the title, author, abstract, text, references, etc. in an article.

In response to this problem, this competition - "Academic Document Element Classification Challenge" aims to realize the structure restoration and automatic organization of academic documents by using advanced technologies such as machine learning and deep learning to classify elements given academic document images, element locations and text content. The competition involves 14 different classification categories, including title, author, mailbox, chapter title, text, image, table, etc. This technology can not only improve the classification efficiency of academic documents, but also reduce the burden of manual classification, and provide researchers and academic publishing institutions with a more efficient way to manage and utilize academic documents.

The purpose of this competition is to provide a platform for communication and discussion on the classification of academic document elements, and to promote research and application in related fields. It is hoped that through the efforts of this competition, further contributions can be made to the research and application in the field of academic document classification, and the accuracy and efficiency of classification can be improved.

Part2 2. Competition tasks

Although there have been many related research results in the field of academic document classification, feature classification is still a challenging problem. In practical applications, many academic documents have a wide variety of elements, with different positions and sizes, and there are also certain confusion and noise, which put forward higher requirements for the accuracy and robustness of the algorithm. Therefore, this competition will provide a challenging dataset to test the accuracy and robustness of the competitors' algorithms. The dataset will contain document images, feature locations and text content from multiple disciplines and in multiple formats. The contestants need to build a classification system to classify each feature.

Part3 3. Review Rules

1. Data description

This competition provides four types of data for contestants: document image, feature location, text content, and classification category. The document image is a picture converted from the PDF file of the original paper. The position of the element is given by the upper left and lower right coordinates of the rectangular box. The text content refers to the text analysis result in the rectangular box. The classification categories include 14 categories including title, author, email, chapter title, text, picture, and table. The training data contains 7043 document images from 500 documents, and the contestants can freely divide the training and verification sets for model training. This competition contains only one stage, the test set does not contain classification categories, and the other contents are the same as the training set data.

data category variable name Numeric format explain
document image none png Image converted from PDF file of paper by PDF2IMG
feature position box list of int [x1, y1, x2, y2], (x1, y1) is the coordinates of the upper left corner of the rectangle, (x2, y2) is the coordinates of the lower right corner of the rectangle
text content text string Text parsing results in a rectangular box
classification category class string 14 categories including title, author, email, chapter title, text, picture, table

The classification categories are described as follows:

Classification category name explain
title The main title of the article, generally only appears on the home page
author Author name of the article
mail Article author's contact details
affiliation Affiliation of the author of the article
section chapter title
fstline the first line of text in a paragraph
paraline Other lines of text in a paragraph
table table area
figure image area
caption Descriptive text for images or tables
equality separate formula area
footer Footer, such as page number, journal name, etc., located directly below the page
header Headers, such as page numbers, thesis title, etc., located right above the page
footnote Comments on the content of the article, such as links, author information, etc., are located at the lower left or lower right of the text area

2. Evaluation indicators

Based on the submitted result files, this model is evaluated using the Macro-F1-score obtained by averaging the F1-scores of all categories.

(1) For each type of element (such as element X), statistics

(for features labeled X class, model correctly predicts X class), (model predicts feature of other class as X class), (model predicts feature of X class as other class)

(2) Calculate the precision and recall of the elements of this category through the statistical value of the first step, and the calculation formula is as follows:

(3) Calculate the F1-score of this category of elements based on the calculation results of the second step, and the calculation method is as follows:

(4) Through the third step to calculate the result, calculate the average value of F1-score of all category elements as the final evaluation index. The calculation method is as follows, where N=14, which means the number of categories:

3. Evaluation and ranking

1. Download data is provided for all questions in this competition. Contestants debug the algorithm locally and submit the results on the competition page.

2. The ranking is sorted from high to low according to the score. The leaderboard will select the team's best historical score for ranking.

Part4 4. Work submission requirements

1. File format: Submit in zip format

2. File size: no requirement

3. Limitation on the number of submissions: each team can submit up to 3 times per day

4. Detailed description of the file: the encoding is UTF-8, refer to the submission example to complete all the text categories in the test-anno folder, compress it into a submit.zip file and upload it

5. The top three contestants need to submit models, source codes and documentation

Part5 5. Schedule Rules

This competition adopts a one-round competition system

schedule cycle

May 6, 2023 - July 26, 2023

1. The training set, development set, and test set will be released at 10:00 on May 6 (that is, the competition list will be opened)

2. The deadline for submission of competition works is 17:00, July 26, and the date of ranking announcement is 10:00, July 27

live defense

1. In the end, the top three teams will be invited to participate in the iFlytek Global 1024 Developer Festival and have a defense on the spot

2. The defense will be conducted in the form of (10mins statement + 5mins question and answer)

3. According to the comprehensive score of the work and the defense (the work score accounts for 70%, and the on-site defense score accounts for 30%)

Part6 6. Award setting

  • finalists

    • HKUST iFlytek 1024 Developer Festival General Ticket

    • Finalist certificate

    • HKUST Xunfei Incubation Base Green Entry Channel

    • The privilege of entering the AI ​​service market

  • won the final

    • Final bonus, TOP3 contestants on the track will get track bonus step by step, the first place is 15,000 yuan, the second place is 10,000 yuan, and the third place is 5,000 yuan

    • Participate in the 1024 Global Developers Festival Awards Ceremony, awarding bonuses, certificates and customized trophies on the spot

    • AI full chain entrepreneurship support

    • Green Employment Channel & Internship Employment Offer

References

[1] 

CompHub homepage:  https://comphub.notion.site/CompHub-c353e310c8f84846ace87a13221637e8

[2] 

Academic Document Element Classification Challenge:  https://challenge.xfyun.cn/topic/info?type=academic-documents

Guess you like

Origin blog.csdn.net/CompHub/article/details/130798491