Tianchi NLP Competition-News Text Classification (1)-Comprehension of Competition Questions


Series of articles
Tianchi NLP competition-news text classification (1)-question understanding


1. Comprehension of competition questions

1.1 Learning objectives

Through the first stage of nlp learning, the previous series of links https://blog.csdn.net/bosszhao20190517/article/details/106911793 , mastered the basic methods and principles of nlp, this time I followed Datawhale to participate in the Tianchi competition—— Zero-based introductory NLP competition-news text classification, registration link https://tianchi.aliyun.com/competition/entrance/531810/introduction?spm=5176.12281949.1003.1.493e24487BQPMy , walk into the world of natural language processing and learn in the competition And improve yourself.

1.2 Question data

The contest questions use news data as the contest question data, and the data set is visible and downloadable after registration. The contest question data is news text and is anonymized according to the character level. Integrate and divide 14 candidate classification categories: finance, lottery, real estate, stocks, home furnishing, education, technology, society, fashion, current affairs, sports, constellation, games, entertainment text data.
The question data consists of the following parts: 20w samples in the training set, 5w samples in the test set A, and 5w samples in the test set B. In order to prevent players from manually labeling the test set, we anonymized the text of the game data according to the character level.

1.3 Data label

The processed training data of the competition questions are as follows:

label text
6 57 44 66 56 2 3 3 37 5 41 9 57 44 47 45 33 13 63 58 31 17 47 0 1 1 69 26 60 62 15 21 12 49 18 38 20 50 23 57 44 45 33 25 28 47 22 52 35 30 14 24 69 54 7 48 19 11 51 16 43 26 34 53 27 64 8 4 42 36 46 65 69 29 39 15 37 57 44 45 33 69 54 7 25 40 35 30 66 56 47 55 69 61 10 60 42 36 46 65 37 5 41 32 67 6 59 47 0 1 1 68

The corresponding relationship of the labels in the data set is as follows: {'Technology': 0,'Stocks': 1,'Sports': 2,'Entertainment': 3,'Current Affairs': 4,'Society': 5,'Education' : 6,'Finance': 7,'Home Furnishing': 8,'Game': 9,'Property': 10,'Fashion': 11,'Lottery': 12,'Constellation': 13}

The source of the contest question data is news on the Internet, which is collected and processed anonymously. Therefore, the contestants can perform data analysis on their own, and can give full play to their strengths to complete various feature projects, without restricting the use of any external data and models.
The data column is divided by \t, and the code for reading the data in Pandas is as follows

train_df = pd.read_csv('../input/train_set.csv', sep='\t')

1.4 Evaluation Criteria

The evaluation standard is the average value of the category f1_score. The results submitted by the players are compared with the categories of the actual test set. The larger the result, the better.

Calculation formula: F 1 = 2 ∗ (precision ∗ recall) (precision + recall) Calculation formula: F1=2*\frac{(precision∗recall)}{(precision+recall)} Count calculation public formula : F. . 1=2(precision+recall)(precisionrecall)

The f1_score calculation can be done through sklearn:

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')

1.5 Read data

Use the Pandas library to complete the data reading operation and analyze the contest data.

1.6 Thinking analysis

Analysis of the idea of ​​the question: The essence of the question is a text classification problem, which needs to be classified according to the characters of each sentence. However, the data given in the question is anonymized, and operations such as Chinese word segmentation cannot be used directly. This is the difficulty of the question.

Therefore, the difficulty of this competition is the need to model anonymous characters to complete the process of text classification. Since text data is a typical unstructured data, it may involve two parts: feature extraction and classification model.

  • Idea 1: TF-IDF + machine learning classifier

Directly use TF-IDF to extract features from the text, and use the classifier to classify. In the choice of classifier, you can use SVM, LR, or XGBoost.

  • Idea 2: FastText

FastText is an entry-level word vector. Using the FastText tool provided by Facebook, you can quickly build a classifier.

  • Idea 3: WordVec + deep learning classifier

WordVec is an advanced word vector, and the classification is completed by constructing a deep learning classification. The network structure of deep learning classification can choose TextCNN, TextRNN or BiLSTM.

  • Idea 4: Bert word vector

Bert is a highly-matched word vector with powerful modeling and learning capabilities.

Guess you like

Origin blog.csdn.net/bosszhao20190517/article/details/107495216