Douban score prediction (how to use your own data set for text classification) - BERT Chinese text classification based on pytorch, super detailed tutorial will definitely! ! !

foreword

I believe that after watching a movie or TV series, most people will inevitably go to Douban to check other people's comments and ratings to see what the movie or TV series is like or who has the same likes and dislikes as themselves.

Then whether there is a certain connection between Douban comments and Douban scores, we can train the BERT Chinese classification model, output the predicted Douban scores by inputting Douban comments, and observe whether it is different from the real Douban scores.

In this project we need to do:

  • text preprocessing
  • Model training and evaluation
  • actual data test

Let's first take a look at the final Douban score prediction effect, taking the comments of "Sweeping Black Storm" as an example:

 forecast result:

Next, we will start to introduce how to realize Douban score prediction.

1. Project overview

First of all, we implemented Douban score prediction based on the Chinese text classification in the github open source project EasyBert . In fact, this is a classification problem.

Configure the relevant environment:

python 3.7
pytorch 1.1
tqdm
sklearn
tensorboardX

data set:

Our Douban comment dataset is in DMSC.csv format, while the dataset of the original project extracted 200,000 news titles from THUCNews , and the text length is between 20 and 30. There are 10 categories in total, each with 20,000 entries. Data is entered into the model in units of words.

THUCNews
├── data
│ ├── train.txt # training set data
│ ├── test.txt # test data
│ ├── dev.txt # verification data
│ └── class.txt # data category
└── saved_dict

 So we need to preprocess the Douban comment data and save it in the same format.

code:

The TextClassifier folder contains three main functions and the models and bert_pretrian folders, the models folder contains bert.py and ernie.py, and the bert_pretrian folder contains the pre-trained model. Model and training parameters can be set in bert.py and ernie.py. run.py is the main function, where parameters are set for model training. train_eval.py contains specific training functions that have been written. Data classification prediction through predict.py.

For specific code analysis, see the decomposition below! ! !

TextClassifier
├── models
│ ├── bert.py # bert model
│ └── ernie.py # ernie model

├── bert_pretrain #pretraining model
│ ├── bert_config.json    
│ ├── pytorch_model.bin     
│ └── vocab.txt  

├── run.py

├── predict.py
└── train_eval.py

Algorithm flow: 

2. Text processing

1. Load data

Since Douban data is in DMSC.csv format, we read the data through the pd.read_csv function, which is used to read files in csv format and convert table data into dataframe format.

#读取数据
data = pd.read_csv('DMSC.csv')
#观察数据格式
data.head()
#输出数据的一些相关信息
data.info()
#只保留数据中我们需要的两列:Comment列和Star列
data = data[['Comment','Star']]
#观察新的数据的格式
data.head()

Output result:

Comment Star
0 Even Ultron knew that he would go to Korea for plastic surgery. 3
1 "A person without a dark side is not worthy of trust." The second part strips away the lengthy foreshadowing, the opening is a climax, and until the end, some people feel... 4
2 Ultron is weak, weak, weak, weak! ! ! ! ! ! 2
3 Unlike the first episode, it is a link between the past and the future, gloomy and serious, but it won't be bad, unless you don't like Marvel movies in the first place. The scene is even bigger... 4
4 After reading it, I excitedly said to my friend, what should I do when Ultron comes to destroy Taipei? She patted me on the shoulder, it’s okay, anyway, you bought two copies 5

2. Text preprocessing

When the training data was sent to BERT at the beginning, it was prompted that blank characters could not be converted and the range of the label label did not match, so the data was preprocessed again, the blank was removed and the label was reduced by one.

def clear_character(sentence):
    new_sentence=''.join(sentence.split()) #去除空白
    return new_sentence
data["comment_processed"]=data['Comment'].apply(clear_character)
data['label']=data['Star']-1
data.head()

Output result:

Comment Star comment_processed label
0 Even Ultron knew that he would go to Korea for plastic surgery. 3 Even Ultron knew that he would go to Korea for plastic surgery. 2
1 "A person without a dark side is not worthy of trust." The second part strips away the lengthy foreshadowing, the opening is a climax, and until the end, some people feel... 4 "A person without a dark side is not worthy of trust." The second part strips away the lengthy foreshadowing, the opening is a climax, and until the end, some people feel that it is only... 3
2 Ultron is weak, weak, weak, weak! ! ! ! ! ! 2 Ultron is weak, weak, weak, weak! ! ! ! ! ! 1
3 Unlike the first episode, it is a link between the past and the future, gloomy and serious, but it won't be bad, unless you don't like Marvel movies in the first place. The scene is even bigger... 4 Unlike the first episode, it is a link between the past and the future, gloomy and serious, but it won't be bad, unless you don't like Marvel movies in the first place. The scene is even bigger,... 3
4 After reading it, I excitedly said to my friend, what should I do when Ultron comes to destroy Taipei? She patted me on the shoulder, it’s okay, anyway, you bought two copies 5 After reading it, I excitedly said to my friend, what should I do when Ultron is coming to destroy Taipei? 4

3. Divide training set and test set 

The data set is divided by the train_test_split() function.

from sklearn.model_selection import train_test_split
X = data[['comment_processed','label']]
test_ratio = 0.2
comments_train, comments_test = train_test_split(X,test_size=test_ratio, random_state=0)
print(comments_train.head(),comments_test.head)

4. Save in txt format 

Since the storage format in BERT is txt and text is tagged, it is stored through the dataframe.to_csv function.

comments_train.to_csv('train.txt', sep='\t', index=False,header=False)
comments_test.to_csv('test.txt', sep='\t', index=False,header=False)

 Output result:

3. BERT model

1. Feature Transformation

In run.py, first convert the saved training data, test data, and verification data into BERT vectors.

print("Loading data...")
train_data, dev_data, test_data = build_dataset(config)
train_iter = build_iterator(train_data, config)
dev_iter = build_iterator(dev_data, config)
test_iter = build_iterator(test_data, config)
time_dif = get_time_dif(start_time)
print("Time usage:", time_dif)
def load_dataset(path, pad_size=32):
    contents = []
    with open(path, 'r', encoding='UTF-8') as f:      # 读取数据
        for line in tqdm(f):
            lin = line.strip()
            if not lin:
                continue
            if len(lin.split('\t')) == 2:
                content, label = lin.split('\t')
            token = config.tokenizer.tokenize(content)      # 分词
            token = [CLS] + token                           # 句首加入CLS
            seq_len = len(token)
            mask = []
            token_ids = config.tokenizer.convert_tokens_to_ids(token)

            if pad_size:
                if len(token) < pad_size:
                    mask = [1] * len(token_ids) + [0] * (pad_size - len(token))
                    token_ids += ([0] * (pad_size - len(token)))
                else:
                    mask = [1] * pad_size
                    token_ids = token_ids[:pad_size]
                    seq_len = pad_size
            contents.append((token_ids, int(label), seq_len, mask))
    return contents

Call the tokenizer, use the tokenizer to split the input, and convert the data into features .

The feature contains 4 data:

  • tokens_ids: the id of each word in the vocabulary after word segmentation, the id corresponding to the completion symbol is 0, and the ids of [CLS] and [SEP] are 101 and 102 respectively. It should be noted that in the Chinese BERT model, Chinese word segmentation is based on word segmentation rather than word segmentation.
  • mask: real character/completion character identifier, each character of the real text corresponds to 1, the completion symbol corresponds to 0, and [CLS] and [SEP] are also 1.
  • seq_len : sentence length
  • label  : Convert the elements in the label_list into index labels using a dictionary.

An example of an element in a transform feature is:

Input: Some plots are lacking in succession, but the characters on the screen are great. 3
tokens_ids: [101, 1196, 2658, 3300, 4638, 2824, 2970, 3612, 5375, 8024, 4514, 7481, 782, 6392, 2523, 3472, 511, 0,...,0] mask:
[ 1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...,0] label:
3 seq_len
: 17 

2. Model training

After reading the data and converting the features, the features are sent to the model for training.

The training algorithm is the Adam algorithm dedicated to BERT

The ratio of training set, test set and verification set is 6:2:2

Every 100 rounds, it will be verified on the verification set, and the corresponding accurate value will be given. If the accurate value is greater than the previous highest score, the model parameters will be saved, otherwise the flags will be increased by 1. If flags is greater than 1000, that is, the performance of the model has not continued to be optimized for 1000 consecutive rounds, stop the training process.

for epoch in range(config.num_epochs):
    print('Epoch [{}/{}]'.format(epoch + 1, config.num_epochs))
    for i, (trains, labels) in enumerate(train_iter):

        outputs = model(trains)
        model.zero_grad()
        loss = F.cross_entropy(outputs, labels)
        loss.backward()
        optimizer.step()
        if total_batch % 100 == 0:
            # 每多少轮输出在训练集和验证集上的效果
            true = labels.data.cpu()
            predic = torch.max(outputs.data, 1)[1].cpu()
            train_acc = metrics.accuracy_score(true, predic)
            dev_acc, dev_loss = evaluate(config, model, dev_iter)
            if dev_loss < dev_best_loss:
                dev_best_loss = dev_loss
                torch.save(model.state_dict(), config.save_path)
                improve = '*'
                last_improve = total_batch
            else:
                improve = ''
            time_dif = get_time_dif(start_time)
            msg = 'Iter: {0:>6},  Train Loss: {1:>5.2},  Train Acc: {2:>6.2%},  Val Loss: {3:>5.2},  Val Acc: {4:>6.2%},  Time: {5} {6}'
            print(msg.format(total_batch, loss.item(), train_acc, dev_loss, dev_acc, time_dif, improve))
            model.train()
        total_batch += 1
        if total_batch - last_improve > config.require_improvement:
            # 验证集loss超过1000batch没下降,结束训练
            print("No optimization for a long time, auto-stopping...")
            flag = True
            break
    if flag:
        break
test(config, model, test_iter)

 Training result:

1245it [00:00, 6290.83it/s]Loading data...
170004it [00:28, 6068.60it/s]
42502it [00:07, 6017.43it/s]
42502it [00:06, 6228.82it/s]
Time usage: 0:00:42
Epoch [1/5]
Iter:      0,  Train Loss:   1.8,  Train Acc:  3.12%,  Val Loss:   1.7,  Val Acc:  9.60%,  Time: 0:02:14 *
Iter:    100,  Train Loss:   1.5,  Train Acc: 25.00%,  Val Loss:   1.4,  Val Acc: 20.60%,  Time: 0:05:10 *
...
Iter:   5300,  Train Loss:  0.75,  Train Acc: 65.62%,  Val Loss:   1.0,  Val Acc: 50.07%,  Time: 2:45:41 *
Epoch [2/5]
Iter:   5400,  Train Loss:   1.0,  Train Acc: 62.50%,  Val Loss:   1.0,  Val Acc: 51.02%,  Time: 2:48:46 
...
Iter:   7000,  Train Loss:  0.77,  Train Acc: 75.00%,  Val Loss:   1.0,  Val Acc: 52.84%,  Time: 3:38:26 
No optimization for a long time, auto-stopping...
Test Loss:   1.0,  Test Acc: 50.89%
Precision, Recall and F1-Score...
              precision    recall  f1-score   support

           1     0.6157    0.5901    0.6026      3706
           2     0.5594    0.1481    0.2342      3532
           3     0.4937    0.5883    0.5369      9678
           4     0.4903    0.5459    0.5166     12899
           5     0.6693    0.6394    0.6540     12687

    accuracy                         0.5543     42502
   macro avg     0.5657    0.5024    0.5089     42502
weighted avg     0.5612    0.5543    0.5463     42502
Time usage: 0:02:25

From the training results, it can be seen that the accuracy rate and F1 score can only reach 60% at most. In fact, we can also know the reason by carefully analyzing the comments:

The difference between similar scores has little to do with reviews . For example, a review with two points may sometimes be the same as one with three points, which makes it difficult to accurately predict the score based on the reviews, but it can be clearly seen from the test results Good and bad reviews can be clearly distinguished , and the accuracy rate can reach 90%. 

3. Model testing

The principle is the same as that of training during testing, and the data is first converted into features, sent to the trained model, and the results are obtained.


def final_predict(config, model, data_iter):
    map_location = lambda storage, loc: storage
    model.load_state_dict(torch.load(config.save_path, map_location=map_location))
    model.eval()
    predict_all = np.array([])
    with torch.no_grad():
        for texts, _ in data_iter:
            outputs = model(texts)
            pred = torch.max(outputs.data, 1)[1].cpu().numpy()
            pred_label = [match_label(i, config) for i in pred]
            predict_all = np.append(predict_all, pred_label)

    return predict_all

def main(text):
    config = Config()
    model = Model(config).to(config.device)
    test_data = load_dataset(text, config)
    test_iter = build_iterator(test_data, config)
    result = final_predict(config, model, test_iter)
    for i, j in enumerate(result):
        print('text:{}'.format(text[i]))
        print('label:{}'.format(j))

 Test Results:

Summarize 

This project is based on pytorch's BERT Chinese text classification to achieve Douban score prediction. After testing with actual data, it still has a certain effect. I have to say that the current effect of BERT in natural language processing tasks is still leveraged!

Hope to continue to innovate and promote the further development of NLP! ! !


That's it for today, keep going tomorrow!

If this article is helpful to you, please like, follow, bookmark and support!

Creation is not easy, prostitution is not good, everyone's support and recognition is the biggest motivation for my creation!

If there are any mistakes in this blog, please criticize and advise, thank you very much! ! !


reference: 

How to use BERT to implement Chinese text classification (with code)

EasyBert, Pytorch-based Bert application

Guess you like

Origin blog.csdn.net/kobepaul123/article/details/119768892