Study Notes: Deep Learning (8) - BERT Application Practice Based on PyTorch

Study time: 2022.04.26~2022.04.30

7. PyTorch-based BERT application practice

This section focuses on applying the BERT model to specific practices, so there are many things that can be optimized in the future, such as rewriting the dataset and dataloader methods by yourself, so that pytorch should be able to use it better and more flexibly.

7.1 Tool selection

To apply Bert using the PyTorch framework, generally go to Hugging Face (a community, many people will put the trained model on it, which can support PyTorch and Tensorflow), download the Bert model, and then apply it.

Before applying a specific model, you need to install the transformers library provided by Hugging Face, because the models on it are all written based on this transformers library, and even if the model is downloaded, it must be used through this library. (this is necessary)

pip install transformers

In addition, in the data processing part, if you don't want to rewrite the dataset and dataloader methods yourself, Hugging Face also provides the datasets library, which can pip install datasetsbe installed and used. (this is optional)

For the usage of these two libraries, you can go directly to the official website to view the usage documentation. I think it is relatively detailed (see Transformers and Datasets ).

In addition, the dataset I used this time is also from Kaggle: Quora Insincere Questions Classification | Kaggle .

7.2 Text Preprocessing

This part mainly refers to the code of the great god on Kaggle, and then summarizes it and packs it into a function. Here is a simple example:

  1. Clear some emojis, html urls, email ids, urls:
def clean_data(data):
    punct_tag = re.compile(r'[^\w\s]')
    data = punct_tag.sub(r'', data)
    html_tag = re.compile(r'<.*?>')
    data = html_tag.sub(r'', data)
    url_clean = re.compile(r"https://\S+|www\.\S+")
    data = url_clean.sub(r'', data)
    emoji_clean = re.compile("["
                             u"\U0001F600-\U0001F64F"  # emoticons
                             u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                             u"\U0001F680-\U0001F6FF"  # transport & map symbols
                             u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                             u"\U00002702-\U000027B0"
                             u"\U000024C2-\U0001F251"
                             "]+", flags=re.UNICODE)
    data = emoji_clean.sub(r'', data)
    url_clean = re.compile(r"https://\S+|www\.\S+")
    data = url_clean.sub(r'', data)
    return data
  1. Clear the possessive:
def strip_possessives(text):
    text = text.replace("'s", '')
    text = text.replace('’s', '')
    text = text.replace("\'s", '')
    text = text.replace("\’s", '')
    return text
  1. Replace numbers with ##:
def clean_numbers(x):
    x = re.sub("[0-9]{5,}", '#####', x)
    x = re.sub("[0-9]{4}", '####', x)
    x = re.sub("[0-9]{3}", '###', x)
    x = re.sub("[0-9]{2}", '##', x)
    return x

...

And finally call it all with one function:

def texts_preprogress(df):
    # 应用之前所有的预处理步骤
    df = df.apply(lambda x: clean_data(x))
    df = df.apply(lambda x: expand_contractions(x))
    df = df.apply(lambda x: replace_typical_misspell(x))
    df = df.apply(lambda x: strip_possessives(x))
    df = df.apply(lambda x: replace_multi_exclamation_mark(x))
    df = df.apply(lambda x: clean_text(x))
    df = df.apply(lambda x: change_stopwords(x))
    df = df.apply(lambda x: clean_numbers(x))
    return df

7.3 Using the BERT model

The whole use is divided into 5 parts: data input and application preprocessing, extraction of word vectors, construction of network model (embedded BERT), parameter preparation and model training.

I am using the bert-base-uncasedmodel this time.

7.3.1 Data Input and Application Preprocessing

# 确定设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
# 导入文件并进行文本预处理
train_df = pd.read_csv('train.csv', nrows=540)
train_df['question_text'] = texts_preprogress(train_df['question_text'])
eval_df = pd.read_csv('train.csv', nrows=675)[540:675]
eval_df['question_text'] = texts_preprogress(eval_df['question_text'])

7.3.2 Extract word vector

⭐Here is worth a special mention: tokenizerthe paddingparameters in the function.

padding=TrueEquivalently padding='max_length', it only works for sentence pair tasks, and it will automatically complete to the longest length in the batch; but! ! ! For single-sentence tasks, the two are not equivalent! ! ! This is also the reason why I set it at the beginning padding=True, and then specified max_length=72it, but the final output sentence is still of different lengths, and then it cannot be input into the network.

So for single-sentence tasks, be sure to specify first padding='max_length', and then set max_length= , so as to truly complete.

# 导入分词器
tokenizer = AutoTokenizer.from_pretrained('D:/Py-project/models/huggingface/bert-base-uncased/')


# 定义token函数
def tokenize_function(examples):
    return tokenizer(examples['question_text'], padding='max_length', max_length=72, truncation=True)
    # padding='max_length'填充批处理中较短的序列以匹配最长的序列(太坑了!!!!);truncation=True将序列截断为模型接受的最大长度


# 定义处理标签及id列的函数
def batch_label(df):
    df = df.drop('qid', axis=1)
    dataset = Dataset.from_pandas(df)
    for k in dataset.column_names:
        if k == 'target':
            dataset = dataset.rename_column(k, 'labels')
    inputs = dataset.map(tokenize_function, batched=True)  # 分词并输出词向量;batched=True开启批次输入
    inputs.set_format(type='torch')  # 词向量转为tensor
    inputs = inputs.remove_columns('question_text')  # 删除文本列,因为模型不接受原始文本作为输入
    dataloader = DataLoader(inputs, shuffle=True, batch_size=128)  # 创建成DataLoader
    return dataloader


train_dl = batch_label(train_df)
eval_dl = batch_label(eval_df)

7.3.3 Network Modeling

In fact, Bert also directly provides a trained bertclassification model, but it can only be called directly, and it is not easy to embed into the network, so only the bert model is used here, and then embedded into the network.

class my_bert(nn.Module):
    def __init__(self):
        super(my_bert, self).__init__()

        # Bert模型需要嵌入到网络中
        self.bert = BertModel.from_pretrained("D:/Py-project/models/huggingface/bert-base-uncased")
        # 将Bert模型的参数设置为可以更新
        for param in self.bert.parameters():
            param.requires_grad = True
        self.linear = torch.nn.Linear(768, 2)
        self.dropout = torch.nn.Dropout(0.5)

    def forward(self, x):

        input_ids = x['input_ids'].to(device)
        token_type_ids = x['token_type_ids'].to(device)
        attention_mask = x['attention_mask'].to(device)

        output = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        output = output['pooler_output']  # bert的输出有4个,这个任务仅需要1个
        output = self.linear(output)
        output = self.dropout(output)
        output = torch.sigmoid(output)
        return output

7.3.4 Parameter preparation

# 设置随机数种子,保证结果可复现
seed = 42
if device == 'cuda':
    torch.cuda.manual_seed(seed)
else:
    torch.manual_seed(seed)

# 实例化模型
model = my_bert()
model.to(device)

# 设置参数
lr = 2e-5
epoch = 2
show_step = 1
optimizer = optim.AdamW(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

7.3.5 Model training

There is only a small problem involved here: if the data set is too large, it basically needs to be converted into a dataloader and then input into the network in batches for training and verification.

for i in range(epoch):
    model.train()
    losses = []
    accuracy = []
    start_time = time.time()

    for batch in tqdm(train_dl, desc=f'第{
      
      i+1}/{
      
      epoch}次迭代进度', ncols=100):
        pred = model(batch)  # 正向传播
        label = batch['labels'].to(device)
        loss = criterion(pred, label)  # 计算损失函数

        # 存入准确率和loss
        losses.append(loss.item())
        pred_labels = torch.argmax(pred, dim=1)
        acc = torch.sum(pred_labels == label).item() / len(pred_labels)
        accuracy.append(acc)

        optimizer.zero_grad()  # 优化器的梯度清零
        loss.backward()  # 反向传播
        optimizer.step()  # 参数更新

    # 测试集评估
    if i % show_step == 0:  # 控制输出间隔
        model.eval()
        ev_losses = []
        ev_acc = []
        with torch.no_grad():
            for batch in eval_dl:
                ev_pred = model(batch)
                ev_label = batch['labels'].to(device)
                ev_loss = criterion(ev_pred, ev_label)

                # 存入准确率和loss
                ev_losses.append(ev_loss.item())
                pred_labels = torch.argmax(ev_pred, dim=1)
                acc = torch.sum(pred_labels == ev_label).item() / len(pred_labels)
                ev_acc.append(acc)

        elapsed_time = time.time() - start_time
        print("\nEpoch: {}/{}: ".format(i+1, epoch),
              "Accuracy: {:.6f}; ".format(np.mean(accuracy)),
              "Val Accuracy: {:.6f}; ".format(np.mean(ev_acc)),
              "Loss: {:.6f}; ".format(np.mean(losses)),
              "Val Loss: {:.6f}; ".format(np.mean(ev_losses)),
              'Time: {:.2f}s'.format(elapsed_time))

The above is most of what my bert tried for the first time. After that, I should try ALBERT and RoBERTa, and then I will try to rewrite datasets and dataloader myself. Keep up the good work in the future!

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124513470