【Kaggle微课程】Natural Language Processing - 2.Text Classification

文章目录

learn from https://www.kaggle.com/learn/natural-language-processing

NLP中的一个常见任务是文本分类。这是传统机器学习意义上的“分类”，并应用于文本。

包括垃圾邮件检测、情绪分析和标记客户查询。

在本教程中，您将学习使用spaCy进行文本分类。该分类器将检测垃圾邮件，这是大多数电子邮件客户端的常见功能。

读取数据

import pandas as pd
spam = pd.read_csv("./spam.csv")
spam.head(10)

1. bag of words

模型不能直接从原始文本中学习，需要转化成数字特征，最简单的方法是用 one-hot 编码。

举个例子：

句子1 "Tea is life. Tea is love."
句子2 "Tea is healthy, calming, and delicious."

忽略标点后的词表是 {"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}

通过对每个句子的单词出现的次数进行统计，用向量表示

扫描二维码关注公众号，回复： 11907219 查看本文章

$v 1 = [2, 2, 1, 1, 0, 0, 0, 0]$
$v 2 = [1, 1, 0, 0, 1, 1, 1, 1]$

这就是词袋表示，相似的文档将会有相似的词袋向量

还有一种表示法，TF-IDF (Term Frequency - Inverse Document Frequency)

2. 建立词袋模型

使用 spacy 的 TextCategorizer 可以处理词袋的转换，建立一个简单的线性模型，它是一个 spacy 管道

import spacy
nlp = spacy.blank('en') # 建立空模型

# Create the TextCategorizer with exclusive classes 
#                        and "bow" architecture
textcat = nlp.create_pipe('textcat',config={
    
    
    "exclusive_classes": True, # 排他的，二分类
    "architecture": "bow"
})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

# help(nlp.create_pipe)
Help on method create_pipe in module spacy.language:

create_pipe(name, config={
    
    }) method of spacy.lang.en.English instance
    Create a pipeline component from a factory.
    
    name (unicode): Factory name to look up in `Language.factories`.
    config (dict): Configuration parameters to initialise component.
    RETURNS (callable): Pipeline component.
    
    DOCS: https://spacy.io/api/language#create_pipe

# Add labels to text classifier
textcat.add_label("ham") # 正常邮件
textcat.add_label("spam") # 垃圾邮件

3. 训练文本分类模型

数据获取

train_texts = spam['text'].values
train_labels = [{
    
    'cats': {
    
    'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]

将文本和对应的标签打包

train_data = list(zip(train_texts, train_labels))
train_data[:3]

输出：

[
('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {
    
    'cats': {
    
    'ham': True, 'spam': False}}),
  
 ('Ok lar... Joking wif u oni...', 
 {
    
    'cats': {
    
    'ham': True, 'spam': False}}),
 
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {
    
    'cats': {
    
    'ham': False, 'spam': True}})
]

准备训练模型

创建优化器 optimizer nlp.begin_training()，spacy使用它更新模型权重
数据分批 minibatch
更新模型参数 nlp.update

from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# 数据分批
batches = minibatch(train_data, size=8)
# 迭代
for batch in batches:
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

这只是一次 epoch

>>> batch = [("1", True),("2", False)]
>>> texts, labels = zip(*batch)
>>> texts
('1', '2')
>>> labels
(True, False)

https://www.runoob.com/python/python-func-zip.html

多次 epochs 迭代

import random
random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

loss = {
    
    }
for epoch in range(10):
    # 每次随机打乱数据
    random.shuffle(train_data)
    # 数据分批
    batches = minibatch(train_data, size=8)
    # 迭代
    for batch in batches:
        texts, labels = zip(*batch)
        nlp.update(texts, labels, drop=0.3, sgd=optimizer, losses=loss)
    print(loss)

# help(nlp.update)
Help on method update in module spacy.language:

update(docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None) method of spacy.lang.en.English instance
    Update the models in the pipeline.
    
    docs (iterable): A batch of `Doc` objects.
    golds (iterable): A batch of `GoldParse` objects.
    drop (float): The dropout rate.
    sgd (callable): An optimizer.
    losses (dict): Dictionary to update with the loss, keyed by component.
    component_cfg (dict): Config parameters for specific pipeline
        components, keyed by component name.
    
    DOCS: https://spacy.io/api/language#update

输出：

{
    
    'textcat': 0.22436044702671132}
{
    
    'textcat': 0.41457826484549287}
{
    
    'textcat': 0.5661000985640895}
{
    
    'textcat': 0.7119002992385974}
{
    
    'textcat': 0.8301601885299159}
{
    
    'textcat': 0.9572314705652767}
{
    
    'textcat': 1.050187804254974}
{
    
    'textcat': 1.1268915971417424}
{
    
    'textcat': 1.2132206293363608}
{
    
    'textcat': 1.3000399094508472}

4. 预测

预测前先要将文本nlp.tokenizer一下

texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA"]
docs = [nlp.tokenizer(text) for text in texts]

textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)
print(scores)

输出预测概率：

[[9.9999392e-01 6.1252954e-06]
 [4.1843491e-04 9.9958152e-01]]

打印预测标签：

predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['ham', 'spam']

练习：

在上一个练习中，你为德法尔科餐厅做了一项非常出色的工作，以至于厨师为一个新项目雇佣了你。

餐厅的菜单上有一个电子邮件地址，游客可以在那里对他们的食物进行反馈。

经理希望你创建一个工具，自动将所有负面评价发送给他，这样他就可以修正它们，同时自动将所有正面评价发送给餐厅老板，这样经理就可以要求加薪了。

您将首先使用Yelp评论构建一个模型来区分正面评论和负面评论，因为这些评论包括每个评论的评级。你的数据由每篇评论的正文和星级评分组成。

1-2 星的评级为“负样本”，4-5 星的评级为“正样本”。3 星的评级是“中性”的，已经从数据中删除。

1. 评估方法

上面方法的优势在于，你可以区分正面邮件和负面邮件，即使你没有标记为正面或负面的历史邮件。
这种方法的缺点是，电子邮件可能与Yelp评论很不同（不同的分布），这会降低模型的准确性。例如，客户在电子邮件中通常会使用不同的单词或俚语，而基于Yelp评论的模型不会看到这些单词。
如果你想知道这个问题有多严重，你可以比较两个来源的词频。在实践中，手动从每一个来源读几封电子邮件就足以判断这是否是一个严重的问题。
如果你想做一些更花哨的事情，你可以创建一个包含Yelp评论和电子邮件的数据集，看看模型是否能从文本内容中分辨出评论的来源。理想情况下，您希望发现该模型的性能不佳，因为这意味着您的数据源是相似的。

2. 数据预处理、建模

数据集切分

def load_data(csv_file, split=0.9):
    data = pd.read_csv(csv_file)
    
    # Shuffle data
    train_data = data.sample(frac=1, random_state=7)
    
    texts = train_data.text.values
    labels = [{
    
    "POSITIVE": bool(y), "NEGATIVE": not bool(y)}
              for y in train_data.sentiment.values]
    split = int(len(train_data) * split)
    
    train_labels = [{
    
    "cats": labels} for labels in labels[:split]]
    val_labels = [{
    
    "cats": labels} for labels in labels[split:]]
    
    return texts[:split], train_labels, texts[split:], val_labels

train_texts, train_labels, val_texts, val_labels = load_data('../input/nlp-course/yelp_ratings.csv')

查看训练数据

print('Texts from training data\n------')
print(train_texts[:2])
print('\nLabels from training data\n------')
print(train_labels[:2])

输出：

Texts from training data
------
["Some of the best sushi I've ever had....and I come from the East Coast.  Unreal toro, have some of it's available."

 "One of the best burgers I've ever had and very well priced. I got the tortilla burger and is was delicious especially with there tortilla soup!"]

Labels from training data
------
[{
    
    'cats': {
    
    'POSITIVE': True, 'NEGATIVE': False}}, 
{
    
    'cats': {
    
    'POSITIVE': True, 'NEGATIVE': False}}]

建模

import spacy
nlp = spacy.blank('en') # 建立空模型

# Create the TextCategorizer with exclusive classes 
#                        and "bow" architecture
textcat = nlp.create_pipe('textcat',config={
    
    
    "exclusive_classes": True, # 排他的，二分类
    "architecture": "bow"
})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

# Add NEGATIVE and POSITIVE labels to text classifier
textcat.add_label("NEGATIVE") # 负面邮件
textcat.add_label("POSITIVE") # 正面邮件

3. 训练

from spacy.util import minibatch
import random

def train(model, train_data, optimizer, batch_size=8):
    loss = {
    
    }
    random.seed(1)
    random.shuffle(train_data)
    
    batches = minibatch(train_data, size=batch_size)
    for batch in batches:
        # train_data is a list of tuples [(text0, label0), (text1, label1), ...]
        # Split batch into texts and labels
        texts, labels = zip(*batch)
        
        # Update model with texts and labels
        model.update(texts, labels, sgd=optimizer, losses=loss)
        
    return loss

训练

# Fix seed for reproducibility
spacy.util.fix_random_seed(1)
random.seed(1)

# This may take a while to run!
optimizer = nlp.begin_training()
train_data = list(zip(train_texts, train_labels))
losses = train(nlp, train_data, optimizer)
print(losses['textcat'])

测试下效果

text = "This tea cup was full of holes. Do not recommend."
doc = nlp(text)
print(doc.cats)

输出：

{
    
    'NEGATIVE': 0.7731374502182007, 'POSITIVE': 0.22686253488063812}

这杯茶不好喝，负类概率大

4. 预测

def predict(nlp, texts): 
    # Use the model's tokenizer to tokenize each input text
    docs = [nlp.tokenizer(text) for text in texts]
    
    # Use textcat to get the scores for each doc
    textcat = nlp.get_pipe('textcat')
    scores, _ = textcat.predict(docs)
    
    # From the scores, find the class with the highest score/probability
    pred_labels = scores.argmax(axis=1)
    
    return pred_labels

5. 评估模型

def evaluate(model, texts, labels):
    """ Returns the accuracy of a TextCategorizer model. 
    
        Arguments
        ---------
        model: ScaPy model with a TextCategorizer
        texts: Text samples, from load_data function
        labels: True labels, from load_data function
    
    """
    # Get predictions from textcat model (using your predict method)
    predicted_class = predict(model, texts)
    
    # From labels, get the true class as a list of integers (POSITIVE -> 1, NEGATIVE -> 0)
    true_class = [int(label['cats']['POSITIVE']) for label in labels]
    
    # A boolean or int array indicating correct predictions
    correct_predictions = (true_class == predicted_class)
    
    # The accuracy, number of correct predictions divided by all predictions
    accuracy = sum(correct_predictions)/len(true_class)
    
    return accuracy

accuracy = evaluate(nlp, val_texts, val_labels)
print(f"Accuracy: {accuracy:.4f}")

输出：验证集准确率 92.39%

Accuracy: 0.9239

多次迭代训练

# This may take a while to run!
n_iters = 5
for i in range(n_iters):
    losses = train(nlp, train_data, optimizer)
    accuracy = evaluate(nlp, val_texts, val_labels)
    print(f"Loss: {losses['textcat']:.3f} \t Accuracy: {accuracy:.3f}")

Loss: 6.752 	 Accuracy: 0.940
Loss: 4.105 	 Accuracy: 0.947
Loss: 2.904 	 Accuracy: 0.945
Loss: 2.267 	 Accuracy: 0.946
Loss: 1.826 	 Accuracy: 0.944

6. 改进

这里有各种超参数可以调节。最重要的超参数是TextCategorizer 的 architecture

上面使用的最简单的模型，它训练得快，但可能比 CNN 和 ensemble 模型的性能差

我的CSDN博客地址 https://michael.blog.csdn.net/

长按或扫码关注我的公众号（Michael阿明），一起加油、一起学习进步！