[NLP Natural Language Processing] Text Feature Extraction

Table of contents

1. Text representation method:

2. Count Vecotrs (Bag of Words bag model)

3. TF-IDF model

3.1 jieba

Download Stuttering :

Use stuttering for word segmentation 

 3.2 Example of text feature extraction by TF-IDF


1. Text representation method:

  • One-hot
  • Bag of Words
  • N-gram

N-gram is similar to Count Vectors, adding adjacent word combinations to form new words and counting.

  • TF-IDF

Defects in these text representation methods:The converted vector has a high dimension and requires a long training practice; the relationship between words is not considered, and only the relationship between words is considered. Statistics were performed

2、Count Vecotrs(Bag of Words词袋模型)

Detailed explanation of word vector bag of words model (BOW)

sklearn——CountVectorizer详

from sklearn.feature_extraction.text import CountVectorizer

#CountVectors+RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split



df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000) 
##统计每个字出现的次数,并赋值为0/1   用词袋表示text(特征集)
##max_features=3000文档中出现频率最多的前3000个词
#ngram_range(1,3)(单个字,两个字,三个字 都会统计
vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
train_text = vectorizer.fit_transform(train_df['text'])

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)


#岭回归拟合训练集(包含text 和 label)
clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_test)
print(f1_score(y_val,val_pred,average = 'macro'))

3. TF-IDF model

TF-IDF score consists of two parts:The first part is term frequency (Term Frequency), and the second part is inverse document frequency (Inverse Document Frequency)< a i=2>. The total number of documents in the corpus is divided by the number of documents containing the word, and then the logarithm is taken to obtain the inverse document frequency.

  • TF(t) = the number of times the word appears in the current document / the total number of words in the current document
  • IDF(t)= log_e (total number of documents / total number of documents in which the word appears)

When there are TF (term frequency) and IDF (inverse document frequency), multiply these two words to get the TF-IDF value of a word. The larger the TF-IDF of a certain word in the article, the higher the importance of the word in the article. Therefore, by calculating the TF-IDF of each word in the article, sort from large to small, and rank The first few words are the keywords of the article.

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)# 得到tf-idf矩阵,稀疏矩阵表示法

vectorizer.get_feature_names()

X.toarray()
#最后to_array()函数返回的是每个文档中关键词的tf-idf值

#将每个文档的toptf-idf值输出
word = vectorizer.get_feature_names()
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

weight = X.toarray()

for i in range(len(weight)):
    w_sort = np.argsort(-weight[i])

    print('doc: {0}, top tf-idf is : {1},{2}'.format(corpus[i], word[w_sort[0]], weight[i][w_sort[0]]) )

from sklearn.feature_extraction.text import TfidfVectorizer
document = ["I have a pen.",
            "I have an apple."]
tfidf_model = TfidfVectorizer().fit(document)
sparse_result = tfidf_model.transform(document)  # 得到tf-idf矩阵,稀疏矩阵表示法
word = tfidf_model.get_feature_names()
word
# ['an', 'apple', 'have', 'pen']

print(sparse_result) # 第0个字符串,对应词典序号为3的词的TFIDF为0.8148

# 词语与列的对应关系
# '''
#   (0, 3)	0.8148024746671689
#   (0, 2)	0.5797386715376657
#   (1, 2)	0.4494364165239821
#   (1, 1)	0.6316672017376245
#   (1, 0)	0.6316672017376245
# '''

3.1 jieba

 Before using TF-IDF, it must undergo word segmentation processing. Use the tool jieba for word segmentation.

Download Stuttering :

Download directly in the jupyter notebook code bar

pip install jieba

Use stuttering for word segmentation 


import jieba
text = """我是一条天狗呀!
我把月来吞了,
我把日来吞了,
我把一切的星球来吞了,
我把全宇宙来吞了。
我便是我了!"""

sentences = text.split()
sent_words  = [list(jieba.cut(sen0)) for sen0 in sentences ]
document= [' '.join(sen0) for sen0 in sent_words]
print(document)
# ['我 是 一条 天狗 呀 !', '我 把 月 来 吞 了 ,', '我 把 日来 吞 了 ,', '我 把 一切 的 星球 来 吞 了 ,
#  ', '我 把 全宇宙 来 吞 了 。', '我 便是 我 了 !']

model = TfidfVectorizer().fit(document)
print(model.vocabulary_)
# {'一条': 1, '天狗': 4, '日来': 5, '一切': 0, '星球': 6, '全宇宙': 3, '便是': 2}

sparse_result  = model.transform(document)
print(sparse_result)
 ''' 
  (0, 4)	0.7071067811865476
  (0, 1)	0.7071067811865476
  (2, 5)	1.0
  (3, 6)	0.7071067811865476
  (3, 0)	0.7071067811865476
  (4, 3)	1.0
  (5, 2)	1.0'''

 3.2 Example of text feature extraction by TF-IDF

#TF-IDF + RidgeClassifier
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score


df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)

train_test = TfidfVectorizer(ngram_range=(1,3),max_features = 3000).fit_transform(df.text)

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)


clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_test)
print(f1_score(y_val,val_pred,average = 'macro'))
  • These two models are generally used together with machine learning models. The former is responsible for extracting features in text, and the machine learning model is responsible for prediction and classification.

CountVectorizer TfidfVectorizer Chinese processing

Guess you like

Origin blog.csdn.net/m0_51933492/article/details/126980563