Dictionary feature extraction, text feature extraction, jieba word segmentation processing, tf-idf text feature extraction concept and code implementation

1. Feature extraction

Feature extraction: Convert arbitrary data (such as text or images) into digital features that can be used for machine learning. Feature value is for computers to better understand data

Feature extraction API: sklearn.feature_extraction

Feature Extraction Classification

  1. Dictionary feature extraction (feature discretization)
  2. Text Feature Extraction
  3. Image Feature Extraction

2. Dictionary feature extraction

  • sklearn.feature_extraction.DictVectorizer(sparse=True,…): feature value for dictionary data
    • DictVectorizer.fit_transform(X): returns a sparse matrix, X is a dictionary or an iterator return value containing a dictionary
    • DictVectorizer.get_feature_names_out(): returns category names
from sklearn.feature_extraction import DictVectorizer
def dict_demo():
    """
    对字典类型的数据进行特征抽取
    :return: None
    """
    data = [{'city': '北京', 'temperature': 100},
            {'city': '上海', 'temperature': 60},
            {'city': '深圳', 'temperature': 30}]
    transfer = DictVectorizer()  # 实例化一个转换器类
    # transfer = DictVectorizer(sparse=False)  # 实例化一个转换器类
    data = transfer.fit_transform(data)  # 调用fit_transform方法输入数据并转换(注意返回格式)
    print("返回的结果:\n", data)
    print("特征名字:", transfer.get_feature_names_out())     # 打印特征名字
    return None
if __name__ == '__main__':
    dict_demo()

输出:
返回的结果:
   (0, 1)	1.0
  (0, 3)	100.0
  (1, 0)	1.0
  (1, 3)	60.0
  (2, 2)	1.0
  (2, 3)	30.0
特征名字: ['city=上海' 'city=北京' 'city=深圳' 'temperature']
-----------------------------
指定sparse=False后的输出:
返回的结果:
 [[  0.   1.   0. 100.]
 [  1.   0.   0.  60.]
 [  0.   0.   1.  30.]]
特征名字: ['city=上海' 'city=北京' 'city=深圳' 'temperature']

One-hot encoding generates a Boolean value for each category, and one-hot encoding is generally performed for features with category information

3. Text Feature Extraction

  • sklearn.feature_extraction.text.CountVectorizer(stop_words=[]): Perform eigenvalues ​​on text data, return word frequency matrix, stop_words are stop words, that is, words that do not want to be counted, and single letters and punctuation marks are not counted

    • CountVectorizer.fit_transform(X): X is text or an iterable object containing text strings, returning a sparse matrix
    • CountVectorizer.get_feature_names_out(): return value word list
  • sklearn.feature_extraction.text.TfidfVectorizer

Application examples are as follows

from sklearn.feature_extraction.text import CountVectorizer

def text_count_demo():
    data = ["life is short,i like like python", "life is too long,i dislike python"]
    # 实例化一个转换器类
    # transfer = CountVectorizer(sparse=False) # 注意,没有sparse这个参数
    transfer = CountVectorizer()
    # 调用fit_transform方法输入数据并转换 (注意返回格式,利用toarray()进行sparse矩阵转换array数组)
    transfer_data = transfer.fit_transform(data)
    print('transfer_data值为:\n', transfer_data)
    print("文本特征抽取的结果:\n", transfer_data.toarray())
    print("返回特征名字:\n", transfer.get_feature_names_out())
    return None
if __name__ == '__main__':
    text_count_demo()

输出:
transfer_data值为:
   (0, 2)	1
  (0, 1)	1
  (0, 6)	1
  (0, 3)	2
  (0, 5)	1
  (1, 2)	1
  (1, 1)	1
  (1, 5)	1
  (1, 7)	1
  (1, 4)	1
  (1, 0)	1
文本特征抽取的结果:
 [[0 1 1 2 0 1 1 0]
 [1 1 1 0 1 1 0 1]]
返回特征名字:
 ['dislike' 'is' 'life' 'like' 'long' 'python' 'short' 'too']

--------------------------------中文----------------------------------
将data改为: data = ["人 生 苦短,我喜欢 Python", "人生太长,我不喜欢Python"]
输出如下:
transfer_data值为:
   (0, 4)	1
  (0, 3)	1
  (0, 0)	1
  (1, 1)	1
  (1, 2)	1
文本特征抽取的结果:
 [[1 0 0 1 1]
 [0 1 1 0 0]]
返回特征名字:
 ['python' '人生太长' '我不喜欢python' '我喜欢' '苦短']

As can be seen from the above, English is separated by spaces to achieve word segmentation by default, and Chinese is separated by commas or spaces, and does not count single Chinese characters, that is, Chinese word segmentation is not supported

Fourth, jieba word segmentation processing

安装命令:pip install jieba
  • jieba.cut(): returns a generator composed of words
  • use
    • Prepare sentences and use jieba.cut for word segmentation
    • Instantiate CountVectorizer
    • Turn the word segmentation result into a string as the input value of fit_transform

Examples are as follows

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cut_words(text):
    result = list(jieba.cut(text))
    print('list(jieba.cut(text))结果为:', result)
    txt = ' '.join(result)
    return txt
def text_count_demo():  # 对中文进行特征提取
    data = ["愿中国青年都能摆脱冷气,只是向上走,不必听自暴自弃者流的话。能做事的做事,能发声的发声。", "有一分热,发一分光,就令萤火一般,也可以在黑暗里发一点光,不必等候炬火。此后如竟没有炬火:我便是唯一的光。"]
    txt_list = []
    for i in data:
        txt_list.append(cut_words(i))
    print('txt_list列表为:', txt_list)
    # 实例化一个转换器类
    # transfer = CountVectorizer(sparse=False) # 注意,没有sparse这个参数
    transfer = CountVectorizer()
    # 调用fit_transform方法输入数据并转换 (注意返回格式,利用toarray()进行sparse矩阵转换array数组)
    transfer_data = transfer.fit_transform(txt_list)
    # print('transfer_data值为:\n', transfer_data)
    print("文本特征抽取的结果:\n", transfer_data.toarray())
    print("返回特征名字:\n", transfer.get_feature_names_out())
    return None
if __name__ == '__main__':
    text_count_demo()

输出:
list(jieba.cut(text))结果为: ['愿', '中国', '青年', '都', '能', '摆脱', '冷气', ',', '只是', '向上', '走', ',', '不必', '听', '自暴自弃', '者', '流', '的话', '。', '能', '做事', '的', '做事', ',', '能', '发声', '的', '发声', '。']
list(jieba.cut(text))结果为: ['有', '一分', '热', ',', '发一分光', ',', '就', '令', '萤火', '一般', ',', '也', '可以', '在', '黑暗', '里发', '一点', '光', ',', '不必', '等候', '炬火', '。', '此后', '如竟', '没有', '炬火', ':', '我', '便是', '唯一', '的', '光', '。']
txt_list列表为: ['愿 中国 青年 都 能 摆脱 冷气 , 只是 向上 走 , 不必 听 自暴自弃 者 流 的话 。 能 做事 的 做事 , 能 发声 的 发声 。', '有 一分 热 , 发一分光 , 就 令 萤火 一般 , 也 可以 在 黑暗 里发 一点 光 , 不必 等候 炬火 。 此后 如竟 没有 炬火 : 我 便是 唯一 的 光 。']
文本特征抽取的结果:
 [[0 0 0 1 1 0 2 1 0 2 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0]
 [1 1 1 1 0 1 0 0 1 0 0 1 0 1 1 0 1 1 2 0 1 0 1 1 0 1]]
返回特征名字:
 ['一分' '一点' '一般' '不必' '中国' '便是' '做事' '冷气' '发一分光' '发声' '只是' '可以' '向上' '唯一'
 '如竟' '摆脱' '此后' '没有' '炬火' '的话' '等候' '自暴自弃' '萤火' '里发' '青年' '黑暗']

Five, tf-idf text feature extraction

  • The main idea of ​​TF-IDF is: if a word or phrase has a high probability of appearing in an article and rarely appears in other articles, it is considered that the word or phrase has a good category discrimination ability and is suitable for classification
  • TF-IDF role: used to evaluate the importance of a word for a file set or a file in a corpus
  • Importance: Classification machine learning algorithm for article classification in the early data processing method
  • official
    • Term frequency (term frequency, tf) : refers to the frequency with which a given word appears in the file
    • Inverse document frequency (inverse document frequency, idf) : It is a measure of the general importance of a word. The idf of a specific word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient to the base 10.

 \operatorname{tfidf}_{\mathrm{i}, \mathrm{j}}=\mathrm{tf}_{\mathrm{i}, \mathrm{j}} \times \operatorname{idf}_{\mathrm{i}}

The result can be understood as the degree of importance 

Example: If the total number of words in an article is 100, and the word "very" appears 5 times, then the word frequency of the word "very" in the document is 5/100=0.05. The document frequency (IDF) is calculated by dividing the total number of documents in the document set by the number of documents in which the word "very" appears. So, if the word "very" occurs in 1,0000 documents, and the total number of documents is 10,000,000, the inverse document frequency is lg(10,000,000 / 1,0000)=3. Finally, "very" for the tf-idf score of this document is 0.05 * 3=0.15

example

from sklearn.feature_extraction.text import TfidfVectorizer
import jieba

def cut_words(text):
    result = list(jieba.cut(text))
    print('list(jieba.cut(text))结果为:', result)
    txt = ' '.join(result)
    return txt
def text_count_demo():  # 对中文进行特征提取
    data = ["愿中国青年都能摆脱冷气,只是向上走,不必听自暴自弃者流的话。能做事的做事,能发声的发声。", "有一分热,发一分光,就令萤火一般,也可以在黑暗里发一点光,不必等候炬火。此后如竟没有炬火:我便是唯一的光。"]
    txt_list = []
    for i in data:
        txt_list.append(cut_words(i))
    print('txt_list列表为:', txt_list)
    # 实例化一个转换器类
    transfer = TfidfVectorizer()
    # 调用fit_transform方法输入数据并转换 (注意返回格式,利用toarray()进行sparse矩阵转换array数组)
    transfer_data = transfer.fit_transform(txt_list)
    # print('transfer_data值为:\n', transfer_data)
    print("文本特征抽取的结果百分占比为:\n", transfer_data.toarray())
    print("返回特征名字:\n", transfer.get_feature_names_out())
    return None
if __name__ == '__main__':
    text_count_demo()

输出:
list(jieba.cut(text))结果为: ['愿', '中国', '青年', '都', '能', '摆脱', '冷气', ',', '只是', '向上', '走', ',', '不必', '听', '自暴自弃', '者', '流', '的话', '。', '能', '做事', '的', '做事', ',', '能', '发声', '的', '发声', '。']
list(jieba.cut(text))结果为: ['有', '一分', '热', ',', '发一分光', ',', '就', '令', '萤火', '一般', ',', '也', '可以', '在', '黑暗', '里发', '一点', '光', ',', '不必', '等候', '炬火', '。', '此后', '如竟', '没有', '炬火', ':', '我', '便是', '唯一', '的', '光', '。']
txt_list列表为: ['愿 中国 青年 都 能 摆脱 冷气 , 只是 向上 走 , 不必 听 自暴自弃 者 流 的话 。 能 做事 的 做事 , 能 发声 的 发声 。', '有 一分 热 , 发一分光 , 就 令 萤火 一般 , 也 可以 在 黑暗 里发 一点 光 , 不必 等候 炬火 。 此后 如竟 没有 炬火 : 我 便是 唯一 的 光 。']
文本特征抽取的结果百分占比为:
 [[0.         0.         0.         0.17512809 0.24613641 0.
  0.49227283 0.24613641 0.         0.49227283 0.24613641 0.
  0.24613641 0.         0.         0.24613641 0.         0.
  0.         0.24613641 0.         0.24613641 0.         0.
  0.24613641 0.        ]
 [0.23245605 0.23245605 0.23245605 0.1653944  0.         0.23245605
  0.         0.         0.23245605 0.         0.         0.23245605
  0.         0.23245605 0.23245605 0.         0.23245605 0.23245605
  0.4649121  0.         0.23245605 0.         0.23245605 0.23245605
  0.         0.23245605]]
返回特征名字:
 ['一分' '一点' '一般' '不必' '中国' '便是' '做事' '冷气' '发一分光' '发声' '只是' '可以' '向上' '唯一'
 '如竟' '摆脱' '此后' '没有' '炬火' '的话' '等候' '自暴自弃' '萤火' '里发' '青年' '黑暗']

Learning to navigate: http://xqnav.top

Guess you like

Origin blog.csdn.net/qq_43874317/article/details/128565651