Machine learning_feature engineering text processing, data feature preprocessing

1. Feature Engineering Definition

Feature engineering is the process of transforming original data into features that better represent the potential problems of the predictive model, thereby improving the accuracy of the model for unknown data.

Second, dictionary feature extraction

from sklearn.feature_extraction import DictVectorizer

# 我们来看看对于x这一个字典抽取特征后的结果
def dict_vec():
    dict = DictVectorizer()
    x = [{
    
    'city': '北京','temperature':100},{
    
    'city': '上海','temperature':60},{
    
    'city': '深圳','temperature':30}]
    data = dict.fit_transform(x)
    print(data)
    return None
    
# 注意结果是,前面是坐标,后面是值,形式是sparse
  (0, 1)	1.0
  (0, 3)	100.0
  (1, 0)	1.0
  (1, 3)	60.0
  (2, 2)	1.0
  (2, 3)	30.0
#如果换成我们常见的二维数组的话,改一下一段代码

dict = DictVectorizer(sparse=False)

#结果是
[[  0.   1.   0. 100.]
 [  1.   0.   0.  60.]
 [  0.   0.   1.  30.]]
 #可以看到提取特征它没改变数值的值,而把城市这类拆成三列
 #用布尔值来表示是否有某个城市

Third, the extraction of text features

(1) Statistics on the frequency of occurrence of English text words

from sklearn.feature_extraction.text import CountVectorizer
# y中含有两个字符串形式的文本,y自己是列表

def count_vec():
    cv = CountVectorizer()
    y = ["There is no place for you to go", "Nothing is impossible"]
    data = cv.fit_transform(y)
    print(cv.get_feature_names())
    print(data.toarray())
    return None
    
#我们观察一下结果
['for', 'go', 'impossible', 'is', 'no', 'nothing', 'place', 'there', 'to', 'you']
[[1 1 0 1 1 0 1 1 1 1]
 [0 0 1 1 0 1 0 0 0 0]]
 #他把每个单词都提出来统计次数(I之类单个字母单词的不会统计)

(2) Chinese text phrase extraction processing

# 要注意的一点是,中文输入进去也可以统计,
但是这里提取单词靠的是空行,对英文没问题,但中文就不行
#因此我们要给中文单词之间自动用空格隔开
#这里推荐 
Import jieba
利用jieba.cut进行分词;
然后list成列表
再” ”.join(内容)成字符串

(3) How to better reflect text characteristics

TF-IDF  #用以评估一字词对于一个文件集或一个语料库中
的其中一份文件的重要程度,而非单纯统计频率
from sklearn.feature_extraction.text import  TfidfVectorizer
import jieba


def cutword():
# 这里先利用(2)中的函数来处理中文文本
    con1 = jieba.cut("今天很残酷,明天更残酷。")
    con2 = jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的.")
    con3 = jieba.cut("如果只用一种方式了解某样事物,你就不会真正了解它。")
    # 转换成列表
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)
    # 将列表转换成字符串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)
    return c1, c2, c3


def tfidfvec():

    c1, c2, c3 = cutword()
    print(c1, c2, c3)
    tf = TfidfVectorizer()
    data = tf.fit_transform([c1, c2, c3])
    print(tf.get_feature_names())
    print(data.toarray())
    return None


Four, data feature extraction

(1) What is
the conversion of data into data required by the algorithm.

(2) Standardization, normalization, and missing value processing

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

def mm():  # 标准化和归一化都是要输入二维数组
    mm = MinMaxScaler(feature_range=(2, 5))
    x = [[90, 2, 10, 40], [60, 4, 15, 45], [75, 3, 13, 46]]
    data = mm.fit_transform(x)
    print(data)
    return None
# 因为归一化容易受异常数据的影响,我们一般标准化用的多

def stand():
    std = StandardScaler()
    x = [[1., -1., 3.], [2., 4., 2.], [4., 6., -1.]]
    data = std.fit_transform(x)
    print(data)
    return None


def im():
    im = SimpleImputer(missing_values=np.nan, strategy='mean')
    x = [[1, 2], [np.nan, 3], [7, 6]]
    data = im.fit_transform(x)
    print(data)
    return None

Guess you like

Origin blog.csdn.net/tjjyqing/article/details/113930818