Python machine learning - machine learning related concepts & feature engineering


machine learning

insert image description here
insert image description here
insert image description here
Supervised learning: the input data has features and labels, that is, there are standard answers

  • Categories: k-Nearest Neighbor Algorithm, Bayesian Classification, Decision Trees and Random Forests, Logistic Regression, Neural Networks

  • Regression: linear regression, ridge regression

  • Label: Hidden Markov Model (not required)

Unsupervised learning: the input data has features and no labels, that is, there is no standard answer

  • Clustering: k-means

insert image description here
insert image description here

feature engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem of a predictive model, thereby improving model accuracy on unknown data

insert image description here

1. Feature extraction

Conclusions are drawn through the demonstration:
• Feature extraction is for non-continuous data
• Feature extraction is used to characterize text, etc. Dictionary feature extraction: One-hot
encoding of dictionary data

insert image description here

# 特征抽取
# 导入包
from sklearn.feature_extraction import DictVectorizer

def dictves():
    """
    字典数据处理
    :return:小王
    """
    #实例化
    dict=DictVectorizer(sparse=False)
    #调用fit_transform
    data=dict.fit_transform([{
    
    'city': '北京','temperature':100},{
    
    'city': '上海','temperature':60},{
    
    'city': '深圳','temperature':30}])
    print(data)
    return None
if __name__ == '__main__':
    dictves()

Text Feature Extraction: Characterizing Text Data

insert image description here
process:

insert image description here

# 特征抽取
# 导入包
from sklearn.feature_extraction.text import CountVectorizer
def dictves():
    """
    字典数据处理
    :return:小王
    """
    #实例化
    dict=CountVectorizer()
    #调用fit_transform
    data=dict.fit_transform(["life is short,i like python","life is too long,i dislike python"])
    print(data.toarray())#sprase矩阵转换成数组
    print(dict.get_feature_names())
    return None

if __name__ == '__main__':
    dictves()

Jieba stuttering word segmentation: eigenvalue of three paragraphs - process

insert image description here

# 特征抽取# 导入包
from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cutword():
    c11=jieba.cut("今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。")
    c21=jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。")
    c31=jieba.cut("如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。")
    #转换成列表
    content1 = list(c11)
    content2 = list(c21)
    content3 = list(c31)
    #列表转换成字符串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)
    return c1,c2,c3

def hanzivec():
    """
    中文特征值化
    :return:小王
    """
    #实例化
    c1,c2,c3=cutword()
    dict=CountVectorizer()
    #调用fit_transform
    data=dict.fit_transform([c1,c2,c3])
    print(dict.get_feature_names())
    print(data.toarray())#sprase矩阵转换成数组
    return None

if __name__ == '__main__':
    hanzivec()

insert image description here
TF-IDF role: used to evaluate the importance of a word for a file set or a file in a corpus.

insert image description here

# 特征抽取# 导入包
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import jieba

def cutword():
    c11=jieba.cut("今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。")
    c21=jieba.cut("我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。")
    c31=jieba.cut("如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。")
    #转换成列表
    content1 = list(c11)
    content2 = list(c21)
    content3 = list(c31)
    #列表转换成字符串
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)
    return c1,c2,c3

def hanzivec():
    """
    中文特征值化
    :return:小王
    """
    #实例化
    c1,c2,c3=cutword()
    dict=TfidfVectorizer()
    #调用fit_transform
    data=dict.fit_transform([c1,c2,c3])
    print(dict.get_feature_names())
    print(data.toarray())#sprase矩阵转换成数组
    return None

if __name__ == '__main__':
    hanzivec()

2. Feature processing

insert image description here
insert image description here

2.1 Normalization: traditional accurate small data

insert image description here

from sklearn.preprocessing import MinMaxScaler
def mm():
    """
    归一化处理
    :return: None
    """
    mm = MinMaxScaler(feature_range=(2, 3))
    data = mm.fit_transform([[90, 2, 10, 40], [60, 4, 15, 45], [75, 3, 13, 46]])
    print(data)
    return None
if __name__ == '__main__':
    mm()

Note that the maximum and minimum values ​​vary in specific scenarios. In addition, the maximum and minimum values ​​are easily affected by outliers, so this method is less robust and is only suitable for traditional accurate small data scenarios.

insert image description here

2.2 Standardization: most cases

insert image description here
insert image description here
It is relatively stable when there are enough samples, and it is suitable for modern noisy big data scenarios.

3. Data dimensionality reduction

Dimensions: the number of features

3.1 Feature selection

Feature selection is to simply select some features from all the extracted features as the training set features. The features can change or not change the value before and after selection, but the feature dimension after selection must be smaller than before selection. After all, We only selected some of the features.

insert image description hereinsert image description here

from sklearn.feature_selection import VarianceThreshold

def seltz():
    """
       特征选择
       :return: None
       """
    mm = VarianceThreshold()
    data = mm.fit_transform([[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]])
    print(data)
    return None

if __name__ == '__main__':
    seltz()

3.2 Principal component analysis PCA

Essence: PCA is a technique for analyzing and simplifying data sets
Purpose: It is to compress the dimensionality of data, reduce the dimensionality (complexity) of the original data as much as possible, and lose a small amount of information.
Function: It can reduce the number of features in regression analysis or cluster analysis

insert image description here

from sklearn.decomposition import PCA

def pcaz():
    """
       PCA
       :return: None
       """
    mm = PCA(n_components=0.94)
    data = mm.fit_transform([[2,8,4,5],[6,3,0,8],[5,4,9,1]])
    print(data)
    return None

if __name__ == '__main__':
    pcaz()

Case: Supermarket Order Analysis

insert image description here
insert image description here

import pandas as pd
prior=pd.read_csv("F:\python\data\order_products__prior.csv")
products=pd.read_csv("F:\python\data\products.csv")
orders=pd.read_csv("F:\python\data\orders.csv")
aisles=pd.read_csv("F:\python/data/aisles.csv")
#合并四张表到一张表(用户——物品类别)
data1=pd.merge(prior,products,on=["product_id","product_id"])
data2=pd.merge(data1,orders,on=["order_id","order_id"])
data=pd.merge(data2,aisles,on=["aisle_id","aisle_id"])
#交叉表(特殊的分组工具)
cross=pd.crosstab(data["use_id"],data["aisles"])

Guess you like

Origin blog.csdn.net/Pireley/article/details/131345627