Machine Learning - Getting Started

Introduction to Machine Learning

  • Machine learning is an approach to artificial intelligence
  • Deep learning is a method of machine learning developed from 

Machine Learning : Automatically analyze and obtain models from data , and use the models to predict unknown data. 

The format of the dataset :  

Eigenvalue + target value

For example, the various attributes of the house in the above picture are eigenvalues, and then the house price is the target value.

Note:

  • For each row of data we can call a sample
  • Some data sets can have no target value - clustering

The relationship between deep learning and machine learning :

Deep learning is machine learning using deep neural networks.

——There is a structure in machine learning called neural network. The multi-layered neural network is called deep learning, and depth means multi-level.

Classification of Machine Learning Algorithms:

With Target Value - Supervised Learning

                The target value is a class (e.g. cat, dog) - a classification problem

                The target value is continuous data (such as housing prices) - regression problem

No target value - unsupervised learning

Machine learning development process:

Dataset usage:

Commonly used data sets include sklearn, kaggle and UCI. Here is an example of sklearn:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


def datasets_demo():
    # 获取数据集
    iris = load_iris()  # load获取小规模数据集,fetch获取大规模数据集
    print("鸢尾花数据集:\n", iris)
    print("查看数据集描述:\n", iris.DESCR)  # 除了 .属性 的方式也可以用字典键值对的方式 iris["DESCR"]
    print("查看特征值的名称:\n", iris.feature_names)
    print("查看特征值:\n", iris.data, iris.data.shape)

    # 数据集划分
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
    # 四个x、y分别是训练集特征值,测试集特征值,训练集目标集和测试集目标集,这同时也是这个API的返回值的顺序
    # test_size为测试集大小(float),默认是0.25,将大多的数据用于训练,测试集一般占20~30%,用于模型评估
    # 伪随机,random_state是随机数种子(不同的种子会造成不同的随机采样结果,相同的种子采样结果相同),如果后面要比较不同算法的优劣,那么数据划分方式要一样,即随机数种子一样以控制变量
    print("训练集的特征值:\n", x_train, x_train.shape)

    return None


if __name__ == "__main__":
    datasets_demo()

Feature engineering:

Feature engineering is the process of using professional background knowledge and skills to process data so that features can work better on machine learning algorithms.

Data and features determine the upper limit of machine learning, while models and algorithms only approach this upper limit.

Feature extraction: 

Convert arbitrary data (text or images) into numerical features that can be used for machine learning.

The purpose of eigenvalue is to allow the computer to better understand the data.

  • Dictionary feature extraction (feature discretization)
  • Text Feature Extraction
  • Image feature extraction (reintroduction to deep learning) 

matrix matrix two-dimensional array

vector vector one-dimensional array

Category ——> one-hot encoding (that is, one-hot encoding)

If the category is directly expressed as a number, then the size of the number will mistakenly divide the category into a size. In order to make each category fair, let several categories have several positions, set 1 for this category, and set 0 if it is not, that is, one- Hot encoding processing.

Dictionary feature extraction 

from sklearn.feature_extraction import DictVectorizer


def dict_demo():
    data = [{'city': '北京', 'temperature': 10}, {'city': '上海', 'temperature': 15}, {'city': '深圳', 'temperature': 20}]
    # 实例化一个转换器类
    transfer = DictVectorizer(sparse=False) # 默认返回sparse稀疏矩阵(只将非零值按位置表示出来,节省内存,提高加载效率)
    # 调用fit_transform(),实现数据转换
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
    print("特征名字:\n", transfer.get_feature_names_out())

    return None


if __name__ == "__main__":
    dict_demo()

Text Feature Extraction

Example of English feature extraction:

from sklearn.feature_extraction.text import CountVectorizer


def count_demo():
    data = ["I like C++,C++ like me", "I like python,python also like me"]  # 一般以单词作为特征,会忽略单个字母的单词
    # 如果是中文则还需要分词将词语隔开(按空格识别),如data = ["我 爱 中国"],同样也会忽略单个中文字

    # 实例化一个转换器类
    transfer = CountVectorizer()  # CountVectorizer(),统计每个样本特征词出现的个数,没有sparse=False这个参数
    # 如果transfer = CountVectorizer(stop_words=["also","me"])意为特征词里去掉also、me这些词,表示这些词不适合作特征

    # 调用fit_transform(),实现数据转换
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new.toarray())  # 用toarray方法等效于sparse=False
    print("特征词:\n", transfer.get_feature_names_out())

    return None


if __name__ == "__main__":
    count_demo()

Chinese feature extraction example:

from sklearn.feature_extraction.text import CountVectorizer
import jieba


def cut_word(text):
    return " ".join(list(jieba.cut(text)))  # 先转成list列表再转成字符串,jieba是中文分词组件


def count_demo():
    data = ["我爱广东", "我爱中国"]
    data_new = []
    for sent in data:
        data_new.append(cut_word(sent))
    # 实例化一个转换器类
    transfer = CountVectorizer()

    # 调用fit_transform(),实现数据转换
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())  # 用toarray方法等效于sparse=False
    print("特征词:\n", transfer.get_feature_names_out())

    return None


if __name__ == "__main__":
    jieba.setLogLevel(jieba.logging.INFO)  # 去除报错
    count_demo()

Key words : words that appear a lot in articles of a certain category but rarely appear in articles of other categories.

Tf-idf text feature extraction

  • The main idea of ​​TF-IDF : If a word or phrase has a high probability of appearing in an article and rarely appears in other articles, it is considered that the word or phrase has a good category discrimination ability and is suitable for classification .
  • The role of TF-IDF : It is used to evaluate the importance of a word for a file set or a file in a corpus.

official 

  • Term frequency (term frequency, tf) refers to the frequency that a given word appears in the file
  • Inverse document frequency (idf) is a measure of the universal importance of words. The idf of a specific word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient to the base 10

        tfidf_{i,j}=tf_{i,j}{\times}idf_{i,j} 

The final result can be understood as the degree of importance. 

from sklearn.feature_extraction.text import TfidfVectorizer
import jieba


def cut_word(text):
    return " ".join(list(jieba.cut(text)))


def tfidf_demo():
    data = ["相遇,是一种美丽,像一座小城向晚,映着夕阳的绚烂。",
            "对执着的人来说,最难莫过于放弃,在间断间续的挣扎中,感谢时间的治愈。",
            "过去无法重写,但它却让我更加坚强。感谢每一次改变,每一次心碎,每一块伤疤。"]
    data_new = []
    for sent in data:
        data_new.append(cut_word(sent))
    # 实例化一个转换器类
    transfer = TfidfVectorizer()

    # 调用fit_transform(),实现数据转换
    data_final = transfer.fit_transform(data_new)
    print("data_new:\n", data_final.toarray())
    print("特征词:\n", transfer.get_feature_names_out())

    return None


if __name__ == "__main__":
    jieba.setLogLevel(jieba.logging.INFO)  # 去除报错
    tfidf_demo()

feature preprocessing 

Through some conversion functions, the feature data is converted into a feature data process that is more suitable for the algorithm model.

Dimensionlessization of numeric data :

  • Normalized
  • standardization

Why normalize/standardize?

The units or sizes of the features are quite different, or the variance of a certain feature is several orders of magnitude larger than other features, which is easy to affect (dominate) the target result, making some algorithms unable to learn other features.

Test data ( data.txt ):

height,weight,sex
178,60,1
173,60,2
180,65,1
182,70,1
168,55,2

Normalized 

Map the data between (default [0,1] ) by transforming the original data .

x^{'}=\frac{x-min}{max-min}                               x^{''}=x^{'}*(mx-mi)+mi 

Acts on each column, maxis the maximum value of a column, minis the minimum value of a column, mxand miis the specified interval value, the default mxis 1, miand 0 x^{''}is the final result.

from sklearn.preprocessing import MinMaxScaler
import pandas as pd


def minmax_demo():
    # 读取数据
    data = pd.read_csv("data.txt")
    data = data.iloc[:, :2]  # 提取所需数据,行全部都要,列要前两列
    print("data:\n", data)

    # 实例化一个转换器类
    transfer = MinMaxScaler()  # 默认是transfer = MinMaxScaler(feature_range=(0,1))即区间[0,1]

    # 调用fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)

    return None


if __name__ == "__main__":
    minmax_demo()

If the maximum/minimum value happens to be an outlier , the result of normalization will be inaccurate. This method is less robust and is only suitable for traditional accurate small data scenarios. 

standardization 

By transforming the original data, the data is transformed into a range with a mean of 0 and a standard deviation of 1.

x^{'}=\frac{x-\bar{x}}{\sigma }

\bar{x}is the mean, \sigmaand is the standard deviation.

If the amount of data is large, a small number of outliers have little effect on the mean and standard deviation. 

from sklearn.preprocessing import StandardScaler
import pandas as pd


def stand_demo():
    # 读取数据
    data = pd.read_csv("data.txt")
    data = data.iloc[:, :2]  
    print("data:\n", data)

    # 实例化一个转换器类
    transfer = StandardScaler()

    # 调用fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)

    return None


if __name__ == "__main__":
    stand_demo()

It is relatively stable when there are enough samples, and it is suitable for modern noisy big data scenarios.

feature dimensionality reduction 

Dimensionality reduction refers to the process of reducing the number of random variables ( features ) under certain limited conditions to obtain a set of " uncorrelated " main variables (uncorrelated between features and features). 

Too many features will cause data redundancy, so dimensionality reduction is required, and features are used for learning in training. If the correlation between features is strong, it will have a greater impact on the results of the algorithm. Related features such as relative humidity and rainfall.

feature selection 

Filter _

  • variance selection method
  • correlation coefficient

Embedded _

  • decision tree
  • Regularization
  • deep learning 

Variance Selection Method: Low Variance Feature Filtering

  • The variance of the feature is small: the value of most samples of a certain feature is relatively similar - such as whether the bird has claws, the variance is 0, and the removal
  • Large feature variance: the values ​​of many samples of a certain feature are different - reserved
from sklearn.feature_selection import VarianceThreshold
import pandas as pd


def variance_demo():
    # 读取数据
    data = pd.read_csv("data.txt")
    print("data:\n", data)

    # 实例化一个转换器类
    transfer = VarianceThreshold(threshold=1)  # 表示方差小于threshold的特征都会被删掉(阈值),默认threshold=0

    # 调用fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)

    return None


if __name__ == "__main__":
    variance_demo()

correlation coefficient

Example: Pearson Correlation Coefficient - For Continuous Data

r=\frac{n\sum xy-\sum x\sum y}{\sqrt{n\sum x^{2}-(\sum x)^{2}}\sqrt{n\sum y^{2}-(\sum y)^{2}}}

Among them, -1\leqslant r\leqslant 1, the properties are as follows:

from scipy.stats import pearsonr
import pandas as pd


def pearsonr_demo():
    # 读取数据
    data = pd.read_csv("data.txt")
    print("data:\n", data)

    # 计算两个变量之间的皮尔逊相关系数
    r = pearsonr(data["height"], data["weight"])
    print("相关系数:\n", r)  # 第一个为相关系数,第二个为相关系数显著性,p值越小表示相关系数越显著

    return None


if __name__ == "__main__":
    pearsonr_demo()

Principal Component Analysis (PCA)

  • Definition: The process of transforming high-dimensional data into low-dimensional data, during which the original data may be discarded and new variables created.
  • Function: It is data dimension compression , which reduces the dimension (complexity) of the original data as much as possible at the cost of losing a small amount of information (retaining as much information as possible).
from sklearn.decomposition import PCA


def pca_demo():
    # 读取数据
    data = [[2, 8, 4, 5], [6, 3, 0, 8], [5, 4, 9, 1]]

    # 实例化一个转换器类
    transfer = PCA(n_components=2) # n_components如果传的是整数就代表降为几个特征(降为几维),如果传的是小数就代表要保留百分之几的信息

    # 调用fit_transform()
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)

    return None


if __name__ == "__main__":
    pca_demo()

classification algorithm

K-nearest neighbor algorithm (KNN)

The K Nearest Neighbor algorithm is also called the KNN algorithm. Its principle is that if most of the k most similar (that is, the closest neighbors in the feature space) samples of a sample in the feature space belong to a certain category, the sample also belongs to this category. category.

The distance between two samples can be calculated by Euclidean distance, such as a(a1,a2,a3), b(b1,b2,b3), then:

d=\sqrt{(a1-b1)^{2}+(a2-b2)^{2}+(a3-b3)^{2}}

If the k value is too small, it is easily affected by outliers, and if it is too large, it is easily affected by sample imbalance. 

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


def knn_demo():
    # 1)读取数据
    iris = load_iris()

    # 2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)  # 前两个参数传的是特征值和目标值

    # 3)特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)  # fit_transform其实是fit计算均值、标准差和transform按照计算的均值、标准差将数据转换两个步骤,它们都是转换器
    x_test = transfer.transform(x_test)  # 这里测试集需要用训练集的均值和标准差来进行转换,所以transform就好

    # 4)KNN算法预估器
    estimator = KNeighborsClassifier(n_neighbors=3)  # estimator是预估器,n_neighbors=3即k值为3的意思,不填默认为5
    estimator.fit(x_train, y_train) # 这里的fit做的工作是训练模型(也是计算的一种)

    # 5)模型评估
    # 方法一:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)  # 每个数据都比对,相等的返回True
    # 方法二:计算准确率
    score = estimator.score(x_test, y_test)  # 相当于在方法一比对的基础上算出了预测的准确率
    print("准确率为:\n", score)

    return None


if __name__ == "__main__":
    knn_demo()

Model selection and tuning

  • Cross-validation
  • hyperparameter search

cross validation

The purpose is to make the model results obtained by training more accurate . Method : Divide the obtained training data into a training set and a verification set. The following figure is an example: Divide the data into 4 parts, one of which is used as a verification set, and then after 4 times (groups) of tests, each time a different verification set is replaced, that is, the results of 4 groups of models are obtained, and the average value is taken as the final result. The result, is called 4-fold cross-validation.

  • Training set: training set + validation set
  • test set: test set 

Hyperparameter Search - Grid Search (Gird Search) 

Usually, there are many parameters that need to be specified manually (such as the k value in the k-nearest neighbor algorithm), which are called hyperparameters. However, manual tuning is complicated, so it is necessary to preset several hyperparameter combinations for the model. Each hyperparameter is evaluated by cross-validation, and finally the optimal parameter combination is selected to build the model.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


def knn_gscv_demo():
    # 1)读取数据
    iris = load_iris()

    # 2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)

    # 3)特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4)KNN算法预估器
    estimator = KNeighborsClassifier()
    # 加入网络搜索与交叉验证
    param_dict = {"n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8]}  # 参数列表:相当于后面for循环一遍这些参数看看哪个好
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)  # 数据量不大时cv可以大一些,即交叉验证分割的多一些,否则数据量大时cv又大会太耗时间

    estimator.fit(x_train, y_train)

    # 5)模型评估
    # 方法一:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法二:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)  # 这里的准确率是训练集+测试集这个结构中测试集的预测准确率

    print("最佳参数:\n", estimator.best_params_)
    print("最佳结果:\n", estimator.best_score_)  # 这里的最佳结构是测试集=测试集+验证集中验证集的结果
    print("最佳估计器:\n", estimator.best_estimator_)
    print("交叉验证结果:\n", estimator.cv_results_)

    return None


if __name__ == "__main__":
    knn_gscv_demo()

Naive Bayes Algorithm

The simplicity is due to the addition of an assumption : features are independent of each other . Therefore, the naive Bayesian algorithm = naive + Bayesian formula.

Bayesian formula:

 p(c|w)=\frac{p(w|c)p(c)}{p(w)}

Note: w is the feature value of a given document, and c is the document category.

Generally, it is also necessary to introduce the Laplace smoothing coefficient for calculation, the purpose is to prevent the calculated classification probability from being 0 (it is easy to appear when the data is small).

P(F1|C)=\frac{Ni+\alpha }{N+\alpha m}

\alphaThe specified coefficient is generally 1, and m is the number of feature words counted in the training document.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


def nb_demo():
    '''
    用朴素贝叶斯算法对新闻进行分类
    :return:
    '''
    # 1)读取数据
    news = fetch_20newsgroups(subset="all")  # 数据集较大用fetch,subset默认是获取训练集,都要就all

    # 2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(news.data, news.target)

    # 3)特征工程:文本特征抽取-tfidf
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    # 4)朴素贝叶斯算法预估器流程
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)

    # 5)模型评估
    # 方法一:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法二:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)

    return None


if __name__ == "__main__":
    nb_demo()

Disadvantages: Due to the assumption of independence of sample attributes, the effect is not good if the feature attributes are related.

decision tree

Similar to a tree built by if-else nesting. 

information entropy

To put it simply, information is something that eliminates random uncertainty. For example, when I don’t know Xiao Ming’s age, and Xiao Ming says he is 18 years old this year, then Xiao Ming’s words are a piece of information. At this time, Xiao Hua then said that Xiao Ming will be 19 years old next year. It is no longer information. To measure how much uncertainty is eliminated, information entropy is introduced.

H(x)=-\sum_{i=1}^{n}P(x_{i})log_{b}P(x_{i})

The technical term for H is called information entropy, and the unit is bit, where the base b is generally 2.

One of the division conditions of the decision tree: information gain

The information gain g(D,A) of feature A to training data set D is defined as the difference between the information entropy H(D) of set D and the information conditional entropy H(D|A) of D under the given conditions of feature A:

g(D,A)=H(D)-H(D|A)

 For example, from the following example to understand the formula, what features should be selected to start building a tree, so as to decide whether to loan.

 

from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree


def decision_demo():
    '''
    用决策树进行分类
    :return:
    '''
    # 1)读取数据
    iris = load_iris()
    feature_names = iris.feature_names

    # 2)划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)

    # 3)决策树预估器
    estimator = DecisionTreeClassifier(criterion="entropy")  # 表示用信息增益的熵分类
    estimator.fit(x_train, y_train)

    # 4)模型评估
    # 方法一:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法二:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)

    # 5)决策树的可视化
    # 指定图幅大小
    plt.figure(figsize=(18, 12))
    # 绘制图像
    _ = tree.plot_tree(estimator, filled=True, feature_names=feature_names)  # 由于返回值不重要,因此直接用下划线接收
    plt.show()
    # 保存图像
    # plt.savefig('./tree.jpg')  # 如果要保存图片记得将plt.show()注释先

    return None


if __name__ == "__main__":
    decision_demo()

 

Random Forest of Integrated Learning Method

Ensemble learning solves a single prediction problem by building a combination of several models. It works by generating multiple classifiers/models, each independently learning and making predictions, which are finally combined into a combined prediction, which is therefore better than any one single-class prediction. In machine learning, a random forest is a classifier that contains multiple decision trees , and the class output by it is determined by the mode of the class output by the individual trees.

 

 

Guess you like

Origin blog.csdn.net/weixin_61725823/article/details/130101257
Recommended