Python Machine Learning (7) Decision Tree (Part 2) Feature Engineering, Dictionary Features, Text Features, Decision Tree Algorithm API, Visualization, Regression Problem Solving

decision tree algorithm

Feature Engineering - Feature Extraction

Feature extraction is the conversion of arbitrary data into numerical features that can be used for machine learning. Computers cannot directly recognize character strings, and only by converting character strings into digital features that machines can understand can the computer understand the meaning of the string (features).
Mainly divided into: dictionary feature extraction (feature discretization), text feature extraction (frequency of occurrence of feature words in articles).

Dictionary feature extraction

Transform categorical data.
The computer cannot recognize the directly imported city and temperature data, and needs to be converted into 0, 1 codes to be recognized by the computer.
insert image description here
Implemented in code as:
insert image description here

Dictionary Feature Extraction API

sklearn.feature_extraction.DictVectorizer(sparse=Ture,...)
DictVectorizer.fit_transform(X),X:字典或者包含字典的迭代器返回值,返回sparse矩阵
DictVectorizer.get_feature_names()返回类别名称

insert image description here
When the amount of data is relatively large, using the sparse matrix can better display the feature data, which is more intuitive, does not display 0 data, and saves memory.

Text Feature Extraction

Characterize the text data, the frequency of occurrence of each word in an article.

Text Feature Extraction API

sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
返回词频矩阵。
CountVectorizer.fit_transform(X)
X:文本或者包含文本字符串的可迭代对象
返回值:返回sparse矩阵
CountVectorizer.get_feature_names()返回值:单词列表

English text feature extraction implementation

Requirements: reflect the frequency of vocabulary in the following paragraphs

[“Life is a never - ending road”,“I walk,walk,keep walking.”]

Note:
1. There is no sparse parameter for text feature extraction, and it can only be received with the default sparse matrix.
2. A single letter, such as I, a, will not be counted
. 3. Specify stop words through stop_words

insert image description here

Implementation of Chinese text feature extraction

Requirements: reflect the frequency of vocabulary in the following paragraphs

data = ['This encounter','It's so beautiful, so beautiful, so beautiful, so beautiful, so beautiful. ']

insert image description here
Requirement: Reflect the occurrence frequency of the words in the following text.
insert image description here
Extract the words in the article uniformly, remove the repeated values ​​and put them in a list. The matrix shows the number of times each word appears in each row. According to how many words appear, the articles can be classified as articles related to words.

Tf-idf text feature extraction

The main idea of ​​TF-IDF is: if a word or phrase has a high probability of appearing in an article and rarely appears in other articles, it is considered that the word or phrase has a good category discrimination ability and is suitable for use in Classification.
The role of TF-IDF: It is used to evaluate the importance of a word for a file set or a file in a corpus.

Tf-idf text feature extraction formula: tfidfi , j = tfi , j ∗ idfi tfidf_{i,j}=tf_{i,j}*idf_itfidfi,j=tfi,jidfi
Term frequency (term frequency, tf): refers to the frequency with which a given word appears in the document.
Inverse document frequency (inverse document frequency, idf): is a measure of the general importance of a word. The idf of a specific word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm to the base 10 of the obtained quotient.

For example, an article consists of 1000 words, and real estate appears 500 times, the frequency tf of real estate in this article is: 500/1000=0.5; real estate has appeared in 1000 documents, the total number of documents is 1000000, idf : log 1000000 / 1000 = 3 log1000000/1000=31000000/1000 _ _ _=3 ;tf-idf is 0.5*3=1.5.
Not only look at the number of times (frequency) a word appears in a certain article, but also need to look at the number of times it appears in the entire file set.

Tf-idf text feature extraction api

sklearn.feature_extraction.text.TfidfVectorizer

insert image description here
insert image description here
What you get is the result of tfidf after calculation. If there is no file set, it is divided by lines. The list is used as the file set, and each line is processed as a file. By judging the size of tfidf, a certain vocabulary is used as an important vocabulary for segmentation.

Decision Tree Algorithm API

Category API

sklearn.tree.DecisionTreeClassifier 决策树的分类算法器
 - criterion:设置树的类型
 - entropy:基于信息熵,也就是ID3算法,实际结果与C4.5相差不大
 - gini:默认参数,相当于基尼系数。CART算法是基于基尼系数做属性划分的,
所以criterion=gini时,实际上执行的是CART算法。
 - splitter:在构造树时,选择属性特征的原则,可以是best或random。默认是best,
 - best代表在所有的特征中选择最好的,random代表在部分特征中选择最好的。
 - max_depth:决策树的最大深度,可以控制决策树的深度来防止决策树过拟合。
 - min_samples_split:当节点的样本数小于min_samples_split时,不再继续分裂,默认值为2
 - min_samples_leaf:叶子节点需要的最小样本数。如果某叶子节点的数目小于这个阈值,则会和
兄弟节点一起被剪枝。可以为intfloat类型。
 - min_leaf_nodes:最大叶子节点数。int类型,默认情况下无需设置,特征不多时,无需设置。
特征比较多时,可以通过该属性防止过拟合。

Case: Titanic Passenger Survival Prediction

Requirements: Read the following data to predict the survival rate

train.csv is the training data set, which contains feature information and labels of survival;
test.csv is the test data set, which only contains feature information.
PassengerId: passenger number; Survived: survived; Pclass: ticket class; Name: name; Sex: gender; Age: age;
SibSp: number of relatives (siblings); Parch: number of relatives (parents and children); Ticket: ticket Number; Fare: ticket price; Cabin: cabin;

insert image description here
insert image description here
insert image description here
insert image description here

Through the analysis of the fields, note that the pure number type can be replaced by the mean value. If the missing value of the string type is too large, it will be deleted directly. If the missing value is relatively small, it will be filled with a larger proportion. Feature selection should be as close as possible Labels have related features, convert the text in the features into corresponding values, and finally train the model, and then perform K-fold cross-validation.

Decision Tree Visualization

安装graphviz工具,下载地址:http://www.graphviz.org/download/
将graphviz添加到环境变量PATH中,然后通过pip install graphviz 安装graphviz库

insert image description here
The generated picture is relatively large and can be saved as a pdf file. The effect is as follows
insert image description here

Decision Trees for Regression Problems

The decision tree is based on the ID3, C4.5, and CART algorithms, and the regression problem is implemented based on the CART algorithm, and the Gini coefficient.
Import the data of Boston housing prices and the regression problem interface of the decision tree. Next, call the data interface to get the data set, then get the feature name, feature set, get the features and labels of the training set and test set, and then perform training and prediction.
insert image description here

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/132114797