Feature selection and data visualization

Author: Quanta branch

 

Data for most of us, are abstract disorderly, today let us try, How python abstract data into a visual clarity of the chart it!

 

To conduct research or data analysis algorithms, the data visualization may not be very popular, after all, does not give the contents of data visualization research bring immediate returns, but the production process can be boring, it can be said is a bit thankless . But in fact data visualization can potentially make you better understand your data, data visualization a good idea, it lets you go about your own research before, in order to indicate the direction of detours.

Here we combine some examples to teach you how clever use of visualization tools to analyze your data feature selection.

0. Importing Data

First, we import the necessary data processing toolkit, and load our data pandas, the data in this example we are using a breast cancer diagnostic data set for this data set, we Although there is no medical knowledge, but We can draw some conclusions by analyzing data

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns # data visualization library 

import matplotlib. pyplot as plt

import time

from subprocess import check_output

data=pd.read_csv('data.csv')

1. Data Analysis

Prior to feature extraction and selection, we first analyze the data base, the first look at what characterizes our data

data.head()

First, by looking at the characteristics of the data, we have to note four points:. 1 id can not be used for classification of data 2. diagnosis (diagnosis) should be our label 3. Finally, there is a Unamed NAN data, so give up this column 4. we do not know what other features are representatives, but this does not affect our analysis of the data.

Below we will label the features needed to separate:

#col name of the store features

col=data.columns

# Y store labels

y=data.diagnosis

list=['Unnamed: 32','id','diagnosis']

x=data.drop(list,axis=1)

x.head()

 

We draw charts to label content library with seaborn

ax = sns.countplot(y,label="Count")

Next we look at features, we do not need to understand what is the meaning of these features, but we know that the variance of the data (variance) and bias (bias) as well as the maximum and minimum data is. These types of information helps to understand the state of the data, the next step in the good work. For example, the maximum value of 2500 is characterized area_mean, wherein the maximum is 0.16340 smoothness_mean. Thus, in visualization, feature selection, feature extraction, or before the classification, we need to standardize these data.

Call pandas below describe the function of calculating the above-mentioned basic channel

x.describe()

 

 

2. Data Visualization

To make data visualization, we use seaborn library because the library there are many chart types, can be a wide range of data analysis

Here we introduce a violin FIG (violin plot), violin FIG exhibits a good profile can result in different characteristics of the label

 

#First ten features

data_dia = and

data = x

#将数据标准化

data_n_2 = (data - data.mean()) / (data.std())

#选取前十个数据

data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)

data = pd.melt(data,id_vars="diagnosis",

                    var_name="features",

                    value_name='value')

plt.figure(figsize=(10,10))

sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")

plt.xticks(rotation=90)

通过上面的小提琴图,我们可以看出texture_mean 特征对应的标签分布良性和恶性分布的中心是区分开来的,而相比之下fractal_dimension_mean特征对应的分布则并不能很好的分开,所以texture_mean特征相对于fractal_dimension_mean更适合于用于标签的区分

下面是第二组特征的小提琴图

# Second ten features

data = pd.concat([y,data_n_2.iloc[:,10:20]],axis=1)

data = pd.melt(data,id_vars="diagnosis",

                    var_name="features",

                    value_name='value')

plt.figure(figsize=(10,10))

sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")

plt.xticks(rotation=90)

作为小提琴图的替代品,箱型图也可以得到同样的效果。箱型图还可以用来查看是否有离群的数据

plt.figure(figsize=(10,10))

sns.boxplot(x="features", y="value", hue="diagnosis", data=data)

plt.xticks(rotation=90)

箱型图的结构如下图所示

从上面的图可以得出一点,concavity_worst和 concave point_worst特征对应的标签分布看起来特别像,可以推断出他们很有可能是相关的,在大部分情况下,相关的特征可以舍弃掉其中一个。

为了更加深入的对比这两个变量,我们下面用一种叫joint plot的图。

sns.jointplot(x.loc[:,'concavity_worst'], x.loc[:,'concave points_worst'], kind="regg", color="#ce1414")

如上图所示,x轴和y轴分别是两个特征,joint plot画出所有数据点对应的点,从图中可以看出两个变量的pearsonr值(度量线性关系的指标)是很高的。

那么怎么对比多个变量的关系呢,我们可以使用 pair grid plot

sns.set(style="white")

df = x.loc[:,['radius_worst','perimeter_worst','area_worst']]

g = sns.PairGrid(df, diag_sharey=False)

g.map_lower(sns.kdeplot, cmap="Blues_d")

g.map_upper(plt.scatter)

g.map_diag(sns.kdeplot, lw=3)

 

Swarm plot 也是一种非常实用的数据分析的图表,下面分三张swarm plot来展示我们所有的数据

sns.set(style="whitegrid", palette="muted")

data_dia = y

data = x

data_n_2 = (data - data.mean()) / (data.std())              # standardization

data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)

data = pd.melt(data,id_vars="diagnosis",

                    var_name="features",

                    value_name='value')

plt.figure(figsize=(10,10))

tic = time.time()

sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

 

plt.xticks(rotation=90)

 

 

data = pd.concat([y,data_n_2.iloc[:,10:20]],axis=1)

data = pd.melt(data,id_vars="diagnosis",

                    var_name="features",

                    value_name='value')

plt.figure(figsize=(10,10))

sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

plt.xticks(rotation=90)

 

 

data = pd.concat([y,data_n_2.iloc[:,20:31]],axis=1)

data = pd.melt(data,id_vars="diagnosis",

                    var_name="features",

                    value_name='value')

plt.figure(figsize=(10,10))

sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

toc = time.time()

plt.xticks(rotation=90)

print("swarm plot time: ", toc-tic ," s")

通过这上面三张图,我们可以将方差(variance)看的更加清楚,我们可以很直观的看出在第三张图中的perimeter_worst和area_worst两个特征在良性和恶性的分类上表现的最明显,而第二张图中的smoothness_se则非常不明显。

如果我们想看,各个特征之间的关系,我们可以使用热力图。

 

#correlation map

f,ax = plt.subplots(figsize=(18, 18))

sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

3.特征的选择和用随机森林预测结果

通过上面的图表,我们可以通过相关性选择我们预测所使用的特征了。通过上面的热力图,我们可以看出 特征radius_mean,perimeter_mean 和 area_mean 是相关的(图中相关性为1.0的两个变量),所以我们可以只选择area_mean这个变量,之所以选择area_mean特征,是因为在swarm plot中area_mean的数据在良性和恶性分类上更加清晰可辨。同样的我们选择出其余的相关性不大的特征,得到如下我们可以丢弃的特征,并丢弃,得到剩余我们用于预测的特征

drop_list1 = ['perimeter_mean','radius_mean','compactness_mean','concave points_mean','radius_se','perimeter_se','radius_worst','perimeter_worst','compactness_worst','concave points_worst','compactness_se','concave points_se','texture_worst','area_worst']

x_1 = x.drop(drop_list1,axis = 1 )        # do not modify x, we will use it later

x_1.head()

 

下面我们用这些筛选出来的特征用随机森林模型来测试下预测准确率

 

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import f1_score,confusion_matrix

from sklearn.metrics import accuracy_score

 

# split data train 70 % and test 30 %

x_train, x_test, y_train, y_test = train_test_split(x_1, y, test_size=0.3, random_state=42)

 

#random forest classifier with n_estimators=10 (default)

clf_rf = RandomForestClassifier(random_state=43)     

clr_rf = clf_rf.fit(x_train,y_train)

 

ac = accuracy_score(y_test,clf_rf.predict(x_test))

print('Accuracy is: ',ac)

cm = confusion_matrix(y_test,clf_rf.predict(x_test))

sns.heatmap(cm,annot=True,fmt="d")

预测准确率超过了百分之96

4.总结

本文主要通过预测肿瘤是良性还是恶性的例子,介绍了小提琴图(violin plot),箱型图(box plot),joint plot,swarm plot和热力图(heatmap)的画法,并通过这些图进行特征的分析和筛选,最后进行预测的测试。筛选特征的方法有很多,而且用一些算法可能进行特征筛选可能更加高效,但是这些数据可视化的图标可以更好的用于展示和给人更加直观的理解

 

 

 

 

 

 

 

 

 

关于我们

Mo(网址:https://momodel.cn)是一个支持 Python 的人工智能在线建模平台,能帮助你快速开发、训练并部署模型。


Mo 人工智能俱乐部 是由人工智能在线建模平台(网址:https://momodel.cn)的研发与产品团队发起、致力于降低人工智能开发与使用门槛的俱乐部。团队具备大数据处理分析、可视化与数据建模经验,已承担多领域智能项目,具备从底层到前端的全线设计开发能力。主要研究方向为大数据管理分析与人工智能技术,并以此来促进数据驱动的科学研究。

目前团队每两周(周六)在杭州举办线下沙龙,进行机器学习相关论文分享与学术交流。希望能汇聚来自各行各业对人工智能感兴趣的朋友,不断交流共同成长,推动人工智能民主化、应用普及化。

 

 

 

 

 

 

 

 

发布了29 篇原创文章 · 获赞 3 · 访问量 7261

Guess you like

Origin blog.csdn.net/weixin_44015907/article/details/104491886