Plotting a matrix with python-pandas

Plotting a matrix with python-pandas

Note: This article is a translated article from

Visualize Machine Learning Data in Python With Pandas-Machine Learning Mastery , the original title is VisualizeMachine Learning Data in Python With Pandas (using pandas in Python to visually analyze machine learning data), the author means that we are using machine learning algorithms to When analyzing data, we must first understand the data, and the fastest way to understand the data is visualization. But the method used by the author for visualization is common to many data, and the graph matrix of various graphs is used, such as histogram, scatter graph matrix, and so on. This article introduces how to use pandas to make various matrix diagrams based on the analysis of the author.

(1) Data

The data is the PimaIndians dataset. The author ’s code contains the URL of the data source, that is, the Pima Indians diabetes dataset. The number of samples is 768, and the variables include:

Preg: Pregnancy times

Plas: plasma glucose concentration in oral glucose tolerance test is 2 hours

Pres: Diastolic blood pressure (mm Hg)

Skin: triceps skinfold thickness (mm)

test: 2 hours serum insulin (μU / ml)

mass: body mass index (kg / (height (m)) ^ 2)

pedi: diabetes lineage function

age: age (years)

class: Class variable (0 or 1), estimated to be gender.

(2) Histograms (histogram matrix)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #设置变量名
data = pandas.read_csv(url, names=names)  #采用pandas读取csv数据
data.hist()
plt.show()

However, we see that the graphics are not coordinated, and there are cases where variables and coordinates overlap. We can adjust the parameters of hist () to solve, including adjusting the x-axis and y-axis label sizes ((xlabelsize, ylabelsize), the entire graphics layout The size adjustment figsize:

data.hist(xlabelsize=7,ylabelsize=7,figsize=(8,6)) #

plt.show()

You can see the distribution of each variable. Among them, mass, plas, and pres have a certain normal distribution. Other than class, they are basically left-biased.

(3) Density Plots (Density Plot Matrix)

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False,fontsize=8,figsize=(8,6))
plt.show()

After the original code is output, there is still some overlap. Here, the fontsize of the coordinate text in the figure and the overall layout size figsize are added.

(4) Box and Whisker Plots

data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False, fontsize=8,figsize=(8,6))
plt.show()

Similar to (3), note here that the x-axis and y-axis can be shared, using the commands sharex = False, sharey = False.

(5) Correlation Matrix Plot

import numpy
correlations = data.corr()  #计算变量之间的相关系数矩阵
# plot correlation matrix
fig = plt.figure() #调用figure创建一个绘图对象
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)  #绘制热力图,从-1到1
fig.colorbar(cax)  #将matshow生成热力图设置为颜色渐变条
ticks = numpy.arange(0,9,1) #生成0-9,步长为1
ax.set_xticks(ticks)  #生成刻度
ax.set_yticks(ticks)
ax.set_xticklabels(names) #生成x轴标签
ax.set_yticklabels(names)
plt.show()

The darker the color, the stronger the correlation between the two.

(6) Scatterplot Matrix

from pandas.tools.plotting import scatter_matrix
scatter_matrix(data,figsize=(10,10))  
plt.show()
Published 30 original articles · praised 74 · 230,000 views +

Guess you like

Origin blog.csdn.net/ruiyiin/article/details/77141979