Use matplotlib, seaborn lack of statistical data visualization

matplotlib use:

  step1. Creating a blank canvas, this function returns fig canvas  

fig=plt.figure()

  step2. Create a sub-graph

ax = fig.add_subplot (1,2,1) # 1 means that the canvas is divided into two rows with the first row is now returned to the first column subgraph

  step3. Start Draw, here we use seaborn to draw, it is advanced packaging matplotlib it does not require the canvas, which is specified in the previous step drawing canvas, the parameters for the x, y

sns.barplot(missing[col], missing.index)

  step4.ax allows us to easily operate the subgraph, we set the title of the sub-picture, wherein the parameter f represents the presence of the interior, where it is used col enclosed in {}, the variable can be displayed in FIG.

ax.set_title(f'Missing values on each columns({col})')

  step5. The last showing fig.show () does not display the image currently unknown reason.

plt.show()

  step6. If you want to save the image

plt.savefig("1.png")

The example used is a practice of kaggle

https://www.kaggle.com/c/cat-in-the-dat-ii

Reference https://www.kaggle.com/warkingleo2000/first-step-on-kaggle/data

Complete code display, there are two ways first approach is to reference the https://zhuanlan.zhihu.com/p/93423829

Import PANDAS AS pd
 Import matplotlib.pyplot AS plt
 Import Seaborn AS the SNS
 DEF plot_missing_values (df): 
    cols = df.columns 
    COUNT = [. df [COL] .isnull () SUM () for COL in cols] # attention here Knowledge point, bearing in mind isnull () usage. 
    Percent = [I / len (DF) for I in COUNT] where # 
    Missing = pd.DataFrame ({ ' Number ' : COUNT, ' Proportion ' : Percent}, index = cols) # Note how to build dataframe 

    Fig = PLT. figure (figsize = (20, 7))

    for i, col in enumerate(missing.columns):
        ax=fig.add_subplot(1,2,i+1)
        ax.set_title(f'Missing values on each columns({col})')
        sns.barplot(missing[col], missing.index)




    plt.show()

if __name__ == '__main__':
    raw_train=pd.read_csv("train.csv")
    raw_test=pd.read_csv("test.csv")
    plot_missing_values(raw_train)
    #plt.savefig("1.png")
    plot_missing_values(raw_test)
    #plt.savefig("2.png")

The second approach reference Kaggle  https://www.kaggle.com/warkingleo2000/first-step-on-kaggle/data

def plot_missing_values(df):

    cols = df.columns
    count = [df[col].isnull().sum() for col in cols]
    percent = [i/len(df) for i in count]
    missing = pd.DataFrame({'number':count, 'proportion': percent}, index=cols)
    
    fig, ax = plt.subplots(1,2, figsize=(20,7))
    for i, col in enumerate(missing.columns):

        plt.subplot(1,2,i+1)
        plt.title(f'Missing values on each columns({col})')
        sns.barplot(missing[col], missing.index)
        mean = np.mean(missing[col])
        std = np.std(missing[col])
        plt.ylabel('Columns')
        plt.plot([], [], ' ', label=f'Average {col} of missing values: {mean:.2f} \u00B1 {std:.2f}')
        plt.legend()
    plt.show()
    return missing.sort_values(by='number', ascending=False)

 Graphical analysis of the painting:

train data

test data

从上面两张图中我们可以看到不论是训练数据还是测试的数据在每个特征中缺失所占比例很少,都在0.0x的范围之中。

Guess you like

Origin www.cnblogs.com/AI-Creator/p/12405638.html