Pandas basic attributes and statistical methods

The data structure was introduced in the previous article, of which the most important is DataFrame.

Basic properties and methods of Series

Insert picture description here

Basic properties and methods of DataFrame

Properties and methods of DataFrame
Code example

import pandas as pd
df = pd.DataFrame([{
    
    '名字':'小明','年龄':18},
                   {
    
    '名字':'小亮','年龄':16},
                   {
    
    '名字':'小红','年龄':17},
                   {
    
    '名字':'小黑'}],index=['A','B','C','D'])
                
=====输出=====
   名字    年龄
A  小明  18.0
B  小亮  16.0
C  小红  17.0
D  小黑   NaN

df.T # 转置行列
=====输出=====
     A   B   C    D
名字  小明  小亮  小红   小黑
年龄  18  16  17  NaN

df.axes # 返回行轴标签和列轴标签列表
=====输出=====
[Index(['A', 'B', 'C', 'D'], dtype='object'),
 Index(['名字', '年龄'], dtype='object')]

df.dtypes # 返回每列的数据类型
=====输出=====
名字     object
年龄     float64
dtype:   object

df.empty # 显示对象是否为空
=====输出=====
 False # 为空返回True
 
 df.ndim # 输出数据的维度
=====输出=====
 2 # DataFrame 二维数据结构
 
df.shape# 对象的维数
=====输出=====
(4, 2) # 4行2列

df.size # 返回DataFrame中的元素数
=====输出=====
 8 # 8个元素
 
df.values # 将DataFrame中的实际数据作为NDarray返回
=====输出=====
array([['小明', 18.0],
       ['小亮', 16.0],
       ['小红', 17.0],
       ['小黑', nan]], dtype=object)

df.head(),df..tail() # 返回数据的前几行、后几行可肤质默认是5

Statistical function

The following are the more important statistical functions.
Statistical functionNote-Because DataFrame is a heterogeneous data structure. Common operations do not apply to all functions. Some functions such as sum() and cumsum() do not report errors for character operations, but some functions such as abs() report errors.
As a two-dimensional data structure, DataFrame uses different functions for different axes to have different effects. Take sum() as an example below.

import pandas as pd
df = pd.DataFrame([{
    
    '名字':'小明','年龄':18},
                   {
    
    '名字':'小亮','年龄':16},
                   {
    
    '名字':'小红','年龄':17},
                   {
    
    '名字':'小黑'}],index=['A','B','C','D'])
                   
df.sum(0) # 默认是0即按照列进行操作
	=====输出=====
	名字    小明小亮小红小黑 # 对于字符串直接进行拼接操作
	年龄          51
	dtype: object

df.sum(1) # 忽略了字符,对数字操作
	=====输出=====
	A    18.0
	B    16.0
	C    17.0
	D     0.0
	dtype: float64

Aggregation method

In addition to the above methods, there is also a summary method: the describe()function is used to calculate the summary of the statistical information about the DataFrame column. includeAttributes are parameters used to pass necessary information about what columns need to be considered for summary. Get a list of values; by default it is "number value".

object number all
Summary string column Summary number column Sum all columns together (it should not be passed as a list value)
f4.describe()
	=====输出=====
	         年龄 											
	count   3.0
	mean   17.0
	std     1.0
	min    16.0
	25%    16.5
	50%    17.0
	75%    17.5
	max    18.0

f4.describe(include=['object'])
	=====输出=====
	        名字
	count    4
	unique   4
	top     小红
	freq     1

f4.describe(include='all') # 将上方两种输出结果一起输出

Unique deduplication and count by value

This includes two methods: unique(), value_counts()to be used for a column, it will be given directly to the subject DataFrame.

import pandas as pd

df = pd.DataFrame([{
    
    '名字':'小明','年龄':18},
                   {
    
    '名字':'小亮','年龄':16},
                   {
    
    '名字':'小红','年龄':17},
                   {
    
    '名字':'小明','年龄':16},
                   {
    
    '名字':'小红','年龄':18},
                   {
    
    '名字':'小黑'}],index=['A','B','C','D'])
                   
df['名字'].unique() # 输出唯一的
	=====输出=====
	array(['小明', '小亮', '小红', '小黑'], dtype=object)

df['名字'].value_counts() # 统计出现的次数并降序排列
	=====输出=====
	小明    2
	小红    2
	小亮    1
	小黑    1
	Name: 名字, dtype: int64

Correlation coefficient and covariance

  1. Covariance : measure the degree of the same direction and reverse. If the covariance is positive, it means that X and Y change in the same direction. The larger the covariance, the
    higher the degree of the same direction ; if the covariance is negative, it means that X and Y move in the opposite direction. The smaller the variance, the higher the degree of reversal. When Cov(X,Y) = 0, the two are irrelevant.
  2. The correlation coefficient : a measure of the degree of similarity, when their correlation coefficient of 1, indicating the forward similarity of the two variables change most when
    large, when the correlation coefficient is - 1, indicating that the backward similarity of the two variables change maximum .

In order to reflect these two functions, the above DataFrame is not enough to support. I directly quoted the basketball game score as an example. If you are interested in the data source, you can go to the website to crawl.

df.corr()
=====输出=====
	                命中     投篮数   投篮命中率 3分命中率   篮板      助攻    得分
	命中     	 1.000000  0.634690  0.839126  0.798088 -0.150331 -0.100641  0.867068
	投篮数   	 0.634690  1.000000  0.123891  0.294384 -0.001462 -0.233715  0.675462
	投篮命中率   0.839126  0.123891  1.000000  0.829982 -0.204990  0.052883  0.650011
	3分命中率    0.798088  0.294384  0.829982  1.000000 -0.092656 -0.061292  0.776918
	篮板    	-0.150331 -0.001462 -0.204990 -0.092656  1.000000 -0.132013 -0.086104
	助攻    	-0.100641 -0.233715  0.052883 -0.061292 -0.132013  1.000000 -0.131812
	得分    	 0.867068  0.675462  0.650011  0.776918 -0.086104 -0.131812  1.000000

df.cov()
=====输出=====	
	              命中      投篮数    投篮命中率  3分命中率   篮板      助攻       得分
	命中         9.083333   6.408333  0.263117  0.370050 -1.233333  -0.983333  22.800000
	投篮数       6.408333  11.223333  0.043182  0.151727 -0.013333  -2.538333  19.743333
	投篮命中率   0.263117   0.043182  0.010824  0.013285 -0.058055   0.017837   0.590035
	3分命中率    0.370050   0.151727  0.013285  0.023669 -0.038803  -0.030570   1.042848
	篮板        -1.233333  -0.013333 -0.058055 -0.038803  7.410000  -1.165000  -2.045000
	助攻        -0.983333  -2.538333  0.017837 -0.030570 -1.165000  10.510000  -3.728333
	得分        22.800000  19.743333  0.590035  1.042848 -2.045000  -3.728333  76.123333

# 我们还可以分析两个因素之间的关系
df['得分'].corr(f6['命中'])
Out[7]: 0.8670683274541471
f6['得分'].corr(f6['助攻'])
Out[8]: -0.13181185657005592

Guess you like

Origin blog.csdn.net/qq_44091773/article/details/105879430