The data structure was introduced in the previous article, of which the most important is DataFrame.
Article title
Basic properties and methods of Series
Basic properties and methods of DataFrame
Code example
import pandas as pd
df = pd.DataFrame([{
'名字':'小明','年龄':18},
{
'名字':'小亮','年龄':16},
{
'名字':'小红','年龄':17},
{
'名字':'小黑'}],index=['A','B','C','D'])
=====输出=====
名字 年龄
A 小明 18.0
B 小亮 16.0
C 小红 17.0
D 小黑 NaN
df.T # 转置行列
=====输出=====
A B C D
名字 小明 小亮 小红 小黑
年龄 18 16 17 NaN
df.axes # 返回行轴标签和列轴标签列表
=====输出=====
[Index(['A', 'B', 'C', 'D'], dtype='object'),
Index(['名字', '年龄'], dtype='object')]
df.dtypes # 返回每列的数据类型
=====输出=====
名字 object
年龄 float64
dtype: object
df.empty # 显示对象是否为空
=====输出=====
False # 为空返回True
df.ndim # 输出数据的维度
=====输出=====
2 # DataFrame 二维数据结构
df.shape# 对象的维数
=====输出=====
(4, 2) # 4行2列
df.size # 返回DataFrame中的元素数
=====输出=====
8 # 8个元素
df.values # 将DataFrame中的实际数据作为NDarray返回
=====输出=====
array([['小明', 18.0],
['小亮', 16.0],
['小红', 17.0],
['小黑', nan]], dtype=object)
df.head(),df..tail() # 返回数据的前几行、后几行可肤质默认是5
Statistical function
The following are the more important statistical functions.
Note-Because DataFrame is a heterogeneous data structure. Common operations do not apply to all functions. Some functions such as sum() and cumsum() do not report errors for character operations, but some functions such as abs() report errors.
As a two-dimensional data structure, DataFrame uses different functions for different axes to have different effects. Take sum() as an example below.
import pandas as pd
df = pd.DataFrame([{
'名字':'小明','年龄':18},
{
'名字':'小亮','年龄':16},
{
'名字':'小红','年龄':17},
{
'名字':'小黑'}],index=['A','B','C','D'])
df.sum(0) # 默认是0即按照列进行操作
=====输出=====
名字 小明小亮小红小黑 # 对于字符串直接进行拼接操作
年龄 51
dtype: object
df.sum(1) # 忽略了字符,对数字操作
=====输出=====
A 18.0
B 16.0
C 17.0
D 0.0
dtype: float64
Aggregation method
In addition to the above methods, there is also a summary method: the describe()
function is used to calculate the summary of the statistical information about the DataFrame column. include
Attributes are parameters used to pass necessary information about what columns need to be considered for summary. Get a list of values; by default it is "number value".
object | number | all |
---|---|---|
Summary string column | Summary number column | Sum all columns together (it should not be passed as a list value) |
f4.describe()
=====输出=====
年龄
count 3.0
mean 17.0
std 1.0
min 16.0
25% 16.5
50% 17.0
75% 17.5
max 18.0
f4.describe(include=['object'])
=====输出=====
名字
count 4
unique 4
top 小红
freq 1
f4.describe(include='all') # 将上方两种输出结果一起输出
Unique deduplication and count by value
This includes two methods: unique()
, value_counts()
to be used for a column, it will be given directly to the subject DataFrame.
import pandas as pd
df = pd.DataFrame([{
'名字':'小明','年龄':18},
{
'名字':'小亮','年龄':16},
{
'名字':'小红','年龄':17},
{
'名字':'小明','年龄':16},
{
'名字':'小红','年龄':18},
{
'名字':'小黑'}],index=['A','B','C','D'])
df['名字'].unique() # 输出唯一的
=====输出=====
array(['小明', '小亮', '小红', '小黑'], dtype=object)
df['名字'].value_counts() # 统计出现的次数并降序排列
=====输出=====
小明 2
小红 2
小亮 1
小黑 1
Name: 名字, dtype: int64
Correlation coefficient and covariance
- Covariance : measure the degree of the same direction and reverse. If the covariance is positive, it means that X and Y change in the same direction. The larger the covariance, the
higher the degree of the same direction ; if the covariance is negative, it means that X and Y move in the opposite direction. The smaller the variance, the higher the degree of reversal. When Cov(X,Y) = 0, the two are irrelevant. - The correlation coefficient : a measure of the degree of similarity, when their correlation coefficient of 1, indicating the forward similarity of the two variables change most when
large, when the correlation coefficient is - 1, indicating that the backward similarity of the two variables change maximum .
In order to reflect these two functions, the above DataFrame is not enough to support. I directly quoted the basketball game score as an example. If you are interested in the data source, you can go to the website to crawl.
df.corr()
=====输出=====
命中 投篮数 投篮命中率 3分命中率 篮板 助攻 得分
命中 1.000000 0.634690 0.839126 0.798088 -0.150331 -0.100641 0.867068
投篮数 0.634690 1.000000 0.123891 0.294384 -0.001462 -0.233715 0.675462
投篮命中率 0.839126 0.123891 1.000000 0.829982 -0.204990 0.052883 0.650011
3分命中率 0.798088 0.294384 0.829982 1.000000 -0.092656 -0.061292 0.776918
篮板 -0.150331 -0.001462 -0.204990 -0.092656 1.000000 -0.132013 -0.086104
助攻 -0.100641 -0.233715 0.052883 -0.061292 -0.132013 1.000000 -0.131812
得分 0.867068 0.675462 0.650011 0.776918 -0.086104 -0.131812 1.000000
df.cov()
=====输出=====
命中 投篮数 投篮命中率 3分命中率 篮板 助攻 得分
命中 9.083333 6.408333 0.263117 0.370050 -1.233333 -0.983333 22.800000
投篮数 6.408333 11.223333 0.043182 0.151727 -0.013333 -2.538333 19.743333
投篮命中率 0.263117 0.043182 0.010824 0.013285 -0.058055 0.017837 0.590035
3分命中率 0.370050 0.151727 0.013285 0.023669 -0.038803 -0.030570 1.042848
篮板 -1.233333 -0.013333 -0.058055 -0.038803 7.410000 -1.165000 -2.045000
助攻 -0.983333 -2.538333 0.017837 -0.030570 -1.165000 10.510000 -3.728333
得分 22.800000 19.743333 0.590035 1.042848 -2.045000 -3.728333 76.123333
# 我们还可以分析两个因素之间的关系
df['得分'].corr(f6['命中'])
Out[7]: 0.8670683274541471
f6['得分'].corr(f6['助攻'])
Out[8]: -0.13181185657005592