Introduction to pandas: A Numpy-based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large structured datasets .
1. Core data structure
1.1, Series object
Series can be understood as a one-dimensional array , but the index name can be changed by itself . Similar to a fixed-length ordered dictionary, with index and value
1.1.1 Creation of Series objects
import pandas as pd
import numpy as np
# 1、Series对象创建--空Series对象
s1 = pd.Series()
print(s1, type(s1), s1.dtype, s1.ndim)
# 2、通过ndarray创建Series对象【或者是一个容器,字典时:key值为索引】
ary1 = np.array([23, 45, 12, 34, 56])
s2 = pd.Series(ary1)
print(s2)
Output result:
# 3、创建Series对象时,指定index行级索引标签
ary1 = np.array([23, 45, 12, 34, 56])
s3 = pd.Series(ary1, index=['zs', 'ls', 'ww', 'll', 'tq'])
print(s3)
Output result:
# 5、从标量创建一个系列
s5 = pd.Series(5, index=[0, 1, 2, 3])
print(s5)
Output result:
1.1.2 References to Series object elements
import numpy as np
import pandas as pd
s1 = pd.Series(np.array([78, 98, 67, 100, 76]), index=['lily', 'bob', 'jim', 'jack', 'mary'])
# 方式1:使用索引检索元素
print(s1[:3]) # 返回一个Series对象
print(s1[1]) # 返回value值
Output result:
# 2、使用标签检索数据[可同时多个元素]
print(s1['lily']) # 返回value值
print(s1[['bob', 'jim', 'jack']]) # 返回一个Series对象
Output result:
1.2 Date type
datetime64 [ns] : date type
timedelta64 [ns] : time offset type
1.2.1 Date processing
The date string format recognized by panda
import pandas as pd
# 将日期列表转为Series对象序列
dates = pd.Series(['2021', '2011-02', '2011-03-02', '2011/04/01', '2011/5/1 01:01:01', '01 Jun 2011'])
print(dates)
Output result:
# to_datetime() 转换日期数据类型
dates = pd.to_datetime(dates)
print(dates, '\n', dates.dtype)
Output result:
datetime type data supports date operations
delta = dates - pd.to_datetime('1970-01-01')
print(delta, type(delta))
Note – Note : At this time, the element type in Series is timedelta type
1.2.2 Date-related operations
Test Series.dt date-related operations: specific and detailed API reference help(DatetimeProperties)
import pandas as pd
from pandas.core.indexes.accessors import DatetimeProperties
dates = pd.Series(['2021', '2011-02', '2011-03-02', '2011/04/01', '2011/5/1 01:01:01', '01 Jun 2011'])
dates = pd.to_datetime(dates)
print(dates)
print("*" * 45)
# 获取当前时间的-日
print(dates.dt.day)
print("*" * 45)
# 返回当前日期是每周第几天
print(dates.dt.dayofweek)
print("*" * 45)
# 返回当前日期的秒
print(dates.dt.second)
print(dates.dt.month)
# 返回当前日期是一年的第几周
print(dates.dt.weekofyear)
In addition to the above, Series.dt also provides many date-related operations
Series.dt.year The year of the datetime.
Series.dt.month The month as January=1, December=12.
Series.dt.day The days of the datetime.
Series.dt.hour The hours of the datetime.
Series.dt.minute The minutes of the datetime.
Series.dt.second The seconds of the datetime.
Series.dt.microsecond The microseconds of the datetime.
Series.dt.week The week ordinal of the year.
Series.dt.weekofyear The week ordinal of the year.
Series.dt.dayofweek The day of the week with Monday=0, Sunday=6.
Series.dt.weekday The day of the week with Monday=0, Sunday=6.
Series.dt.dayofyear The ordinal day of the year.
Series.dt.quarter The quarter of the date.
Series.dt.is_month_start Indicates whether the date is the first day of the month.
Series.dt.is_month_end Indicates whether the date is the last day of the month.
Series.dt.is_quarter_start Indicator for whether the date is the first day of a quarter.
Series.dt.is_quarter_end Indicator for whether the date is the last day of a quarter.
Series.dt.is_year_start Indicate whether the date is the first day of a year.
Series.dt.is_year_end Indicate whether the date is the last day of the year.
Series.dt.is_leap_year Boolean indicator if the date belongs to a leap year.
Series.dt.days_in_month The number of days in the month.
1.3 DateTimeIndex
DateTimeIndex : Create a sequence of dates using the date_range() function by specifying a period and frequency . By default, the range's frequency is days
1.3.1 Detailed Explanation of date_range Parameters
# date_range参数详解
def date_range(
start=None, # 生成日期的起始日期
end=None, # 结束日期
periods=None, # 生成日期序列中日期元素个数
freq=None, # 指定生成日期之间的间隔或频率
tz=None, # 时区
normalize=False,
name=None,
closed=None,
**kwargs,
) -> DatetimeIndex
1.3.2 Create DateTimeIndex
# freq="M"代表每月生成一次日期,此种情况首日期从起始日期当月最后一天开始
dates = pd.date_range('2023-5-17', periods=10, freq="M")
print(dates, dates.dtype, type(dates))
Output result:
1.4 DataFrame
A data type similar to a table can be understood as a two-dimensional array, and the index has two dimensions and can be changed.
Features: underlying columns are of different types; variable size; labeled axes; can perform arithmetic operations on rows and columns
1.4.1 Creation of DataFrame objects
(1) Create an empty object
# DataFrame对象创建1
df1 = pd.DataFrame()
print(df1, type(df1))
(2) Create a DataFrame object using a one-dimensional array
# DataFrame对象创建[通过一维数组]2
data = [1, 2, 3, 4, 5]
df2 = pd.DataFrame(data)
print(df2)
(3) Create a DataFrame object using a two-dimensional array
# DataFrame对象创建[通过二维数组]3
data1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).reshape(3, 3)
df3 = pd.DataFrame(data1)
print(df3)
(4) Set row [index], column index label [columns]
# 设置行、列索引标签
data2 = np.array([[87, 76], [67, 99], [99, 100]])
df4 = pd.DataFrame(data2, index=['zs', 'ls', 'ww'], columns=['语文', '数学'])
print(df4)
(5) Create a DataFrame object through a dictionary
# 通过字典创建DataFrame对象
data3 = [{
'a': 1, 'b': 2}, {
'a': 3, 'b': 4, 'c': 9}]
print(pd.DataFrame(data3))
data4 = {
'Name': ['tom', 'jack', 'jim', 'bob'], 'Age': [23, 24, 21, 22]}
print(pd.DataFrame(data4))
(6) You can directly get a row or a column of data through the index label \ index
data4 = {
'Name': ['tom', 'jack', 'jim', 'bob'], 'Age': [23, 24, 21, 22]}
df5 = pd.DataFrame(data4)
print(df5['Name']) # 通过列标签拿到'Name'列
2. Core data structure operation
2.1 Column operation
2.1.1 Column access
import numpy as np
import pandas as pd
df = pd.DataFrame({
'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([1, 2, 3, 4], index={
'a', 'b', 'c', 'd'})})
# 列访问
print(df['one'], '-->访问第一列')
print(df[['one', 'two']], '-->访问多列')
Output result:
2.1.2 Column addition
"""
import numpy as np
import pandas as pd
df = pd.DataFrame({
'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([1, 2, 3, 4], index={
'a', 'b', 'c', 'd'})})
# 列添加
df['three'] = pd.Series([2, 3, 5, 6], index={
'a', 'b', 'c', 'd'})
# df['six'] = pd.Series([2, 3, 5, 6]) # 使用Series对象添加列时,必须指定索引index,否则默认的0,1,2,3不匹配abcd,都是Nan
df['seven'] = pd.Series([2, 3, 5, 6], index=df.index)
df['four'] = [12, 3, 4, 5]
df['five'] = np.array([1, 4, 6, 8])
print(df)
Output result:
Note: When using the Series object to add columns, you must specify the index index , otherwise the default 0, 1, 2, 3 does not match abcd, all are Nan
2.1.3 Column deletion
There are two common deletion methods:
Method 1: Use the pop method provided by the DataFrame class in pandas Method 2: Use the del index to delete
df.pop('seven')
print(df, '-->删除seven列')
del (df['five'])
print(df, '-->删除five列')
Output result:
2.2 Row operation
2.2.1 Row access
(1) Access method 1: using slices
import pandas as pd
name = pd.Series(['zs', 'ls', 'ww', 'tq'], index=['s1', 's2', 's3', 's4'])
age = pd.Series([23, 24, 21, 10], index=['s1', 's2', 's3', 's4'])
df = pd.DataFrame({
'Name1': name, 'Age': age})
print(df)
print('*' * 45)
# 行访问 使用切片的方式访问
print(df[0:1]) # 访问0行
Output result:
(2) Access method 2: loc method: slice method for DataFrame index name
import pandas as pd
name = pd.Series(['zs', 'ls', 'ww', 'tq'], index=['s1', 's2', 's3', 's4'])
age = pd.Series([23, 24, 21, 10], index=['s1', 's2', 's3', 's4'])
df = pd.DataFrame({
'Name1': name, 'Age': age})
print(df.loc['s1'])
print('*' * 45)
print(df.loc[['s1', 's2']])
Output results:
(3) Access method three: iloc method, the difference between iloc and loc is that iloc must accept the position of the row index and column index.
import pandas as pd
name = pd.Series(['zs', 'ls', 'ww', 'tq'], index=['s1', 's2', 's3', 's4'])
age = pd.Series([23, 24, 21, 10], index=['s1', 's2', 's3', 's4'])
df = pd.DataFrame({
'Name1': name, 'Age': age})
print(df.iloc[2], '-->2行')
print(df.iloc[[2, 3]], '-->2、3行') # 2、3行
print(df.iloc[1, 1], '-->1行1列') # 1行1列
Output result:
2.2.2 Row addition
import numpy as np
import pandas as pd
age = np.array([23, 45, 67, 89])
name = np.array(['lily', 'bob', 'jack', 'jim'])
df = pd.DataFrame({
'Age_info': age, 'Name_info': name})
print(df)
# df1与df两个DataFrame对象列名一致时,合并操作
df1 = pd.DataFrame({
'Age_info': pd.Series([34, 56]), 'Name_info': pd.Series(['kevin', 'Mary'])})
# print(df1)
print(df.append(df1))
# df1与df两个DataFrame对象列名不一致时,合并操作
df2 = pd.DataFrame({
'sex_info': pd.Series(['W', 'M']), 'score_info': pd.Series([67.7, 89.5])})
# print(df2)
print(df.append(df2))
Output result:
2.2.3 Line deletion
Delete by : Delete rows from DataFrame using index labels [or index without labels]. If the label is repeated, multiple lines will be deleted
Note : After using drop to delete, an object will be regenerated, and the original object will remain unchanged
import numpy as np
import pandas as pd
age = np.array([23, 45, 67, 89])
name = np.array(['lily', 'bob', 'jack', 'jim'])
df = pd.DataFrame({
'Age_info': age, 'Name_info': name}, index=['s1', 's2', 's3', 's4'])
print(df)
# 使用索引标签[或无标签使用索引]从DataFrame中删除行。如果标签重复,则会删除多行
df1 = df.drop('s1')
print(df1,'-->删除s1行')
Output result:
2.3 Value modification
(1) Method 1: Use loc to find the element to be modified
import numpy as np
import pandas as pd
age = np.array([23, 45, 67, 89])
name = np.array(['lily', 'bob', 'jack', 'jim'])
df = pd.DataFrame({
'Age_info': age, 'Name_info': name})
print(df)
df.loc[0, 'Age_info'] = 444
df.iloc[1, 0] = 555 # 必须是索引,不可以是索引标签
print(df)
Output results:
(2) SettingWithCopyWarning : A value is trying to be set on a copy of a slice from a DataFrame Reason and solution
[1] Reason: Trying to change the value in a copy of DataFrame (similar to a pandas vector)
[2] Solution: Use loc to ensure that the return is itself, and no copy will be generated
2.4 Case
Masks are still available in DataFrame
Case: Change all 0 values in the score column to np.nan
import numpy as np
import pandas as pd
s1 = pd.Series(['ll', 'ww', 'zz', 'qq'])
s2 = pd.Series([78, 0, 45, 0])
df = pd.DataFrame({
'name': s1, 'score': s2})
print(df)
mask = df[df['score'] == 0].index # 查找score为0的行索引,利用了掩码
print(mask)
df.loc[mask, 'score'] = np.nan
print(df)
Output result:
2.5 Common properties of DataFrame
Example code:
import pandas as pd
data = {
'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4'])
print(df)
print(df.axes)
print(df['Age'].dtype)
print(df.empty)
print(df.ndim)
print(df.size)
print(df.values)
print(df.head(3)) # df的前三行
print(df.tail(3)) # df的后三行
Result demo:
E:\Anaconda\python.exe E:/Python达内/网络并发/data_analysis/6_pandas_study/demo12.py
Name Age score
s1 Tom 28 90
s2 Jack 34 80
s3 Steve 29 70
s4 Ricky 42 60
[Index(['s1', 's2', 's3', 's4'], dtype='object'), Index(['Name', 'Age', 'score'], dtype='object')]
int64
False
2
12
[['Tom' 28 90]
['Jack' 34 80]
['Steve' 29 70]
['Ricky' 42 60]]
Name Age score
s1 Tom 28 90
s2 Jack 34 80
s3 Steve 29 70
Name Age score
s2 Jack 34 80
s3 Steve 29 70
s4 Ricky 42 60
Process finished with exit code 0
3. Descriptive statistics
Descriptive statistics of numerical data mainly includes the complete situation of computational data, minimum, maximum, median, mean, quartile, range, standard deviation, variance, covariance, etc. Some commonly used statistical functions in the Numpy library can also be used to perform descriptive statistics on data frames
3.1 Common APIs
Example code:
import pandas as pd
# Create a Dictionary of series
d = {
'Name': pd.Series(['Tom', 'James', 'Ricky', 'Vin', 'Steve', 'Minsu', 'Jack',
'Lee', 'David', 'Gasper', 'Betina', 'Andres', 'Andres']),
'Age': pd.Series([25, 26, 25, 23, 30, 29, 23, 34, 40, 30, 51, 46, 46]),
'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65, 3.65]),
'Score': pd.Series([3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.80, 3.65, 4.23, 3.24, 3.98, 2.56, 2.56])}
s = pd.DataFrame({
'a': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 25, 26, 25, 23]),
'b': pd.Series([3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65])})
# Create a DataFrame
df = pd.DataFrame(d)
print(df)
print(df.mean(0)) # 计算平均值 axis代表轴向
print(df.max())
print(df.prod())
print(df.median()) # 中位数
print(df.count()) # 计数
print(df.value_counts()) # 统计每个值出现的次数 查看的是每一行数据出现几次
# print(df.cumprod(), "累积") # 使用前手动消除非数值型
print(df.std(), '------------------------------标准差') # 标准差
print(df.cov(), '----协方差') # 协方差 自动忽略非数值型
print(df.var(), '--------方差')
print(df.corr(), '-------corr') # 相关系数 任意两对之间的相关系数
print(df.corrwith(s['a']), ' - -----------------corrwith') # 相关系数 计算每一列与指定对象之间的相关系数,返回Series对象
# print(df.describe())
# print(df.describe(include=['object']))
# print(df.describe(include=['number']))
operation result:
E:\Anaconda\python.exe E:/Python达内/网络并发/data_analysis/6_pandas_study/demo13.py
Name Age Rating Score
0 Tom 25 4.23 3.20
1 James 26 3.24 4.60
2 Ricky 25 3.98 3.80
3 Vin 23 2.56 3.78
4 Steve 30 3.20 2.98
5 Minsu 29 4.60 4.80
6 Jack 23 3.80 4.80
7 Lee 34 3.78 3.65
8 David 40 2.98 4.23
9 Gasper 30 4.80 3.24
10 Betina 51 4.10 3.98
11 Andres 46 3.65 2.56
12 Andres 46 3.65 2.56
Age 32.923077
Rating 3.736154
Score 3.706154
dtype: float64
Name Vin
Age 51
Rating 4.8
Score 4.8
dtype: object
Age -3.964810e+18
Rating 2.306847e+07
Score 1.894188e+07
dtype: float64
Age 30.00
Rating 3.78
Score 3.78
dtype: float64
Name 13
Age 13
Rating 13
Score 13
dtype: int64
Name Age Rating Score
Andres 46 3.65 2.56 2
Betina 51 4.10 3.98 1
David 40 2.98 4.23 1
Gasper 30 4.80 3.24 1
Jack 23 3.80 4.80 1
James 26 3.24 4.60 1
Lee 34 3.78 3.65 1
Minsu 29 4.60 4.80 1
Ricky 25 3.98 3.80 1
Steve 30 3.20 2.98 1
Tom 25 4.23 3.20 1
Vin 23 2.56 3.78 1
dtype: int64
Age 9.673517
Rating 0.633989
Score 0.773892
dtype: float64 ------------------------------标准差
Age Rating Score
Age 93.576923 0.226346 -3.057821
Rating 0.226346 0.401942 0.004109
Score -3.057821 0.004109 0.598909 ----协方差
Age 93.576923
Rating 0.401942
Score 0.598909
dtype: float64 --------方差
Age Rating Score
Age 1.000000 0.036907 -0.408458
Rating 0.036907 1.000000 0.008375
Score -0.408458 0.008375 1.000000 -------corr
Age 0.775174
Rating 0.211911
Score -0.275430
dtype: float64 - -----------------corrwith
Process finished with exit code 0
3.2 Data deduplication
(1) DataFrame uses the drop_duplicates function to de-duplicate, and the parameters are explained in detail as follows:
[1] Parameter 1: subset, by default, all column data are identified repeatedly at the same time; or the column specified by subset=[] is identified repeatedly
[2] Parameter 2: keep, three optional values {'first', 'last', False}, the default is first, which means to keep the first item in the order of the index among the identified duplicates, and delete the rest; False deletes all duplicates [
3 】Parameter 3: inplace, when False, the original object will not be modified, and will be assigned to a new object; True will modify the original object data
Code example:
import pandas as pd
# 通过字典创建DataFrame对象
data = [{
'name': 'lily', 'age': 24, 'sex': 'M', 'score': 89.7},
{
'name': 'jack', 'age': 22, 'sex': 'M', 'score': 76.6},
{
'name': 'mary', 'age': 24, 'sex': 'W', 'score': 69.7},
{
'name': 'bob', 'age': 22, 'sex': 'M', 'score': 99.7},
{
'name': 'james', 'age': 25, 'sex': 'W', 'score': 91},
{
'name': 'lily', 'age': 24, 'sex': 'M', 'score': 89.7}]
df = pd.DataFrame(data)
print(df)
# 去除重复数据
# 默认情况下,对于所有的列进行去重,识别重复中保留按照索引顺序的第一个内容,其余删除,不对原数据进行去重,处理结果赋予一个新的变量
df1 = df.drop_duplicates() # 不修改原数据
print(df1)
df.drop_duplicates(subset=['age', 'sex'], inplace=True) # 对原对象进行修改,在'age''sex'列识别重复
print(df)
Output result:
E:\Anaconda\python.exe E:/Python达内/网络并发/data_analysis/6_pandas_study/demo14.py
name age sex score
0 lily 24 M 89.7
1 jack 22 M 76.6
2 mary 24 W 69.7
3 bob 22 M 99.7
4 james 25 W 91.0
5 lily 24 M 89.7
name age sex score
0 lily 24 M 89.7
1 jack 22 M 76.6
2 mary 24 W 69.7
3 bob 22 M 99.7
4 james 25 W 91.0
name age sex score
0 lily 24 M 89.7
1 jack 22 M 76.6
2 mary 24 W 69.7
4 james 25 W 91.0
Process finished with exit code 0
3.3 Sorting
Pandas has two sorting methods, they are sorted by label and actual value
3.3.1 Sorting by tags
With sort_index()
the method, passing the axis parameter and the sort order, the rows of the DataFrame can be sorted. By default, row labels are sorted in ascending order
(1) sort_index()
Detailed explanation of important parameters
axis parameter: the default value is 0, which means sorting by row label (vertical); 1 means sorting by column label (horizontal )
;
, at this time, a new variable is required to receive this object; when True, modify it in the original object
(2) Code example
import numpy as np
import pandas as pd
# np.random.randn(10, 2)生成一个10行2列二维数组
df = pd.DataFrame(np.random.randn(10, 2), index=[8, 2, 4, 6, 1, 7, 0, 5, 3, 9], columns=['col1', 'col2'])
print(df)
# 参数inplace默认False。不在原对象修改;True代表修改原对象
df.sort_index(inplace=True, ascending=False) # ascending=False时降序
print(df)
Output result:
E:\Anaconda\python.exe E:/Python达内/网络并发/data_analysis/6_pandas_study/demo15.py
col1 col2
8 -0.670793 -0.037655
2 0.994857 -2.152398
4 1.304834 -0.292244
6 1.360664 1.097519
1 -0.336153 -0.289120
7 -1.964574 1.090914
0 -1.339923 -1.153182
5 -0.552900 0.279713
3 0.015910 -0.582301
9 -1.666869 0.146527
col1 col2
9 -1.666869 0.146527
8 -0.670793 -0.037655
7 -1.964574 1.090914
6 1.360664 1.097519
5 -0.552900 0.279713
4 1.304834 -0.292244
3 0.015910 -0.582301
2 0.994857 -2.152398
1 -0.336153 -0.289120
0 -1.339923 -1.153182
Process finished with exit code 0
3.3.2 Sorting by actual value
When using sort_values()
the method, when referring to multi-column sorting, you can specify the sorting method separately
. Code example:
import pandas as pd
# Create a Dictionary of series
d = {
'Name': pd.Series(['Tom', 'James', 'Ricky', 'Vin', 'Steve', 'Minsu', 'Jack',
'Lee', 'David', 'Gasper', 'Betina', 'Andres', 'Andres']),
'Age': pd.Series([25, 26, 25, 23, 30, 29, 23, 34, 40, 30, 51, 46, 46]),
'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65, 3.65]),
'Score': pd.Series([3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.80, 3.65, 4.23, 3.24, 3.98, 2.56, 2.56])}
df = pd.DataFrame(d)
print(df)
# 先按Age排序,相同值按Rating排序.Age升序,Rating降序
df.sort_values(by=['Age', 'Rating'], ascending=[True, False], inplace=True)
print(df)
Output result:
**E:\Anaconda\python.exe E:/Python达内/网络并发/data_analysis/6_pandas_study/demo15.py
Name Age Rating Score
0 Tom 25 4.23 3.20
1 James 26 3.24 4.60
2 Ricky 25 3.98 3.80
3 Vin 23 2.56 3.78
4 Steve 30 3.20 2.98
5 Minsu 29 4.60 4.80
6 Jack 23 3.80 4.80
7 Lee 34 3.78 3.65
8 David 40 2.98 4.23
9 Gasper 30 4.80 3.24
10 Betina 51 4.10 3.98
11 Andres 46 3.65 2.56
12 Andres 46 3.65 2.56
Name Age Rating Score
6 Jack 23 3.80 4.80
3 Vin 23 2.56 3.78
0 Tom 25 4.23 3.20
2 Ricky 25 3.98 3.80
1 James 26 3.24 4.60
5 Minsu 29 4.60 4.80
9 Gasper 30 4.80 3.24
4 Steve 30 3.20 2.98
7 Lee 34 3.78 3.65
8 David 40 2.98 4.23
11 Andres 46 3.65 2.56
12 Andres 46 3.65 2.56
10 Betina 51 4.10 3.98
Process finished with exit code 0
**