Introduction to pandas
pandas is a NumPy-based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large data sets, and pandas provides a large number of functions and methods that allow us to process data quickly and easily.
data structure
- Series: One-dimensional array, similar to one-dimensional array in Numpy. The two are also very similar to Python's basic data structure List. The difference is that the elements in List can be of different data types, while Array and Series are only allowed to store the same data type, which can use memory more efficiently. Improve operational efficiency.
- DataFrame: A two-dimensional tabular data structure. Many functions are similar to data.frame in R. DataFrame can be understood as a container of Series. The following content is mainly based on DataFrame.
reindex
An important method of pandas is to reindex, creating a new object with a new index. (I only apply the method once, and think about the others myself: such as series.reindex(range(8),method='ffill') or series.reindex(range(8),method='bfill'))
series=pd.Series([1,2,3,4,5,6,7],index=['less0','less1','less2','less3','less4','less5','less6'])
less0 1
less1 2
less2 3
less3 4
less4 5
less5 6
less6 7
dtype: int64
series.reindex(['less1','less0','less2','less3','less4','less5','less6','b'])
less1 2.0
less0 1.0
less2 3.0
less3 4.0
less4 5.0
less5 6.0
less6 7.0
b NaN
dtype: float64
series.reindex(['12','211'])
12 NaN
211 NaN
dtype: float64
series.reindex(['12','211'],fill_value=0)
12 0
211 0
dtype: int64
series=pd.Series([1,2,3,4,5,6])
series.reindex(range(8),method='ffill')
0 1
1 2
2 3
3 4
4 5
5 6
6 6
7 6
dtype: int64
dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
dataframe
In [3]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe
...:
Out[3]:
a b c d
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
In [4]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d'])
Out[4]:
c b a d
2 6 5 4 7
1 2 1 0 3
3 10 9 8 11
4 14 13 12 15
In [5]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'])
Out[5]:
c b a d e
2 6.0 5.0 4.0 7.0 NaN
1 2.0 1.0 0.0 3.0 NaN
3 10.0 9.0 8.0 11.0 NaN
4 14.0 13.0 12.0 15.0 NaN
In [6]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='ffill')
Out[6]:
c b a d e
2 6 5 4 7 7
1 2 1 0 3 3
3 10 9 8 11 11
4 14 13 12 15 15
In [7]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='bfill')
Out[7]:
c b a d e
2 6 5 4 7 NaN
1 2 1 0 3 NaN
3 10 9 8 11 NaN
4 14 13 12 15 NaN
In [8]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='bfill',fill_value=0)
Out[8]:
c b a d e
2 6 5 4 7 0
1 2 1 0 3 0
3 10 9 8 11 0
4 14 13 12 15 0
#copy 默认是true 怎样都复制,false的时候新旧相等不复制
In [9]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='bfill',fill_value=0,copy=True)
Out[9]:
c b a d e
2 6 5 4 7 0
1 2 1 0 3 0
3 10 9 8 11 0
4 14 13 12 15 0
discard data on the specified axis
series=pd.Series([1,2,3,4,5,6,7],index=['less0','less1','less2','less3','less4','less5','less6'])
series.drop(['less1','less0','less2'])
less3 4
less4 5
less5 6
less6 7
dtype: int64
In [10]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe.drop([1,2])
...:
Out[10]:
a b c d
3 8 9 10 11
4 12 13 14 15
In [11]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe.drop(columns=['a','b'],index=[1,2])
...:
Out[11]:
c d
3 10 11
4 14 15
index selection filter
series=pd.Series([1,2,3,4,5,6])
series
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
In [12]: series=pd.Series([1,2,3,4,5,6])
...: series[series>2]
...:
Out[12]:
2 3
3 4
4 5
5 6
dtype: int64
In [13]: series=pd.Series([1,2,3,4,5,6])
...: series[2]
...:
Out[13]: 3
series=pd.Series([1,2,3,4,5,6])
series[2:5]
2 3
3 4
4 5
dtype: int64
series=pd.Series([1,2,3,4,5,6])
series[2:5]=7
series
0 1
1 2
2 7
3 7
4 7
5 6
dtype: int64
In [14]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe
...:
Out[14]:
a b c d
1 0 1 2 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
In [16]: dataframe.loc[[1,2,3],['a','c']]
Out[16]:
a c
1 0 2
2 4 6
3 8 10
In [17]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe[dataframe['a']>3]
...:
Out[17]:
a b c d
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
In [18]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe[dataframe.loc[[1,2,3],['a','c']]>3]
...:
Out[18]:
a b c d
1 NaN NaN NaN NaN
2 4.0 NaN 6.0 NaN
3 8.0 NaN 10.0 NaN
4 NaN NaN NaN NaN
In [20]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe[dataframe<3]=0
...: dataframe
...:
Out[20]:
a b c d
1 0 0 0 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
Application and Mapping of Functions
In [20]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: dataframe[dataframe<3]=0
...: dataframe
...:
Out[20]:
a b c d
1 0 0 0 3
2 4 5 6 7
3 8 9 10 11
4 12 13 14 15
In [21]:
In [21]:
In [21]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: f=lambda x :x.max()-x.min()
...: dataframe.apply(f,axis=1)
...:
Out[21]:
1 3
2 3
3 3
4 3
dtype: int64
In [22]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
...: f=lambda x :x.max()-x.min()
...: dataframe.apply(f,axis=0)
...:
Out[22]:
a 12
b 12
c 12
d 12
dtype: int64
#应用在元素及的函数
dataframe.applymap(f2)
In [24]: f2=lambda x:x/2
In [25]: dataframe.applymap(f2)
Out[25]:
a b c d
1 0.0 0.5 1.0 1.5
2 2.0 2.5 3.0 3.5
3 4.0 4.5 5.0 5.5
4 6.0 6.5 7.0 7.5
#一维数组直接用apply()
series=pd.Series([1,2,3,4,5,6])
series.apply(f2)
0 0.5
1 1.0
2 1.5
3 2.0
4 2.5
5 3.0
dtype: float64
Common functions
df represents any Pandas DataFrame object s represents any Pandas Series object
-
Import data
pd.read_csv(filename): Import data from a CSV filepd.read_table(filename): Import data from a delimited text file
pd.read_excel(filename): Import data from an Excel file
pd.read_sql(query, connection_object): import data from SQL table/library
pd.read_json(json_string): import data from a string in JSON format
pd.read_html(url): Parse URLs, strings or HTML files, and extract tables from them
pd.read_clipboard(): Get the content from your clipboard and pass it to read_table()
pd.DataFrame(dict): Import data from a dictionary object, Key is the column name, Value is the data
-
Export data df.to_csv(filename): export data to CSV file
df.to_excel(filename): export data to Excel file
df.to_sql(table_name,connection_object): export data to SQL table
df.to_json(filename): export data to text file in Json format
-
View, inspect data df.head(n): View the first n rows of the DataFrame object
df.tail(n): View the last n rows of a DataFrame object
df.shape(): View the number of rows and columns
http://df.info(): View index, data type and memory information
df.describe(): View summary statistics for numeric columns
s.value_counts(dropna=False): View the unique values and counts of the Series object
-
Data selection df[col]: According to the column name, and return the column in the form of a Series
df[[col1,col2]]: Returns multiple columns as a DataFrame
s.iloc[0]: select data by location
s.loc['index_one']: select data by index
df.iloc[0,:]: returns the first row
df.iloc[0,0]: returns the first element of the first column
-
Data cleaning
df.columns=['a','b','c']: rename column namespd.isnull(): Checks the DataFrame object for null values and returns a Boolean array
pd.notnull(): Checks the DataFrame object for non-null values and returns a Boolean array
df.dropna(): drop all rows containing null values
df.dropna(axis=1): drop all columns containing null values
df.dropna(axis=1,thresh=n): delete all rows with less than n non-null values
df.fillna(x): replace all null values in the DataFrame object with x
s.astype(float): Change the data type in Series to float type
s.replace(1,'one'): Replace all values equal to 1 with 'one'
s.replace([1,3],['one','three']): replace 1 with 'one' and 3 with 'three'
df.rename(columns=lambdax:x+1): batch change column names
df.rename(columns={'old_name':'new_name'}): Selectively change column names
df.set_index('column_one'): change the index column
df.rename(index=lambdax:x+1): batch rename indexes
In [27]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=[1,2,3,4])
...: dataframe
...: dataframe.set_index(4)
...:
Out[27]:
1 2 3
4
3 0 1 2
7 4 5 6
11 8 9 10
15 12 13 14
-
Data processing: Filter, Sort, and GroupBy
df[df[col]>0.5]: Select rows with col column values greater than 0.5df.sort_values(col1): Sort data according to column col1, in ascending order by default
df.sort_values(col2,ascending=False): sort data in descending order of column col1
df.sort_values([col1,col2],ascending=[True,False]): first sort the data in ascending order by column col1, and then sort the data in descending order by col2 df.groupby(col): return a Groupby object grouped by column col
df.groupby([col1,col2]): Returns a Groupby object grouped by multiple columns
df.groupby(col1)[col2]: Returns the mean of column col2 after grouping by column col1 df.pivot_table(index=col1,
values=[col2,col3],aggfunc=max): Create a pivot table that groups by column col1 and calculates the maximum value of col2 and col3
df.groupby(col1).agg(np.mean): Returns the mean of all columns grouped by column col1
data.apply(np.mean): apply the function np.mean to each column in the DataFrame
data.apply(np.max,axis=1): apply the function np.max to each row in the DataFrame
-
Data Merge
df1.append(df2): add rows in df2 to the tail of df1pd.concat([df1,df2],axis=1): add the columns in df2 to the tail of df1
df1.join(df2,on=col1,how='inner'): perform SQL-style join on the columns of df1 and df2
-
Data statistics
df.describe(): View summary statistics for data value columnsdf.mean(): returns the mean of all columns
df.corr(): returns the correlation coefficient between columns
df.count(): Returns the number of non-null values in each column
df.max(): returns the maximum value of each column
df.min(): returns the minimum value of each column
df.median(): returns the median of each column
df.std(): returns the standard deviation of each column
Hierarchical Index
pd.Series([1,2,34,5,67,8,90,23,4,5],index=[['one','one','one','two','one','one','three','three','three','three'],[1,2,3,4,5,6,7,8,9,10]])
one 1 1
2 2
3 34
two 4 5
one 5 67
6 8
three 7 90
8 23
9 4
10 5
dtype: int64
series['one']
1 1
2 2
3 34
5 67
6 8
dtype: int64
series['one',3]
#重排分级顺序
series=pd.Series([1,2,34,5,67,8,90,23,4,5],index=[['one','one','one','two','one','one','three','three','three','three'],[1,2,3,4,5,6,7,8,9,10]])
#同理列也可以指定名称
series.index.names=['key1','key2']
key1 key2
one 1 1
2 2
3 34
two 4 5
one 5 67
6 8
three 7 90
8 23
9 4
10 5
dtype: int64
series.swaplevel('key2','key1')
key2 key1
1 one 1
2 one 2
3 one 34
4 two 5
5 one 67
6 one 8
7 three 90
8 three 23
9 three 4
10 three 5
dtype: int64
#更具级别获取统计数据
series.sum(level='key1')
key1
one 112
two 5
three 122
dtype: int64
series.sum(level='key2')
key2
1 1
2 2
3 34
4 5
5 67
6 8
7 90
8 23
9 4
10 5
dtype: int64