pandas-using python for data analysis

Introduction to pandas

pandas is a NumPy-based tool created to solve data analysis tasks. Pandas incorporates a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large data sets, and pandas provides a large number of functions and methods that allow us to process data quickly and easily.

data structure

  • Series: One-dimensional array, similar to one-dimensional array in Numpy. The two are also very similar to Python's basic data structure List. The difference is that the elements in List can be of different data types, while Array and Series are only allowed to store the same data type, which can use memory more efficiently. Improve operational efficiency.
  • DataFrame: A two-dimensional tabular data structure. Many functions are similar to data.frame in R. DataFrame can be understood as a container of Series. The following content is mainly based on DataFrame.

reindex

An important method of pandas is to reindex, creating a new object with a new index. (I only apply the method once, and think about the others myself: such as series.reindex(range(8),method='ffill') or series.reindex(range(8),method='bfill'))

series=pd.Series([1,2,3,4,5,6,7],index=['less0','less1','less2','less3','less4','less5','less6'])
less0    1
less1    2
less2    3
less3    4
less4    5
less5    6
less6    7
dtype: int64

series.reindex(['less1','less0','less2','less3','less4','less5','less6','b'])
less1    2.0
less0    1.0
less2    3.0
less3    4.0
less4    5.0
less5    6.0
less6    7.0
b        NaN
dtype: float64

series.reindex(['12','211'])
12    NaN
211   NaN
dtype: float64

series.reindex(['12','211'],fill_value=0)
12     0
211    0
dtype: int64

series=pd.Series([1,2,3,4,5,6])
series.reindex(range(8),method='ffill')

0    1
1    2
2    3
3    4
4    5
5    6
6    6
7    6
dtype: int64

dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
dataframe

In [3]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
   ...: dataframe
   ...: 
Out[3]: 
    a   b   c   d
1   0   1   2   3
2   4   5   6   7
3   8   9  10  11
4  12  13  14  15

In [4]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d'])
Out[4]: 
    c   b   a   d
2   6   5   4   7
1   2   1   0   3
3  10   9   8  11
4  14  13  12  15


In [5]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'])
Out[5]: 
      c     b     a     d   e
2   6.0   5.0   4.0   7.0 NaN
1   2.0   1.0   0.0   3.0 NaN
3  10.0   9.0   8.0  11.0 NaN
4  14.0  13.0  12.0  15.0 NaN

In [6]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='ffill')
Out[6]: 
    c   b   a   d   e
2   6   5   4   7   7
1   2   1   0   3   3
3  10   9   8  11  11
4  14  13  12  15  15

In [7]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='bfill')
Out[7]: 
    c   b   a   d   e
2   6   5   4   7 NaN
1   2   1   0   3 NaN
3  10   9   8  11 NaN
4  14  13  12  15 NaN

In [8]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='bfill',fill_value=0)
Out[8]: 
    c   b   a   d  e
2   6   5   4   7  0
1   2   1   0   3  0
3  10   9   8  11  0
4  14  13  12  15  0
#copy 默认是true 怎样都复制,false的时候新旧相等不复制
In [9]: dataframe.reindex(index=[2,1,3,4],columns=['c','b','a','d','e'],method='bfill',fill_value=0,copy=True)
Out[9]: 
    c   b   a   d  e
2   6   5   4   7  0
1   2   1   0   3  0
3  10   9   8  11  0
4  14  13  12  15  0

discard data on the specified axis

series=pd.Series([1,2,3,4,5,6,7],index=['less0','less1','less2','less3','less4','less5','less6'])
series.drop(['less1','less0','less2'])
less3    4
less4    5
less5    6
less6    7
dtype: int64

In [10]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe.drop([1,2])
    ...: 
Out[10]: 
    a   b   c   d
3   8   9  10  11
4  12  13  14  15

In [11]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe.drop(columns=['a','b'],index=[1,2])
    ...: 
Out[11]: 
    c   d
3  10  11
4  14  15

index selection filter

series=pd.Series([1,2,3,4,5,6])
series
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [12]: series=pd.Series([1,2,3,4,5,6])
    ...: series[series>2]
    ...: 
Out[12]: 
2    3
3    4
4    5
5    6
dtype: int64

In [13]:     series=pd.Series([1,2,3,4,5,6])
     ...:    series[2]
    ...: 
Out[13]: 3

series=pd.Series([1,2,3,4,5,6])
series[2:5]
2    3
3    4
4    5
dtype: int64

series=pd.Series([1,2,3,4,5,6])
series[2:5]=7
series
0    1
1    2
2    7
3    7
4    7
5    6
dtype: int64

In [14]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe
    ...: 
Out[14]: 
    a   b   c   d
1   0   1   2   3
2   4   5   6   7
3   8   9  10  11
4  12  13  14  15


In [16]: dataframe.loc[[1,2,3],['a','c']]
Out[16]: 
   a   c
1  0   2
2  4   6
3  8  10


In [17]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe[dataframe['a']>3]
    ...: 
Out[17]: 
    a   b   c   d
2   4   5   6   7
3   8   9  10  11
4  12  13  14  15

In [18]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe[dataframe.loc[[1,2,3],['a','c']]>3]
    ...: 
Out[18]: 
     a   b     c   d
1  NaN NaN   NaN NaN
2  4.0 NaN   6.0 NaN
3  8.0 NaN  10.0 NaN
4  NaN NaN   NaN NaN

In [20]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe[dataframe<3]=0
    ...: dataframe
    ...: 
Out[20]: 
    a   b   c   d
1   0   0   0   3
2   4   5   6   7
3   8   9  10  11
4  12  13  14  15

Application and Mapping of Functions


In [20]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: dataframe[dataframe<3]=0
    ...: dataframe
    ...: 
Out[20]: 
    a   b   c   d
1   0   0   0   3
2   4   5   6   7
3   8   9  10  11
4  12  13  14  15

In [21]: 

In [21]: 

In [21]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: f=lambda x :x.max()-x.min()
    ...: dataframe.apply(f,axis=1)
    ...: 
Out[21]: 
1    3
2    3
3    3
4    3
dtype: int64

In [22]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
    ...: f=lambda x :x.max()-x.min()
    ...: dataframe.apply(f,axis=0)
    ...: 
Out[22]: 
a    12
b    12
c    12
d    12
dtype: int64

#应用在元素及的函数
dataframe.applymap(f2)
In [24]: f2=lambda x:x/2

In [25]: dataframe.applymap(f2)
Out[25]: 
     a    b    c    d
1  0.0  0.5  1.0  1.5
2  2.0  2.5  3.0  3.5
3  4.0  4.5  5.0  5.5
4  6.0  6.5  7.0  7.5
 #一维数组直接用apply()
series=pd.Series([1,2,3,4,5,6])
series.apply(f2)
0    0.5
1    1.0
2    1.5
3    2.0
4    2.5
5    3.0
dtype: float64

Common functions

df represents any Pandas DataFrame object s represents any Pandas Series object

  • Import data
    pd.read_csv(filename): Import data from a CSV file

    pd.read_table(filename): Import data from a delimited text file

    pd.read_excel(filename): Import data from an Excel file

    pd.read_sql(query, connection_object): import data from SQL table/library

    pd.read_json(json_string): import data from a string in JSON format

    pd.read_html(url): Parse URLs, strings or HTML files, and extract tables from them

    pd.read_clipboard(): Get the content from your clipboard and pass it to read_table()

    pd.DataFrame(dict): Import data from a dictionary object, Key is the column name, Value is the data

  • Export data df.to_csv(filename): export data to CSV file

    df.to_excel(filename): export data to Excel file

    df.to_sql(table_name,connection_object): export data to SQL table

    df.to_json(filename): export data to text file in Json format

  • View, inspect data df.head(n): View the first n rows of the DataFrame object

    df.tail(n): View the last n rows of a DataFrame object

    df.shape(): View the number of rows and columns

    http://df.info(): View index, data type and memory information

    df.describe(): View summary statistics for numeric columns

    s.value_counts(dropna=False): View the unique values ​​and counts of the Series object

  • Data selection df[col]: According to the column name, and return the column in the form of a Series

    df[[col1,col2]]: Returns multiple columns as a DataFrame

    s.iloc[0]: select data by location

    s.loc['index_one']: select data by index

    df.iloc[0,:]: returns the first row

    df.iloc[0,0]: returns the first element of the first column

  • Data cleaning
    df.columns=['a','b','c']: rename column names

    pd.isnull(): Checks the DataFrame object for null values ​​and returns a Boolean array

    pd.notnull(): Checks the DataFrame object for non-null values ​​and returns a Boolean array

    df.dropna(): drop all rows containing null values

    df.dropna(axis=1): drop all columns containing null values

    df.dropna(axis=1,thresh=n): delete all rows with less than n non-null values

    df.fillna(x): replace all null values ​​in the DataFrame object with x

    s.astype(float): Change the data type in Series to float type

    s.replace(1,'one'): Replace all values ​​equal to 1 with 'one'

    s.replace([1,3],['one','three']): replace 1 with 'one' and 3 with 'three'

    df.rename(columns=lambdax:x+1): batch change column names

    df.rename(columns={'old_name':'new_name'}): Selectively change column names

    df.set_index('column_one'): change the index column

    df.rename(index=lambdax:x+1): batch rename indexes

In [27]: dataframe=pd.DataFrame(np.arange(16).reshape((4,4)),index=[1,2,3,4],columns=[1,2,3,4])
    ...: dataframe
    ...: dataframe.set_index(4)
    ...: 
Out[27]: 
     1   2   3
4
3    0   1   2
7    4   5   6
11   8   9  10
15  12  13  14

  • Data processing: Filter, Sort, and GroupBy
    df[df[col]>0.5]: Select rows with col column values ​​greater than 0.5

    df.sort_values(col1): Sort data according to column col1, in ascending order by default

    df.sort_values(col2,ascending=False): sort data in descending order of column col1

    df.sort_values([col1,col2],ascending=[True,False]): first sort the data in ascending order by column col1, and then sort the data in descending order by col2 df.groupby(col): return a Groupby object grouped by column col

    df.groupby([col1,col2]): Returns a Groupby object grouped by multiple columns

    df.groupby(col1)[col2]: Returns the mean of column col2 after grouping by column col1 df.pivot_table(index=col1,

    values=[col2,col3],aggfunc=max): Create a pivot table that groups by column col1 and calculates the maximum value of col2 and col3

    df.groupby(col1).agg(np.mean): Returns the mean of all columns grouped by column col1

    data.apply(np.mean): apply the function np.mean to each column in the DataFrame

    data.apply(np.max,axis=1): apply the function np.max to each row in the DataFrame

  • Data Merge
    df1.append(df2): add rows in df2 to the tail of df1

    pd.concat([df1,df2],axis=1): add the columns in df2 to the tail of df1

    df1.join(df2,on=col1,how='inner'): perform SQL-style join on the columns of df1 and df2

  • Data statistics
    df.describe(): View summary statistics for data value columns

    df.mean(): returns the mean of all columns

    df.corr(): returns the correlation coefficient between columns

    df.count(): Returns the number of non-null values ​​in each column

    df.max(): returns the maximum value of each column

    df.min(): returns the minimum value of each column

    df.median(): returns the median of each column

    df.std(): returns the standard deviation of each column

Hierarchical Index


pd.Series([1,2,34,5,67,8,90,23,4,5],index=[['one','one','one','two','one','one','three','three','three','three'],[1,2,3,4,5,6,7,8,9,10]])
one    1      1
       2      2
       3     34
two    4      5
one    5     67
       6      8
three  7     90
       8     23
       9      4
       10     5
dtype: int64

series['one']
1     1
2     2
3    34
5    67
6     8
dtype: int64

series['one',3]

#重排分级顺序
series=pd.Series([1,2,34,5,67,8,90,23,4,5],index=[['one','one','one','two','one','one','three','three','three','three'],[1,2,3,4,5,6,7,8,9,10]])
#同理列也可以指定名称
series.index.names=['key1','key2']
key1   key2
one    1        1
       2        2
       3       34
two    4        5
one    5       67
       6        8
three  7       90
       8       23
       9        4
       10       5
dtype: int64

series.swaplevel('key2','key1')
key2  key1 
1     one       1
2     one       2
3     one      34
4     two       5
5     one      67
6     one       8
7     three    90
8     three    23
9     three     4
10    three     5
dtype: int64

#更具级别获取统计数据

series.sum(level='key1')
key1
one      112
two        5
three    122
dtype: int64

series.sum(level='key2')
key2
1      1
2      2
3     34
4      5
5     67
6      8
7     90
8     23
9      4
10     5
dtype: int64

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325563405&siteId=291194637