Introduction to DataFrame data structure in Python data processing library pandas

There are two major data structures Series and DataFrame in pandas. This article mainly introduces the usage of DataFrame. DataFrame can handle tabular data .

Introduction to Series Introduction  to the Series data structure in the Python data processing library pandas

There are many ways to create DataFrame data, such as through dictionaries:

In [1]: import pandas as pd

In [8]: data = {'name': ['张三', '张三', '张三', '李四', '李四', '李四'],
   ...:         'year': [2016, 2017, 2018, 2016, 2017, 2018],
   ...:         'income': [6000, 6500, 7000, 25000, 26000, 29000]}

In [9]: frame = pd.DataFrame(data)

In [10]: frame
Out[10]: 
  name  year  income
0   张三  2016    6000
1   张三  2017    6500
2   张三  2018    7000
3   李四  2016   25000
4   李四  2017   26000
5   李四  2018   29000

The first 5 rows can be selected by the head method:

In [11]: frame.head()
Out[11]: 
  name  year  income
0   张三  2016    6000
1   张三  2017    6500
2   张三  2018    7000
3   李四  2016   25000
4   李四  2017   26000

Sort columns by specifying columns:

In [13]: pd.DataFrame(data, columns=['year', 'income', 'name'])
Out[13]: 
   year  income name
0  2016    6000   张三
1  2017    6500   张三
2  2018    7000   张三
3  2016   25000   李四
4  2017   26000   李四
5  2018   29000   李四

If a non-existent column is specified in columns, the non-existent column will be filled with NAN:

In [14]: frame2 = pd.DataFrame(data, columns=['income', 'year', 'name', 'gender'],
    ...:                       index=['one', 'two', 'three', 'four', 'five', 'six'])
    ...: 

In [15]: frame2
Out[15]: 
       income  year name gender
one      6000  2016   张三    NaN
two      6500  2017   张三    NaN
three    7000  2018   张三    NaN
four    25000  2016   李四    NaN
five    26000  2017   李四    NaN
six     29000  2018   李四    NaN

In [17]: frame2.columns
Out[17]: Index(['income', 'year', 'name', 'gender'], dtype='object')

Select the data of this column by the name of the column

In [18]: frame2['name']
Out[18]: 
one      张三
two      张三
three    张三
four     李四
five     李四
six      李四
Name: name, dtype: object


In [20]: frame2.income
Out[20]: 
one       6000
two       6500
three     7000
four     25000
five     26000
six      29000
Name: income, dtype: int64

Select a row of data by loc

In [21]: frame2.loc['six']
Out[21]: 
income    29000
year       2018
name         李四
gender      NaN
Name: six, dtype: object

You can assign a value to a column of numbers

In [22]: frame2['gender'] = 'male'

In [23]: frame2
Out[23]: 
       income  year name gender
one      6000  2016   张三   male
two      6500  2017   张三   male
three    7000  2018   张三   male
four    25000  2016   李四   male
five    26000  2017   李四   male
six     29000  2018   李四   male

In [24]: frame2['gender'] = ['male', 'male', 'male', 'female', 'female', 'female']

In [25]: frame2
Out[25]: 
       income  year name  gender
one      6000  2016   张三    male
two      6500  2017   张三    male
three    7000  2018   张三    male
four    25000  2016   李四  female
five    26000  2017   李四  female
six     29000  2018   李四  female

You can use series to assign values ​​to the corresponding index

In [26]: gender = pd.Series(['male', 'female'], index=['one', 'four'])

In [27]: frame2['gender'] = gender

In [28]: frame2
Out[28]: 
       income  year name  gender
one      6000  2016   张三    male
two      6500  2017   张三     NaN
three    7000  2018   张三     NaN
four    25000  2016   李四  female
five    26000  2017   李四     NaN
six     29000  2018   李四     NaN

Delete a column by del

In [29]: del frame2['gender']

In [30]: frame2.columns
Out[30]: Index(['income', 'year', 'name'], dtype='object')

Another way to create a dataframe class is through the dictionary nesting method:

In [31]: income = {'张三': {2016: 6000, 2017:6500, 2018:7000},
    ...:           '李四': {2016: 25000, 2017:26000}}

In [32]: frame3= pd.DataFrame(income)

In [33]: frame3
Out[33]: 
        张三          李四
2016    6000      25000.0
2017    6500      26000.0
2018    7000          NaN

Dataframe can be transposed:

In [34]: frame3.T
Out[34]: 
       2016     2017    2018
张三   6000.0   6500.0  7000.0
李四  25000.0  26000.0     NaN

Add names to rows and columns:

In [35]: frame3.index.name = 'year'

In [36]: frame3.columns.name = 'name'

In [37]: frame3
Out[37]: 
name    张三       李四
year               
2016  6000  25000.0
2017  6500  26000.0
2018  7000      NaN

Apply the to_numpy method to convert the dataframe into a two-dimensional array

In [38]: frame3.to_numpy()
Out[38]: 
array([[ 6000., 25000.],
       [ 6500., 26000.],
       [ 7000.,    nan]])

In [39]: frame2.to_numpy()
Out[39]: 
array([[6000, 2016, '张三'],
       [6500, 2017, '张三'],
       [7000, 2018, '张三'],
       [25000, 2016, '李四'],
       [26000, 2017, '李四'],
       [29000, 2018, '李四']], dtype=object)

Reference: Python for Data Analysis, 2nd Edition by Wes McKinney

Guess you like

Origin blog.csdn.net/bo17244504/article/details/124692061