[Self-study] Using python for data analysis LESSON6 <Introduction to pandas——Introduction to pandas data structure 2>

Table of contents

foreword

1. DataFrame

1. Column selection

2. Selection of rows

3. Column modification

4. Column deletion

 5. Assign nested dictionaries to DataFrame

Summarize


foreword

Continue with the previous section. Past content is as follows:

[Self-study] Using python for data analysis LESSON5 <Introduction to pandas——Introduction to pandas data structure 1>_Rachel MuZy's blog-CSDN blog mainly learns the data structure of pandas, including Series and DataFrame https://blog.csdn.net/mzy20010420 /article/details/127026241


1. DataFrame

Rows can also be selected by location or special attribute loc, and the column reference can directly indicate the name of the column

1. Column selection

A column in a DataFrame can be retrieved as a series as a dictionary-like tag or attribute:

Example:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China', 'Japan'],
        'years': [2000, 2001, 2002, 2003],
        'pop': [1.5, 3.6, 2.4, 5.1]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
val = pd.Series([-1.2, -1.5, -1.7])
frame['debt'] = val
val_1 = pd.Series([100, 200, 300], index = [0, 1, 3])
frame['pofit'] = val_1
print(frame)
frame_1 = frame['state']
print(frame_1)
frame_2 = frame.state
print(frame_2)
#可以说,frame['state']和frame.state是等价的

result:

   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003     Japan  5.1   NaN  300.0
0    Astrilia
1      Mexico
2       China
3       Japan
Name: state, dtype: object
0    Astrilia
1      Mexico
2       China
3       Japan
Name: state, dtype: object

2. Selection of rows

Select by special attribute loc:

Example:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China', 'Japan'],
        'years': [2000, 2001, 2002, 2003],
        'pop': [1.5, 3.6, 2.4, 5.1]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
val = pd.Series([-1.2, -1.5, -1.7])
frame['debt'] = val
val_1 = pd.Series([100, 200, 300], index = [0, 1, 3])
frame['pofit'] = val_1
print(frame)

#当行为默认的索引标签时
frame_1row = frame.loc[1]
print(frame_1row)

#当行有自己设定的索引标签时
frame_label = pd.DataFrame(data, columns = ['years', 'state', 'pop'], index = ['one', 'two', 'three', 'four'])
print(frame_label)
frame_label_row = frame_label.loc['two']
print(frame_label_row)

result:

   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003     Japan  5.1   NaN  300.0
years      2001
state    Mexico
pop         3.6
debt       -1.5
pofit     200.0
Name: 1, dtype: object
       years     state  pop
one     2000  Astrilia  1.5
two     2001    Mexico  3.6
three   2002     China  2.4
four    2003     Japan  5.1
years      2001
state    Mexico
pop         3.6
Name: two, dtype: object

3. Column modification

Column references can be modified. For example an empty 'debt' column can be assigned a scalar value or an array of values.

Example:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China'],
        'years': [2000, 2001, 2002],
        'pop': [1.5, 3.6, 2.4]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
print(frame)
frame['debt'] = 16.2
print(frame)
frame['pofit'] = np.random.randint(100, 200, size = 3)
print(frame)

result:

   years     state  pop
0   2000  Astrilia  1.5
1   2001    Mexico  3.6
2   2002     China  2.4
   years     state  pop  debt
0   2000  Astrilia  1.5  16.2
1   2001    Mexico  3.6  16.2
2   2002     China  2.4  16.2
   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  16.2    192
1   2001    Mexico  3.6  16.2    138
2   2002     China  2.4  16.2    140

 When assigning a list or array to a column, the length of the value must match the length of the DataFrame.

Example:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China', 'Japan'],
        'years': [2000, 2001, 2002, 2003],
        'pop': [1.5, 3.6, 2.4, 5.1]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
print(frame)
val = pd.Series([-1.2, -1.5, -1.7])
frame['debt'] = val
print(frame)
val_1 = pd.Series([100, 200, 300], index = [0, 1, 3])
frame['pofit'] = val_1
print(frame)

result:

   years     state  pop
0   2000  Astrilia  1.5
1   2001    Mexico  3.6
2   2002     China  2.4
3   2003     Japan  5.1
   years     state  pop  debt
0   2000  Astrilia  1.5  -1.2
1   2001    Mexico  3.6  -1.5
2   2002     China  2.4  -1.7
3   2003     Japan  5.1   NaN
   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003     Japan  5.1   NaN  300.0

If the copied column does not exist, a new one will be generated:

Example:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China', 'Japan'],
        'years': [2000, 2001, 2002, 2003],
        'pop': [1.5, 3.6, 2.4, 5.1]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
val = pd.Series([-1.2, -1.5, -1.7])
frame['debt'] = val
val_1 = pd.Series([100, 200, 300], index = [0, 1, 3])
frame['pofit'] = val_1
print(frame)

#给新的一列赋值
frame['date'] = np.random.randint(1, 10, size = 4)
print(frame)

result:

   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003     Japan  5.1   NaN  300.0
   years     state  pop  debt  pofit  date
0   2000  Astrilia  1.5  -1.2  100.0     7
1   2001    Mexico  3.6  -1.5  200.0     1
2   2002     China  2.4  -1.7    NaN     8
3   2003     Japan  5.1   NaN  300.0     4

4. Column deletion

 use the del function

Example: First add a column consisting of boolean values:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China', 'Mexico'],
        'years': [2000, 2001, 2002, 2003],
        'pop': [1.5, 3.6, 2.4, 5.1]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
val = pd.Series([-1.2, -1.5, -1.7])
frame['debt'] = val
val_1 = pd.Series([100, 200, 300], index = [0, 1, 3])
frame['pofit'] = val_1
print(frame)

'''现在构建一个布尔值组成的数组,如果state == Mexico,则在FT列输出T,否则为F'''
#方法1
frame['TF'] = frame.state == 'Mexico'
print(frame)

print(frame.TF[0])

#方法2
Buer = []
for i in range(4):
        Buer.append(frame.state[i] == 'Mexico')
frame['tf'] = Buer
print(frame)

result:

   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003    Mexico  5.1   NaN  300.0
   years     state  pop  debt  pofit     TF
0   2000  Astrilia  1.5  -1.2  100.0  False
1   2001    Mexico  3.6  -1.5  200.0   True
2   2002     China  2.4  -1.7    NaN  False
3   2003    Mexico  5.1   NaN  300.0   True
False
   years     state  pop  debt  pofit     TF     tf
0   2000  Astrilia  1.5  -1.2  100.0  False  False
1   2001    Mexico  3.6  -1.5  200.0   True   True
2   2002     China  2.4  -1.7    NaN  False  False
3   2003    Mexico  5.1   NaN  300.0   True   True

进程已结束,退出代码0

Then delete the TF column:

import pandas as pd
import numpy as np

data = {'state': ['Astrilia', 'Mexico', 'China', 'Mexico'],
        'years': [2000, 2001, 2002, 2003],
        'pop': [1.5, 3.6, 2.4, 5.1]}
frame = pd.DataFrame(data, columns = ['years', 'state', 'pop'])
val = pd.Series([-1.2, -1.5, -1.7])
frame['debt'] = val
val_1 = pd.Series([100, 200, 300], index = [0, 1, 3])
frame['pofit'] = val_1
print(frame)

'''现在构建一个布尔值组成的数组,如果state == Mexico,则在FT列输出T,否则为F'''
#构建一个新的列
frame['TF'] = frame.state == 'Mexico'
print(frame)

#删除该列
del frame['TF']
print(frame)

result:

   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003    Mexico  5.1   NaN  300.0
   years     state  pop  debt  pofit     TF
0   2000  Astrilia  1.5  -1.2  100.0  False
1   2001    Mexico  3.6  -1.5  200.0   True
2   2002     China  2.4  -1.7    NaN  False
3   2003    Mexico  5.1   NaN  300.0   True
   years     state  pop  debt  pofit
0   2000  Astrilia  1.5  -1.2  100.0
1   2001    Mexico  3.6  -1.5  200.0
2   2002     China  2.4  -1.7    NaN
3   2003    Mexico  5.1   NaN  300.0

进程已结束,退出代码0

Note here:

It must be written in this way, so that it can run normally

If written like this:

 Both of these ways of writing will report an error! ! !

 5. Assign nested dictionaries to DataFrame

If nested dictionaries are assigned to a DataFrame, pandas will treat the keys of the dictionaries as columns and the keys of the inner dictionaries as row indices:

Example:

import pandas as pd
import numpy as np

pop = {'MZY': {2001: 2.4, 2002: 2.9},
       'DRX': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = pd.DataFrame(pop)
print(frame)

result:

      MZY  DRX
2001  2.4  1.7
2002  2.9  3.6
2000  NaN  1.5

It can be transposed using numpy-like syntax:

Example:

import pandas as pd
import numpy as np

pop = {'MZY': {2001: 2.4, 2002: 2.9},
       'DRX': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = pd.DataFrame(pop)
print(frame)

#转置
print(frame.T)

result:

      MZY  DRX
2001  2.4  1.7
2002  2.9  3.6
2000  NaN  1.5
     2001  2002  2000
MZY   2.4   2.9   NaN
DRX   1.7   3.6   1.5

The keys of the inner dictionary are not sorted if an index is specified explicitly:

Example:

import pandas as pd
import numpy as np

pop = {'MZY': {2001: 2.4, 2002: 2.9},
       'DRX': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = pd.DataFrame(pop)
print(frame)

frame1 = pd.DataFrame(pop, index=[2000, 2002, 2001, 2003])
print(frame1)

result:

      MZY  DRX
2001  2.4  1.7
2002  2.9  3.6
2000  NaN  1.5
      MZY  DRX
2000  NaN  1.5
2002  2.9  3.6
2001  2.4  1.7
2003  NaN  NaN


Summarize

Although Series and DataFrame cannot solve all problems, they provide an effective and easy-to-use foundation for most applications.

Guess you like

Origin blog.csdn.net/mzy20010420/article/details/127042765