Pandas basic common operations

Pandas

1. Introduction

The name Pandas comes from panel data and Python data analysis. Pandas is a powerful tool set for analyzing structured data. It is built on NumPy and provides advanced data structure and data manipulation tools. It is one of the important factors that make Python a powerful and efficient data analysis environment.

  • A powerful set of tools needed to analyze and manipulate large structured data sets
  • The foundation is NumPy, which provides high-performance matrix operations
  • Provides a large number of functions and methods that can process data quickly and conveniently
  • Applied to data mining, data analysis
  • Provide data cleaning function

For the data, the first operation is to clean the data, including the processing of missing values, illegal values, Nan values, and the conversion of data types. The cleaning of the data in the early stage will take a long time, so the processing of the data in the later stage It will be very convenient.

2. Data structure and basic operations

1. Series

  • Series is an object similar to a one-dimensional array, composed.
  • A set of data (various NumPy data types), the corresponding index (data label),
  • Index on the left
  • Data (values) on the right
  • Index is created automatically

1.1 Build Series through list

import pandas as pd
# 不指定索引,默认是从0开始
# ser_obj  = pd.Series(range(1,5))

# 指定索引
ser_obj = pd.Series(range(16), index = ['a', 'b', 'c', 'd', 'e'])
# 查看前3行数据
ser_obj.head(3)

1.2 Build Series through dict

import pandas as pd
# 字典的键为对应Series中的索引,字典的值为对应Series中的值
dict = {
    
    "first":"人生苦短,我用Python","second":"hello","third":"word"}
ser_obj = pd.Series(dict)
print(ser_obj)
# 输出索引
print(ser_obj.index)
# 输出数据
print(ser_obj.values)

# 根据位置获取数据
print(ser_obj[1])
# 根据索引获取数据
print(ser_obj["second"])
# 支持连续切片操作,此时的切片操作是基于行
print(ser_obj[1:3])
# 支持不连续切片,  此时的切片操作是基于列
print(ser_obj[["first","third"]])

2. DataFrame

  • A tabular data structure that contains a set of ordered columns, each column can be a different type of value. DataFrame has both row index and column index, and the data is stored in a two-dimensional structure.
  • Similar to multi-dimensional array/tabular data (for example, data.frame in excel, R)
  • Each column of data can be of different types
  • Index includes column index and row index

2.1 Construct DataFrame through numpy.ndarray

import numpy as np
import pandas as pd
arr_obj = np.random.rand(3,4)
df_obj = pd.DataFrame(arr_obj)
print(df_obj)
# 查看前两行
print(df_obj.head(2))

2.2 Build DataFrame through dict

(a) Use numpy to customize data
dict = {
    
    
    "A":1,
    "B":pd.Timestamp("20200101"),
    "C":pd.Series(range(10,14),dtype="float64"),
    "D":["python","java","c++","c"],
    "E":np.array([3] *4,dtype="int32"),
    "F":"上海"
}
df_obj = pd.DataFrame(dict)
print(df_obj)

# 增加新的一列数据
df_obj["new_col"] = np.arange(1,5)
print(df_obj)

# 支持数学运算
df_obj["second_new_col"] = df_obj["new_col"] + 12
print(df_obj)

# 删除列,使用列名称
del df_obj["E"]

(b) dict initialization data
data = {
    
    'a':[11,22,33],'b':[44,55,66]}
test = pd.DataFrame(data)
test
© Custom index and data
import pandas as pd
import numpy as np

data = ["a", "b", "c", "d", "e", "f"]
index = np.arange(1, 7)
columns = ["test"]
# 自定义列名
df = pd.DataFrame(data, index,columns=columns)
print(df)
(d) Define a multidimensional DataFrame
df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])


Three, Pandas advanced index

The operations of the two data types of Serics and DataFrame are basically the same. The index is only operated in the Series, not in the DataFrame.

Because there is little difference between the two.

  • Serics supports list slice index (continuous slice index, row index) and column name index (discontinuous columns can be selected, the selected columns are stored in a list , column index)

    import pandas as pd
    # 字典的键为对应Series中的索引,字典的值为对应Series中的值
    dict = {
          
          "first":"人生苦短,我用Python","second":"hello","third":"word"}
    ser_obj = pd.Series(dict)
    # 支持连续切片操作,此时的切片操作,依据是行
    print(ser_obj[1:3])
    # 列索引,列表
    print(ser_obj[["first","third"]])
    
  • DataFrame

    df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
    print(df_obj[1:3])
    print(df_obj[["a","c"]])
    

    If you want to get the data in the first three rows of columns b and c, the basic functions of DateFramed are complicated to implement

    Need to use advanced indexing of DataFrame objects

The following explains the Pandas advanced indexing method

Senior Index: tags, location and mix

1. loc tag index

loc tag name is based on the index, which is our custom index name

​ Still use the above case column to get the first three rows of data in columns b and c

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
# 不连续索引
print(df_obj.loc[:2,["b","c"])
# 连续索引
print(df_obj.loc[:2,"b":"c"])

2. iloc tag index

loc is based on the index number

​ Still use the above case column to get the first three rows of data in columns b and c

df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj.iloc[0:2, 0:3])

3. ix label and position hybrid index

  • ix is ​​a combination of the above two. Both index numbers and custom indexes can be used, depending on the situation.

  • If the index has both numbers and English, it is easy to cause confusion in positioning, then this method is not recommended

    This method is error-prone, try to use

Four, Pandas alignment operation

  • Alignment operation is an important process of data cleaning, which can be operated according to index alignment
  • Add NaN if the position is not aligned
  • add, sub, mul, div, addition, subtraction, multiplication and division

1. Series

ser_obj1 = pd.Series(range(10, 20), index = range(10))
ser_obj2 = pd.Series(range(20, 25), index = range(5))
# 此时的加法操作,后面5个数据为Nan
# print(ser_obj1 + ser_obj2)
print(ser_obj1.add(ser_obj2))

Such operations may cause inaccuracies to the data, so use the fill_value attribute

ser_obj1 = pd.Series(range(10, 20), index = range(10))
ser_obj2 = pd.Series(range(20, 25), index = range(5))
# 使用默认的值
print(ser_obj1.add(ser_obj2,fill_value=0))

2. DataFrame

df_obj1 = pd.DataFrame(np.ones((2,2)),columns = ['a','b'])
df_obj2 = pd.DataFrame(np.ones((3,3)),columns = ['a','b','c'])
print(dfdf_obj11.add(df_obj2,fill_value = 0))

Five, Pandas function application

  • In numpy, the function that can operate on each element in the array is called ufunc universal function (universal function)
  • Numpy's ufunc can be used directly in Pandas

1. Use functions in Numpy directly

example

df = pd.DataFrame(np.random.randn(5,4) - 1)
print(df)
print(np.abs(df))

2. Apply the function to the row or column through apply

The axis parameter can specify the axis

  • The default value is 0 and the direction is column
  • Value is 1, direction is row
df = pd.DataFrame(np.random.randn(5,4))
# lambda函数,返回最大值
f = lambda x : x.max()
# axis默认为列方向
print(df.apply(lambda x : x.max()))
# 行方向
print(df.apply(lambda x : x.max(), axis=1))

3. The applymap function is mapped to the entire DataFrame object

df = pd.DataFrame(np.random.randn(5,4))
print(df)
f1 = lambda x : '%.2f' % x
print(df.applymap(f1))
f2 = lambda x: x+x
print(df.applymap(f2))

Six, sort

1. Index sorting

(a) Series
ser_obj2 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(ser_obj2) 
# 默认升序
print(ser_obj2.sort_index())
# 降序排序
print(ser_obj2.sort_index(ascending = False))  
(b) DataFrame
df_obj = pd.DataFrame(np.random.randn(3, 5),
                      index=np.random.randint(3, size=3),
                      columns=np.random.randint(5, size=5))
print(df_obj)
# 修改axis和accending两个参数,可以改变行、列排序和升序、降序
df_obj_sort = df_obj.sort_index(axis=0, ascending=False)
print(df_obj_sort)

2. Value sorting

(a) Series
ser_obj = pd.Series(np.random.randint(10,20,size= 10))
print(ser_obj)
# 默认升序
print(ser_obj.sort_values()) 
# 降序
print(ser_obj.sort_values(ascending = False)) 
(b) DaraFrame
df4 = pd.DataFrame(np.random.randn(3, 5),
                   index=np.random.randint(2, size=3),
                   columns=np.random.randint(4, size=5))
print(df4)
# by参数
# 行索引,直接使用行号
# 列索引,使用列表
print(df4.sort_values(by=[1, ], ))

Seven, missing value and Nan value processing

df_obj = pd.DataFrame([
    [1, 2, np.nan, np.nan],
    [np.nan, 3, 4, np.nan],
    list(range(4))])
print(df_obj)
# 返回布尔值矩阵
print(df_obj.isnull())
# 删除Nan值所在行列,axis=1列, axis=0行
print(df_obj.dropna(axis=1))
# 将Nan值替换为设置值
print(df_obj.fillna(0))

8. Level Index

There are two or more index levels, in the following format.

a  0   -0.816360
   1   -0.459840
   2    0.664878
b  0    0.039940
   1    1.049324
   2   -0.525796
c  0   -1.887801
   1    1.361369
   2    0.120353
d  0   -1.432332
   1    0.143934
   2    0.320637
   

Use code to generate hierarchical Series objects

ser_obj = pd.Series(np.random.randn(12),
                    index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'],
                           [0, 1, 2] * 4])

print(type(ser_obj.index))
# <class 'pandas.core.indexes.multi.MultiIndex'>

You can view ser_obj.index in a for loop, which makes it easier to understand the concept of hierarchical index.

The method of obtaining elements is basically the same as the single-layer structure of Series and DataFrame

# 获取所有a索引的数据
print(ser_obj["a"])
# 获取a索引下的第一个数据
print(np.round(ser_obj["a"][0], 6))
# 获取第一层索引下的第一条数据 
print(ser_obj[:,1])

Nine, Pandas statistical calculation

describe() Generate the data description of the data set and
count the number of each column, average, standard deviation, minimum, quantile distribution, maximum, etc.

df_obj = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
print(df_obj)
print(df_obj.describe())

Common method

method Description
count Number of non-NA values
describe Calculate statistics for Series and DataFrame columns
min、max Calculate the best value
argmin、argmax Calculate the highest index position that can be obtained
idxmin、idxmax Calculate the highest index value that can be obtained
sum Sample sum
mean Sample average
median Sample median
where Sample variance
std Sample standard deviation

Guess you like

Origin blog.csdn.net/qq_41292236/article/details/108620803