python——Detailed explanation of pandas usage

Python's own data analysis function is not strong, and you need to install some capabilities 第三方的扩展库to 增强it. Among them, the third-party extension library with the strongest analysis ability 结构化数据(which can be simply understood as 二维表data, or our commonly used grid data) is .Excel表Pandas

Pandas is a data analysis package for python. It was originally developed by AQR Capital Management 2008年4月开发and came out in 2016. It is currently developed and maintained 2009年底开源by developers focusing on Python data package development PyDataand is part of the PyData project. teamPandas 最初was 金融数据分析工具developed as Python and, therefore, pandas 时间序列分析provides good support for Python. Pandas 名称comes from 面板数据(panel data) and python 数据分析(data analysis).

1.2 pandas features

Pandasis 基于NumPya toolkit created to solve data analysis tasks. But Numpy can only process numbers. If you want to process other types of data, such as strings, you need to use Pandas.
PandasIt incorporates a large number of libraries and some standard data models to provide the 高效necessary 操作大型数据集tools.
PandasProviding a large number of functions that enable us to 快速便捷do this is one of the important factors that make Python a powerful and efficient data analysis language.处理数据函数方法
Pandas can import data from various file formats such as CSV, JSON, SQL, Microsoft .Excel
Pandas can perform operations on various data, such as merging, reshaping, selection, and 数据清洗features 数据加工.
Pandas is widely used in various data analysis fields such as 学术, 金融, 统计学and so on.

1.3 Two main data structures of pandas

The main data structures of Pandas are Series(one-dimensional data) and DataFrame(two-dimensional data). These two data structures are sufficient to handle most typical use cases in finance, statistics, social sciences, engineering and other fields.

Seriesis a one-dimensional array-like object that consists of a set of 数据(various Numpy data types) and a set of data labels associated with them (ie 索引.

DataFrame 是一个tabular data structure is a dictionary composed of ，它含有一组ordered column ，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有row index 也有and column index Series (shared with one index).，它可以被看做由

2. Detailed explanation of pandas data structure

2.1 pandas——series

Basic method of creating Series:

#s = pandas.Series(data, [index=index])

The data can be different data types, including dictionary, ndarray and scalar. The index index is an optional parameter. The final index generated depends on the data type of the incoming data.

2.1.1 Create a series from a dictionary

data = {'b': 1, 'a': 0, 'c': 2}
ser1 = pd.Series(data)
print(ser1)

#结果
b    1
a    0
c    2
dtype: int64

When no indexparameters are passed, Seriesthe index is in the dictionary key, and the value corresponding to the index is in the dictionary value. In addition, according to the official documentation, if the Python version ≥ 3.6 and the pandas version ≥ 0.23, then the index will be dictsorted in the insertion order. If Python version < 3.6 or pandas version < 0.23, the index will be sorted alphabetically by dictionary key

If specified index, only indexthe values in the dictionary corresponding to the tags in and will be taken out for creation Series. If a tag is not a dictionary key, the value corresponding to this tag is NaN(not a number, used in pandas Mark missing data)

Example:

data = {'b': 1, 'a': 0, 'c': 2}
index = ['b', 'c', 'd', 'a']
ser1 = pd.Series(data=data,index=index)
print(ser1)

结果：
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

2.1.2 Create Series from ndarray

If datait is one , it must be the same length as length ndarrayif specified . If not passed , an integer index is automatically created with the value .indexdataindex[0, ..., len(data) - 1]

ser2 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(ser2)
print(ser2.values) #输出值
print(ser2.index)  #输出索引
ser2 = pd.Series(np.random.randn(5))
print(ser2)

#结果
a   -0.691551
b    0.484485
c    0.240262
d   -1.184450
e   -0.533851
dtype: float64
[-0.69155096  0.48448515  0.24026173 -1.18444977 -0.53385096]
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
0    1.118744
1    1.570550
2   -0.069444
3   -0.086134
4   -0.950202
dtype: float64

2.1.3 Creating Series from Scalar

If datait is a scalar value, an index must be provided. The value will be repeated to match the length of the index.

ser3 = pd.Series(5.0, index=['a', 'b', 'c', 'd', 'e'])
print(ser3)

#结果
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

2.1.4 Series features

2.1.4.1 Class ndarray

Seriesbehaves ndarrayvery similarly to , which is a valid parameter for most NumPy functions. However, operations such as slicing also slice the index. Example:

ser2 = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(ser2[0])
print(ser2[1:3])
print(ser2[ser2 > ser2.median()])  #大于中值
print(np.exp(ser2))

pandas also has a dtype attribute:

print(ser2.dtype)

To get the underlying array of a Series, you can use the Series.array property:

print(ser2.array)

#结果：
[  -0.7815805591394526, -0.033036078599971104,    1.4844533748085762,
    1.2854696909097223,   -0.9010676391265999]
Length: 5, dtype: float64

Obtain the corresponding NumPy array ndarray through the Series.to_numpy() method:

print(ser2.to_numpy())

#得到数组
[ 1.66084398 -0.07187783 -1.18355426  2.5447355  -0.45818444]

2.1.4.2 Class dict (dictionary)

We saw earlier that a Series can be created from a dictionary. Coincidentally, the original dictionary and the Series created through the dictionary are basically the same in terms of data acquisition and setting (except that the dictionary can add new keys). That is, a Series is like a fixed-size dict where values can be gotten and set via index labels.

data = {'b': 1, 'a': 3, 'c': 2}
index = ['b', 'c', 'd', 'a']
ser1 = pd.Series(data=data,index=index)
# print(ser1)

print(ser1['a'])
ser1['f'] = 12
ser1['d'] = 10
print(ser1)

#判断是否有索引
print('f' in ser1)
print('z' in ser1)

Using the get method, getting the value of a tag that is not in the Series will return None or the specified default value:

print(ser1.get('z',np.nan))

2.1.4.3 Vectorization operation (broadcast) and label alignment

When working with raw NumPy arrays, you generally don't need to loop through values. The same is true when using Series in pandas. Series can also be passed to most NumPy methods that require an ndarray.

Example:

data = {'b': 1, 'a': 3, 'c': 2}
index = ['b', 'c', 'd', 'a']
ser1 = pd.Series(data=data,index=index)
print(ser1 + ser1)
print(ser1 - 1)
print(np.square(ser1))

A key difference between Series and ndarray is that operations between Series automatically align data based on labels, that is, operations are performed between elements with the same label.

print(ser1[1:] + ser1[:-1])

The index of the result of an operation between misalignments is the union of Seriesall involved indices. SeriesIf a tag Seriesis not found in one of them, the value corresponding to that tag in the results is NaN. Code can be written without any explicit data alignment, which provides great freedom and flexibility for interactive data analysis and research. The integrated data alignment capabilities of pandas data structures distinguish pandas from most other related tools for processing labeled data.

Usually, in order to avoid information loss, we will choose objects with different indexes to do the default result after calculation. The index of this result is the union of the indexes of the objects participating in the calculation. The values of certain tags in the result will be marked as real data. ( NaN) because these tags are not tags in one or more of the objects. Such data is often important information for calculation results. If you don't need the missing data, you can dropnadiscard the missing data and labels through a function.

print((ser1[1:] + ser1[:-1]).dropna())

2.1.4.4 Name attribute

Series has a name attribute, which can be understood as a column name:

ser1 = pd.Series(data=data,index=index,name='test')
print(ser1.name)

You can use the method Series.rename() to rename a Series. This method will return a new Series object. If used on the original series, the name attribute of the original series will not change:

ser1 = pd.Series(data=data,index=index,name='test')
print(ser1.name)
ser2 = ser1.rename('te')
print(ser2.name)

2.2 DataFrame (two-dimensional data)

DataFrame is the most commonly used pandas object. It is a two-dimensional labeled data structure whose columns may have different data types. You can think of it as a spreadsheet, a SQL table, or a dictionary of Series objects.

The basic method of creating a DataFrame is as follows:

#df = pd.DataFrame(data, [index=index, columns=columns])

Like Series, DataFrame accepts many different types of input data:

Dictionary of one-dimensional ndarray, list, dictionary, Series and other objects
2D NumPy ndarray
Structured or record ndarray
A `Series
Another DataFrame

In addition to data, optionally, you can also pass index (row labels) and columns (column labels) parameters. If index and/or column parameters are passed, the index and/or column of the generated DataFrame object will be the index and/or column you specified. If no axis labels are passed, they will be constructed from the input data according to default rules.

2.2.1 Create DataFrame from Series dictionary

#df = pd.DataFrame(data, [index=index, columns=columns])

d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"])
}

# 没有传递索引和列，则结果的索引为各个Series索引的并集，列是字典的键
df = pd.DataFrame(d)
print(df)

# 指定index，Series中匹配标签的数据会被取出，没有匹配的标签的值为NaN
df = pd.DataFrame(d, index=["d", "b", "a"])
print(df)

# 同时指定了索引和列，同样的，如果字典中没有和指定列标签匹配的键，则结果中该列标签对应的列值都为NaN
df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
print(df)

#结果
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

2.2.2 Create a DataFrame from a dictionary object

These nested dictionaries are first converted to Series, and then a DataFrame is created from the Series dictionary:

dd = {'one': {'a':1, 'b':2},
'two': {'c':3, 'd':4}}

pa = pd.DataFrame(dd)
print(pa)

The key name of the outer dictionary becomes the column index, the key name of the inner layer becomes the row index, and the default value for no label is Nan.

2.2.3 Create DataFrame from ndarray or dictionary of lists

The ndarray or list must be of the same length. If an index is specified, the length of the index must also be the same as the length of the array/list. If no index is passed, an integer index range(n) is automatically created, where n is the length of the array/list.

Example:

data = {'one': np.array([1.0, 2.0, 3.0, 4.0]), 'two': np.array([4.0, 3.0, 2.0, 1.0])}
pb = pd.DataFrame(data)
print(pb)

pc = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
print(pc)

data = {'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}
pe = pd.DataFrame(data)
print(pe)

#结果：
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

2.2.4 Create array from structured or record number

Same as array:

data = np.array([(1, 2.0, "Hello"), (2, 3.0, "World")], dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])
pa = pd.DataFrame(data)
print(pa)

pb = pd.DataFrame(data, index=["first", "second"])
print(pb)

pc = pd.DataFrame(data, columns=["C", "A", "B"])
print(pc)

#结果：
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

Process finished with exit code 0

2.2.4 Create DataFrame from dictionary list

Dictionary key names default to column names:

data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pa = pd.DataFrame(data2)
print(pa)

pb = pd.DataFrame(data2, index=["first", "second"])
print(pb)

pc = pd.DataFrame(data2, columns=["a", "b"])
print(pc)

2.2.5 Create DataFrame from tuple dictionary

A DataFrame with multi-level indexes can be automatically created by passing a dictionary of tuples:

pa = pd.DataFrame(
     {
         ('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
         ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
         ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
         ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
         ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}
     }
)
print(pa)

#结果：
       a              b      
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

2.2.6 Create DataFrame from a Series

A DataFrame created from a Series has only one column of data. The name of the column is the original name of the Series (when no other column names are provided), and its index is the same as the input Series.

ser = pd.Series(range(3), index=list('abc'), name='ser')

print(pd.DataFrame(ser))

2.2.7 Create DataFrame from named tuples

The field name of the first namedtuple in the list determines the columns of the DataFrame. The subsequent named tuples (or tuples) simply unpack the values and fill them into rows of the DataFrame. If the length of any of these tuples is shorter than the first namedtuple, the subsequent columns in the corresponding row will be marked as missing values. But if any one is longer than the first namedtuple, a ValueError will be thrown.

from collections import namedtuple

Point = namedtuple("Point", "x y")

# 列由第一个命名元组Point(0, 0)的字段决定，后续的元组可以是命名元组，也可以是普通元组
print(pd.DataFrame([Point(0, 0), Point(0, 3), (2, 3)]))

Point3D = namedtuple("Point3D", "x y z")
# 第一个元组决定了生成的DataFrame由3列（x，y，z），而列表中第三个元组长度为2
# 所以在DataFrame的第三行中，第三列的值为NaN
print(pd.DataFrame([Point3D(0, 0, 0), Point3D(0, 3, 5), Point(2, 3)]))

2.2.8 Created from a list of data classes

Passing a list of data classes to the DataFrame constructor is equivalent to passing a list of dictionaries, but it should be noted that all values in the list should be data classes, and mixing types in the list will cause a TypeError exception.

from dataclasses import make_dataclass
Point = make_dataclass('Point', [('x', int), ('y', int)])

print(pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]))

2.2.9 Other creation methods

In addition to using the class constructor to create DataFrame objects, the DataFrame class itself also provides some class methods for creating objects.

DataFrame.from_dict

Accepts a dictionary of dictionaries or a dictionary of array-like sequences as argument and puts back a DataFrame. It behaves like the DataFrame constructor, except that it has an orient parameter, which defaults to columns, but can also be set to index, thereby using the dictionary keys as row labels.

print(pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])])))

# 传递index给orient参数，字典key会变成行标签
print(pd.DataFrame.from_dict(
dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
orient='index',
columns=['one', 'two', 'three'],
))

DataFrame.from_records

It accepts a list of tuples or structured arrays as parameters, and its working principle is similar to the ordinary DataFrame constructor. The difference is that the index of the generated DataFrame can be a specific field of the structured data type.

data = np.array([(1, 2., b'Hello'), (2, 3., b'World')],dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
print(pd.DataFrame.from_records(data, index='C'))# 指定字段C的数据作为index

2.2.10 DataFrame operations

2.2.10.1 Column selection, addition and deletion

Like a Series, a DataFrame is similar to a dictionary. You can semantically think of it as a dictionary of Series objects that share the same index. The syntax for getting, setting, and deleting columns is the same as the corresponding dictionary operations:

d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"])
}
#
# # 没有传递索引和列，则结果的索引为各个Series索引的并集，列是字典的键
df = pd.DataFrame(d)
print(df)
print(df['one'])
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2 #布尔
print(df)

Columns can be removed or popped like dictionary operations:

del df['one']
three = df.pop('three')
print(three)

When inserting a scalar value, the value fills the entire column:

df['foo'] = 'bar'
print(df)

When inserting a Series with a different index than the DataFrame, only matching label values will be retained, label values that are not in the DataFrame index will be discarded, and label values that are not in the Series index will be set to NaN:

df['one_trunc'] = pd.Series([1,2,3,4], index=list('acef'))
print(df)

It is also possible to insert a raw ndarray, but its length must match the length of the DataFrame index:

df['array'] = np.array([5, 6, 7, 8])
print(df)

By default, columns are inserted at the end, you can use the DataFrame.insert() method to insert at a specific position in the column:

Dataframe.insert(loc, column, value, allow_duplicates=False): Insert data in the specified column of the Dataframe.

Parameter introduction:

   loc: int type, indicating which column; if data is inserted in the first column, loc=0 
column
   : Name the inserted column, such as column='new column' 
value
   : numbers, arrays, series, etc. Yes (you can try it yourself) 
allow_duplicates
   : Whether to allow duplicate column names. Select True to allow new column names to duplicate existing column names.

df.insert(1, 'one','bar') #Insert 
print(df) in the second column

2.2.10.2 Allocating new columns in method chains

DataFrame has an assign() method that makes it easy to create new columns derived from existing columns:

iris = pd.DataFrame({
     'SepalLength': [5.1, 4.9, 4.7, 4.6, 5.0],
     'SepalWidth': [3.5, 3.0, 3.2, 3.1, 3.6],
     'PetalLength': [1.4, 1.4, 1.3, 1.5, 1.4],
     'PetalWidth': [0.2, 0.2, 0.2, 0.2, 0.2],
     'Name': 'Iris-setosa'
})
iris = iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])

print(iris)

You can also pass a function object that only accepts one parameter. In this process, the DataFrame object calling the assign method will be passed to this function, and this function will generate a new column:

iris = pd.DataFrame({
     'SepalLength': [5.1, 4.9, 4.7, 4.6, 5.0],
     'SepalWidth': [3.5, 3.0, 3.2, 3.1, 3.6],
     'PetalLength': [1.4, 1.4, 1.3, 1.5, 1.4],
     'PetalWidth': [0.2, 0.2, 0.2, 0.2, 0.2],
     'Name': 'Iris-setosa'
})
# iris = iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
iris = iris.assign(sepal_ratio=lambda x: (x['SepalWidth'] / x['SepalLength']))
print(iris)

assing() returns a copy of the data and inserts new columns, leaving the original DataFrame unchanged.

The assign() method also has a feature that allows dependent assignment. Among the parameters provided to the assign method, the parameter expressions are calculated in order from left to right, and subsequent expressions can refer to previously created columns:

dfa = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

print(dfa.assign(C=lambda x: x['A'] + x['B'], D=lambda x: x['A'] + x['C']))

2.2.10.3 Index/Select

The basic indexing method is as follows:

You can see that there is also a syntax df.col for selecting columns, and the column name col must be a valid Python variable name to use this syntax.

data = {'one': np.array([1.0, 2.0, 3.0, 4.0]), 'two': np.array([4.0, 3.0, 2.0, 1.0])}
pb = pd.DataFrame(data)
# print(pb)

pc = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
print(pc)

#选择列
print(pc['one'])
print(pc.one)

#按照标签选择行
print(pc.loc['a'])
print(pc.iloc[2])

#对行切片
print(pc[:2])

You can also retrieve data based on Boolean values;

d = {
"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
"two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"])
}
#
# # 没有传递索引和列，则结果的索引为各个Series索引的并集，列是字典的键
df = pd.DataFrame(d)
print(df)
print(df['one'])
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
print(df)

#按照布尔型取值
print(df.flag)

2.2.10.4 Data alignment and arithmetic operations

Data alignment between DataFrame objects is automatically aligned on columns and indexes. As with Series data alignment, the resulting object will have the union of column and row labels.

df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
print(df)
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print(df2)
print(df + df2)

When performing operations between a DataFrame and a Series, the default behavior is to align the Series index on the DataFrame columns and then broadcast row by row, example:

print(df - df.iloc[0])

Arithmetic operations with scalars operate element-wise:

print(df * 5 - 2)
print(1 / df)
print(df ** 2)

Boolean operators also operate on an element-by-element basis, performing Boolean operations on elements at the same position:

df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)
print(df1 & df2)
print(df1 | df2)
print(df1 ^ df2)
print(-df1)

2.2.10.5 Device

Like ndarray, to transpose, access Tproperties or call the DataFrame.transpose() method:

print(df1.T)

2.2.10.6 Interoperation between DataFrame and NumPy functions

Most NumPy functions can be called directly on Series and DataFrame:

df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
print(np.abs(df))
print(np.square(df))
print(np.asarray(df))

When passing two pandas objects to a NumPy function, they will be aligned first and then the function operation will be performed:

numpy.remainder() is another function used to perform mathematical operations in numpy, it returns the element-wise remainder of division between two arrays arr1 and arr2, i.e. arr1 % arr2 when arr2 is 0 and both arr1 and arr2 are integer arrays , returns 0.

ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

ser2 = pd.Series([1, 3, 5], index=['b', 'a', 'c'])

print(np.remainder(ser1, ser2))

As with Series, the corresponding ndarray can be obtained using the DataFrame.to_numpy() method:

df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
print(df.to_numpy())

3. Basic operations of pandas

3.1 Import commonly used libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import  ssl
warnings.filterwarnings('ignore')

ssl._create_default_https_context = ssl._create_unverified_context

3.2 Reading and writing files

To read CSV format data, use pd.read_csv(), and to write to a file, use DataFrame.to_csv():

work_path = '../save_file/'
os.chdir(work_path)#改变当前目录到work_path
print(os.getcwd())#打印当前目录
data = pd.read_csv(
    'data.csv', encoding = 'utf-8', dtype = {'id':str, 'name':str,'age':int})  #读取data.csv文件
print(data.head())
data.to_csv('data_out.csv', encoding = 'utf-8')# 读到data_out.csv，没有文件的话自动创建

3.3 Index

When indexed directly, columns are easier to locate and index. It seems that rows can only be indexed continuously, not flexibly:

print(data.columns)#显示列标签
print(data['id'])#获取一列'user_id'
print(data[['id','name']])#获取两列,这是用列表形式获取的两列，所以有2层方括号
print(data[1:4])#获取1:3行,好像无法获取不连续的多行，需要用loc或iloc函数
print(data[['id','name']][1:4])#获取两列,再获取这两列组成的dataframe的1:4行

3.4 Indexing through the loc function

The loc function locates the label (index), not the row position. The first parameter of the loc function is the operation on the row, and the second parameter is the operation on the column.

print(data.loc[1:3, ['id','name']])#获取行标签为1:3,列标签为'id','name'的数据
print(data.loc[data.id == '1', ['id','name']])#获id列中值为'1'的行，列标签为'user_id','name'的数据
print(data.loc[data.age > 22, ])#age > 22的所有行
print(data.loc[(data.age > 24) | (data.id == '1') ])#age > 24或id='1'的所有行

The iloc function locates the position of the row. The first parameter of the iloc function is the operation on the row, and the second parameter is the operation on the column.

print(data.iloc[[1, 3], 0:2 ])#获取第1行和第4行，0到2列
print(data.iloc[[1, 3], [0, 2]])#获取第1行和第4行，0和2列

3.5 Addition and deletion of data

To add columns, you mainly use the direct addition method or the insert() method, and to delete columns, you mainly use the drop and del methods. You can also use the drop method to delete rows, and you can use _append to add rows at the end. There are also better methods.

tmp = data['age']
data.insert(0,'copy购买量',tmp)#在数据中插入一列，名称‘copy购买量’，位置第0列，数字tmp
print(data)
del data['copy购买量']#删除dataframe中名称为copy购买量的列
data.insert(1, 'copy购买量', tmp)#在数据中插入一列，名称‘copy购买量’，位置第0列，数字tmp
data.insert(0, 'copy购买量1', tmp)#在数据中插入一列，名称‘copy购买量’，位置第0列，数字tmp
data.drop(labels = ['copy购买量', 'copy购买量1'], axis = 1, inplace = True)
#上一行在数据中删除名称为'copy购买量', 'copy购买量1'的列，名称‘copy购买量’，axis = 1表示删除列，axis = 0表示删除行，inplace表示真实删除
print(data.drop(labels = [0, 2], axis = 0, inplace = False))#删除行标签为0和2的行
print(data.drop(labels = range(1, 3), axis = 0, inplace = False))#删除行标签1到2的行
print(data._append(data))#在df的最后附加行

3.6 Data modification and search

Use rename to modify row labels or column names. Directly use df.series with logical operators such as ==, >, < to obtain df data that satisfies a specific relationship. You can also use the between and isin methods to obtain data that matches a certain amount.

data.rename(columns = {'id':'用户ID'},inplace=True)#用rename修改列标签，使用字典的形式；
data.rename(index = {1:11, 2:22}, inplace = True)#用rename修改行标签，使用字典的形式；
print(data)
data.loc[data['性别'] == 0, '性别'] = '女性'#将列标签为‘性别’的列中值等于0的元素，改为女性。
data.loc[data['性别'] == 1, '性别'] = '男性'
data.loc[data['性别'] == 2, '性别'] = '未知'
print(data)
print(data[data['age']>23])#获取名称为'age'列的数据中大于80的数据，会得到所有列
print(data[(data['age'] < 23) & (data['性别'] == '男性')])
print(data[data['age'].isin([12,21,20])])#age 列数值属于list[12,21,20]的数据

3.7 Time and date format processing

Use df.to_datetime() to process various string times in pandas; use pd.dt.strftime() to process datetime64 or timestamp type data in formatted string output.

start_datetime = np.arange('2021-11-01', '2021-11-10', dtype = 'datetime64[D]')#生成numpy起始时间序列
end_datetime = np.arange('2021-11-01T12:14:30.789', '2021-11-10T12:14:30.789', 86400000, dtype = 'datetime64[ms]')
segment_time = np.array([start_datetime, end_datetime]).T#建立一个起始时间的numpy数组
df = pd.DataFrame(segment_time, columns = ['起始时间', '结束时间'])#新建dataframe
df['持续时间'] = df['结束时间'] - df['起始时间']#增加一列持续的时间
df['间隔时间'] = np.append((df['起始时间'].values[1:] - df['结束时间'].values[0:-1]), 0)#增加一列每两组之间间隔的时间
df['start_time'] = df['起始时间'].dt.strftime('%A %B %d %Y %H:%M:%S.%f')#把datetime64格式转换成规定格式字符串时间
df['end_time'] = df['结束时间'].dt.strftime('%a %b %d %Y %H:%M:%S.%f')
df['读入起始时间'] = pd.to_datetime(df['start_time'], format = '%A %B %d %Y %H:%M:%S.%f')
df['读入结束时间'] = pd.to_datetime(df["end_time"], format = '%a %b %d %Y %H:%M:%S.%f')
print(df)
#建立df的另外一种方法：
start_datetime_char = np.datetime_as_string(start_datetime, unit = 'ms')#numpy里的datetime64转换为字符串,改格式不方便
end_datetime_char = np.datetime_as_string(end_datetime, unit = 'ms')#转换为字符串
df_char = pd.DataFrame(start_datetime_char, columns = ['起始时间'])
df_char['结束时间'] = end_datetime_char
print(df_char)

Use pd.dt.date/time/year/month/wedk/weekday/day/hour/min/second for datetime64 Use pd.dt.days/total_seconds() for timedelta64 Use pd.dt.dayofyear/weekofyear for datetime64

df_char['date_time'] = pd.to_datetime(df['结束时间'], format = '%Y-%m-%dT%H%M%S.%f')
df_char['年'] = df_char['date_time'].dt.year
df_char['周'] = df_char['date_time'].dt.isocalendar().week
df_char['周几'] = df_char['date_time'].dt.weekday
df_char['分'] = df_char['date_time'].dt.minute
df_char['秒'] = df_char['date_time'].dt.second
df_char['微秒'] = df_char['date_time'].dt.microsecond
df_char['总秒数'] = df['间隔时间'].dt.total_seconds()#不要忘了括号
df_char['总天数'] = (df['间隔时间']).dt.days#不用括号
df_char['加天序列后总天数'] = (df['间隔时间'] + np.arange(0, 9, 1, dtype = 'timedelta64[D]')).dt.days#不用括号
#上面的增加np.arange()主要是用来增加间隔天数，以使显示的天数由变化。
pd.set_option('display.max_columns', None)#设置pandas显示选项，以便显示所有列。
print(df_char)

3.8 Data stacking and merging

Use concat() to stack two dataframes horizontally or vertically and use merge() to merge two dataframes according to the primary key.

merge1 = pd.concat([df, df_char], axis = 0, join = 'outer')#使用列表堆叠，0轴沿着纵向拓展，即行数增加，inner表示不一致的删除，outer表示不一致的保留
merge2 = pd.concat([df, df_char, data], axis = 1, join = 'outer')#使用列表堆叠，0轴沿着横向拓展，即列数增加
merge3 = pd.merge(left = df, right = df_char, how = 'inner', left_on = '结束时间', right_on = 'date_time')#按照df[结束时间]和df[date_time]这两个主键关联
merge4 = pd.merge(left = df, right = df_char, how = 'inner', left_on = '起始时间', right_on = 'date_time')#按照df[起始时间]和df[date_time]这两个主键关联，由于不相等且是inner模式，所以会清空所有数据
merge5 = pd.merge(left = df, right = df_char, how = 'outer', left_on = '起始时间', right_on = 'date_time')#按照df[起始时间]和df[date_time]这两个主键关联，由于不相等但是是outer模式，所以会保留所有数据

print(merge1)
print(merge2)
print(merge3)
print(merge4)
print(merge5)

3.9 String processing

The main function names used are described as follows: pd.str. contains() returns a string indicating whether each str contains the specified pattern. replace() replaces a string lower() returns a copy of the string with all letters converted to lowercase. upper() returns a copy of the string with all letters converted to uppercase. split() returns a list of words in a string. strip() removes leading and trailing spaces. join() returns a string that is the concatenation of all strings in the given sequence. There is also a function pd.isnull() that determines whether it contains empty elements. It is often used in conjunction with T and any() ∙ \bull ∙ Use df.fillna/df.age.fillna to replace missing values

df1 = pd.read_csv('MotorcycleData.csv', encoding = 'gbk')
print(df1['Price'].str[0])#pd.str可以用切片方式实现索引
print(df1['Price'].str[:2])
df1['价格'] = df1['Price'].str.strip('$')#删去头尾的'$'符号
df1['价格'] = df1['价格'].str.replace(',', '')#删去','符号
df1['价格'] = df1['价格'].astype(int)
# print(df1.info())
# print(df1[['Price', '价格']])
df1['位置'] = df1['Location'].str.split(',')#通过字符串中“，”将其分割成一个字符串列表
# print(df1['位置'].str[1])#打印出这个列表中的第一个字符串元素
# print(df1['Location'].str.len())#获取字符串长度
df1.loc[df1[['Location']].isnull().T.any()] = 'aaa'#将'Location'这列中是空数据的填上字符串，避免字符串判断时出现空值而报错
#上面的例子中df1[['Location']]必须用两层中括号，这样结果是DataFrame类，如果是df1['Location']就是series类了
#非转置：df1.isnull().any()，得到的每一列求any()计算的结果，输出为列的Series。
#转置：df1.isnull().T.any()，得到的每一行求any()计算的结果，输出为行的Series。
#这里要知道那一行有Nan，所以要用转置
print(df1[df1[['Location', '价格']].isnull().T.any()]['Location'])#检查'Location和价格'这两列中是是否有空数据 
#替换缺失值用df.fillna更方便
df1.fillna('datamiss', inplace = True)
print(df1.loc[df1['Location'].str.contains('New Hampshire'), 'Location'])#找出数据中包含特定字符串的数据

3.10 Data statistics and sorting

The discribe() method is used to display data statistics; data sorting uses: ∙ \bullet ∙ sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last'), based on a certain number Arrange row or column values ∙ \bullet ∙ sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None), Sort based on row or column labels

print(df1.describe())#只对数字类参数有用，输出每一列的均值、标准差、最大、最小值和25%/50%/75%的值
print(df1.sort_index(axis = 1))
print(df1.sort_values(by = ['Bid_Count', 'Price']))#根据指定的列名称及其优先级顺序，对整个列表进行排序。

3.11 Read txt file

You can use read_csv or read_table to read txt files. The prefix is separated by commas by default, and the delimiter must be specified at the end. The former can also use the sep parameter to specify the separator.

Data sorting uses:

dtxt = pd.read_table('sample_data_out.txt', encoding = 'gbk', header = 0, sep = ' ')
dtxt = pd.read_csv('sample_data_out.txt', encoding = 'gbk', header = 0, sep = ' ')
'''
上面的两句功能基本相同
'''
print(dtxt.iloc[:,0].str.split(','))#手动分隔，结果是两层列表
print(dtxt.columns.str.split(',')[0])#手动分隔列名称，结果是两层列表
dtxt_deal = (dtxt.iloc[:,0].str.split(',')).apply(pd.Series, index = dtxt.columns.str.split(',')[0])
dtxt_deal['buy_mount'] = dtxt_deal['buy_mount'].astype(int)
print(dtxt_deal.info())