pandas library overview
pandas provides a large number of data structures and functions for quickly and conveniently processing structured data. Since its emergence in 2010, it has helped make Python a powerful and efficient data analysis environment. The most commonly used data structure object in pandas is DataFrame, which is a column-oriented two-dimensional table structure, and the other is Series, a one-dimensional labeled array object.
pandas combines the high-performance array calculation capabilities of NumPy with the flexible data processing capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing capabilities that make it easier to complete operations such as reshaping, slicing and dicing, aggregating, and selecting data subsets. Data manipulation, preparation, and cleaning are the most important skills for data analysis, and pandas is one of the preferred python libraries.
Personally, I feel that it is best to learn pandas in the jupyter environment of anaconda, which is convenient for breakpoint debugging and analysis, and it is also convenient to run the code line by line.
Install pandas
Installation under Windows/Linux system environment
Installation via conda
conda install pandas
Installation via pip3
py -3 -m pip install --upgrade pandas #Windows系统
python3 -m pip install --upgrade pandas #Linux系统
pandas library usage
Pandas adopts a large number of NumPy coding styles, but the biggest difference between the two is that pandas is specifically designed for processing tabular and mixed data. NumPy is more suitable for processing uniform numerical array data.
Import the pandas module, and commonly used submodules Series and DataFrame
import pands as pd
from pandas import Series,DataFrame
Create a Series by passing a list of values, letting pandas create a default integer index:
s = pd.Series([1,3,5,np.nan,6,8])
s
output
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Introduction to pandas data structure
To use pandas, you must first be familiar with its two main data structures:Series and DataFrame. While they don't solve every problem, they provide a reliable, easy-to-use foundation for most applications.
Series data structure
A Series is a one-dimensional array-like object that consists of a set of data (various NumPy data types) and a set of data labels (i.e., indices) associated with it. The simplest Series can be generated from just one set of data. Code example:
import pandas as pd
obj = pd.Series([1,4,7,8,9])
obj
The string representation of a Series is: index on the left and value on the right. Since we did not specify an index for the data, an integer index ranging from 0 to N-1 (N is the length of the data) will be automatically created. You can also obtain its array representation and index object through the values and index properties of the Series. Code example:
obj.values
obj.index # like range(5)
Output:
array([ 1, 4, 7, 8, 9])
RangeIndex(start=0, stop=5, step=1)
We also want to create a Series with an index that can mark each data point. Code example:
obj2 = pd.Series([1, 4, 7, 8, 9],index=['a', 'b', 'c', 'd'])
obj2
obj2.index
output
a 1
b 4
c 7
d 8
e 9
dtype: int64
Index([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype=’object’)
Compared with ordinary NumPy arrays, you can select a single or a group of values in a Series through indexing. Code example:
obj2[['a', 'b', 'c']]
obj2['a']=2
obj2[['a', 'b', 'c']]
['a','b','c] is an indexed list, even though it contains strings rather than integers.
Using NumPy functions or NumPy-like operations (such as filtering based on boolean arrays, scalar multiplication, applying mathematical functions, etc.) will preserve the link to the index value, code example:
obj2*2
np.exp(obj2)
You can also think of a Series as a fixed-length ordered dictionary, since it is a mapping of index values to data values. It can be used in many functions that originally require dictionary parameters, code example:
dict = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000,'Utah': 5000}
obj3 = pd.Series(dict)
obj3
output
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
DataFrame data structure
DataFrame is a tabular data structure that contains a set of ordered columns, each column can be a different value type (numeric, string, Boolean, etc.). DataFrame has both row and column indexes, and it can be viewed as a dictionary composed of Series (sharing the same index). Data in a DataFrame is stored in one or more two-dimensional blocks (rather than lists, dictionaries, or other one-dimensional data structures).
Although a DataFrame holds data in a two-dimensional structure, you can still easily represent it as higher-dimensional data (a hierarchically indexed tabular structure, which is a key element of many advanced data processing functions in pandas)
There are many ways to create a DataFrame. The most commonly used one is to directly pass in a dictionary composed of equal-length lists or NumPy arrays. Code example:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada','Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
As a result, the DataFrame will be automatically indexed (same as Series), and all columns will be sorted. The output is as follows:
For particularly large DataFrames, the head method selects the first five rows:
frame.head()
If a column sequence is specified, the columns of the DataFrame will be arranged in the specified order. Code example:
pd.DataFrame(data,columns=['state','year','pop'])
If the passed in column is not found in the data, missing values will be generated in the result. Code example:
frame2 = pd.DataFrame(data,columns=['state','year','pop','debt'],
index=['one','two','three','four','five','six'])
frame2
Get the columns and index of DataFrame, code example:
frame2.columns
frame2.index
output
Index([‘state’, ‘year’, ‘pop’, ‘debt’], dtype=’object’)
Index([‘one’, ‘two’, ‘three’, ‘four’, ‘five’, ‘six’], dtype=’object’)
The columns of the DataFrame can be obtained as a Series through dictionary-like tags or attributes. Code example:
frame2['state']
frame2.state
Columns can be modified through assignment, which is similar to Series. For example, we can assign a scalar value or a set of values (in the form of an array or list) to the empty "debt" column, code example:
frame2.debt = np.arange(6.)
frame2
Note: When assigning a list or array to a column, its length must match the length of the DataFrame.
If the assigned value is a Series, it will exactly match the index of the DataFrame, and all the gaps will be filled with missing values. Code example:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four','five'])
frame2.debt = val
frame2
Assigning a value to a column that does not exist creates a new column. Keyword del is used to delete a column.
As an example of del, here first add a new Boolean value column, whether the state is ‘Ohio’, code example:
frame2['eastern'] = frame2.state=='Ohio'
frame2
Insert image description here
Another common data form of DataFrame is a nested dictionary. If a nested dictionary is passed to a DataFrame, pandas will be interpreted as: the keys of the outer dictionary are used as columns, and the keys of the inner layer are used as row indexes. Code example:
#DataFrame另一种常见的数据形式是嵌套字典
pop = {
'Nvidia':{2001:2.4,2002:3.4},
'Intel':{2000:3.7,2001:4.7,2002:7.8}
}
frame3 = pd.DataFrame(pop,columns=['Nvidia','Intel'])
frame3
Table 5-1 lists the various data that the DataFrame constructor can accept.
index object
The pandas index object is responsible for managing axis labels and other metadata (such as axis names, etc.). When building a Series or DataFrame, any array or other sequence labels used will be converted into an Index, code example:
import numpy as np
import pandas as pd
obj = pd.Series(np.arange(4),index=['a','b','c','d'])
index = obj.index
#index
index[:-1]
Note: Index objects are immutable, so they cannot be modified by the user.
Immutability allows Index objects to be shared safely between multiple data structures, code example:
#pd.Index储存所有pandas对象的轴标签
#不可变的ndarray实现有序的可切片集
labels = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
#print(obj2.index is labels)
Note: Although users do not need to use the Index functionality frequently, because some operations generate data that contains indexed data, it is important to understand how they work.
Unlike python's collection, pandas's Index can contain duplicate labels, code example:
dup_labels = pd.Index(['foo','foo','bar','alice'])
dup_labels
Each index has methods and properties that you can use to set up logic and answer common questions about the data contained in the index. Table 5-2 lists these functions.
pandas select data
import numpy as np
import pandas as pd
# dates = pd.date_range('20190325', periods=6)
dates = pd.date_range('20190325', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
print(df)
'''
A B C D
2019-03-25 0 1 2 3
2019-03-26 4 5 6 7
2019-03-27 8 9 10 11
2019-03-28 12 13 14 15
2019-03-29 16 17 18 19
2019-03-30 20 21 22 23
'''
# 检索指定A列
print(df['A']) # 等同于print(df.A)
'''
2019-03-25 0
2019-03-26 4
2019-03-27 8
2019-03-28 12
2019-03-29 16
2019-03-30 20
Freq: D, Name: A, dtype: int64
'''
## 切片选取多行或多列
print(df[0:3]) # 等同于print(df['2019-03-25':'2019-03-27'])
'''
A B C D
2019-03-25 0 1 2 3
2019-03-26 4 5 6 7
2019-03-27 8 9 10 11
'''
# 根据标签选择数据
# 获取特定行或列
# 指定行数据
print(df.loc['2019-03-25'])
bb = df.loc['2019-03-25']
print(type(bb))
'''
A 0
B 1
C 2
D 3
Name: 2019-03-25 00:00:00, dtype: int64
<class 'pandas.core.series.Series'>
'''
# 指定列, 两种方式
print(df.loc[:, ['A', 'B']]) # print(df.loc[:, 'A':'B'])
'''
A B
2019-03-25 0 1
2019-03-26 4 5
2019-03-27 8 9
2019-03-28 12 13
2019-03-29 16 17
2019-03-30 20 21
'''
# 行列同时检索
cc = df.loc['20190325', ['A', 'B']]
print(cc);print(type(cc.values))# numpy ndarray
'''
A 0
B 1
Name: 2019-03-25 00:00:00, dtype: int64
<class 'numpy.ndarray'>
'''
print(df.loc['20190326', 'A'])
'''
4
'''
# 根据序列iloc获取特定位置的值, iloc是根据行数与列数来索引的
print(df.iloc[1,0]) # 13, numpy ndarray
'''
4
'''
print(df.iloc[3:5,1:3]) # 不包含末尾5或3,同列表切片
'''
B C
2019-03-28 13 14
2019-03-29 17 18
'''
# 跨行操作
print(df.iloc[[1, 3, 5], 1:3])
'''
B C
2019-03-26 5 6
2019-03-28 13 14
2019-03-30 21 22
'''
# 通过判断的筛选
print(df[df.A>8])
'''
A B C D
2019-03-28 12 13 14 15
2019-03-29 16 17 18 19
2019-03-30 20 21 22 23
'''
Summarize
This article mainly records some characteristics of Series and DataFrame as the basic structure of the pandas library, how to create pandas objects, specify columns and index to create Series and DataFrame objects, assignment operations, attribute acquisition, index objects, etc. This chapter introduces the operations in Series and DataFrame basic means of data.
References
- "Using python for data analysis"