Python data analysis-getting started with the pandas library

pandas library overview

pandas provides a large number of data structures and functions for quickly and conveniently processing structured data. Since its emergence in 2010, it has helped make Python a powerful and efficient data analysis environment. The most commonly used data structure object in pandas is DataFrame, which is a column-oriented two-dimensional table structure, and the other is Series, a one-dimensional labeled array object.

pandas combines the high-performance array calculation capabilities of NumPy with the flexible data processing capabilities of spreadsheets and relational databases (such as SQL). It provides sophisticated indexing capabilities that make it easier to complete operations such as reshaping, slicing and dicing, aggregating, and selecting data subsets. Data manipulation, preparation, and cleaning are the most important skills for data analysis, and pandas is one of the preferred python libraries.

Personally, I feel that it is best to learn pandas in the jupyter environment of anaconda, which is convenient for breakpoint debugging and analysis, and it is also convenient to run the code line by line.

Install pandas

Installation under Windows/Linux system environment

Installation via conda

conda install pandas

Installation via pip3

py -3 -m pip install --upgrade pandas    #Windows系统
python3 -m pip install --upgrade pandas    #Linux系统

pandas library usage

Pandas adopts a large number of NumPy coding styles, but the biggest difference between the two is that pandas is specifically designed for processing tabular and mixed data. NumPy is more suitable for processing uniform numerical array data.

Import the pandas module, and commonly used submodules Series and DataFrame

import pands as pd
from pandas import Series,DataFrame

Create a Series by passing a list of values, letting pandas create a default integer index:

s = pd.Series([1,3,5,np.nan,6,8])
s

output

0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

Introduction to pandas data structure

To use pandas, you must first be familiar with its two main data structures:Series and DataFrame. While they don't solve every problem, they provide a reliable, easy-to-use foundation for most applications.

Series data structure

A Series is a one-dimensional array-like object that consists of a set of data (various NumPy data types) and a set of data labels (i.e., indices) associated with it. The simplest Series can be generated from just one set of data. Code example:

import pandas as pd
obj = pd.Series([1,4,7,8,9])
obj

Insert image description here

The string representation of a Series is: index on the left and value on the right. Since we did not specify an index for the data, an integer index ranging from 0 to N-1 (N is the length of the data) will be automatically created. You can also obtain its array representation and index object through the values and index properties of the Series. Code example:

obj.values
obj.index # like range(5)

Output:

array([ 1, 4, 7, 8, 9])
RangeIndex(start=0, stop=5, step=1)

We also want to create a Series with an index that can mark each data point. Code example:

obj2 = pd.Series([1, 4, 7, 8, 9],index=['a', 'b', 'c', 'd'])
obj2
obj2.index

output

a 1
b 4
c 7
d 8
e 9
dtype: int64

Index([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype=’object’)

Insert image description here

Compared with ordinary NumPy arrays, you can select a single or a group of values in a Series through indexing. Code example:

obj2[['a', 'b', 'c']] 
obj2['a']=2
obj2[['a', 'b', 'c']]

Insert image description here

['a','b','c] is an indexed list, even though it contains strings rather than integers.

Using NumPy functions or NumPy-like operations (such as filtering based on boolean arrays, scalar multiplication, applying mathematical functions, etc.) will preserve the link to the index value, code example:

obj2*2
np.exp(obj2)

Insert image description here

You can also think of a Series as a fixed-length ordered dictionary, since it is a mapping of index values to data values. It can be used in many functions that originally require dictionary parameters, code example:

dict = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000,'Utah': 5000}
obj3 = pd.Series(dict)
obj3

output

Ohio 35000

Oregon 16000

Texas 71000

Utah 5000

dtype: int64

Insert image description here

DataFrame data structure

DataFrame is a tabular data structure that contains a set of ordered columns, each column can be a different value type (numeric, string, Boolean, etc.). DataFrame has both row and column indexes, and it can be viewed as a dictionary composed of Series (sharing the same index). Data in a DataFrame is stored in one or more two-dimensional blocks (rather than lists, dictionaries, or other one-dimensional data structures).

Although a DataFrame holds data in a two-dimensional structure, you can still easily represent it as higher-dimensional data (a hierarchically indexed tabular structure, which is a key element of many advanced data processing functions in pandas)

There are many ways to create a DataFrame. The most commonly used one is to directly pass in a dictionary composed of equal-length lists or NumPy arrays. Code example:

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada','Nevada'],
             'year': [2000, 2001, 2002, 2001, 2002, 2003],
              'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

As a result, the DataFrame will be automatically indexed (same as Series), and all columns will be sorted. The output is as follows:

Insert image description here

For particularly large DataFrames, the head method selects the first five rows:

frame.head()

If a column sequence is specified, the columns of the DataFrame will be arranged in the specified order. Code example:

pd.DataFrame(data,columns=['state','year','pop'])

Insert image description here

If the passed in column is not found in the data, missing values will be generated in the result. Code example:

frame2 = pd.DataFrame(data,columns=['state','year','pop','debt'],
                                    index=['one','two','three','four','five','six'])
frame2

Insert image description here

Get the columns and index of DataFrame, code example:

frame2.columns
frame2.index

output

Index([‘state’, ‘year’, ‘pop’, ‘debt’], dtype=’object’)

Index([‘one’, ‘two’, ‘three’, ‘four’, ‘five’, ‘six’], dtype=’object’)

Insert image description here

The columns of the DataFrame can be obtained as a Series through dictionary-like tags or attributes. Code example:

frame2['state']
frame2.state

Insert image description here

Columns can be modified through assignment, which is similar to Series. For example, we can assign a scalar value or a set of values (in the form of an array or list) to the empty "debt" column, code example:

frame2.debt = np.arange(6.)
frame2

Insert image description here

Note: When assigning a list or array to a column, its length must match the length of the DataFrame.

If the assigned value is a Series, it will exactly match the index of the DataFrame, and all the gaps will be filled with missing values. Code example:

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four','five'])
frame2.debt = val
frame2

Assigning a value to a column that does not exist creates a new column. Keyword del is used to delete a column.

As an example of del, here first add a new Boolean value column, whether the state is ‘Ohio’, code example:

frame2['eastern'] = frame2.state=='Ohio'
frame2

Insert image description here

Another common data form of DataFrame is a nested dictionary. If a nested dictionary is passed to a DataFrame, pandas will be interpreted as: the keys of the outer dictionary are used as columns, and the keys of the inner layer are used as row indexes. Code example:

#DataFrame另一种常见的数据形式是嵌套字典
pop = {
      'Nvidia':{2001:2.4,2002:3.4},
      'Intel':{2000:3.7,2001:4.7,2002:7.8}
}
frame3 = pd.DataFrame(pop,columns=['Nvidia','Intel'])
frame3

Insert image description here

Table 5-1 lists the various data that the DataFrame constructor can accept.

Insert image description here

index object

The pandas index object is responsible for managing axis labels and other metadata (such as axis names, etc.). When building a Series or DataFrame, any array or other sequence labels used will be converted into an Index, code example:

import numpy as np
import pandas as pd
obj = pd.Series(np.arange(4),index=['a','b','c','d'])
index = obj.index
#index
index[:-1]

Insert image description here

Note: Index objects are immutable, so they cannot be modified by the user.

Immutability allows Index objects to be shared safely between multiple data structures, code example:

#pd.Index储存所有pandas对象的轴标签
#不可变的ndarray实现有序的可切片集
labels = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
#print(obj2.index is labels)

Insert image description here

Note: Although users do not need to use the Index functionality frequently, because some operations generate data that contains indexed data, it is important to understand how they work.

Unlike python's collection, pandas's Index can contain duplicate labels, code example:

dup_labels = pd.Index(['foo','foo','bar','alice'])
dup_labels

Each index has methods and properties that you can use to set up logic and answer common questions about the data contained in the index. Table 5-2 lists these functions.

Insert image description here

pandas select data

import numpy as np
import pandas as pd
# dates = pd.date_range('20190325', periods=6)
dates = pd.date_range('20190325', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
print(df)
'''
             A   B   C   D
2019-03-25   0   1   2   3
2019-03-26   4   5   6   7
2019-03-27   8   9  10  11
2019-03-28  12  13  14  15
2019-03-29  16  17  18  19
2019-03-30  20  21  22  23
'''
# 检索指定A列
print(df['A'])    # 等同于print(df.A)
'''
2019-03-25     0
2019-03-26     4
2019-03-27     8
2019-03-28    12
2019-03-29    16
2019-03-30    20
Freq: D, Name: A, dtype: int64
'''
## 切片选取多行或多列
print(df[0:3])    # 等同于print(df['2019-03-25':'2019-03-27'])
'''
            A  B   C   D
2019-03-25  0  1   2   3
2019-03-26  4  5   6   7
2019-03-27  8  9  10  11
'''
# 根据标签选择数据
# 获取特定行或列
# 指定行数据
print(df.loc['2019-03-25'])
bb = df.loc['2019-03-25']
print(type(bb))
'''
A    0
B    1
C    2
D    3
Name: 2019-03-25 00:00:00, dtype: int64
<class 'pandas.core.series.Series'>
'''
# 指定列, 两种方式
print(df.loc[:, ['A', 'B']])    # print(df.loc[:, 'A':'B'])
'''
             A   B
2019-03-25   0   1
2019-03-26   4   5
2019-03-27   8   9
2019-03-28  12  13
2019-03-29  16  17
2019-03-30  20  21
'''
# 行列同时检索
cc = df.loc['20190325', ['A', 'B']]
print(cc);print(type(cc.values))# numpy ndarray
'''
A    0
B    1
Name: 2019-03-25 00:00:00, dtype: int64
<class 'numpy.ndarray'>
'''
print(df.loc['20190326', 'A'])
'''
4
'''
# 根据序列iloc获取特定位置的值, iloc是根据行数与列数来索引的
print(df.iloc[1,0])     # 13, numpy ndarray
'''
4
'''
print(df.iloc[3:5,1:3]) # 不包含末尾5或3，同列表切片
'''
             B   C
2019-03-28  13  14
2019-03-29  17  18
'''
# 跨行操作
print(df.iloc[[1, 3, 5], 1:3])
'''
             B   C
2019-03-26   5   6
2019-03-28  13  14
2019-03-30  21  22
'''
# 通过判断的筛选
print(df[df.A>8])
'''
             A   B   C   D
2019-03-28  12  13  14  15
2019-03-29  16  17  18  19
2019-03-30  20  21  22  23
'''

Summarize

This article mainly records some characteristics of Series and DataFrame as the basic structure of the pandas library, how to create pandas objects, specify columns and index to create Series and DataFrame objects, assignment operations, attribute acquisition, index objects, etc. This chapter introduces the operations in Series and DataFrame basic means of data.

References

"Using python for data analysis"