Pandas study notes (1) - Pandas basics
More article code details:
You can view the blogger's personal website: https://www.iwtmbtly.com/
Introduction to Pandas
Pandas is Python's core data analysis support library, providing fast, flexible, and clear data structures, designed to handle relational and labeled data simply and intuitively. The goal of Pandas is to become an essential advanced tool for Python data analysis practice and combat. Its long-term goal is to become the most powerful and flexible open source data analysis tool that can support any language. After years of unremitting efforts, Pandas is getting closer and closer to this goal.
Pandas is suitable for working with the following types of data:
- Tabular data with heterogeneous columns similar to SQL or Excel tables;
- ordered and unordered (non-fixed frequency) time series data;
- Matrix data with row and column labels, including isomorphic or heterogeneous data;
- Any other form of observation and statistical data sets, when the data is transferred into the Pandas data structure, there is no need to mark it in advance.
The main data structures of Pandas are Series (one-dimensional data) and DataFrame (two-dimensional data), which are sufficient to handle most typical use cases in the fields of finance, statistics, social science, engineering, etc. For R users, DataFrame provides richer functionality than the R language data.frame. Pandas is developed based on NumPy and can be perfectly integrated with other third-party scientific computing support libraries.
Pandas is like a universal swiss army knife, and here are just a few of its advantages:
- Handle missing data in floating-point and non-floating-point data, represented as NaN;
- variable size: insert or delete columns of multidimensional objects such as DataFrame;
- Automatic, explicit data alignment: explicitly align objects with a set of labels, or ignore labels, and automatically align with data when Series and DataFrame are calculated;
- Powerful and flexible group by function: split-apply-combine data sets, aggregate and transform data;
- Easily convert irregular and differently indexed data in Python and NumPy data structures into DataFrame objects;
- Based on smart tags, perform operations such as slicing, fancy indexing, and subset decomposition on large data sets;
- Intuitively merge and join datasets;
- Flexibly reshape and pivot datasets;
- Axis supports structured labels: one scale supports multiple labels;
- Mature IO tool: read data from sources such as text files (CSV and other files that support delimiters), Excel files, databases, etc., and save/load data in the ultra-fast HDF5 format;
- Time series: support date range generation, frequency conversion, moving window statistics, moving window linear regression, date displacement and other time series functions.
These functions are mainly to solve the pain points of other programming languages and scientific research environments. Processing data is generally divided into several stages: data collation and cleaning, data analysis and modeling, data visualization and tabulation, Pandas is an ideal tool for processing data.
other instructions:
- Pandas is fast. Many underlying algorithms of Pandas have been optimized with Cython. However, in order to maintain generality, some performance must be sacrificed. If you focus on a certain function, you can develop a dedicated tool that is faster than Pandas.
- Pandas is a dependency of statsmodels, therefore, Pandas is also an important part of the statistical computing ecosystem in Python.
- Pandas has been widely used in the financial field.
Import Pandas
>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__ # 查看Pandas版本
'1.3.4'
1. File reading and writing
read
1. csv format
>>> df = pd.read_csv("data/table.csv", encoding="utf_8_sig")
>>> df.head() # 查看部分前面的内容
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
3 S_1 C_1 1104 F street_2 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
2. txt format
>>> df_txt = pd.read_table("data/table.txt")
>>> df_txt
col1 col2 col3 col4
0 2 a 1.4 apple
1 3 b 3.4 banana
2 6 c 2.5 orange
3 5 d 3.2 lemon
3. xls or xlsx format
# 需要安装xlrd包
>>> df_excel = pd.read_excel("data/table.xlsx")
>>> df_excel.head() # 查看部分前面的内容
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
3 S_1 C_1 1104 F street_2 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
to write
1. csv format
# df.to_csv("data/new_table.csv", index=False) # 保存时去除索引
>>> df.to_csv("data/new_table.csv", encoding="utf_8_sig") # 写入中文时容易乱码,使用encoding="utf_8_sig"
2. xls or xlsx format
# 需要安装openpyxl
>>> df.to_excel("data/new_table.xlsx", sheet_name="Sheet1")
2. Basic data structure
Series
Pandas Series is similar to a column in a table, similar to a one-dimensional array, and can hold any data type. Series consists of indexes and columns. Common functions are as follows:
1. Create a Series
For a Series, its most commonly used attributes are values (values), index (index), name (name), type (dtype)
# 使用numpy的np.random.randn(6)随机生成服从标准正态分布的随机数
>>> s = pd.Series(np.random.randn(6), index=['a', 'b', 'c', 'd', 'e', 'f'], name="这是一个Series", dtype='float64')
>>> s
a 0.945151
b 2.071048
c 0.983989
d 0.032149
e 1.465196
f 0.416414
Name: 这是一个Series, dtype: float64
2. Access the Series property
>>> s.values # 查看值
array([ 0.07553884, -2.52708429, 0.90744636, 0.95449944, 1.2871629 ,
-0.03133632])
>>> s.name # 查看名字
'这是一个Series'
>>> s.index # 查看索引
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
>>> s.dtype # 查看类型
dtype('float64')
3. Take out an element
>>> s["a"]
0.9451514803304423
4. Call method
>>> s.mean()
0.9856579470429936
# Series有很多的方法可以调用
>>> print([attr for attr in dir(s) if not attr.startswith('_')])
['T', 'a', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'b', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'c', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'd', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtype', 'dtypes', 'duplicated', 'e', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'f', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'set_flags', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']
DataFrame
A DataFrame is a tabular data structure that contains an ordered set of columns, each of which can be of a different value type (numeric, string, boolean). DataFrame has both row index and column index, which can be regarded as a dictionary composed of Series (commonly share an index).
The DataFrame construction method is as follows:
pandas.DataFrame( data, index, columns, dtype, copy)
Parameter Description:
- data: A set of data (ndarray, series, map, lists, dict, etc.).
- index: Index value, or can be called row label.
- columns: column labels, the default is RangeIndex (0, 1, 2, …, n) .
- dtype: data type.
- copy: copy data, the default is False.
Pandas DataFrame is a two-dimensional array structure, similar to a two-dimensional array.
1. Create a DataFrame
>>> df = pd.DataFrame({
'col1': list('abcde'), 'col2': range(5, 10), 'col3': [1.3, 2.5, 3.6, 4.6, 5.8]},
... index=list('一二三四五'))
>>> df
col1 col2 col3
一 a 5 1.3
二 b 6 2.5
三 c 7 3.6
四 d 8 4.6
五 e 9 5.8
2. Get a Series from DataFrame
>>> df['col1']
一 a
二 b
三 c
四 d
五 e
Name: col1, dtype: object
>>> type(df)
pandas.core.frame.DataFrame
>>>type(df['col1'])
pandas.core.series.Series
3. Modify row or column names
>>> df.rename(index={
'一': 'one'}, columns={
'col1': 'new_col1'})
new_col1 col2 col3
one a 5 1.3
二 b 6 2.5
三 c 7 3.6
四 d 8 4.6
五 e 9 5.8
4. Access properties and call methods
>>> df.index
Index(['一', '二', '三', '四', '五'], dtype='object')
>>> df.columns
Index(['col1', 'col2', 'col3'], dtype='object')
>>> df.values
array([['a', 5, 1.3],
['b', 6, 2.5],
['c', 7, 3.6],
['d', 8, 4.6],
['e', 9, 5.8]], dtype=object)
>>> df.shape
(5, 3)
5. Indexing its properties
This is a very powerful feature in Pandas, and not understanding it can sometimes cause trouble
>>> df1 = pd.DataFrame({
'A': [1, 2, 3]}, index=[1, 2, 3])
>>> df2 = pd.DataFrame({
'A': [1, 2, 3]}, index=[3, 1, 2])
>>> df1 - df2 # 由于索引并未对齐,因此结果不是0
A
1 -1
2 -1
3 2
6. Delete and add columns
For deletion, you can use the drop function or del or pop
>>> # 删除“第五行”和“第一列”
>>> df.drop(index='五', columns='col1') # 设置inplace=True后会直接在原DataFrame中改动
col2 col3
一 5 1.3
二 6 2.5
三 7 3.6
四 8 4.6
>>> df['col1'] = [1, 2, 3, 4, 5]
>>> del df['col2']
>>> df
col1 col3
一 1 1.3
二 2 2.5
三 3 3.6
四 4 4.6
五 5 5.8
The pop method operates directly on the original DataFrame and returns the deleted columns, similar to the pop function in python
>>> df['col2'] = [6, 7, 8, 9, 10]
>>> df.pop('col1')
一 1
二 2
三 3
四 4
五 5
Name: col1, dtype: int64
>>> df
col3 col2
一 1.3 6
二 2.5 7
三 3.6 8
四 4.6 9
五 5.8 10
You can add new columns directly, or use the assign method
>>> df1['B'] = list('abc')
>>> df1
A B
1 1 a
2 2 b
3 3 c
>>> df1.assign(C=pd.Series(list('def'))) # 为什么会出现NaN?(提示:索引对齐)assign左右两边的索引不一样,请问结果的索 引谁说了算?
A B C
1 1 a e
2 2 b f
3 3 c NaN
But the assign method will not modify the original DataFrame
>>> df1
A B
1 1 a
2 2 b
3 3 c
7. Select columns by type
>>> df.select_dtypes(include=['number']).head()
col3 col2
一 1.3 6
二 2.5 7
三 3.6 8
四 4.6 9
五 5.8 10
>>> df.select_dtypes(include=['float']).head()
col3
一 1.3
二 2.5
三 3.6
四 4.6
五 5.8
8. Convert Series to DataFrame
>>> s = df.mean()
>>> s.name = 'to_DataFrame'
>>> s
col3 3.56
col2 8.00
Name: to_DataFrame, dtype: float64
>>> s.to_frame()
to_DataFrame
col3 3.56
col2 8.00
Use the T notation to transpose
>>> s.to_frame().T
col3 col2
to_DataFrame 3.56 8.0
3. Common basic functions
From the following, including all subsequent chapters, we will use this data set.
df = pd.read_csv("data/table.csv")
1. head and tail
>>> df.head() # 查看文件前面部分数据
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
3 S_1 C_1 1104 F street_2 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
>>> df.tail() # 查看文件后面部分数据
School Class ID Gender Address Height Weight Math Physics
30 S_2 C_4 2401 F street_2 192 62 45.3 A
31 S_2 C_4 2402 M street_7 166 82 48.7 B
32 S_2 C_4 2403 F street_6 158 60 59.7 B+
33 S_2 C_4 2404 F street_2 160 84 67.7 B
34 S_2 C_4 2405 F street_6 193 54 47.6 B
You can specify how many lines to display with the n parameter
>>> df.head(3)
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
2. unique和nunique
Nunique shows how many unique values there are, unique shows all unique values
>>> df['Physics'].nunique()
7
>>> df['Physics'].unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)
3. count和value_counts
count returns the number of non-missing value elements, value_counts returns how many of each element
>>> df['Physics'].count()
35
>>> df['Physics'].value_counts()
B+ 9
B 8
B- 6
A 4
A+ 3
A- 3
C 2
Name: Physics, dtype: int64
4. describe and info
The info function returns which columns are there, how many non-missing values there are, and the type of each column
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 35 non-null object
1 Class 35 non-null object
2 ID 35 non-null int64
3 Gender 35 non-null object
4 Address 35 non-null object
5 Height 35 non-null int64
6 Weight 35 non-null int64
7 Math 35 non-null float64
8 Physics 35 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB
Describe default statistics of various statistics of numerical data
>>> df.describe()
ID Height Weight Math
count 35.00000 35.000000 35.000000 35.000000
mean 1803.00000 174.142857 74.657143 61.351429
std 536.87741 13.541098 12.895377 19.915164
min 1101.00000 155.000000 53.000000 31.500000
25% 1204.50000 161.000000 63.000000 47.400000
50% 2103.00000 173.000000 74.000000 61.700000
75% 2301.50000 187.500000 82.000000 77.100000
max 2405.00000 195.000000 100.000000 97.000000
You can choose your own quantile
>>> df.describe(percentiles=[.05, .25, .75, .95])
ID Height Weight Math
count 35.00000 35.000000 35.000000 35.000000
mean 1803.00000 174.142857 74.657143 61.351429
std 536.87741 13.541098 12.895377 19.915164
min 1101.00000 155.000000 53.000000 31.500000
5% 1102.70000 157.000000 56.100000 32.640000
25% 1204.50000 161.000000 63.000000 47.400000
50% 2103.00000 173.000000 74.000000 61.700000
75% 2301.50000 187.500000 82.000000 77.100000
95% 2403.30000 193.300000 97.600000 90.040000
max 2405.00000 195.000000 100.000000 97.000000
For non-numeric types, you can also use the describe function
>>> df['Physics'].describe()
count 35
unique 7
top B+
freq 9
Name: Physics, dtype: object
5. idxmax and nlargest
The idxmax function returns the index of the maximum value, which is especially applicable in some cases. The idxmin function is similar. The nlargest function returns the first few large element values. The nsmallest function is similar
>>> df['Math'].idxmax()
5
>>> df['Math'].nlargest(3)
5 97.0
28 95.5
11 87.7
Name: Math, dtype: float64
6. clip and replace
clip and replace are two types of replacement functions
Clip is to truncate numbers that exceed or fall below certain values
>>> df['Math'].head()
0 34.0
1 32.5
2 87.2
3 80.4
4 84.8
Name: Math, dtype: float64
>>> df['Math'].clip(33, 80).head()
0 34.0
1 33.0
2 80.0
3 80.0
4 80.0
Name: Math, dtype: float64
replace is to replace some values
>>> df['Address'].head()
0 street_1
1 street_2
2 street_2
3 street_2
4 street_4
Name: Address, dtype: object
>>> df['Address'].replace(['street_1', 'street_2'], ['one', 'two']).head()
0 one
1 two
2 two
3 two
4 street_4
Name: Address, dtype: object
>>> df.replace({
'Address': {
'street_1': 'one', 'street_2': 'two'}}).head()
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M one 173 63 34.0 A+
1 S_1 C_1 1102 F two 192 73 32.5 B+
2 S_1 C_1 1103 M two 186 82 87.2 B+
3 S_1 C_1 1104 F two 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
7. apply function
apply is a function with a high degree of freedom. For Series, it can iterate the value operation of each column:
>>> df['Math'].apply(lambda x: str(x) + '!').head() # 可以使用lambda表达式,也可以使用函数
0 34.0!
1 32.5!
2 87.2!
3 80.4!
4 84.8!
Name: Math, dtype: object
For DataFrame, it can iterate each column operation under the default axis=0:
>>> df.apply(lambda x: x.apply(lambda y: str(y) + '!')).head() # 这是一个稍显复杂的例子,有利于理解apply的功能
School Class ID Gender Address Height Weight Math Physics
0 S_1! C_1! 1101! M! street_1! 173! 63! 34.0! A+!
1 S_1! C_1! 1102! F! street_2! 192! 73! 32.5! B+!
2 S_1! C_1! 1103! M! street_2! 186! 82! 87.2! B+!
3 S_1! C_1! 1104! F! street_2! 167! 81! 80.4! B-!
4 S_1! C_1! 1105! F! street_4! 159! 64! 84.8! B+!
When the axis parameter in Pandas=0, it always indicates the processing direction instead of the aggregation direction. When axis='index' or =0, iterates the columns and aggregates the rows, and the rows are cross-columns. The axis=1 is the same.
4. Sorting
1. Index sorting
>>> df.set_index('Math').head() # set_index函数可以设置索引
School Class ID Gender Address Height Weight Physics
Math
34.0 S_1 C_1 1101 M street_1 173 63 A+
32.5 S_1 C_1 1102 F street_2 192 73 B+
87.2 S_1 C_1 1103 M street_2 186 82 B+
80.4 S_1 C_1 1104 F street_2 167 81 B-
84.8 S_1 C_1 1105 F street_4 159 64 B+
>>> df.set_index('Math').sort_index().head() # 可以设置ascending参数,默认为升序,True
School Class ID Gender Address Height Weight Physics
Math
31.5 S_1 C_3 1301 M street_4 161 68 B+
32.5 S_1 C_1 1102 F street_2 192 73 B+
32.7 S_2 C_3 2302 M street_5 171 88 A
33.8 S_1 C_2 1204 F street_5 162 63 B
34.0 S_1 C_1 1101 M street_1 173 63 A+
2. Value sorting
>>> df.sort_values(by='Class').head() # 默认升序
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
19 S_2 C_1 2105 M street_4 170 81 34.2 A
18 S_2 C_1 2104 F street_5 159 97 72.2 B+
16 S_2 C_1 2102 F street_6 161 61 50.6 B+
15 S_2 C_1 2101 M street_7 174 84 83.3 C
Sort multiple values, that is, sort the first layer first, and sort the second layer if the first layer is the same
>>> df.sort_values(by=['Address', 'Height']).head()
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
11 S_1 C_3 1302 F street_1 175 57 87.7 A-
23 S_2 C_2 2204 M street_1 175 74 47.2 B-
33 S_2 C_4 2404 F street_2 160 84 67.7 B
3 S_1 C_1 1104 F street_2 167 81 80.4 B-