Pandas study notes (1) - Pandas basics

Pandas study notes (1) - Pandas basics


More article code details:

You can view the blogger's personal website: https://www.iwtmbtly.com/


Introduction to Pandas

Pandas is Python's core data analysis support library, providing fast, flexible, and clear data structures, designed to handle relational and labeled data simply and intuitively. The goal of Pandas is to become an essential advanced tool for Python data analysis practice and combat. Its long-term goal is to become the most powerful and flexible open source data analysis tool that can support any language. After years of unremitting efforts, Pandas is getting closer and closer to this goal.

Pandas is suitable for working with the following types of data:

  • Tabular data with heterogeneous columns similar to SQL or Excel tables;
  • ordered and unordered (non-fixed frequency) time series data;
  • Matrix data with row and column labels, including isomorphic or heterogeneous data;
  • Any other form of observation and statistical data sets, when the data is transferred into the Pandas data structure, there is no need to mark it in advance.

The main data structures of Pandas are Series (one-dimensional data) and DataFrame (two-dimensional data), which are sufficient to handle most typical use cases in the fields of finance, statistics, social science, engineering, etc. For R users, DataFrame provides richer functionality than the R language data.frame. Pandas is developed based on NumPy and can be perfectly integrated with other third-party scientific computing support libraries.

Pandas is like a universal swiss army knife, and here are just a few of its advantages:

  • Handle missing data in floating-point and non-floating-point data, represented as NaN;
  • variable size: insert or delete columns of multidimensional objects such as DataFrame;
  • Automatic, explicit data alignment: explicitly align objects with a set of labels, or ignore labels, and automatically align with data when Series and DataFrame are calculated;
  • Powerful and flexible group by function: split-apply-combine data sets, aggregate and transform data;
  • Easily convert irregular and differently indexed data in Python and NumPy data structures into DataFrame objects;
  • Based on smart tags, perform operations such as slicing, fancy indexing, and subset decomposition on large data sets;
  • Intuitively merge and join datasets;
  • Flexibly reshape and pivot datasets;
  • Axis supports structured labels: one scale supports multiple labels;
  • Mature IO tool: read data from sources such as text files (CSV and other files that support delimiters), Excel files, databases, etc., and save/load data in the ultra-fast HDF5 format;
  • Time series: support date range generation, frequency conversion, moving window statistics, moving window linear regression, date displacement and other time series functions.

These functions are mainly to solve the pain points of other programming languages ​​and scientific research environments. Processing data is generally divided into several stages: data collation and cleaning, data analysis and modeling, data visualization and tabulation, Pandas is an ideal tool for processing data.

other instructions:

  • Pandas is fast. Many underlying algorithms of Pandas have been optimized with Cython. However, in order to maintain generality, some performance must be sacrificed. If you focus on a certain function, you can develop a dedicated tool that is faster than Pandas.
  • Pandas is a dependency of statsmodels, therefore, Pandas is also an important part of the statistical computing ecosystem in Python.
  • Pandas has been widely used in the financial field.

Import Pandas

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__ # 查看Pandas版本
'1.3.4'

1. File reading and writing

read

1. csv format

>>> df = pd.read_csv("data/table.csv", encoding="utf_8_sig")
>>> df.head() # 查看部分前面的内容
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

2. txt format

>>> df_txt = pd.read_table("data/table.txt")
>>> df_txt
   col1 col2  col3    col4
0     2    a   1.4   apple
1     3    b   3.4  banana
2     6    c   2.5  orange
3     5    d   3.2   lemon

3. xls or xlsx format

# 需要安装xlrd包
>>> df_excel = pd.read_excel("data/table.xlsx")
>>> df_excel.head()  # 查看部分前面的内容
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

to write

1. csv format

# df.to_csv("data/new_table.csv", index=False) # 保存时去除索引
>>> df.to_csv("data/new_table.csv", encoding="utf_8_sig")  # 写入中文时容易乱码,使用encoding="utf_8_sig"

2. xls or xlsx format

# 需要安装openpyxl
>>> df.to_excel("data/new_table.xlsx", sheet_name="Sheet1")

2. Basic data structure

Series

Pandas Series is similar to a column in a table, similar to a one-dimensional array, and can hold any data type. Series consists of indexes and columns. Common functions are as follows:

1. Create a Series

For a Series, its most commonly used attributes are values ​​(values), index (index), name (name), type (dtype)

# 使用numpy的np.random.randn(6)随机生成服从标准正态分布的随机数
>>> s = pd.Series(np.random.randn(6), index=['a', 'b', 'c', 'd', 'e', 'f'], name="这是一个Series", dtype='float64')
>>> s
a    0.945151
b    2.071048
c    0.983989
d    0.032149
e    1.465196
f    0.416414
Name: 这是一个Series, dtype: float64

2. Access the Series property

>>> s.values  # 查看值
array([ 0.07553884, -2.52708429,  0.90744636,  0.95449944,  1.2871629 ,
       -0.03133632])
>>> s.name  # 查看名字
'这是一个Series'
>>> s.index  # 查看索引
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
>>> s.dtype  # 查看类型
dtype('float64')

3. Take out an element

>>> s["a"]
0.9451514803304423

4. Call method

>>> s.mean()
0.9856579470429936
# Series有很多的方法可以调用
>>> print([attr for attr in dir(s) if not attr.startswith('_')])
['T', 'a', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'argmax', 'argmin', 'argsort', 'array', 'asfreq', 'asof', 'astype', 'at', 'at_time', 'attrs', 'autocorr', 'axes', 'b', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'c', 'clip', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'd', 'describe', 'diff', 'div', 'divide', 'divmod', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtype', 'dtypes', 'duplicated', 'e', 'empty', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'f', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'hasnans', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'interpolate', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'isin', 'isna', 'isnull', 'item', 'items', 'iteritems', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lt', 'mad', 'map', 'mask', 'max', 'mean', 'median', 'memory_usage', 'min', 'mod', 'mode', 'mul', 'multiply', 'name', 'nbytes', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'radd', 'rank', 'ravel', 'rdiv', 'rdivmod', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'repeat', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'searchsorted', 'sem', 'set_axis', 'set_flags', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_frame', 'to_hdf', 'to_json', 'to_latex', 'to_list', 'to_markdown', 'to_numpy', 'to_period', 'to_pickle', 'to_sql', 'to_string', 'to_timestamp', 'to_xarray', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unique', 'unstack', 'update', 'value_counts', 'values', 'var', 'view', 'where', 'xs']

DataFrame

A DataFrame is a tabular data structure that contains an ordered set of columns, each of which can be of a different value type (numeric, string, boolean). DataFrame has both row index and column index, which can be regarded as a dictionary composed of Series (commonly share an index).

The DataFrame construction method is as follows:

pandas.DataFrame( data, index, columns, dtype, copy)

Parameter Description:

  • data: A set of data (ndarray, series, map, lists, dict, etc.).
  • index: Index value, or can be called row label.
  • columns: column labels, the default is RangeIndex (0, 1, 2, …, n) .
  • dtype: data type.
  • copy: copy data, the default is False.

Pandas DataFrame is a two-dimensional array structure, similar to a two-dimensional array.

1. Create a DataFrame

>>> df = pd.DataFrame({
    
    'col1': list('abcde'), 'col2': range(5, 10), 'col3': [1.3, 2.5, 3.6, 4.6, 5.8]},
...                   index=list('一二三四五'))
>>> df
  col1  col2  col3
一    a     5   1.3
二    b     6   2.5
三    c     7   3.6
四    d     8   4.6
五    e     9   5.8

2. Get a Series from DataFrame

>>> df['col1']
一    a
二    b
三    c
四    d
五    e
Name: col1, dtype: object
>>> type(df)
pandas.core.frame.DataFrame
>>>type(df['col1'])
pandas.core.series.Series

3. Modify row or column names

>>> df.rename(index={
    
    '一': 'one'}, columns={
    
    'col1': 'new_col1'})
    new_col1  col2  col3
one        a     5   1.3
二          b     6   2.5
三          c     7   3.6
四          d     8   4.6
五          e     9   5.8

4. Access properties and call methods

>>> df.index
Index(['一', '二', '三', '四', '五'], dtype='object')
>>> df.columns
Index(['col1', 'col2', 'col3'], dtype='object')
>>> df.values
array([['a', 5, 1.3],
       ['b', 6, 2.5],
       ['c', 7, 3.6],
       ['d', 8, 4.6],
       ['e', 9, 5.8]], dtype=object)
>>> df.shape
(5, 3)

5. Indexing its properties

This is a very powerful feature in Pandas, and not understanding it can sometimes cause trouble

>>> df1 = pd.DataFrame({
    
    'A': [1, 2, 3]}, index=[1, 2, 3])
>>> df2 = pd.DataFrame({
    
    'A': [1, 2, 3]}, index=[3, 1, 2])
>>> df1 - df2  # 由于索引并未对齐,因此结果不是0
   A
1 -1
2 -1
3  2

6. Delete and add columns

For deletion, you can use the drop function or del or pop

>>> # 删除“第五行”和“第一列”
>>> df.drop(index='五', columns='col1')  # 设置inplace=True后会直接在原DataFrame中改动
   col2  col35   1.36   2.57   3.68   4.6
>>> df['col1'] = [1, 2, 3, 4, 5]
>>> del df['col2']
>>> df
   col1  col31   1.32   2.53   3.64   4.65   5.8

The pop method operates directly on the original DataFrame and returns the deleted columns, similar to the pop function in python

>>> df['col2'] = [6, 7, 8, 9, 10]
>>> df.pop('col1')12345
Name: col1, dtype: int64
>>> df
   col3  col21.3     62.5     73.6     84.6     95.8    10

You can add new columns directly, or use the assign method

>>> df1['B'] = list('abc')
>>> df1
   A  B
1  1  a
2  2  b
3  3  c
>>> df1.assign(C=pd.Series(list('def')))  # 为什么会出现NaN?(提示:索引对齐)assign左右两边的索引不一样,请问结果的索 引谁说了算?
   A  B    C
1  1  a    e
2  2  b    f
3  3  c  NaN

But the assign method will not modify the original DataFrame

>>> df1
   A  B
1  1  a
2  2  b
3  3  c

7. Select columns by type

>>> df.select_dtypes(include=['number']).head()
   col3  col21.3     62.5     73.6     84.6     95.8    10
>>> df.select_dtypes(include=['float']).head()
   col31.32.53.64.65.8

8. Convert Series to DataFrame

>>> s = df.mean()
>>> s.name = 'to_DataFrame'
>>> s
col3    3.56
col2    8.00
Name: to_DataFrame, dtype: float64
>>> s.to_frame()
      to_DataFrame
col3          3.56
col2          8.00

Use the T notation to transpose

>>> s.to_frame().T
              col3  col2
to_DataFrame  3.56   8.0

3. Common basic functions

From the following, including all subsequent chapters, we will use this data set.

df = pd.read_csv("data/table.csv")

1. head and tail

>>> df.head()  # 查看文件前面部分数据
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+
>>> df.tail()  # 查看文件后面部分数据
   School Class    ID Gender   Address  Height  Weight  Math Physics
30    S_2   C_4  2401      F  street_2     192      62  45.3       A
31    S_2   C_4  2402      M  street_7     166      82  48.7       B
32    S_2   C_4  2403      F  street_6     158      60  59.7      B+
33    S_2   C_4  2404      F  street_2     160      84  67.7       B
34    S_2   C_4  2405      F  street_6     193      54  47.6       B

You can specify how many lines to display with the n parameter

>>> df.head(3)
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+

2. unique和nunique

Nunique shows how many unique values ​​there are, unique shows all unique values

>>> df['Physics'].nunique()
7
>>> df['Physics'].unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

3. count和value_counts

count returns the number of non-missing value elements, value_counts returns how many of each element

>>> df['Physics'].count()
35
>>> df['Physics'].value_counts()
B+    9
B     8
B-    6
A     4
A+    3
A-    3
C     2
Name: Physics, dtype: int64

4. describe and info

The info function returns which columns are there, how many non-missing values ​​there are, and the type of each column

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   School   35 non-null     object
 1   Class    35 non-null     object
 2   ID       35 non-null     int64
 3   Gender   35 non-null     object
 4   Address  35 non-null     object
 5   Height   35 non-null     int64
 6   Weight   35 non-null     int64
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB

Describe default statistics of various statistics of numerical data

>>> df.describe()
               ID      Height      Weight       Math
count    35.00000   35.000000   35.000000  35.000000
mean   1803.00000  174.142857   74.657143  61.351429
std     536.87741   13.541098   12.895377  19.915164
min    1101.00000  155.000000   53.000000  31.500000
25%    1204.50000  161.000000   63.000000  47.400000
50%    2103.00000  173.000000   74.000000  61.700000
75%    2301.50000  187.500000   82.000000  77.100000
max    2405.00000  195.000000  100.000000  97.000000

You can choose your own quantile

>>> df.describe(percentiles=[.05, .25, .75, .95])
               ID      Height      Weight       Math
count    35.00000   35.000000   35.000000  35.000000
mean   1803.00000  174.142857   74.657143  61.351429
std     536.87741   13.541098   12.895377  19.915164
min    1101.00000  155.000000   53.000000  31.500000
5%     1102.70000  157.000000   56.100000  32.640000
25%    1204.50000  161.000000   63.000000  47.400000
50%    2103.00000  173.000000   74.000000  61.700000
75%    2301.50000  187.500000   82.000000  77.100000
95%    2403.30000  193.300000   97.600000  90.040000
max    2405.00000  195.000000  100.000000  97.000000

For non-numeric types, you can also use the describe function

>>> df['Physics'].describe()
count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

5. idxmax and nlargest

The idxmax function returns the index of the maximum value, which is especially applicable in some cases. The idxmin function is similar. The nlargest function returns the first few large element values. The nsmallest function is similar

>>> df['Math'].idxmax()
5
>>> df['Math'].nlargest(3)
5     97.0
28    95.5
11    87.7
Name: Math, dtype: float64

6. clip and replace

clip and replace are two types of replacement functions

Clip is to truncate numbers that exceed or fall below certain values

>>> df['Math'].head()
0    34.0
1    32.5
2    87.2
3    80.4
4    84.8
Name: Math, dtype: float64
>>> df['Math'].clip(33, 80).head()
0    34.0
1    33.0
2    80.0
3    80.0
4    80.0
Name: Math, dtype: float64

replace is to replace some values

>>> df['Address'].head()
0    street_1
1    street_2
2    street_2
3    street_2
4    street_4
Name: Address, dtype: object
>>> df['Address'].replace(['street_1', 'street_2'], ['one', 'two']).head()
0         one
1         two
2         two
3         two
4    street_4
Name: Address, dtype: object
>>> df.replace({
    
    'Address': {
    
    'street_1': 'one', 'street_2': 'two'}}).head()
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M       one     173      63  34.0      A+
1    S_1   C_1  1102      F       two     192      73  32.5      B+
2    S_1   C_1  1103      M       two     186      82  87.2      B+
3    S_1   C_1  1104      F       two     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

7. apply function

apply is a function with a high degree of freedom. For Series, it can iterate the value operation of each column:

>>> df['Math'].apply(lambda x: str(x) + '!').head()  # 可以使用lambda表达式,也可以使用函数
0    34.0!
1    32.5!
2    87.2!
3    80.4!
4    84.8!
Name: Math, dtype: object

For DataFrame, it can iterate each column operation under the default axis=0:

>>> df.apply(lambda x: x.apply(lambda y: str(y) + '!')).head()  # 这是一个稍显复杂的例子,有利于理解apply的功能
  School Class     ID Gender    Address Height Weight   Math Physics
0   S_1!  C_1!  1101!     M!  street_1!   173!    63!  34.0!     A+!
1   S_1!  C_1!  1102!     F!  street_2!   192!    73!  32.5!     B+!
2   S_1!  C_1!  1103!     M!  street_2!   186!    82!  87.2!     B+!
3   S_1!  C_1!  1104!     F!  street_2!   167!    81!  80.4!     B-!
4   S_1!  C_1!  1105!     F!  street_4!   159!    64!  84.8!     B+!

When the axis parameter in Pandas=0, it always indicates the processing direction instead of the aggregation direction. When axis='index' or =0, iterates the columns and aggregates the rows, and the rows are cross-columns. The axis=1 is the same.

4. Sorting

1. Index sorting

>>> df.set_index('Math').head()  # set_index函数可以设置索引
     School Class    ID Gender   Address  Height  Weight Physics
Math
34.0    S_1   C_1  1101      M  street_1     173      63      A+
32.5    S_1   C_1  1102      F  street_2     192      73      B+
87.2    S_1   C_1  1103      M  street_2     186      82      B+
80.4    S_1   C_1  1104      F  street_2     167      81      B-
84.8    S_1   C_1  1105      F  street_4     159      64      B+
>>> df.set_index('Math').sort_index().head()  # 可以设置ascending参数,默认为升序,True
     School Class    ID Gender   Address  Height  Weight Physics
Math
31.5    S_1   C_3  1301      M  street_4     161      68      B+
32.5    S_1   C_1  1102      F  street_2     192      73      B+
32.7    S_2   C_3  2302      M  street_5     171      88       A
33.8    S_1   C_2  1204      F  street_5     162      63       B
34.0    S_1   C_1  1101      M  street_1     173      63      A+

2. Value sorting

>>> df.sort_values(by='Class').head()  # 默认升序
   School Class    ID Gender   Address  Height  Weight  Math Physics
0     S_1   C_1  1101      M  street_1     173      63  34.0      A+
19    S_2   C_1  2105      M  street_4     170      81  34.2       A
18    S_2   C_1  2104      F  street_5     159      97  72.2      B+
16    S_2   C_1  2102      F  street_6     161      61  50.6      B+
15    S_2   C_1  2101      M  street_7     174      84  83.3       C

Sort multiple values, that is, sort the first layer first, and sort the second layer if the first layer is the same

>>> df.sort_values(by=['Address', 'Height']).head()
   School Class    ID Gender   Address  Height  Weight  Math Physics
0     S_1   C_1  1101      M  street_1     173      63  34.0      A+
11    S_1   C_3  1302      F  street_1     175      57  87.7      A-
23    S_2   C_2  2204      M  street_1     175      74  47.2      B-
33    S_2   C_4  2404      F  street_2     160      84  67.7       B
3     S_1   C_1  1104      F  street_2     167      81  80.4      B-

Guess you like

Origin blog.csdn.net/qq_43300880/article/details/124971639