Pandas Quick Start Tutorial

Pandas overview

Pandas is the core data analysis support library ofPythonopen in new window, providing fast and flexible , clear data structures designed to handle relational and tagged data simply and intuitively. Pandas' goal is to become an essential advanced tool for Python data analysis practice and practice. Its long-term goal is to becomethe most powerful and flexible open source data analysis tool that can support any language >. After years of unremitting efforts, Pandas is getting closer and closer to this goal.

Pandas is suitable for processing the following types of data:

Tabular data with heterogeneous columns similar to SQL or Excel tables;
Ordered and unordered (non-fixed frequency) time series data;
Matrix data with row and column labels, including homogeneous or heterogeneous data;
Any other form of observation or statistical data set does not need to be marked in advance when the data is transferred into the Pandas data structure.

The main data structures of Pandas areSeriesopen in new window (one-dimensional data) andDataFrameopen in new window (two-dimensional data), these two data structures are sufficient to handle most typical use cases in finance, statistics, social sciences, engineering, etc. For R users, DataFrame provides richer functionality than the R language data.frame. Pandas is developed based on NumPyopen in new window and can be perfectly integrated with other third-party scientific computing support libraries.

Pandas is like a universal Swiss Army Knife, here are just some of its advantages:

Processesmissing data in floating point and non-floating point data, expressed as NaN; a>
Variable size:Insertion or deletion Columns of multidimensional objects such as DataFrame;
Automatic, explicitData alignment: Explicitly align the object with a set of labels, or ignore the labels. In Series, DataFrame Automatically align with data during calculation;
Powerful and flexibleGroup (group by) function:Split-Apply-Combine< /span>Data set, aggregate and transform data;
Convert irregular and differently indexed data in Python and NumPy data structuresEasily into DataFrame objects;
Based on smart tags, slicing, fancy indexing , subset decomposition and other operations;
directlymerge,joinCollection of numbers;
Active earthReshape, PivotCollection of numbers;
Axis supports structured labels: one scale supports multiple labels;
Mature IO tools: readtext files (CSV and other files that support delimiters), Excel files, databases and other sources of data , save/load data using the ultra-fastHDF5 format;
Time series: Supports time series functions such as date range generation, frequency conversion, moving window statistics, moving window linear regression, and date displacement.

These functions are mainly to solve the pain points of other programming languages and scientific research environments. Data processing is generally divided into several stages: data sorting and cleaning, data analysis and modeling, data visualization and tabulation. Pandas is an ideal tool for processing data.

other instructions:

Pandas is fastfast. Many of Pandas' underlying algorithms have been optimized using Cythonopen in new window . However, in order to maintain versatility, some performance must be sacrificed. If you focus on a certain function, you can develop a dedicated tool that is faster than Pandas.
Pandas is a dependency of statsmodelsopen in new window and therefore is an important part of the statistical computing ecosystem in Python.
Pandas has been widely used in the financial field.

data structure

dimension	name	describe
1	Series	Labeled one-dimensional homogeneous array
2	DataFrame	Labeled, variable-sized, two-dimensional heterogeneous tables

Why are there multiple data structures?

Pandas data structures are like containers for low-dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. Using this approach, objects can be inserted or deleted as dictionaries in the container.

Additionally, the default operations of common API functions take into account the orientation of time series and cross-sectional datasets. When multi-dimensional arrays store two-dimensional or three-dimensional data, you must pay attention to the direction of the data set when writing functions, which is a burden for users; if the impact of continuity on performance in C or Fortran is not considered, in general, different axes There is actually no difference in the program. In Pandas, the concept of axis is mainly to give the data more intuitive semantics, that is, to express the direction of the data set in a "more appropriate" way. This allows users to save their brains when writing data conversion functions.

When processing tabular data such as DataFrame, index (rows) or columns< a i=4> (column) is more common than axis 0 and axis 1 Intuitive. By iterating over the columns of a DataFrame this way, the code is more readable and understandable:

for col in df.columns:
    series = df[col]
    # do something with series
 

import pandas as pd
import random
from faker import Faker
from openpyxl import Workbook

# 创建虚假数据生成器
fake = Faker()

# 创建一个数据帧，生成虚假数据
n=1000
data = {
    '姓名': [fake.name() for _ in range(n)],
    '电子邮件': [fake.email() for _ in range(n)],
    '电话号码': [fake.phone_number() for _ in range(n)],
    '地址': [fake.address() for _ in range(n)],
    '工资': [random.randint(1000,10000) for _ in range(n)]
}

df = pd.DataFrame(data)

# 创建一个Excel工作簿并将数据写入工作表
wb = Workbook()
ws = wb.active

# 将数据帧的列标题写入工作表的第一行
for col, column_name in enumerate(df.columns, start=1):
    ws.cell(row=1, column=col, value=column_name)

# 将数据写入工作表
for row, record in enumerate(df.values, start=2):
    for col, value in enumerate(record, start=1):
        ws.cell(row=row, column=col, value=value)

# 保存Excel文件
wb.save('虚假数据.xlsx')

df=pd.read_excel('虚假数据.xlsx')
for col in df.columns:
    series = df[col]
    print(col,series)

姓名 0                  Susan Ortiz
1              Bryan Patterson
2                Betty Stevens
3                  Tina Mendez
4                Edward Miller
                ...           
995                Maria Davis
996        Stephanie Underwood
997            Nicole Robinson
998    Mr. Charles Lawrence MD
999             Angela Johnson
Name: 姓名, Length: 1000, dtype: object
电子邮件 0             [email protected]
1             [email protected]
2              [email protected]
3               [email protected]
4                [email protected]
                   ...             
995              [email protected]
996        [email protected]
997               [email protected]
998    [email protected]
999             [email protected]
Name: 电子邮件, Length: 1000, dtype: object
电话号码 0        (551)885-6301x47608
1                 1430520935
2       001-117-798-3751x241
3          701.251.0146x3110
4          180-238-9727x8554
               ...          
995         340-283-5841x845
996        (668)810-5237x874
997            (844)113-8337
998    001-563-272-0102x6982
999       914-902-3095x90088
Name: 电话号码, Length: 1000, dtype: object
地址 0      9266 Abbott Burg Suite 758\nNorth Stephanie, I...
1       439 Gonzalez Turnpike\nWest Zoechester, PA 51406
2      56319 Matthew Estate Apt. 619\nNorth Michael, ...
3                 4664 Dawson Burgs\nLongmouth, MH 51592
4      485 Rogers Prairie Suite 472\nPearsonmouth, MP...
                             ...                        
995    78764 Jennifer Squares Suite 495\nAimeetown, V...
996     739 Donald Mill Apt. 480\nJenniferstad, CT 56683
997         540 Christine Shoals\nRichardsview, IA 65480
998    20154 Craig Path Suite 691\nPetersonland, SD 8...
999    0537 Rebecca Lock Apt. 835\nLindsayport, MA 44282
Name: 地址, Length: 1000, dtype: object
工资 0      9183
1      3573
2      7649
3      3934
4      1371
       ... 
995    3751
996    2966
997    7967
998    3144
999    4289
Name: 工资, Length: 1000, dtype: int64

Variable size and data replication

The values of all Pandas data structures are variable, but not all data structures are variable in size. For example, the length of a Series cannot be changed, but columns can be inserted into a DataFrame.

In Pandas, most methods do not change the original input data, but copy the data and generate new objects. Generally speaking, it is safer to leave the original input dataunchanged unchanged.

Get started with Pandas in ten minutes

This section is an introduction to help newbies get started with Pandas quickly. Cooking Guide introduces more practical cases.

This section imports Pandas and NumPy in the following ways:

In [1]: import numpy as np

In [2]: import pandas as pd

Generate object

SeeData Structure Introduction document for details.

When is generated with a list of values Seriesopen in new window, Pandas automatically generates integer indexes by default:


In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s
Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
用含日期时间索引与标签的 NumPy 数组生成 DataFrameopen in new window：

dates = pd.date_range('20130101', periods=6)
print(dates)
df=pd.DataFrame(np.random.random((6,5)),index=dates,columns=list('ABCDE'))
print(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D         E
2013-01-01  0.821722  0.838978  0.684531  0.867492  0.084755
2013-01-02  0.784795  0.971571  0.509171  0.046268  0.806170
2013-01-03  0.546673  0.073271  0.738921  0.297711  0.735907
2013-01-04  0.621592  0.403766  0.802696  0.109643  0.171212
2013-01-05  0.469991  0.893711  0.461032  0.326327  0.424273
2013-01-06  0.680158  0.605057  0.230274  0.458527  0.647544

用 Series 字典对象生成 DataFrame:

In [9]: df2 = pd.DataFrame({'A': 1.,
   ...:                     'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
   ...: 

In [10]: df2
Out[10]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
DataFrame 的列有不同数据类型open in new window。

In [11]: df2.dtypes
Out[11]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
IPython支持 tab 键自动补全列名与公共属性。下面是部分可自动补全的属性：

In [12]: df2.<TAB>  # noqa: E225, E999
df2.A                  df2.bool
df2.abs                df2.boxplot
df2.add                df2.C
df2.add_prefix         df2.clip
df2.add_suffix         df2.clip_lower
df2.align              df2.clip_upper
df2.all                df2.columns
df2.any                df2.combine
df2.append             df2.combine_first
df2.apply              df2.compound
df2.applymap           df2.consolidate

df2.D

Columns A, B, C, D, and E can all be autocompleted; for brevity, only some of the properties are shown here.

View data

For details, seeBasic usage open in new window document.

The following code shows how to view the head and tail data of a DataFrame:

In [13]: df.head()
Out[13]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.49492