Big data (8): Detailed explanation of the basic application of Pandas (5)

Column introduction

Combining my own experience and the Python tutorials summarized by internal materials, 3-5 chapters a day, and a minimum of 1 month can complete the learning of Python in an all-round way and carry out practical development. After learning, I will definitely become a boss! Come on! roll up!

For all articles, please visit the column: "Python Full Stack Tutorial (0 Basics)"
and recommend the most recent update: "Detailed Explanation of High-frequency Interview Questions in Dachang Test" This column provides detailed answers to interview questions related to high-frequency testing in recent years, combined with your own Years of work experience, as well as the guidance of peer leaders summed up. It aims to help students in testing and python to pass the interview smoothly and get a satisfactory offer!



Detailed explanation of the basic application of Pandas (5)

Applications of DataFrames

window calculation

DataFrameObject rollingmethods allow us to put data in the window, and then we can use functions to operate and process the data in the window. For example, if we have obtained the recent data of a certain stock and want to make a 5-day moving average and a 10-day moving average, we need to set the window first and then perform the calculation. We can use a three-party library pandas-datareaderto obtain the data of a specified stock within a certain period of time, and the specific operations are as follows.

Install pandas-datareaderthird-party libraries.

pip install pandas-datareader

Get Baidu (stock code: BIDU) recent stock data from the Stooq website pandas-datareaderprovided by .get_data_stooq

import pandas_datareader as pdr

baidu_df = pdr.get_data_stooq('BIDU', start='2021-11-22', end='2021-12-7')
baidu_df.sort_index(inplace=True)
baidu_df

output:

insert image description here

DataFrameThere are five columns above Open, High, Low, Close, respectively, which represent the opening price, highest price, lowest price, closing price and trading volume of the code stock. Next, we will perform window calculations on Baidu's stock data.Volume

baidu_df.rolling(5).mean()

output:

The data in the above Closecolumn is the 5-day moving average we need. Of course, we can also use the following method to directly calculate the 5-day moving average on the object Closecorresponding to the column .Series

baidu_df.Close.rolling(5).mean()

output:

Date
2021-11-22        NaN
2021-11-23        NaN
2021-11-24        NaN
2021-11-26        NaN
2021-11-29    150.608
2021-11-30    151.014
2021-12-01    150.682
2021-12-02    150.196
2021-12-03    147.062
2021-12-06    146.534
2021-12-07    146.544
Name: Close, dtype: float64

Relevance judgment

In statistics, we usually use covariance (covariance) to measure the joint variation of two random variables. If the variable XXLarger values ​​of X are mainly related to another variableYYLarger values ​​of Y correspond to smaller values ​​of both, then the two variables tend to behave similarly, and the covariance is positive. If larger values ​​of one variable mostly correspond to smaller values ​​of the other, the two variables tend to exhibit opposite behaviors, with a negative covariance. Simply put, the sign of the covariance shows how two variables are related. Variance is a special case of covariance, the covariance of a variable with itself.

cov ( X , Y ) = E ( ( X − μ ) ( Y − υ ) ) = E ( X ⋅ Y ) − μ υ cov(X,Y) = E((X − \mu)(Y − \upsilon )) = E(X \cdot Y) - \mu\upsilonco v ( X ,Y)=And (( Xm ) ( Yu ))=E ( XY)muscle

If XXXYYY is statistically independent, then the covariance of the two is 0, because inXXXYYWhen Y is independent:

E ( X ⋅ Y ) = E ( X ) ⋅ E ( Y ) = μ υ E(X \cdot Y) = E(X) \cdot E(Y) = \mu\upsilonE ( XY)=E ( X )E ( AND )=muscle

The magnitude of the covariance depends on the size of the variables and is usually not easy to interpret, but the magnitude of the covariance in the normal form can show the strength of the linear relationship between the two variables. In statistics, the Pearson product-moment correlation coefficient is the normal form of covariance, which is used to measure two variables XXXYYThe degree of correlation between Y-1 (linear correlation), with values ​​between and1.

ρ X , Y = cov ( X , Y ) σ X σ Y \rho{X,Y} = \frac {cov(X, Y)} {\sigma_{X}\sigma_{Y}}ρX,Y=pXpYco v ( X ,Y)

Estimate the covariance and standard deviation of the sample, and the sample Pearson coefficient can be obtained, usually with the Greek letter ρ \rhoρ said.

ρ = ∑ i = 1 n ( X i − X ˉ ) ( Y i − Y ˉ ) ∑ i = 1 n ( X i − X ˉ ) 2 ∑ i = 1 n ( Y i − Y ˉ ) 2 \rho = \frac {\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})} {\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}} r=i=1n(XiXˉ)2 i=1n(YiYˉ)2 i=1n(XiXˉ )(ANDiYˉ)

We use ρ \rhoThe following two steps are followed when judging the correlation of indicators by ρ value.

  1. Determine whether the indicators are positively correlated, negatively correlated, or not correlated.
    • When $ \rho \gt 0 $, it is considered that the variables are positively correlated, that is, the trends of the two are consistent.
    • When $ \rho \lt 0 $, it is considered that the variables are negatively correlated, that is, the trends of the two are opposite.
    • When $ \rho = 0 $, it is considered that the variables are not correlated, but it does not mean that the two indicators are statistically independent.
  2. Determine the degree of correlation between indicators.
    • When the absolute value of $\rho$ is between $[0.6,1]$, the variables are considered to be strongly correlated.
    • When the absolute value of $ \rho $ is between $ [0.1,0.6) $, it is considered that the variables are weakly correlated.
    • When the absolute value of $\rho$ is between $[0,0.1)$, it is considered that there is no correlation between variables.

Pearson's correlation coefficient applies to:

  1. There is a linear relationship between the two variables and both are continuous data.
  2. The population of two variables is normally distributed, or nearly normal and unimodal.
  3. Observations of two variables are paired, and each pair of observations is independent of the other.

DataFramecovThe method and method of the object corrare used to calculate the covariance and correlation coefficient respectively. The default value of corrthe first parameter of the method is to indicate the calculation of the Pearson correlation coefficient; in addition, you can also specify or to obtain the Kendall coefficient or Speier Mann rank correlation coefficient.methodpearsonkendallspearman

Next, we boston_house_price.csvget the famous Boston housing price data set from the file named to create one DataFrame. We use corrthe method to calculate which of the factors that may affect housing prices 13are positively or negatively correlated with housing prices. The code is as follows.

boston_df = pd.read_csv('data/csv/boston_house_price.csv')
boston_df.corr()

Note : If you need the CSV file in the above example, you can obtain it through the following Baidu cloud disk address, and the data is in the directory of "Learning Data Analysis from Scratch". Link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g , extraction code: e7b4.

output:

insert image description here


Spearman's correlation coefficient is not as strict as Pearson's correlation coefficient for data conditions, as long as the observed values ​​of two variables are paired graded data, or graded data transformed from continuous variable observed data, regardless of the two variables The overall distribution shape and the size of the sample size can all be studied with the Spearman rank correlation coefficient. We calculate the Spearman correlation coefficient in the following way.

boston_df.corr('spearman')

output:

In Notebook or JupyterLab, we can PRICEadd a gradient color to the column, and use the color to visually display the columns that are negatively correlated, positively correlated, and uncorrelated with the house price. The method of DataFrameobject properties can complete this operation. The code is as follows.stylebackground_gradient

boston_df.corr('spearman').style.background_gradient('RdYlBu', subset=['PRICE'])

The color of the representative in the above code RdYlBuis as follows, the closer the data value of the correlation coefficient is 1, the closer the color is to red; the closer the data value is 1, the closer the color is to blue; the data value in 0the attachment is yellow.

plt.get_cmap('RdYlBu')

insert image description here

Application of Index

Let's take a look at Indexthe type again, which provides indexing services for Seriesand objects. The commonly used types are as follows.DataFrameIndex

Range Index (RangeIndex)

code:

sales_data = np.random.randint(400, 1000, 12)
month_index = pd.RangeIndex(1, 13, name='月份')
ser = pd.Series(data=sales_data, index=month_index)
ser

output:

月份
1     703
2     705
3     557
4     943
5     961
6     615
7     788
8     985
9     921
10    951
11    874
12    609
dtype: int64

CategoricalIndex

code:

cate_index = pd.CategoricalIndex(
    ['苹果', '香蕉', '苹果', '苹果', '桃子', '香蕉'],
    ordered=True,
    categories=['苹果', '香蕉', '桃子']
)
ser = pd.Series(data=amount, index=cate_index)
ser

output:

苹果    6
香蕉    6
苹果    7
苹果    6
桃子    8
香蕉    6
dtype: int64

code:

ser.groupby(level=0).sum()

output:

苹果    19
香蕉    12
桃子     8
dtype: int64

Multi-level index (MultiIndex)

code:

ids = np.arange(1001, 1006)
sms = ['期中', '期末']
index = pd.MultiIndex.from_product((ids, sms), names=['学号', '学期'])
courses = ['语文', '数学', '英语']
scores = np.random.randint(60, 101, (10, 3))
df = pd.DataFrame(data=scores, columns=courses, index=index)
df

Explanation : The above code uses MultiIndexthe class method from_product, which constructs a multi-level index through the Cartesian product of two sets of data ids.sms

output:

             语文 数学 英语
学号	学期			
1001  期中	93	77	60
      期末	93	98	84
1002  期中	64	78	71
      期末	70	71	97
1003  期中	72	88	97
      期末	99	100	63
1004  期中	80	71	61
      期末	91	62	72
1005  期中	82	95	67
      期末	84	78	86

code:

# 计算每个学生的成绩,期中占25%,期末占75%
df.groupby(level=0).agg(lambda x: x.values[0] * 0.25 + x.values[1] * 0.75)

output:

        语文    数学    英语
学号			
1001	93.00	92.75	78.00
1002	68.50	72.75	90.50
1003	92.25	97.00	71.50
1004	88.25	64.25	69.25
1005	83.50	82.25	81.25

Datetime Index (DatetimeIndex)

  1. Through date_range()functions, we can create a datetime index, the code is shown below.

    code:

    pd.date_range('2021-1-1', '2021-6-1', periods=10)
    

    output:

    DatetimeIndex(['2021-01-01 00:00:00', '2021-01-17 18:40:00',
                   '2021-02-03 13:20:00', '2021-02-20 08:00:00',
                   '2021-03-09 02:40:00', '2021-03-25 21:20:00',
                   '2021-04-11 16:00:00', '2021-04-28 10:40:00',
                   '2021-05-15 05:20:00', '2021-06-01 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    

    code:

    pd.date_range('2021-1-1', '2021-6-1', freq='W')
    

    output:

    DatetimeIndex(['2021-01-03', '2021-01-10', '2021-01-17', '2021-01-24',
                   '2021-01-31', '2021-02-07', '2021-02-14', '2021-02-21',
                   '2021-02-28', '2021-03-07', '2021-03-14', '2021-03-21',
                   '2021-03-28', '2021-04-04', '2021-04-11', '2021-04-18',
                   '2021-04-25', '2021-05-02', '2021-05-09', '2021-05-16',
                   '2021-05-23', '2021-05-30'],
                  dtype='datetime64[ns]', freq='W-SUN')
    
  2. Through DateOffsetthe type, we can set the time difference and DatetimeIndexperform calculations. The specific operations are as follows.

    code:

    index = pd.date_range('2021-1-1', '2021-6-1', freq='W')
    index - pd.DateOffset(days=2)
    

    output:

    DatetimeIndex(['2021-01-01', '2021-01-08', '2021-01-15', '2021-01-22',
                   '2021-01-29', '2021-02-05', '2021-02-12', '2021-02-19',
                   '2021-02-26', '2021-03-05', '2021-03-12', '2021-03-19',
                   '2021-03-26', '2021-04-02', '2021-04-09', '2021-04-16',
                   '2021-04-23', '2021-04-30', '2021-05-07', '2021-05-14',
                   '2021-05-21', '2021-05-28'],
                  dtype='datetime64[ns]', freq=None)
    

    code:

    index + pd.DateOffset(days=2)
    

    output:

    DatetimeIndex(['2021-01-05', '2021-01-12', '2021-01-19', '2021-01-26',
                   '2021-02-02', '2021-02-09', '2021-02-16', '2021-02-23',
                   '2021-03-02', '2021-03-09', '2021-03-16', '2021-03-23',
                   '2021-03-30', '2021-04-06', '2021-04-13', '2021-04-20',
                   '2021-04-27', '2021-05-04', '2021-05-11', '2021-05-18',
                   '2021-05-25', '2021-06-01'],
                  dtype='datetime64[ns]', freq=None)
    
  3. Data can be DatatimeIndexmanipulated using the type's associated methods, including:

    • shift()Method: By shifting the data forward or backward in time, we still take the above Baidu stock data as an example, and the code is as follows.

      code:

      baidu_df.shift(3, fill_value=0)
      

      output:

      code:

      baidu_df.shift(-1, fill_value=0)
      

      output:

    • asfreq()Method: Specify a time frequency to extract the corresponding data, the code is as follows.

      code:

      baidu_df.asfreq('5D')
      

      output:

insert image description here

code:

       baidu_df.asfreq('5D', method='ffill')
- `resample()`方法:基于时间对数据进行重采样,相当于根据时间周期对数据进行了分组操作,代码如下所示。

    代码:

    ```python
    baidu_df.resample('1M').mean()
    ```

> **说明**:上面的代码中,`W`表示一周,`5D`表示`5`天,`1M`表示`1`个月。
  1. time zone conversion

    • Get time zone information.

      import pytz
      
      pytz.common_timezones
      
    • tz_localize()Method: Localize the datetime.

      code:

      baidu_df = baidu_df.tz_localize('Asia/Chongqing')
      baidu_df
      

      output:

    • tz_convert()Method: Convert time zone.

      code:

      baidu_df.tz_convert('America/New_York')
      

      output:

Guess you like

Origin blog.csdn.net/ml202187/article/details/132707836
Recommended