Column introduction
Combining my own experience and the Python tutorials summarized by internal materials, 3-5 chapters a day, and a minimum of 1 month can complete the learning of Python in an all-round way and carry out practical development. After learning, I will definitely become a boss! Come on! roll up!
For all articles, please visit the column: "Python Full Stack Tutorial (0 Basics)"
and recommend the most recent update: "Detailed Explanation of High-frequency Interview Questions in Dachang Test" This column provides detailed answers to interview questions related to high-frequency testing in recent years, combined with your own Years of work experience, as well as the guidance of peer leaders summed up. It aims to help students in testing and python to pass the interview smoothly and get a satisfactory offer!
Article directory
Detailed explanation of the basic application of Pandas (5)
Applications of DataFrames
window calculation
DataFrame
Object rolling
methods allow us to put data in the window, and then we can use functions to operate and process the data in the window. For example, if we have obtained the recent data of a certain stock and want to make a 5-day moving average and a 10-day moving average, we need to set the window first and then perform the calculation. We can use a three-party library pandas-datareader
to obtain the data of a specified stock within a certain period of time, and the specific operations are as follows.
Install pandas-datareader
third-party libraries.
pip install pandas-datareader
Get Baidu (stock code: BIDU) recent stock data from the Stooq website pandas-datareader
provided by .get_data_stooq
import pandas_datareader as pdr
baidu_df = pdr.get_data_stooq('BIDU', start='2021-11-22', end='2021-12-7')
baidu_df.sort_index(inplace=True)
baidu_df
output:
DataFrame
There are five columns above Open
, High
, Low
, Close
, respectively, which represent the opening price, highest price, lowest price, closing price and trading volume of the code stock. Next, we will perform window calculations on Baidu's stock data.Volume
baidu_df.rolling(5).mean()
output:
The data in the above Close
column is the 5-day moving average we need. Of course, we can also use the following method to directly calculate the 5-day moving average on the object Close
corresponding to the column .Series
baidu_df.Close.rolling(5).mean()
output:
Date
2021-11-22 NaN
2021-11-23 NaN
2021-11-24 NaN
2021-11-26 NaN
2021-11-29 150.608
2021-11-30 151.014
2021-12-01 150.682
2021-12-02 150.196
2021-12-03 147.062
2021-12-06 146.534
2021-12-07 146.544
Name: Close, dtype: float64
Relevance judgment
In statistics, we usually use covariance (covariance) to measure the joint variation of two random variables. If the variable XXLarger values of X are mainly related to another variableYYLarger values of Y correspond to smaller values of both, then the two variables tend to behave similarly, and the covariance is positive. If larger values of one variable mostly correspond to smaller values of the other, the two variables tend to exhibit opposite behaviors, with a negative covariance. Simply put, the sign of the covariance shows how two variables are related. Variance is a special case of covariance, the covariance of a variable with itself.
cov ( X , Y ) = E ( ( X − μ ) ( Y − υ ) ) = E ( X ⋅ Y ) − μ υ cov(X,Y) = E((X − \mu)(Y − \upsilon )) = E(X \cdot Y) - \mu\upsilonco v ( X ,Y)=And (( X−m ) ( Y−u ))=E ( X⋅Y)−muscle
If XXX和YYY is statistically independent, then the covariance of the two is 0, because inXXX和YYWhen Y is independent:
E ( X ⋅ Y ) = E ( X ) ⋅ E ( Y ) = μ υ E(X \cdot Y) = E(X) \cdot E(Y) = \mu\upsilonE ( X⋅Y)=E ( X )⋅E ( AND )=muscle
The magnitude of the covariance depends on the size of the variables and is usually not easy to interpret, but the magnitude of the covariance in the normal form can show the strength of the linear relationship between the two variables. In statistics, the Pearson product-moment correlation coefficient is the normal form of covariance, which is used to measure two variables XXX和YYThe degree of correlation between Y-1
(linear correlation), with values between and1
.
ρ X , Y = cov ( X , Y ) σ X σ Y \rho{X,Y} = \frac {cov(X, Y)} {\sigma_{X}\sigma_{Y}}ρX,Y=pXpYco v ( X ,Y)
Estimate the covariance and standard deviation of the sample, and the sample Pearson coefficient can be obtained, usually with the Greek letter ρ \rhoρ said.
ρ = ∑ i = 1 n ( X i − X ˉ ) ( Y i − Y ˉ ) ∑ i = 1 n ( X i − X ˉ ) 2 ∑ i = 1 n ( Y i − Y ˉ ) 2 \rho = \frac {\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})} {\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}} r=∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2∑i=1n(Xi−Xˉ )(ANDi−Yˉ)
We use ρ \rhoThe following two steps are followed when judging the correlation of indicators by ρ value.
- Determine whether the indicators are positively correlated, negatively correlated, or not correlated.
- When $ \rho \gt 0 $, it is considered that the variables are positively correlated, that is, the trends of the two are consistent.
- When $ \rho \lt 0 $, it is considered that the variables are negatively correlated, that is, the trends of the two are opposite.
- When $ \rho = 0 $, it is considered that the variables are not correlated, but it does not mean that the two indicators are statistically independent.
- Determine the degree of correlation between indicators.
- When the absolute value of $\rho$ is between $[0.6,1]$, the variables are considered to be strongly correlated.
- When the absolute value of $ \rho $ is between $ [0.1,0.6) $, it is considered that the variables are weakly correlated.
- When the absolute value of $\rho$ is between $[0,0.1)$, it is considered that there is no correlation between variables.
Pearson's correlation coefficient applies to:
- There is a linear relationship between the two variables and both are continuous data.
- The population of two variables is normally distributed, or nearly normal and unimodal.
- Observations of two variables are paired, and each pair of observations is independent of the other.
DataFrame
cov
The method and method of the object corr
are used to calculate the covariance and correlation coefficient respectively. The default value of corr
the first parameter of the method is to indicate the calculation of the Pearson correlation coefficient; in addition, you can also specify or to obtain the Kendall coefficient or Speier Mann rank correlation coefficient.method
pearson
kendall
spearman
Next, we boston_house_price.csv
get the famous Boston housing price data set from the file named to create one DataFrame
. We use corr
the method to calculate which of the factors that may affect housing prices 13
are positively or negatively correlated with housing prices. The code is as follows.
boston_df = pd.read_csv('data/csv/boston_house_price.csv')
boston_df.corr()
Note : If you need the CSV file in the above example, you can obtain it through the following Baidu cloud disk address, and the data is in the directory of "Learning Data Analysis from Scratch". Link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g , extraction code: e7b4.
output:
Spearman's correlation coefficient is not as strict as Pearson's correlation coefficient for data conditions, as long as the observed values of two variables are paired graded data, or graded data transformed from continuous variable observed data, regardless of the two variables The overall distribution shape and the size of the sample size can all be studied with the Spearman rank correlation coefficient. We calculate the Spearman correlation coefficient in the following way.
boston_df.corr('spearman')
output:
In Notebook or JupyterLab, we can PRICE
add a gradient color to the column, and use the color to visually display the columns that are negatively correlated, positively correlated, and uncorrelated with the house price. The method of DataFrame
object properties can complete this operation. The code is as follows.style
background_gradient
boston_df.corr('spearman').style.background_gradient('RdYlBu', subset=['PRICE'])
The color of the representative in the above code RdYlBu
is as follows, the closer the data value of the correlation coefficient is 1
, the closer the color is to red; the closer the data value is 1
, the closer the color is to blue; the data value in 0
the attachment is yellow.
plt.get_cmap('RdYlBu')
Application of Index
Let's take a look at Index
the type again, which provides indexing services for Series
and objects. The commonly used types are as follows.DataFrame
Index
Range Index (RangeIndex)
code:
sales_data = np.random.randint(400, 1000, 12)
month_index = pd.RangeIndex(1, 13, name='月份')
ser = pd.Series(data=sales_data, index=month_index)
ser
output:
月份
1 703
2 705
3 557
4 943
5 961
6 615
7 788
8 985
9 921
10 951
11 874
12 609
dtype: int64
CategoricalIndex
code:
cate_index = pd.CategoricalIndex(
['苹果', '香蕉', '苹果', '苹果', '桃子', '香蕉'],
ordered=True,
categories=['苹果', '香蕉', '桃子']
)
ser = pd.Series(data=amount, index=cate_index)
ser
output:
苹果 6
香蕉 6
苹果 7
苹果 6
桃子 8
香蕉 6
dtype: int64
code:
ser.groupby(level=0).sum()
output:
苹果 19
香蕉 12
桃子 8
dtype: int64
Multi-level index (MultiIndex)
code:
ids = np.arange(1001, 1006)
sms = ['期中', '期末']
index = pd.MultiIndex.from_product((ids, sms), names=['学号', '学期'])
courses = ['语文', '数学', '英语']
scores = np.random.randint(60, 101, (10, 3))
df = pd.DataFrame(data=scores, columns=courses, index=index)
df
Explanation : The above code uses
MultiIndex
the class methodfrom_product
, which constructs a multi-level index through the Cartesian product of two sets of dataids
.sms
output:
语文 数学 英语
学号 学期
1001 期中 93 77 60
期末 93 98 84
1002 期中 64 78 71
期末 70 71 97
1003 期中 72 88 97
期末 99 100 63
1004 期中 80 71 61
期末 91 62 72
1005 期中 82 95 67
期末 84 78 86
code:
# 计算每个学生的成绩,期中占25%,期末占75%
df.groupby(level=0).agg(lambda x: x.values[0] * 0.25 + x.values[1] * 0.75)
output:
语文 数学 英语
学号
1001 93.00 92.75 78.00
1002 68.50 72.75 90.50
1003 92.25 97.00 71.50
1004 88.25 64.25 69.25
1005 83.50 82.25 81.25
Datetime Index (DatetimeIndex)
-
Through
date_range()
functions, we can create a datetime index, the code is shown below.code:
pd.date_range('2021-1-1', '2021-6-1', periods=10)
output:
DatetimeIndex(['2021-01-01 00:00:00', '2021-01-17 18:40:00', '2021-02-03 13:20:00', '2021-02-20 08:00:00', '2021-03-09 02:40:00', '2021-03-25 21:20:00', '2021-04-11 16:00:00', '2021-04-28 10:40:00', '2021-05-15 05:20:00', '2021-06-01 00:00:00'], dtype='datetime64[ns]', freq=None)
code:
pd.date_range('2021-1-1', '2021-6-1', freq='W')
output:
DatetimeIndex(['2021-01-03', '2021-01-10', '2021-01-17', '2021-01-24', '2021-01-31', '2021-02-07', '2021-02-14', '2021-02-21', '2021-02-28', '2021-03-07', '2021-03-14', '2021-03-21', '2021-03-28', '2021-04-04', '2021-04-11', '2021-04-18', '2021-04-25', '2021-05-02', '2021-05-09', '2021-05-16', '2021-05-23', '2021-05-30'], dtype='datetime64[ns]', freq='W-SUN')
-
Through
DateOffset
the type, we can set the time difference andDatetimeIndex
perform calculations. The specific operations are as follows.code:
index = pd.date_range('2021-1-1', '2021-6-1', freq='W') index - pd.DateOffset(days=2)
output:
DatetimeIndex(['2021-01-01', '2021-01-08', '2021-01-15', '2021-01-22', '2021-01-29', '2021-02-05', '2021-02-12', '2021-02-19', '2021-02-26', '2021-03-05', '2021-03-12', '2021-03-19', '2021-03-26', '2021-04-02', '2021-04-09', '2021-04-16', '2021-04-23', '2021-04-30', '2021-05-07', '2021-05-14', '2021-05-21', '2021-05-28'], dtype='datetime64[ns]', freq=None)
code:
index + pd.DateOffset(days=2)
output:
DatetimeIndex(['2021-01-05', '2021-01-12', '2021-01-19', '2021-01-26', '2021-02-02', '2021-02-09', '2021-02-16', '2021-02-23', '2021-03-02', '2021-03-09', '2021-03-16', '2021-03-23', '2021-03-30', '2021-04-06', '2021-04-13', '2021-04-20', '2021-04-27', '2021-05-04', '2021-05-11', '2021-05-18', '2021-05-25', '2021-06-01'], dtype='datetime64[ns]', freq=None)
-
Data can be
DatatimeIndex
manipulated using the type's associated methods, including:-
shift()
Method: By shifting the data forward or backward in time, we still take the above Baidu stock data as an example, and the code is as follows.code:
baidu_df.shift(3, fill_value=0)
output:
code:
baidu_df.shift(-1, fill_value=0)
output:
-
asfreq()
Method: Specify a time frequency to extract the corresponding data, the code is as follows.code:
baidu_df.asfreq('5D')
output:
-
code:
baidu_df.asfreq('5D', method='ffill')
- `resample()`方法:基于时间对数据进行重采样,相当于根据时间周期对数据进行了分组操作,代码如下所示。
代码:
```python
baidu_df.resample('1M').mean()
```
> **说明**:上面的代码中,`W`表示一周,`5D`表示`5`天,`1M`表示`1`个月。
-
time zone conversion
-
Get time zone information.
import pytz pytz.common_timezones
-
tz_localize()
Method: Localize the datetime.code:
baidu_df = baidu_df.tz_localize('Asia/Chongqing') baidu_df
output:
-
tz_convert()
Method: Convert time zone.code:
baidu_df.tz_convert('America/New_York')
output:
-