Computing master's degree in stock forecasting and analysis based on time series


1 Introduction

Hi, everyone, this is Senior Dancheng. Today I will introduce to you a big data project.

Big data analysis: stock prediction and analysis based on time series

2 Origin of time series

When it comes to time series analysis technology, we have to mention the AR/MA/ARMA/ARIMA analysis model. The common feature of these four analysis methods is to jump out of the analysis perspective of changing components, start from the time series itself, and strive to obtain the quantitative relationship between the early data and the later data, thereby establishing a model in which the early data is the independent variable and the later data is the dependent variable. , to achieve the purpose of prediction. To use a popular metaphor, who you were the day before yesterday, who you were the day before yesterday, and who you were yesterday made you who you are today.

2.1 Names of the four models:

  • AR model: Auto Regressive model;
  • MA model: Moving Average model;
  • ARMA: Auto Regressive and Moving Average model;
  • ARIMA model: Differential autoregressive moving average model.
  • AR model:

If any value of a certain time series can be expressed as the following regression equation, then the time series obeys a p-order autoregressive process, which can be expressed as AR§:

Insert image description here
The AR model uses the correlation (autocorrelation) between early values ​​and later values ​​to establish a regression equation containing early values ​​and later values ​​to achieve the purpose of prediction, so it becomes an autoregressive process. White noise needs to be explained here. White noise can be understood as random fluctuations in time series values. The sum of these random fluctuations will be equal to 0. For example, a certain biscuit automated production line requires each package of biscuits to be 500 grams, but the biscuit products produced are random due to The influence of factors cannot be exactly equal to 500 grams, but will fluctuate around 500 grams. The sum of these fluctuations will cancel each other out and equal 0.

3 Data preview

import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

#准备两个数组
list1 = [6,4,8]
list2 = [8,6,10]

#分别将list1,list2转为Series数组
list1_series = pd.Series(list1) 
print(list1_series)
list2_series = pd.Series(list2) 
print(list2_series)

#将两个Series转为DataFrame,对应列名分别为A和B
frame = {
    
     'Col A': list1_series, 'Col B': list2_series } 
result = pd.DataFrame(frame)

result.plot()
plt.show()

Insert image description here

4 Theoretical formulas

4.1 Covariance

First, let’s look at the formula for covariance:

Insert image description here

Insert image description here

4.2 Correlation coefficient

After calculating Cov, you can calculate the correlation coefficient. The value is between -1 and 1. The closer it is to 1, the greater the positive correlation; the closer it is to -1, the greater the negative correlation. 0 means no correlation formula
. as follows:

Insert image description here

4.3 scikit-learn calculates correlation

Insert image description here

#各特征间关系的矩阵图
sns.pairplot(iris, hue='species', size=3, aspect=1)

Insert image description here

Andrews Curves are a method for visualizing multidimensional data by mapping each observation to a function.
Use Andrews Curves to convert each multivariate observation into a curve and represent the coefficients of a Fourier series, which is useful for detecting outliers in time series data.

plt.subplots(figsize = (10,8))
pd.plotting.andrews_curves(iris, 'species', colormap='cool')

Insert image description here
Here we take the classic iris data set as an example

Setosa, versicolor, and virginica represent three varieties of iris. It can be seen that there are intersections between various features, and there are also certain separation rules.

#最后,通过热图找出数据集中不同特征之间的相关性,高正值或负值表明特征具有高度相关性:

fig=plt.gcf()
fig.set_size_inches(10,6)
fig=sns.heatmap(iris.corr(), annot=True, cmap='GnBu', linewidths=1, linecolor='k', \
square=True, mask=False, vmin=-1, vmax=1, \
cbar_kws={
    
    "orientation": "vertical"}, cbar=True)

Insert image description here

5 Time series analysis of financial data

Main introduction: time series change calculation, time series resampling and window function

5.1 Data overview

import pandas as pd

tm = pd.read_csv('/home/kesci/input/gupiao_us9955/Close.csv')
tm.head()

Insert image description here

The meaning of each indicator in the data:

  • AAPL.O | Apple Stock
  • MSFT.O | Microsoft Stock
  • INTC.O | Intel Stock
  • AMZN.O | Amazon Stock
  • GS.N | Goldman Sachs Stock
  • SPY | SPDR S&P 500 ETF Trust
  • .SPX | S&P 500 Index
  • .VIX | VIX Volatility Index
  • EUR= | EUR/USD Exchange Rate
  • XAU= | Gold Price
  • GDX | VanEck Vectors Gold Miners ETF
  • GLD | SPDR Gold Trust

Overview of price (or indicator) trends over the 8-year period

Insert image description here

5.2 Calculation of sequence changes

  • Calculate the difference value of each indicator for each day (subtract the result of the previous day from the next day)
  • Calculate pct_change: the growth rate is (last value - previous value)/previous value)
  • Calculate average calculation pct_change indicator
  • Plot to observe which indicator has the highest average growth rate
  • Calculate the growth rate in continuous time (where you need to calculate the difference between today's price and yesterday's price)

Calculate the difference value of each indicator for each day (subtract the result of the previous day from the next day)

Insert image description here

Calculate pct_change: the growth rate is (last value - previous value)/previous value)

Insert image description here

Calculate the average, calculate the pct_change indicator
and draw it to observe which indicator has the highest average growth rate.

Insert image description here
In addition to the highest growth rate of the Volatility Index (.VIX indicator), it is Amazon’s stock price! Bezos is simply the most powerful bald guy in the universe

Calculate the growth rate in continuous time (where you need to calculate the difference between today's price and yesterday's price)

#第二天数据
tm.shift(1).head()

#计算增长率
rets = np.log(tm/tm.shift(1))
print(rets.tail().round(3))

#cumsum的小栗子:
print('小栗子的结果:',np.cumsum([1,2,3,4]))

#增长率做cumsum需要对log进行还原,用e^x
rets.cumsum().apply(np.exp).plot(figsize=(10,6))

Insert image description here
The above is the growth rate in continuous time, that is to say, 1 yuan in 2010 has become more than 10 yuan in 2018 (take Amazon as an example)

at last

Guess you like

Origin blog.csdn.net/HUXINY/article/details/133384880