Financial Mathematical Modeling - 2022 Greater Bay Area Cup Financial Mathematical Modeling Question B (problem-solving ideas and some python codes)

Table of contents

I. Overview

2. Questions and Interpretation

1. Details of the competition

 2. Interpretation of the competition questions

3. Problem-solving method

1. First question

first question code

 2. The second question

Second question part code

3. The third question:

Code for the third question

4. The fourth question

 3. Summary


I. Overview

This competition is the first time for our team to participate in financial mathematical modeling. Although we have done related exercises with the 2020 Greater Bay Area Cup A questions before the competition, we still have some problems with the financial mathematics modeling questions. strenuous. But fortunately, after 7 days of hard work, we still completed all four questions. The official results of the competition were released last month. Our team won the third prize. For me personally, it is already a very good result ( Because I didn’t have any hope of winning the prize at first hhhhh).

So let's take a look at these four questions :)

2. Questions and Interpretation

1. Details of the competition

The difficulty of question B in this competition is not high, and the title of the overall competition is "Construction of Asset Allocation Strategies Based on Macroeconomic Cycle".

There are four questions in total, as shown in the figure below:

 In addition, the author of the question also provides the relevant data sets needed to solve the problem. The data set files are roughly as follows:

(Appendix 1: Macroeconomic Indicator Data (Questions 1 and 2))

 (Appendix 2: Market Data of Major Asset Indexes (Questions 3 and 4))

 2. Interpretation of the competition questions

For example, the four questions can be divided into three parts:

The first part is the analysis and classification of macroeconomic data and the forecast of future economic conditions (questions 1 and 2).

The second part is to conduct a correlation analysis based on the data of the major asset index data that can be invested in (the third question).

The third part is the combined application of the first part and the second part. The conclusion of China's macroeconomic situation in the next 5 years drawn in the first part and the analysis results of the data correlation of the large-scale asset index in the second part are used by the Merrill Lynch clock frame Determine how to combine large categories of assets for investment.

It should be noted that the data given by the question maker does not need to be used in full, but should be screened and reused according to the mathematical modeling method used by the individual.

3. Problem-solving method

1. First question

Topic analysis: For the first question, first determine the time limit range from 2001 to 2021, and select the GDP in the national economic accounting and the money supply in the bank and currency from the macroeconomic index data The M2 index is used as an indicator to divide the economic status, and the GDP growth rate and inflation rate are calculated. The four economic statuses divided by the Merrill Lynch clock frame can be measured by using the above two indicators.

                                (Figure: Merrill Lynch Clock Frame Theory)

Applied data and algorithm formula: Calculate the growth rate of GDP and money supply M2 based on ready-made data such as GDP, M2 (broad money supply) and actual inflation rate (year) in Annex 1 Index growth rate. Due to the constraints of the topic, only the data related to the macroeconomic performance in the 20 years from 2001 to 2021 are selected as the division indicators.

 If the GDP growth rate is greater than 10%, it is judged as high growth, otherwise it is considered as low growth; if the inflation rate is greater than 6%, it is judged as high inflation, otherwise it is considered as low inflation.

So after the calculation, we can get the GDP growth rate and inflation rate. Through these two indicators, we write a python program to classify the past 20 years according to the Merrill Lynch clock frame. The result is shown in the figure below:

According to the summary of the classification results of all years, the years in the recession stage from 2001 to 2021 are: 2002, 2009, 2014, 2017; the years in the recovery stage are: 2001, 2003, 2004, 2005, 2006, 2007, 2008, 2010, 2011, 2012, 2013, 2018, 2021; Years in stagflation: 2015, 2016, 2019, 2020; There were no years in the overheating phase.

Code for the first question:

def classify(GDP,CPI):
    data1 = pd.DataFrame(GDP)
    data2 = pd.DataFrame(CPI)
    #print(data1.iloc[0,0])
    GDP_speed_up = []
    M_tongbi = []
    tongzhanglv = []
    
    drop = []
    recover = []
    overheat = []
    stagflation =[]
    for i in range(21):
        then_year = data1.iloc[i+1,1]
        ago_year = data1.iloc[i,1]
        zhengzhang = ((then_year - ago_year)/ago_year)*100
        GDP_speed_up.append(zhengzhang)

    for j in range(11,253,12):
        M_tb = data2.iloc[j,6]
        M_tongbi.append(M_tb)
    
    for k in GDP_speed_up:
        for z in M_tongbi:
            tongzhang = z - k
            tongzhanglv.append(tongzhang)
            break
    print(GDP_speed_up)
    print(tongzhanglv)
    year = [2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021]
    for a in range(0,len(GDP_speed_up)):
            if (GDP_speed_up[a]<15 and tongzhanglv[a]<6):
                drop.append(year[a])
                
            elif (GDP_speed_up[a]>15 and tongzhanglv[a]<6):
                recover.append(year[a])
                
            elif (GDP_speed_up[a]>15 and tongzhanglv[a]>6):
                overheat.append(year[a])
                
            else :
                stagflation.append(year[a])
    return drop,recover,overheat,stagflation


def divide(drop_1,recover_1,overhead_1,staflation_1):
    index = pd.date_range('2001','2021')
    print(index)
    

 2. The second question

Topic analysis: According to the GDP growth rate calculated in the first question, the consumer price index CPI is used to calculate the inflation rate of currency commodities, and the current deposit interest rate (monthly) is used to calculate the interest rate to simulate and predict China's macroeconomic development in the next five years. The LSTM algorithm predicts economic growth, inflation, and interest rates in the next five years using a time-series sliding window with a year as a time slice.

Since the second question requires us to predict future data, we need to use the LSTM algorithm for modeling. Similar to machine learning methods such as CNN, we need to perform three steps of data preprocessing, model training, and data prediction to build our mathematical model and Write code.

Data preprocessing: Since these data are time series data, it is necessary to unify their time stamps. Through observation, it is found that all data have tended to be stable since 1988 (no missing values), so the data from 1988 to 2021 are selected. . In addition, since the data of GDP and inflation rate are in December every year, we slice the M2 money supply data and the interest rate of RMB demand deposits, and extract the data in December of each year, so as to achieve uniformity in time. Save the Dataframe as a two-dimensional array, and then perform this difference conversion, convert the time series data into a supervised learning set, and divide the data set into a training set and a test set, and use MinMaxScaler to scale the data to [-1,1] speed up the convergence.

Model training (method selection): It is necessary to use a memory algorithm for model training. RNN (Recurrent Neural Network, RNN) can realize model training for ordered data, but using it for model training will cause the problem of gradient disappearance. Therefore, only short-term memory can be realized. In order to solve this problem, we use LSTM (Long Short Term, LSTM), a derivative algorithm of RNN. This algorithm can learn long-term dependent information, which is very suitable for solving our target problem.

Forecasting: In terms of forecasting, we adopt sliding window forecasting.

Operation result:

Based on the calculated results combined with the Merrill Lynch clock, it can be seen that China's economic growth in the next five years will be at high growth, while inflation and interest rates will be in a low-inflation and low-inflation situation. Therefore, it is judged that China will be in the recovery stage in the next five years .

Second question part code:

#设置随机种子
numpy.random.seed(7)

df = pd.read_excel('./data/GDP_TZ.xlsx')

df.drop(['tongzhanglv'],axis=1,inplace=True)
df['Date']=pd.to_datetime(df['Date'],format='%Y')
# print(data.head())
df = df.set_index(['Date'], drop=True)
dataframe = pd.DataFrame(df)
# print(dataframe)

dataset = dataframe.values
dataset = dataset.astype("float64")
# print(dataframe.head())
#
#异常值检测
fig = plt.figure(1,figsize=(9,6))
ax = fig.add_subplot(111)
bp = ax.boxplot(dataset)
# print(bp['fliers'][0].get_ydata())
# plt.show()

#需要将数据标准化到0-1
scaler = MinMaxScaler(feature_range=(0,1))
dataset = scaler.fit_transform(dataset)
#分割训练集与测试集
train_size = int(len(dataset)*0.75)
test_size = len(dataset)-train_size

# print(test_size)

train,test = dataset[0:train_size,:],dataset[train_size:len(dataset),:]
trainX,trainY = create_dataset(train,look_back)
testX,testY = create_dataset(test,look_back)
print(trainX)
print(trainY)


#创建一个LSTM模型
model = Sequential()
model.add(LSTM(4,input_shape=(1,look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error',optimizer='adam')
trainX= trainX.reshape(trainX.shape[0],1,trainX.shape[1])
# trainY= trainY.reshape(trainY.shape[0],1,trainY.shape[0])
model.fit(trainX,trainY,epochs=100,batch_size=3,verbose=2)

3. The third question:

Problem analysis: It is necessary to use the four economic state division conditions divided according to the Merrill Lynch clock framework in the first question to calculate the risk-return characteristics (expected return, return standard deviation) of the major asset indexes in Annex 2 under various economic states , Sharpe ratio). Use the Pearson coefficient to calculate the correlation coefficient between the major asset indices, and use the heat map to represent the significance among the major asset indices.

Pearson coefficient:

Compute the results and plot a heatmap:

(eg. Risk-return characteristics in a recessionary economy)

 (heat map)

 It is calculated that the correlation coefficient between ChinaBond-Comprehensive Wealth (3-5 years) and ChinaBond-Comprehensive Wealth (7-10 years) is the highest. From the calculation results, we can see that the correlation coefficient of major asset indexes between the same categories Higher than the correlation coefficient of the index between different categories.

Code for the third question:

def divide():
    data1 = pd.read_excel(path3)
    data_dalei = pd.DataFrame(data1)
    data_dalei.set_index('time',inplace=True)
    stats = ['stock1','stock2','stock3','stock4','goods1','goods2','bond1','bond2','bond3','cash1']
    data_dalei.reindex(columns=stats)
    return data_dalei
      
def drop(drop_1):
    xiapu = []
    expect_profit = []
    biaozhun = []
    data_dalei = divide()
    for x1 in drop_1:
        data_x_1 = []
        for i in range(6,9):
                data_dalei_drop_2 = data_dalei[str(x1)].iloc[:,i]
            #计算期望收益
                data_expect_profit = data_dalei_drop_2[-1]-data_dalei_drop_2[0]
            #计算夏普比率:
                for i1 in range(1,len(data_dalei_drop_2)):
                    data = ((data_dalei_drop_2[i1]-data_dalei_drop_2[i1-1])/data_dalei_drop_2[i1-1])*100
                    data_x_1.append(data)
                data_x1 = pd.Series(data_x_1)
                sharp = cal_sharp(data_x1,rf = 3)
            #计算收益率标准差
                #平均收益:avg
                qiuhe = 0
                x0 = 0
                for k in range(1,len(data_dalei_drop_2)):
                    data2 = (data_dalei_drop_2[k]-data_dalei_drop_2[k-1])
                    qiuhe = qiuhe + data2
                avg = qiuhe/(len(data_dalei_drop_2)-1)
                #当天收益:then
                for k1 in range(1,len(data_dalei_drop_2)):
                    data3 = (data_dalei_drop_2[k1]-data_dalei_drop_2[k1-1])
                    x = (data3-avg)**2
                    x0 = x0 + x
                biaozhunci = math.sqrt(x0/(len(data_dalei_drop_2)-1))
                xiapu.append(sharp)
                expect_profit.append(data_expect_profit)
                biaozhun.append(biaozhunci)
    M1 = pd.Series(expect_profit)
    M2 = pd.Series(xiapu)
    M3 = pd.Series(biaozhun)
    M4 = pd.concat([M1,M2,M3],axis=1)
    M4.to_excel('./data/dropnew.xlsx')
                
def recover(recover_1):
    xiapu2 = []
    expect_profit2 = []
    biaozhun2 = []
    data_dalei = divide()
    for x2 in recover_1:
        data_x_2 = []
        data_x_3 = []
        data_x= []
        if int(x2) == 2001:
            #计算夏普比率:
            data_dalei_recover = data_dalei[str(x2)].iloc[:,5]
            for i in range(1,len(data_dalei_recover)):
                data = ((data_dalei_recover[i]-data_dalei_recover[i-1])/data_dalei_recover[i-1])*100
                data_x.append(data)
            data_x1 = pd.Series(data_x)
            sharp = cal_sharp(data_x1,rf = 3)
            #计算期望收益:
            data_expect_profit = data_dalei_recover[-1]-data_dalei_recover[0]
            #计算收益率标准差
            qiuhe3 = 0
            x_3 = 0
            for aa in range(1,len(data_dalei_recover)):
                data8 = (data_dalei_recover[aa]-data_dalei_recover[aa-1])
                qiuhe3 = qiuhe3 + data8
            avg3 = qiuhe3/(len(data_dalei_recover)-1)
            for aa1 in range(1,len(data_dalei_recover)):
                data9 = (data_dalei_recover[aa1]-data_dalei_recover[aa-1])
                x_33 = pow(data9-avg3,2)
                x_3 = x_3 +x_33
            zhuanx = x_3/(len(data_dalei_recover))
            biaozhuancha3 = math.sqrt(zhuanx)
            xiapu2.append(sharp)
            expect_profit2.append(data_expect_profit)
            biaozhun2.append(biaozhuancha3)
        elif int(x2) == 2004:
            for kk in range(0,2):
                #计算夏普比率:
                data_dalei_recover = data_dalei[str(x2)].iloc[:,kk]
                for kk1 in range(1,len(data_dalei_recover)):
                    data = ((data_dalei_recover[kk1]-data_dalei_recover[kk1-1])/data_dalei_recover[kk1-1])*100
                    data_x_2.append(data)
                data_x2 = pd.Series(data_x_2)
                sharp1 = cal_sharp(data_x2,rf = 3)
                #计算期望收益:
                data_expect_profit1 = data_dalei_recover[-1]-data_dalei_recover[0]
                #计算收益率标准差
                qiuhe2 = 0
                x_2 = 0
                for kk3 in range(1,len(data_dalei_recover)):
                    datam = (data_dalei_recover[kk3]-data_dalei_recover[kk3-1])
                    qiuhe2 = qiuhe2 +datam
                avg2 = qiuhe2/(len(data_dalei_recover)-1)
                
                for kk4 in range(1,len(data_dalei_recover)):
                    datan = (data_dalei_recover[kk4]-data_dalei_recover[kk4-1])
                    x_22 = pow(datan - avg2,2)
                    x_2 = x_2 + x_22
                biaozhunca = math.sqrt(x_2/len(data_dalei_recover))                
                xiapu2.append(sharp1)
                biaozhun2.append(biaozhunca)
                expect_profit2.append(data_expect_profit1)
        elif int(x2) == 2003:
                #计算夏普比率:
                data_dalei_recover = data_dalei[str(x2)].iloc[:,1]
                for kk1 in range(1,len(data_dalei_recover)):
                    data = ((data_dalei_recover[kk1]-data_dalei_recover[kk1-1])/data_dalei_recover[kk1-1])*100
                    data_x_2.append(data)
                data_x2 = pd.Series(data_x_2)
                sharp1 = cal_sharp(data_x2,rf = 3)
                #计算期望收益:
                data_expect_profit1 = data_dalei_recover[-1]-data_dalei_recover[0]
                #计算收益率标准差
                qiuhe2 = 0
                x_2 = 0
                for kk3 in range(1,len(data_dalei_recover)):
                    datam = (data_dalei_recover[kk3]-data_dalei_recover[kk3-1])
                    qiuhe2 = qiuhe2 +datam
                avg2 = qiuhe2/(len(data_dalei_recover)-1)
                
                for kk4 in range(1,len(data_dalei_recover)):
                    datan = (data_dalei_recover[kk4]-data_dalei_recover[kk4-1])
                    x_22 = pow(datan - avg2,2)
                    x_2 = x_2 + x_22
                biaozhunca = math.sqrt(x_2/len(data_dalei_recover))                
                xiapu2.append(sharp1)
                biaozhun2.append(biaozhunca)
                expect_profit2.append(data_expect_profit1)
        else:
            for kk2 in range(0,4):
                data_dalei_recover_2 = data_dalei[str(x2)].iloc[:,kk2]
            #计算期望收益
                data_expect_profit2 = data_dalei_recover_2[-1]-data_dalei_recover_2[0]
                print(f"期望收益{data_expect_profit2}")
            #计算夏普比率:
                for kk3 in range(1,len(data_dalei_recover_2)):
                    data = ((data_dalei_recover_2[kk3]-data_dalei_recover_2[kk3-1])/data_dalei_recover_2[kk3-1])*100
                    data_x_3.append(data)
                data_x3 = pd.Series(data_x_3)
                sharp2 = cal_sharp(data_x3,rf = 3)
                print(f"夏普:{sharp2}")
            #计算收益率标准差
                #平均收益:avg
                qiuhe1 = 0
                x_1 = 0
                for kk4 in range(1,len(data_dalei_recover_2)):
                    data4 = (data_dalei_recover_2[kk4]-data_dalei_recover_2[kk4-1])
                    qiuhe1 = qiuhe1 + data4
                avg = qiuhe1/(len(data_dalei_recover_2)-1)
                #当天收益:
                for kk5 in range(1,len(data_dalei_recover_2)):
                    data5 = (data_dalei_recover_2[kk5] - data_dalei_recover_2[kk5-1])
                    x_11 = pow(data5-avg,2)
                    x_1 = x_1 + x_11 
                zhuan = x_1/len(data_dalei_recover_2)
                biaozhunci2 = math.sqrt(zhuan)
                print(f"标准差{biaozhunci2}\n")
                qiuhe1 = 0
                x_1 = 0
                x_11=0
                avg = 0
                zhuan = 0
                data5 = 0
                print(f"期望收益{x2}{data_expect_profit2}")
                print(f"夏普:{x2}{sharp2}")
                print(f"收益率标准差{x2}{biaozhunci2}")
                xiapu2.append(sharp2)
                expect_profit2.append(data_expect_profit2)
                biaozhun2.append(biaozhunci2)
    K1 = pd.Series(expect_profit2)
    K2 = pd.Series(xiapu2)
    K3 = pd.Series(biaozhun2)
    K4 = pd.concat([K1,K2,K3],axis=1)
    K4.to_excel('./data/recovernew.xlsx')
      
def overhead(overhead_1):
    data_dalei =  divide()
    for x3 in overhead_1:
        data_dalei_overhear = data_dalei[str(x3)]
    #没有经济过热的年份  
def staflation(staflation_1):
    xiapu3 = []
    expect_profit3 = []
    biaozhun3 = []
    data_dalei = divide()
    for x4 in staflation_1:
        data_x_4 = []
        data_dalei_stagflation = data_dalei[str(x4)].iloc[:,9]
        #计算夏普比率:
        for jj1 in range(1,len(data_dalei_stagflation)):
            data1 = ((data_dalei_stagflation[jj1]-data_dalei_stagflation[jj1-1])/data_dalei_stagflation[jj1-1])*100
            data_x_4.append(data1)
        data_x4 = pd.Series(data_x_4)
        sharp2 = cal_sharp(data_x4,rf = 3)
        
        #计算收益期望:
        data_expect_profit3 = data_dalei_stagflation[-1]-data_dalei_stagflation[0]
        
        #计算收益率标准差:
        qiuhe2 = 0 
        x_2 = 0 
        for jj2 in range(1,len(data_dalei_stagflation)):
            data6 = (data_dalei_stagflation[jj2]-data_dalei_stagflation[jj2-1])
            qiuhe2 = qiuhe2 + data6
        avg2 = qiuhe2/(len(data_dalei_stagflation)-1)
        for jj3 in range(1,len(data_dalei_stagflation)):
            data7 = (data_dalei_stagflation[jj3]-data_dalei_stagflation[jj3-1])
            x_x_1 = pow(data7-avg2,2)
            x_2 = x_2 + x_x_1
        zhuan1 = x_2/len(data_dalei_stagflation)
        boapzhuncha3 =math.sqrt(zhuan1)
        xiapu3.append(sharp2)
        expect_profit3.append(data_expect_profit3)
        biaozhun3.append(boapzhuncha3)
        # print(f"夏普比率{sharp2}")
        # print(f"收益期望{data_expect_profit3}")
        # print(f"收益率标准差{boapzhuncha3}\n")  
    L1 = pd.Series(expect_profit3)
    L2 = pd.Series(xiapu3)
    L3 = pd.Series(biaozhun3)
    L4 = pd.concat([L1,L2,L3],axis=1)
    L4.to_excel('./data/stagflationnew.xlsx') 
               
def cal_sharp(daily_returns: np.ndarray, rf=0, period=252):
    """计算夏普比率:(投资组合期望收益率 - 无风险收益) / 投资组合波动率"""
    Er = daily_returns.sum() / len(daily_returns) - rf / period  # 每日的平均收益 - 每日的无风险收益
    sharp = Er / daily_returns.std() * math.sqrt(period)
    return sharp

def xiangguanxing():
    f = pd.read_excel(path3)
    s = f.corr()
    print(s)
    ax = plt.subplots(1,1)
    ax = sns.heatmap(s,vmax = 1,square=True,annot=True)
    plt.xticks()
    plt.yticks()
    # sns.pairplot(f)
    # # sns.pairplot(s,hue='sepal_width')
    # pd.plotting.scatter_matrix(f,figsize=(12,12),range_padding=0.5)
    plt.show()

4. The fourth question

Topic analysis: arrange and combine 4 kinds of stocks, 3 kinds of bonds, 2 kinds of commodity indexes, and 1 kind of monetary fund, a total of 24 kinds of collocations, and select the risk-return characteristics of the past five years according to the heat map of the correlation coefficient obtained in the third question. It is predicted that the investment portfolios that are more suitable for the recovery stage in the next five years and have high returns are: SSE 50, CSI 300, South China Commodity Index, CSI-Comprehensive Wealth (3-5 years, 7-10 years), money market funds. Then use the LSTM algorithm to predict the risk-return characteristics of the investment portfolio (the method is based on the second question).

(Since the fourth question is a comprehensive application of the first three questions, I won't go into details here)

forecast result:

(eg. CSI 300 risk-return characteristics prediction results for the next five years)

 3. Summary

Question B mainly uses the knowledge of data mining and data analysis. The prediction of future data uses the LSTM algorithm of machine learning. As an excellent variant model of RNN, LSTM inherits most of the characteristics of the RNN model and solves the problem of The gradient disappearance problem caused by gradual reduction in the gradient backpropagation process can realize long-term data storage and input. Therefore, adding it to the sliding window algorithm can be well used to predict data for a long period of time in the future.

This competition has been very fruitful for me, and it can be regarded as a learning experience in the unfamiliar field of financial data modeling. Generally speaking, the difficulty of the competition is not too high, and the competition lasts for 8 days with plenty of time, and the competition experience is good , is very suitable for financial mathematics as a practice game.

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/128659211