Python data analysis case 22 - financial news credibility analysis (linear regression, principal component regression, random forest regression)

 This case is still suitable for the fields of humanities and social sciences, finance or journalism. Linear regression and principal component regression are enough for undergraduates, and random forest regression can be added for postgraduates. The method is enough for a master's thesis in the field of humanities and social sciences.


case background

There are eight independent variables, ['Weibo platform credibility','professionalism','reliability','forwarding volume','Weibo content quality','timeliness','verification degree',' Interpersonal trust'], a dependent variable: investment information trustworthiness.

Investigate the effect of these eight independent variables on the dependent variable.


data read

import package

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as smf
plt.rcParams ['font.sans-serif'] ='SimHei'               #显示中文
plt.rcParams ['axes.unicode_minus']=False               #显示负号
sns.set_style("darkgrid",{"font.sans-serif":['KaiTi', 'Arial']})

Read, my data format here is the sav format of spss, but python can also read it.

# 读取数据清洗后的数据
spss = pd.read_spss('数据2.sav')
#spss

Select the required variables and display the first five rows

data=spss[['微博平台可信','专业性','可信赖性','转发量','微博内容质量','时效性','验证程度','人际信任','投资信息可信度']]
data.head()

get column name

columns1=data.columns

Descriptive statistics, calculate the mean variance quantile, etc.

data.describe()  #描述性统计

I don't have much data... 

take out x and y

X=data.iloc[:,:-1]
y=data.iloc[:,-1]

 


drawing display

Draw boxplots for eight independent variables and one dependent variable 

column = data.columns.tolist() # 列表头
fig = plt.figure(figsize=(10,10), dpi=128)  # 指定绘图对象宽度和高度
for i in range(9):
    plt.subplot(3,3, i + 1)  # 2行3列子图
    sns.boxplot(data=data[column[i]], orient="v",width=0.5)  # 箱式图
    plt.ylabel(column[i], fontsize=16)
plt.tight_layout()
plt.show()

 

Draw Kernel Density Map

column = data.columns.tolist() # 列表头
fig = plt.figure(figsize=(10,10), dpi=128)  # 指定绘图对象宽度和高度
for i in range(9):
    plt.subplot(3,3, i + 1)  # 2行3列子图
    sns.kdeplot(data=data[column[i]],color='blue',shade= True)  
    plt.ylabel(column[i], fontsize=16)
plt.tight_layout()
plt.show()

 Scatterplot between two variables

sns.pairplot(data[column],diag_kind='kde')
#plt.savefig('散点图.jpg',dpi=256)

 Correlation coefficient heat map between variables

#画皮尔逊相关系数热力图
corr = plt.subplots(figsize = (14,14))
corr= sns.heatmap(data[column].corr(),annot=True,square=True)

 It can be seen that the correlation coefficients between many x are quite high, and the linear regression model should have serious multicollinearity.


Linear regression analysis

import package

import statsmodels.formula.api as smf

print regression equation

all_columns = "+".join(data.columns[:-1])
print('x是:'+all_columns)
formula = '投资信息可信度~' + all_columns
print('回归方程为:'+formula)

fit model 

results = smf.ols(formula, data=data).fit()
results.summary()

 

 It can be seen that the goodness of fit is quite high, 84%. Looking at the p-value of each variable, at the significance level of 0.05, it is almost not significant.....

It should be caused by multicollinearity.

You can also view the regression results like this:

print(results.summary().tables[1])

The coefficient p value is the same as above.

logarithmic regression 

A commonly used method in econometrics is to logarithmize the data, which can reduce the influence of heteroscedasticity and so on. Let's try it out, take the logarithm and regress:

data_log=pd.DataFrame(columns=columns1)
for i in columns1:
    data_log[i]=data[i].apply(np.log)

fit

results_log = smf.ols(formula, data=data_log).fit()
results_log.summary()

 It's not much better.... Only the p-value of timeliness is less than 0.05, which is significant, and the others are not significant.

Next use principal component regression


principal component regression

Principal component regression will compress your variables and create several new variables so that the multicollinearity between variables can be dealt with.

The new variable is a linear combination of the old variable, but it is not easy to explain, and it loses the practical significance of economics or news.

Guide package

from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
from mpl_toolkits import mplot3d

If you want to find a few principal component regressions, it would be better to use several principal components:

model = PCA()
model.fit(X)
#每个主成分能解释的方差
model.explained_variance_
#每个主成分能解释的方差的百分比
model.explained_variance_ratio_
#可视化
plt.plot(model.explained_variance_ratio_.cumsum(), 'o-')
plt.xlabel('Principal Component')
plt.ylabel('Cumulative Proportion of Variance Explained')
plt.axhline(0.9, color='k', linestyle='--', linewidth=1)
plt.title('Cumulative PVE')

It can be seen that when the number of principal components is 4, it can explain more than 90% of the original data (the axis is 3 because it starts from 0..)

 The following four principal components are used for regression analysis:

Transform X into a matrix of 4 principal components to see the shape of the data.

model = PCA(n_components = 4)
model.fit(X)
X_train_pca = model.transform(X)
X_train_pca.shape

 

25 is my sample size and 4 is the number of principal components. (25 is indeed less....) 

 into a data frame: (principal component score matrix)

columns = ['PC' + str(i) for i in range(1, 5)]
X_train_pca_df = pd.DataFrame(X_train_pca, columns=columns)
X_train_pca_df.head()

Only the first 5 rows are shown above.

It is also possible to compute the principal component kernel loading matrix, showing the relationship between the original variables and the principal components.

pca_loadings= pd.DataFrame(model.components_.T, columns=columns,index=columns1[:-1])
pca_loadings

 

 print the principal components regression equation

X_train_pca_df['财经信息可信度']=y
all_columns = "+".join(X_train_pca_df.columns[:-1])
print('x是:'+all_columns)
formula = '财经信息可信度~' + all_columns
print('回归方程为:'+formula)

fit model

results = smf.ols(formula, data=X_train_pca_df).fit()
results.summary()

 

 Only the first and fourth principal components are significant.

Print to view:

print(results.summary().tables[1])

 The effect of principal component regression is also general.

This traditional statistical model - parametric model, linear model, has too many restrictions and assumptions and is not easy to use.

As soon as you encounter problems such as multicollinearity and heteroscedasticity, you will be G

 The following uses a non-parametric regression method - random forest, which can avoid the influence of multicollinearity and obtain the ranking of important characteristics of variables.


 

random forest regression

Machine learning models such as random forests, support vector machines, and gradient boosting are all blows to dimensionality reduction in the field of humanities and social sciences. The old traditional statistical models are still used in the field of humanities and social sciences, and the effect is not very good.

Random forest regression is a very simple model in statistics, computers and other disciplines, but if it is in humanities and social sciences, this kind of model must be considered advanced when written in a paper.

Standardize the data first

# 数据标准化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
data[:5]

 take out x and y

airline_scale = data
airline_scale.shape

X=airline_scale[:,:-1]
y=airline_scale[:,-1]
X.shape,y.shape

 Fitting model:

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=5000, max_features=int(X.shape[1] / 3), random_state=0)
model.fit(X,y)
model.score(X,y)

The above code generates a random forest model with 5000 decision trees, fits it, and evaluates it.

The goodness of fit is as high as 95%!

View a plot of true and fitted values:

pred = model.predict(X)
plt.scatter(pred, y, alpha=0.6)
w = np.linspace(min(pred), max(pred), 100)
plt.plot(w, w)
plt.xlabel('pred')
plt.ylabel('y_test')
plt.title('模型预测的财经信息可信度和真实值对比')

 

 very close.

Calculate variable importance:

model.feature_importances_
sorted_index = model.feature_importances_.argsort()

plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), columns1[:-1][sorted_index],fontsize=14)
plt.xlabel('Feature Importance',fontsize=14)
plt.ylabel('Feature')
plt.title('特征变量的重要性排序图',fontsize=24)
plt.tight_layout()

It can be seen that for dependent variables such as the credibility of investment information, the timeliness of information, the amount of forwarding, and platform credibility are the most important, followed by variables such as interpersonal trust and reliability.

The above is the importance of each variable for y,

The following draws how each variable affects y, the partial dependence graph:

X2=pd.DataFrame(X,columns=columns1[:-1])
from sklearn.inspection import PartialDependenceDisplay
#plt.figure(figsize=(12,12),dpi=100)
PartialDependenceDisplay.from_estimator(model, X2,['时效性','转发量','微博平台可信'])
#画出偏依赖图

We can clearly see how timeliness, forwarding volume, Weibo platform credibility, and how the value changes of the three variables affect the change of y. It is obviously not a linear relationship, although the general direction is positive correlation. But the degree of influence is a nonlinear relationship, slow first, then fast and then slow.


Summarize 

This time, I used three methods to do a regression problem in the field of financial news. Each method has its advantages and disadvantages, but the effect is definitely better than the machine learning model. Some students must be wondering why the machine learning model has no p-value or something, how can it be obvious?

In fact, the method of machine learning does not have parameter estimation and hypothesis testing, and there is no p-value, so statistical inference cannot be done, which is also a shortcoming of it. Although traditional linear regression and principal component regression can be statistically inferred, the effect is very poor.

It depends on what kind of model everyone needs to make, but writing a machine learning model in a paper on humanities and social sciences is still considered innovative.

Guess you like

Origin blog.csdn.net/weixin_46277779/article/details/129465523