Forest biomass inversion based on random forest algorithm [Matlab Python]

1. Significance and technical route

  Methods for estimating forest biomass can be roughly classified into the following two types: one is the traditional estimation method, which mostly uses sampling methods to obtain field survey data to estimate forest biomass. This method often requires more manpower and material resources to complete, and the obtained The data does not have spatial continuity characteristics and cannot reflect the impact of environmental factors on the estimation results; the second is the remote sensing technology estimation method. The remote sensing image bands have spatial continuity characteristics, and are macroscopic, fast and repeatable. They are useful for studying forest biomass. and its spatial distribution provide the necessary conditions so that the estimation results are not only close to reality, but also provide intuitive spatial distribution information of forest biomass.

  The route used this time is the latter. The remote sensing technology estimation method uses the data of LandSat-8 satellite observation and uses various algorithms to build and train the biomass fitting model using each band. A model with better fitting effect is obtained. In turn, Perform forest aboveground biomass (AGB).

 2. Variable extraction

Use ENVI to calculate the five selected indices according to the following vegetation index calculation formula and use the Band Math formula. The five vegetation indices selected here are ARVI, NDVI, SR, OSAVI and VIGreen. Five images are obtained, as follows: Show.

This time ARVI, NDVI, OSAVI, SR and VIGreen vegetation index were selected as parameter inversion.

Use ENVI software to calculate texture features of remote sensing images, calculate their Mean, Variance, Homogeneity, Contrast, Dissimilarity, Entropy, Second Moment and Correlation, and obtain eight images.

Mean

 

Homogeneity

 

Only two images are shown here.

 3. Sample extraction

Use Matlab to select corresponding random image points with the same name from the preprocessed remote sensing image data, texture data and AGB measured values. This time, 556 points were selected to form the training set and verification set for model training.

Extracted points

 

clear;
AGB_data=readgeoraster("AGB_lidarplots.tif");
ARVI_data=readgeoraster("ARVI.tif");
VIGreen_data=readgeoraster("VIGreen.tif");
NDVI_data=readgeoraster("NDVI.tif");
SR_data=readgeoraster("SR.tif");
OSAVI_data=readgeoraster("OSAVI.tif");


Multi_data=readgeoraster("Multispec_ref_layer_stack_clip_corre_allstudyarea_resam_AGB.tif");
Swir1=Multi_data(:,:,2);
Nir=Multi_data(:,:,3);
Red=Multi_data(:,:,4);
Green=Multi_data(:,:,5);
Blue=Multi_data(:,:,6);
Coastal=Multi_data(:,:,7);


texture_data=readgeoraster("texture_NIR.tif");
Mean=texture_data(:,:,1);
Variance=texture_data(:,:,2);
Homogeneity=texture_data(:,:,3);
Contrast=texture_data(:,:,4);
Dissimilarity=texture_data(:,:,5);
Entropy=texture_data(:,:,6);
SecondMoment=texture_data(:,:,7);
Correlation=texture_data(:,:,8);

x=randi(266,1,1000);
y=randi(400,1,1000);

AGB_selected=zeros(1,1000);
SWIR1_selected=zeros(1,1000);
NIR_selected=zeros(1,1000);
RED_selected=zeros(1,1000);
GREEN_selected=zeros(1,1000);
Blue_selected=zeros(1,1000);
COASTAL_selected=zeros(1,1000);
NDVI_selected=zeros(1,1000);
SR_selected=zeros(1,1000);
OSAVI_selected=zeros(1,1000);
ARVI_selected=zeros(1,1000);
VIGreen_selected=zeros(1,1000);
MEAN_selected=zeros(1,1000);
Variance_selected=zeros(1,1000);
Homogeneity_selected=zeros(1,1000);
Contrast_selected=zeros(1,1000);
Dissimularity_selected=zeros(1,1000);
Entropy_selected=zeros(1,1000);
SecondMoment_selected=zeros(1,1000);
Correlation_selected=zeros(1,1000);

for i=1:size(x,2)
xi=x(1,i);
yi=y(1,i);
if xi==0 || yi==0
xi=1;
yi=1;
end
AGB_selected(1,i)=AGB_data(xi,yi);
SWIR1_selected(1,i)=Swir1(xi,yi);
NIR_selected(1,i)=Nir(xi,yi);
RED_selected(1,i)=Red(xi,yi);
GREEN_selected(1,i)=Green(xi,yi);
Blue_selected(1,i)=Blue(xi,yi);
COASTAL_selected(1,i)=Coastal(xi,yi);
NDVI_selected(1,i)=NDVI_data(xi,yi);
SR_selected(1,i)=SR_data(xi,yi);
OSAVI_selected(1,i)=OSAVI_data(xi,yi);
ARVI_selected(1,i)=ARVI_data(xi,yi);
VIGreen_selected(1,i)=VIGreen_data(xi,yi);
MEAN_selected(1,i)=Mean(xi,yi);
Variance_selected(1,i)=Variance(xi,yi);
Homogeneity_selected(1,i)=Homogeneity(xi,yi);
Contrast_selected(1,i)=Contrast(xi,yi);
Dissimularity_selected(1,i)=Dissimilarity(xi,yi);
Entropy_selected(1,i)=Entropy(xi,yi);
SecondMoment_selected(1,i)=SecondMoment(xi,yi);
Correlation_selected(1,i)=Correlation(xi,yi);
end


%输出到excel中
data=cell(1000,20);
%title={'AGB','SWIR1','NIR','RED',"GREEN",'Blue','COASTAL','NDVI','SR','OSAVI','ARVI','VIGreen','MEAN','Variance','Homogeneity','Contrast','Dissimularity','Entropy','Second Moment','Correlation'};
result=[AGB_selected;SWIR1_selected;NIR_selected;RED_selected;GREEN_selected;Blue_selected;COASTAL_selected;NDVI_selected;SR_selected;OSAVI_selected;ARVI_selected;VIGreen_selected;MEAN_selected;Variance_selected;Homogeneity_selected;Contrast_selected;Dissimularity_selected;Entropy_selected; SecondMoment_selected;Correlation_selected];
result=result';
xlswrite('data.xlsx',result)

4. Variable filtering

Perform variable screening on all data to determine the variables used in model fitting. You can use methods such as correlation analysis, importance ranking, and forward/backward screening. In this comprehensive design, correlation analysis is used. The Pearson correlation index determines the correlation between AGB and each variable.

clear;
%读取excel数据
[num]=readtable("筛选后.xls");
AGB=table2array(num(:,1));
SWIR1=table2array(num(:,2));
NIR=table2array(num(:,3));
RED=table2array(num(:,4));
Green=table2array(num(:,5));
Blue=table2array(num(:,6));
Coastal=table2array(num(:,7));
NDVI=table2array(num(:,8));
SR=table2array(num(:,9));
OSAVI=table2array(num(:,10));
ARVI=table2array(num(:,11));
VIGreen=table2array(num(:,12));
Mean=table2array(num(:,13));
Variance=table2array(num(:,14));
Homogeneity=table2array(num(:,15));
Contrast=table2array(num(:,16));
Dissimularity=table2array(num(:,17));
Entropy=table2array(num(:,18));
SecondMoment=table2array(num(:,19));
Correlation=table2array(num(:,20));


data=[AGB,SWIR1,NIR,RED,Green,Blue,Coastal,NDVI,SR,OSAVI,ARVI,VIGreen,Mean,Variance,Homogeneity,Contrast,Dissimularity,Entropy,SecondMoment,Correlation];
data1=[SR,NDVI,OSAVI,ARVI,VIGreen];%自变量准备

[rho,pval]=corr(data,'type','Pearson');
string_name={'AGB','SWIR1','NIR','RED',"GREEN",'Blue','COASTAL','NDVI','SR','OSAVI','ARVI','VIGreen','MEAN','Variance','Homogeneity','Contrast','Dissimularity','Entropy','Second Moment','Correlation'};
x_values=string_name;
y_values=string_name;
h=heatmap(x_values,y_values,rho);
h.Title="Pearson相关性";
% colormap summer
% size(pval)
% h1=heatmap(x_values,y_values,pval);
% h1.Title="显著性水平";
% colormap summer
Pearson correlation index confusion matrix

 As can be seen from the above figure, the correlation between NDVI, SR, OSAVI, ARVI, VIGreen and AGB is stronger than other variables (Pearson correlation index is about 0.6), so the above five variables are selected as the model input independent variables. train.

5. Prediction model (random forest algorithm partial least squares SVR support vector regression)

 5.1 Partial least squares algorithm

Partial least squares regression (PLSR) is a linear regression algorithm. It is a multivariate statistical analysis method based on principal component analysis, which can be used to process the linear relationship between multiple independent variables and one or more dependent variables in high-dimensional data sets. The PLSR algorithm reduces the dimensionality of the data set by projecting the independent variables and dependent variables into a new status space, and can solve the problem of multicollinearity between independent variables. Its purpose is to minimize the sum of squares of prediction errors to find the best prediction model.

The main steps of the PLSR algorithm include:

  1. Choose projection direction
  2. Calculate projection coefficient
  3. Perform regression analysis on projected variables
  4. Cross-validate regression results
  5. Choose the best prediction model

code show as below:

%训练集和验证集数据个数分别为
VerifySize=round(size(AGB,1)/3);
TrainSize=size(AGB,1)-VerifySize;

input=data1(1:TrainSize,:);%训练输入值(自变量)
predict_input=data1(TrainSize:end,:);%验证输入值(自变量)
output=AGB(1:TrainSize,:);%训练输出值(因变量)
predict_output=AGB(TrainSize:end,:);%验证输出值(因变量)

%偏最小二乘

var=[input,output];
mu=mean(var);sig=std(var); %求均值和标准差
rr=corrcoef(var); %求相关系数矩阵
ab=zscore(var); %数据标准化
a=ab(:,1:5); %提出标准化后的自变量数据
b=ab(:,6); %提出标准化后的因变量数据

%% 判断提出成分对的个数
[XL,YL,XS,YS,BETA,PCTVAR,MSE,stats] =plsregress(a,b);
xw=a\XS; %求自变量提出成分的系数,每列对应一个成分,这里xw等于stats.W
yw=b\YS; %求因变量提出成分的系数
a_0=PCTVAR(1,:);b_0=PCTVAR(2,:);% 自变量和因变量提出成分的贡献率
a_1=cumsum(a_0);b_1=cumsum(b_0);% 计算累计贡献率
i=1;%赋初始值
while ((a_1(i)<0.999)&(a_0(i)>0.001)&(b_1(i)<0.999)&(b_0(i)>0.001)) % 提取主成分的条件
i=i+1;
end
ncomp=i;% 选取的主成分对的个数
fprintf('%d对成分分别为:\n',ncomp);% 打印主成分的信息
for i=1:ncomp
fprintf('第%d对成分:\n',i);
fprintf('u%d',i);
for k=1:size(a,2)%此处为变量x的个数
fprintf('+(%fx_%d)',xw(k,i),k);
end
fprintf('\n');
fprintf('v%d=',i);
for k=1:size(b,2)%此处为变量y的个数
fprintf('+(%fy_%d)',yw(k,i),k);
end
fprintf('\n');
end

%% 确定主成分后的回归分析
[XL2,YL2,XS2,YS2,BETA2,PCTVAR2,MSE2,stats2] =plsregress(a,b,ncomp);
n=size(a,2); m=size(b,2);%n是自变量的个数,m是因变量的个数
beta3(1,:)=mu(n+1:end)-mu(1:n)./sig(1:n)*BETA2([2:end],:).*sig(n+1:end); %原始数据回归方程的常数项
beta3([2:n+1],:)=(1./sig(1:n))'*sig(n+1:end).*BETA2([2:end],:); %计算原始变量x1,...,xn的系数,每一列是一个回归方程
fprintf('最后得出如下回归方程:\n')
for i=1:size(b,2)%此处为变量y的个数
fprintf('y%d=%f',i,beta3(1,i));
for j=1:size(a,2)%此处为变量x的个数
fprintf('+(%f*x%d)',beta3(j+1,i),j);
end
fprintf('\n');
end
%% 求预测值
y1 = repmat(beta3(1,:),[size(predict_input,1),1])+predict_input(:,[1:n])*beta3([2:end],:); %求y1,..,ym的预测值
y0 = predict_output(:,end-size(y1,2)+1:end); % 真实值

abs(y1-y0);

%%求整体预测值(精度评定)
y_all=repmat(beta3(1,:),[size(data1,1),1])+data1(:,[1:n])*beta3([2:end],:);
y_real=AGB(:,end-size(y1,2)+1:end);
abs(y_all-y_real);

RMSE=sqrt(sum((y_all-y_real).*(y_all-y_real))/size(data1,1));
R2=1-sum((y_all-y_real).^2)/sum((y_real-mean(y_real)).^2);

y2=predict_output;
y3 = repmat(beta3(1,:),[size(predict_input,1),1])+predict_input(:,[1:n])*beta3([2:end],:);

plotregression(y2,y3)
ylabel("AGB预测值(kg/ha)")
xlabel("AGB实际测量值(kg/ha)")
title("偏最小二乘实现模型拟合结果(R=0.73291,RMSE=39.7270)")

The result graph is as follows:

 5.2RF-Random Forest Algorithm

Decision tree is a classic machine learning algorithm that can be used to deal with classification problems and regression problems. It is also the machine learning algorithm often chosen by weak learners in ensemble learning. Such as RF, GBDT.

A decision tree is a tree structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category. Decision tree is a supervised machine learning algorithm based on if-then-else rules.

The algorithm of decision tree learning is usually a process of recursively selecting the optimal features and dividing the training data according to the features so that each sub-data set has the best classification. This process corresponds to the division of feature space and the construction of decision trees. Random Forest (RF)

Random forest (RF) is actually multiple decision trees. Different training sample sets are obtained by resampling samples, and learners are trained separately on these new training sample sets. Finally, the results of each learner are combined as the final learning result, where the weight of each sample is the same. As shown below:

Among them, in this method, the b learners are independent of each other. This feature makes this method easier to parallelize. The bootstrap method performs random sampling with replacement. The learner is the decision tree DT.

Algorithm steps

Each tree is generated according to the following rules:

(1) If the size of the training set is N, for each tree, N training samples are randomly selected from the training set with replacement as the training set of the tree. (Each tree has a different training set and contains repeated samples)

(2) If the feature dimension of each sample is M, specify a constant m<<M, randomly select m feature subsets from M features, and select the best feature subset from these m features each time the tree is split. excellent. (Repeat until no more splits can be made)

(3) Each tree grows to its maximum extent and there is no pruning process. (Build a large number of decision trees to form a forest)

Random: random samples and random features (to ensure that it is not easy to fall into overfitting)

Classification problem: For the test sample, each decision tree in the forest will give the final category. Finally, the output of each tree will be comprehensively considered and the decision will be made by voting.

Among them, in this method, the b learners are independent of each other. This feature makes this method easier to parallelize. The bootstrap method performs random sampling with replacement. The learner is the decision tree DT.

  1. Algorithm steps

Each tree is generated according to the following rules:

(1) If the size of the training set is N, for each tree, N training samples are randomly selected from the training set with replacement as the training set of the tree. (Each tree has a different training set and contains repeated samples)

(2) If the feature dimension of each sample is M, specify a constant m<<M, randomly select m feature subsets from M features, and select the best feature subset from these m features each time the tree is split. excellent. (Repeat until no more splits can be made)

(3) Each tree grows to its maximum extent and there is no pruning process. (Build a large number of decision trees to form a forest)

Random: random samples and random features (to ensure that it is not easy to fall into overfitting)

Classification problem: For the test sample, each decision tree in the forest will give the final category. Finally, the output of each tree will be comprehensively considered and the decision will be made by voting.

MSE vs. tree and leaf tree trend plot

 

It can be seen from the image that when the number of Trees is near 166, the MSE decreases slowly, and when the number of leaves is 20, the MSE value is the smallest. So the number of trees in this model is 166 and the number of leaves is 20.

Use the training set to train the model, enter the independent variables, calculate the predicted value, and determine to stop training when the RMSE is less than 35.

The result is as follows:

 

code show as below:

%RF随机森林
clc;
clear;
[num]=readtable("筛选后.xls");
AGB=table2array(num(:,1));
NDVI=table2array(num(:,8));
SR=table2array(num(:,9));
OSAVI=table2array(num(:,10));
ARVI=table2array(num(:,11));
VIGreen=table2array(num(:,12));
data1=[SR,NDVI,OSAVI,ARVI,VIGreen];%自变量准备

%训练集
input=data1(:,:);%训练输入值(自变量)
output=AGB(:,:);%训练输出值(因变量)


%判断RF算法的叶子数和决策树的数目
for RFOptimizationNum=1:5
RFLeaf=[5,10,20,50,100,200,500];
col='rgbcmyk';
figure('Name','RF Leaves and Trees');
for i=1:length(RFLeaf)
RFModel=TreeBagger(2000,data1,AGB,'Method','R','OOBPrediction','On','MinLeafSize',RFLeaf(i));
plot(oobError(RFModel),col(i));
hold on
end
xlabel('Number of Grown Trees');
ylabel('Mean Squared Error') ;
LeafTreelgd=legend({'5' '10' '20' '50' '100' '200' '500'},'Location','NorthEast');
title(LeafTreelgd,'Number of Leaves');
hold off;
disp(RFOptimizationNum);
end

%由图像可知 当Tree的数目位于166附近时,MSE减少的趋势缓慢,且当叶子数为20时,MSE的值最小

RFRMSEMatrix=[];
RFrAllMatrix=[];
RFRunNumSet=50000;
for RFCycleRun=1:RFRunNumSet
% 训练集数据划分
RandomNumber=(randperm(length(output),floor(length(output)/3)))';
TrainYield=output;
TestYield=zeros(length(RandomNumber),1);
TrainVARI=input;
TestVARI=zeros(length(RandomNumber),size(TrainVARI,2));
for i=1:length(RandomNumber)
m=RandomNumber(i,1);
TestYield(i,1)=TrainYield(m,1);
TestVARI(i,:)=TrainVARI(m,:);
TrainYield(m,1)=0;
TrainVARI(m,:)=0;
end
TrainYield(all(TrainYield==0,2),:)=[];
TrainVARI(all(TrainVARI==0,2),:)=[];

%% RF
nTree=166;
nLeaf=20;
RFModel=TreeBagger(nTree,TrainVARI,TrainYield,...
'Method','regression','OOBPredictorImportance','on', 'MinLeafSize',nLeaf);
[RFPredictYield,RFPredictConfidenceInterval]=predict(RFModel,TestVARI);


%RF模型准确度判断
RFRMSE=sqrt(sum(sum((RFPredictYield-TestYield).^2))/size(TestYield,1));
RFrMatrix=corrcoef(RFPredictYield,TestYield);
RFr=RFrMatrix(1,2);
RFRMSEMatrix=[RFRMSEMatrix,RFRMSE];
RFrAllMatrix=[RFrAllMatrix,RFr];
if RFRMSE<34
disp(RFRMSE);
break;
end
disp(RFRMSE);
disp(RFCycleRun);
end


% RF模型存储
RFModelSavePath='E:\微信存储位置\WeChat Files\wxid_xjealsgnaqio21\FileStorage\File\2023-06\海测综合设计-2020级\海测综合设计-2020级\生物量反演\data\matlab';
save(sprintf('%sRF.mat',RFModelSavePath),'nLeaf','nTree',...
'RandomNumber','RFModel','RFPredictConfidenceInterval','RFPredictYield','RFr','RFRMSE',...
'TestVARI','TestYield','TrainVARI','TrainYield');

%精度评定
R2=1-sum((RFPredictYield-TestYield).^2)/sum((output-mean(output)).^2);

5.3 SVR vector regression algorithm

An "interval zone" is created on both sides of the linear function, and the distance between this "interval zone" is ϵ (this value is often given based on experience). No loss is calculated for all samples that fall into the interval zone. , that is, only the support vector will have an impact on its function model, and finally the optimized model is obtained by minimizing the total loss and maximizing the interval. For nonlinear models, like SVM, the kernel function is used to map to the feature space, and then regression is performed.

 

formulaic representation

The SVR model also has some differences from the traditional general linear regression model in use. The difference is mainly reflected in the fact that the loss is calculated in the SVR model when and only when the absolute value of the difference between f(x) and y is greater than ϵ. In the general linear model, the loss is calculated as long as f(x) and y are not equal.

The optimization methods of the two models are different. In the SVR model, the model is optimized by maximizing the width of the interval band and minimizing the loss, while in the general linear regression model, the model is optimized by the average value after gradient descent.

 After dividing the data set, set the penalty parameter and ε width value, and train the model. The result is as shown below.

 code show as below:

import scipy.io as scio
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

matdata=scio.loadmat("data.mat")
AGB=matdata["AGB"].flatten()
ARVI=matdata["ARVI"].flatten()
NDVI=matdata["NDVI"].flatten()
OSAVI=matdata["ARVI"].flatten()
SR=matdata["ARVI"].flatten()
VIGreen=matdata["ARVI"].flatten()
data=matdata["data1"]

from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


C = 6.0  # 惩罚项参数
epsilon = 0.001  # ε-tube宽度
model = SVR(C=C, epsilon=epsilon)

# 拟合模型
X_train, X_test, y_train, y_test = train_test_split(data, AGB, test_size=1/3, random_state=42)
print(model.fit(X_train, y_train))

# 预测
y_train_pred = model.predict(X_train)

#预测值

y_test_pred = model.predict(X_test)

#模型评估
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
test_rmse = mean_squared_error(y_test, y_test_pred, squared=False)
print("训练集RMSE:",train_rmse)
print("测试集RMSE:",test_rmse)

train_r=metrics.r2_score(y_train, y_train_pred)
test_r=metrics.r2_score(y_test, y_test_pred)

print("训练集R2:",train_r)
print("测试集R2:",test_r)

print("回归分析")
T_y=np.array(y_test_pred).reshape((len(y_test_pred),1))
T_x=np.array(y_test).reshape((len(y_test),1))

slope,intercept=np.polyfit(y_test,y_test_pred,1)

import seaborn as sns
plt.rcParams['font.sans-serif'] = ['SimHei']  # 中文字体设置-黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题


ax=sns.regplot(x=T_x,y=T_y,marker="+")
plt.title("SVR-支持向量回归拟合结果(R^2=0.404,RMSE=43.925(Mg/ha))")
plt.legend(["原始数据点",'拟合曲线',"置信区间"])
plt.ylabel("AGB预测值(Mg/ha)")
plt.xlabel("AGB实际测量值(Mg/ha)")
plt.savefig("SVR")
plt.show()

A comprehensive assessment yields:

random forest

partial least squares

SVR vector regression

R^2

 

0.744

0.732

0.404

RMSE(Mg/ha)

 

28.895

39.727

43.925

To sum up, the RF-random forest fitting algorithm has the best effect. Its coefficient of determination is closest to 1 and the RMSE value is the smallest, so the RF-random forest fitting algorithm is used for fitting.

6. Biomass inversion

%使用RF模型进行反演
ARVI_data=readgeoraster("ARVI.tif");
VIGreen_data=readgeoraster("VIGreen.tif");
NDVI_data=readgeoraster("NDVI.tif");
SR_data=readgeoraster("SR.tif");
OSAVI_data=readgeoraster("OSAVI.tif");
AGB_predict=zeros(size(ARVI_data,1),size(ARVI_data,2));
%使用以上五个参数进行反演
for i=1:1:size(ARVI_data,1)
length=size(ARVI_data,2);
ARVI=ARVI_data(i,1:length);
VIGreen=VIGreen_data(i,1:length);
NDVI=NDVI_data(i,1:length);
SR=SR_data(i,1:length);
OSAVI=OSAVI_data(i,1:length);
data=[SR',NDVI',OSAVI',ARVI',VIGreen'];
[predict_result,interval]=predict(RFModel,data);
AGB_predict(i,1:length)=predict_result;
end
%完成训练后输出mat文件
5.导出反演后tif图像(Python)
import scipy.io as scio
import gdal
import numpy as np
matdata=scio.loadmat("AGB_predict.mat")
AGB=matdata['AGB_predict']
driver=gdal.GetDriverByName("GTiff")
dataset=driver.Create(r'AGB_predict.tif',400,266,1,gdal.GDT_Float32)
band=dataset.GetRasterBand(1)
band.WriteArray(AGB)

Use envi drawing to get the result:

 

Guess you like

Origin blog.csdn.net/m0_49684834/article/details/131747522