Learning Mathematical Modeling (6): Data Preprocessing Topics in Mathematical Modeling

1 What is data preprocessing?

In mathematical modeling competition questions, the official data given to all contestants may be subject to
There are some problems with the impact of objective or objective conditions. If the data is not processed and directly
Direct use may have a certain impact on the final result, so in order to ensure the
The authenticity of the data and the reliability of the modeling results, it is necessary to correlate the data before modeling
Close the preprocessing job!
Data preprocessing generally includes: data cleaning, data integration, data transformation and data specification

 2 Data preprocessing - data cleaning

When we get a set of data, there may be some missing values ​​and
Outliers (noisy data). At this point we perform data cleaning, mainly including two parts
Points: missing value processing and outlier processing

 2.1 Missing value processing

There are three main ways to deal with missing values: delete records, data imputation and no treatment.
Delete record: refers to delete the data of this group of cases when the data of a certain case of this group of data is defaulted. The advantage of this method is to deal with
It is convenient, but it should be used with caution when the data is small.
Data imputation: Use different imputation methods to complete the default data. The main imputation methods are: mean/median/mode imputation;
Use fixed value interpolation; nearest neighbor interpolation; regression method interpolation; interpolation method interpolation.
Nearest neighbor interpolation: that is, to find the attribute interpolation of the sample closest to the missing sample in the record, which can be calculated by calculating the Euclidean
distance measure.
Regression method imputation: building a fitting model based on existing data and data on other variables related to it to predict missing values.
Interpolation method: There are many commonly used interpolation methods, mainly including Lagrange interpolation method and Newton interpolation method.
No processing, sometimes we can divide all default data samples into another group for special processing.

 

 

 

Matlab interpolation: one-dimensional interpolation
yi=interp1(x,y,xi, 'method')
%x, y are interpolation points, xi, yi are interpolated points and results, x, y and xi, yi are usually vectors
%'method' indicates the interpolation method: common methods are 'nearest''linear''spline''cubic'
nearest ' - nearest neighbor interpolation : insert the value closest to its distance
linear ' ——linear interpolation: construct a linear function for interpolation
spline ' ——cubic spline interpolation, constructing a cubic polynomial for interpolation
cubic ' ——cubic interpolation: construct a cubic function for interpolation
method 'The default is linear interpolation
Matlab interpolation: two-dimensional interpolation
yi=interp2(x,y,z,xi,yi,'method')
%x, y, z are the interpolation points, xi, yi are the interpolated points , and zi is the output interpolation result, that is, the interpolation function is in
xi,yi ); x,y are vectors, xi,yi are vectors or matrices, and z and zi are matrices
%'method' indicates the interpolation method: common methods are 'nearest''linear''spline''cubic'
nearest ' - nearest neighbor interpolation : insert the value closest to its distance
linear ' ——Bilinear interpolation: Construct two sets of linear functions for interpolation
spline ' ——Bicubic spline interpolation, constructing a cubic polynomial for interpolation in each interval
cubic ' ——Bicubic interpolation: Construct a cubic function for interpolation
The default is bilinear interpolation
% 一维插值

clc;clear all;

y=[0.31472 0.84549 0.98429 0.81619 0.51237];

x=[1 2 3 4 5];

x1=0:0.1:5;

y1=interp1(x,y,x1,'spline');%三次样条插值,构造三次多项式进行插值

plot(x1,y1)
%二维插值
x=1:5;
y = 1:3;
temps = [82 81 80 82 84;79 63 61 65 81;84 84 82 85 86];
xi = 1:.2:5;
y1 = 1:.2:3;
zzi = interp2(x,y,temps,xi',y1,'spline');
mesh(xi,y1,zzi);
clc;clear all;
x=[123 55 89 84 56 54 100];
y=[2 5 8 9 10 16 15];
z=[165 654 852 254 0 456 251];
x1=50:0.1:150;
y1=0:0.1:20;
[x1,y1]=meshgrid(x1,y1);
z1=griddata(x,y,z,x1,y1,
'linear');
meshc(x1,y1,z1);
2.2 Outlier processing
For example: for a set of height data, most of the data is a few meters at a point, and a sudden jump of 5 meters is obviously different from other data
If it is large, it is judged that the data belongs to outliers.
There are two processing methods: the 3σ principle of normal distribution, and drawing box plots.
1. The 3σ principle of normal distribution
The probability that the values ​​are distributed in (μ-3σ, μ+3σ) is 99.73%, where μ is the mean and σ is the standard deviation.
Solving steps:
1. Calculate the mean value μ and standard deviation σ; 2. Determine whether each data value is within (μ-3σ, μ+3σ), and if it is not, it is an outlier.
Applicable topics: The overall conformity to the normal distribution, such as population data, measurement errors, production and processing quality, test scores, etc.
Inapplicable topic: The overall conforms to other distributions, for example, the queuing theory of the number of people at bus stops conforms to Poisson distribution
2. Draw a box diagram
In the box chart, sort the data from small to large.
The lower quartile Q1 is the 25th percentile value, and the upper quartile Q3 is the 75th percentile value.
Interquartile range IQR = Q3 - Q1 , which is the value of the 75th percentile minus the 25th percentile
Similar to the normal distribution, set a reasonable interval, and those outside the interval are outliers.
Generally set [ Q1 1.5* IQR , Q3 +1.5* IQR ] as the normal value

3 Data preprocessing - data transformation

3.1 Consistent processing method of data type
Extremely large : the larger the expected value, the better;
Extremely small : the smaller the expected value, the better;
Intermediate type : the expected value should be neither too large nor too small, and it is best to take an appropriate interval ;
Interval type : It is best for the expected value to fall within a certain interval.

For materials, please pay attention to the official account [Mathematical Modeling Brother] courseware or code, please reply to "courseware" on the official account, fans

3.1 Consistent processing method of data type 

In order to objectively evaluate the actual situation of graduate education in my country and the teaching quality of graduate schools, the State Council
The degree office of the college organized an evaluation of the graduate school. In order to gain experience, 5 graduate schools were selected first,
The relevant data were collected and evaluated, and Table 1 shows part of the data.

 3.2 Dimensionless processing of data indicators

 

 

 

 Commonly used methods : standard deviation method, extreme value difference method and power coefficient method, etc.

 (1) Standard deviation method

 

 (2) Extreme value difference method

(3 ) Efficacy coefficient method

 

 

 

 

 

 

 Functions for dimensionless processing of data:

%数据预处理方法:线性归一化
%a为处理数据矩阵 u为选择处理方法 1为效益型 2
为成本型 3为区间型 qujian为效益形中的最优属性
区间 rennai为忍耐上下限区间
function b=topsis(a,u,qujian,rennai)
am1=min(a);am2=max(a);
% 效益型数据处理(即数据越大越好)
if u==1
b=(a-am1)./(am2-am1);
% 成本型数据处理(即数据越小越好)
elseif u==2
b=(am2-a)./(am2-am1);
% 区间型数据处理
elseif u==3
n=length(a);
for k=1:n
if a(k)>=rennai(1)&a(k)<qujian(1)
b(k)=1-(qujian(1)-a(k))/(qujian(1)-rennai(1));
elseif a(k)>=qujian(1)&a(k)<=qujian(2)
b(k)=1;
elseif a(k)>qujian(2)&a(k)<=rennai(2)
b(k)=1-(a(k)-qujian(2))/(rennai(2)-qujian(2));
else
b(k)=0;
end
end
end

Call the function for dimensionless processing

A=[0.1 0.2 0.4 0.9 1.2;
5 6 7 10 2;
5000 6000 7000 10000 400;
4.7 5.6 6.7 2.3 1.8];
A=A';
a1=A(:,1);a2=A(:,2);a3=A(:,3);a4=A(:,4);
b1=topsis(a1,1);
b2=topsis(a2,3,[5,6],[2,12]);
b3=topsis(a3,2);
b4=topsis(a4,2);
[b1,b2',b3,b4]
3.3 Quantitative processing methods of qualitative indicators
 In social practice, many problems involve the quantitative treatment of qualitative factors (indicators)
question. Such as: teaching quality, scientific research level, work performance, personnel quality, various satisfaction
political, social, human
Issues in the field of literature and so on.
How to give a quantitative analysis of the relevant issues?
According to national evaluation standards, evaluation factors are generally divided into five levels,
Such as A, B, C, D, E.
How to quantify it? If A-, B+, C-, D+, etc. how to quantify reasonably?
It is unscientific to simply correspond to the digital quantization method!
According to practical problems, it is feasible to construct a quantitative method of fuzzy membership function
effective method.
Assume that there are multiple appraisers who evaluate a factor as A, B, C, D, and E, a total of 5 grades:
{v1 ,v2 ,v3 ,v4,v5}。
For example: the appraiser's evaluation of the "satisfaction" of an event can be divided into
{very satisfied, satisfied, somewhat satisfied, not too satisfied, very dissatisfied}
Correspond to its 5 grades as 5, 4, 3, 2, 1 in turn.
For continuous quantification, a partial large Cauchy distribution and a logarithmic function are taken as membership functions:

 

 

 

According to this rule, for any evaluation value,
A suitable quantization value can be given.
Other affiliations can also be constructed according to the actual situation
function

Guess you like

Origin blog.csdn.net/qq_60498436/article/details/131986854