MATLAB Gray Relational Analysis

@

1. Shiming

Gray Relational Analysis (Grey Relation Analysis, GRA), is a multi-factor statistical analysis methods. Simply put, that is a gray system, we want to know which one of a project we are concerned by the relative strength of the influence of other factors, and then bluntly, that is to say: we know that certain assumptions and an indicator may be related to a number of other factors, then we want to know this index and other factors which are relatively more relevant, and which factors relative relationship a little weak, and so on, these factors row a sequence, to get a analysis As a result, we can know that the indicators we follow, more relevant factors which.

(Note:.. Put forward the concept of gray system is relative to the black and white system system concept was originally developed by Professor Deng Julong control science and engineering made in accordance with the practice of cybernetics, color generally represents a system for we know how much information the white represents sufficient information, such as the relationship between a mechanical system, the elements are able to determine, this is a white system; and black for the system on behalf of our structure is not clear which of the system , often called black box or black box is such a system. ranged between gray, mean that we only partially understand the system.)

2. Examples

To illustrate the scenario gray correlation analysis, we use the following diagram illustrates:
Here Insert Picture Description
the contents of the study was to map factors affecting the development of tourism, look at this table, the first acts of the total tourism income five years, representing the tourism development degree, but following these elements that we need to analyze factors such as the number of college students, the number of travel agencies, number of star hotels, a-level scenic spots number and so on. The ultimate goal is to get a sort, in order to illustrate the extent of these factors on the total tourism revenue of relevance.

3. Detailed procedures and principles

(1) establish the parent sequence
(reference sequence, above the chestnut is from 1998 to total tourism revenues sequence 2002) and the sequence (comparative sequence, which is necessary to establish the order of the factors sequence, the chestnut in all except the first row factors that can be used as a reference sequence)
in order to facilitate the presentation behind, unified look notation here:

我们用x_i(k)表示第i个因素的第k个数值
用上面的栗子来说
比如第一个因素是在校大学生人数,那么x_1(1)就表示在校大学生人数在1998年的取值,也就是341,x_1(2)就是1999年的取值,
而x_2(1)就是表示旅游从业人数在1998年的数值。以此类推。
我们用x_0(k)表示母序列,i≥1的表示子序列,也就是要分析的要素的序列。
如果不写括号,比如x_i ,就代表这个元素的整个序列,也就是向量 x_i = [x_i(1), x_i(2), ... , x_i(n)] 
n为每个向量的维度,也就是每个元素的特征的数量,在上栗中,n就是5,因为有五年的数据,代表五维向量。
以下所有表述都用该notation表示。

详解: 这个就是我们任务的目的(找到子序列和参考序列的关联程度)
(2)归一化,或者叫 无量纲化。
详解: 因为我们的这些要素是不同质的东西的指标,因此可能会有的数字很大有的数字很小,但是这并不是由于它们内禀的性质决定的,而只是由于量纲不同导致的,因此我们需要对它们进行无量纲化。这个操作一般在数据处理领域叫做归一化(normalization),也就是减少数据的绝对数值的差异,将它们统一到近似的范围内,然后重点关注其变化和趋势。
如下图所示,这是上面表格中前3个元素随年的变化曲线,以及作为母序列的旅游总收入:
Here Insert Picture Description
可以看到,有两个曲线绝对数值很大,而另外两个很小,如果不做处理必然导致大的数值的影响会”淹没“掉小数值的变量的影响。

所以我们要对数据进行归一化处理,主要方法有如下几个:
(1)初值化: 顾名思义,就是把这一个序列的数据统一除以最开始的值,由于同一个因素的序列的量级差别不大,所以通过除以初值就能将这些值都整理到1这个量级附近。

公式: x_i(k)' = x_i(k) / x_i(1)   i = 1,...,m, k = 1,...,n 
(m为因素个数,n为每个因素的数据维度,仍如上栗,n=5,m=3(我们只看前三个因素,
就是曲线图里画的这三种,和旅游总收入的关联,数据维度为5,即五年))

(2) 均值化: 顾名思义,就是把这个序列的数据除以均值,由于数量级大的序列均值比较大,所以除掉以后就能归一化到1的量级附近。

公式: x_i(k)' = x_i(k) / ( mean(x_i) )        (除以均值)
其中 : mean(x_i) = (1/n) sum_k=1^n (x_i(k))     (求第i个因素序列的均值)

用初值化进行归一化,得到的结果如下图:
Here Insert Picture Description
可以看到,归一化以后的数据,量级差别变小了,这是为了后面提供铺垫,因为我们关注的实际上是曲线的形状的差异,而不希望绝对数值对后面的计算有影响。
(3)计算灰色关联系数
先放上公式:
Here Insert Picture Description
详解:
首先,我们把i看做固定值,也就是说对于某一个因素,其中的每个维度进行计算,得到一个新的序列,这个序列中的每个点就代表着该子序列与母序列对应维度上的关联性(数字越大,代表关联性越强)。
仔细观察这个公式,rho是一个可调节的系数,取值为(0,1),大于零小于一,这一项的目的是为了调节输出结果的差距大小,我们放在后面讲。我们先假设把rho取成0,那么,这个式子就变成了

pseudo_zeta_i(k) = min min |x_0(k) - x_i(k)| / |x_0(k) - x_i(k)|
= constant / |x_0(k) - x_i(k)|

我们看上面这个式子,可以发现,分子上这个数值,对于所有子序列来说都是一样的(,分子上这个数实际上就是所有因素的所有维度中,与母序列(参考序列,即我们要比较的序列)距离最近的维度上的距离。为什么要这样做呢?这样来想,假如我们没有进行归一化,或者不是用的初值化,而是用的均值化或者其他方法,可能会导致曲线之间,也就是母序列和各个子序列之间仍然有一段距离,那么这个距离最小值与下面的每个维度的距离相除,实际上也可以看成是一种取消量纲的手段。对于所有子序列,这个分子是相同的,所以实际上,这个系数pseudo_zeta是与第k个维度上,子序列与母序列的距离(差的绝对值,通常叫做l1范数(l1-norm))成反比,也就是说,这两个数距离越远,我们认为越不相关,这是符合直觉的。

当然,如果用了初值化归一化数据,如上面的图2所示,min min |x_0(k) - x_i(k)| 对每个i都会变成0,这样就不好了,因为这样一来,所有的zeta_i(k)都成了0,是无意义的。所以这时候我们就看到后面的 rho max max这一项的作用了。这一项对于每个i来说也是一个不变的常数constant,所以可以理解为给上面那个式子的分子分母同时加上某个数值,如下所示:

zeta_i(k) = (aconstant + bconstant) / (|x_0(k) - x_i(k)| + bconstant)

这样做的目的是什么呢? 我们举个栗子: 对于两个分数: 1/5 和 1/4 ,它们的分子一样,分母相差为1,这时候他们的值相差1/20,也就是0.05,这就是没有+rho max max那一项的情况,分子相同,分母的差代表着与参考序列的距离。 如果我们给他们分子分母同时加上20,那么就是21/25和21/24,它们相差为0.035,可以看到,加入这一项会导致同样的距离的点的系数差,会因为计算而变小。很显然地,rho取得越大,不同zeta系数的差距就越小。

另外,由于分子上是min min,也就是距离的全局最小值,这就导致下面的分母必然大于分子(不考虑 rho max max 项),而且,如果分母非常大,曲线距离非常远,那么,zeta接近0; 相反,如果x_i和x_0在所有维度上的差完全一样,那么分数的值就是1。这样zeta取值范围就是0~1之间,0表示不相关,1表示强关联性。这也符合认知。考虑上rho max max 项之后,我们知道对于一个真分数,分子分母都加一个同样的值,仍然是真分数(实际上是一个添加溶质的溶液的问题)。也就是说,仍然是0到1。

总结来说,rho是控制zeta系数区分度的一个系数,rho取值0到1, rho越小,区分度越大,一般取值0.5较为合适。zeta关联系数取值落在0到1之间。

Connect chestnut, we make a sequence zeta coefficient of correlation of these three sub-sequences, the results are as follows:
Here Insert Picture Description
In fact, you can already see from this chart, the students on the tour this factor also correlated generally higher, fewer workers relative impact some. The number of star-rated hotels in the center.
(4) to calculate the average correlation coefficient, incidence order form
on the map based on fact, we can already be seen about the trend, but this is only because of this trend is just more consistent on all dimensions, in fact, we get the value of the correlation coefficient of zeta later, should values in different dimensions of each factor is obtained for the mean, in other words, the zeta versus those above, obtaining the mean of the same color. The results are as follows:

>> mean(zeta_1)
ans =
   0.7505
>> mean(zeta_2)
ans =
    0.5848
>> mean(zeta_3)
ans =
   0.7154

It can be seen, according to the size of the correlation coefficient, the result of sorting:

The number of students> number of star-rated hotels> Number of employees

4. Summary

Essentially speaking GRA algorithm is to provide a method of measure of the distance between the two vectors, for time-sensitive factor, a vector can be regarded as a time profile, the algorithm is a measure of morphology GRA and whether the trend of the two curves are similar. In order to avoid interference of other, projecting influence the morphology characteristics, the GRA do first normalization, all the correction vector to the same scale and location, and calculates the distance to each point. Finally, the max max min min and corrected, so that the final output result falls between 0 and 1, so as to conform generally defined coefficients. differences between adjusted rho correlation coefficient, in other words, the distribution of the output, or it may become more closely sparse. Yaoyan Zhi mathematical angles, i.e., the algorithm measures the distance l1-norm of each dimension normalized vector of the parent sub-vector sum of the inverses, and mapped into the range 0 to 1, as the picture vectors One strategy measure of relevance.

5. Appendix: MATLAB Code

% Grey relation analysis

clear all
close all
clc

zongshouru = [3439, 4002, 4519, 4995, 5566];
daxuesheng = [341, 409, 556, 719, 903];
congyerenyuan = [183, 196, 564, 598, 613];
xingjifandian = [3248, 3856, 6029, 7358, 8880];

% define comparative and reference
x0 = zongshouru;
x1 = daxuesheng;
x2 = congyerenyuan;
x3 = xingjifandian;

% normalization
x0 = x0 ./ x0(1);
x1 = x1 ./ x1(1);
x2 = x2 ./ x2(1);
x3 = x3 ./ x3(1);

% global min and max
global_min = min(min(abs([x1; x2; x3] - repmat(x0, [3, 1]))));
global_max = max(max(abs([x1; x2; x3] - repmat(x0, [3, 1]))));

% set rho
rho = 0.5;

% calculate zeta relation coefficients
zeta_1 = (global_min + rho * global_max) ./ (abs(x0 - x1) + rho * global_max);
zeta_2 = (global_min + rho * global_max) ./ (abs(x0 - x2) + rho * global_max);
zeta_3 = (global_min + rho * global_max) ./ (abs(x0 - x3) + rho * global_max);

% show
figure;
plot(x0, 'ko-' )
hold on
plot(x1, 'b*-')
hold on
plot(x2, 'g*-')
hold on
plot(x3, 'r*-')
legend('zongshouru', 'daxuesheng', 'congyerenyuan', 'xingjifandian')

figure;
plot(zeta_1, 'b*-')
hold on
plot(zeta_2, 'g*-')
hold on
plot(zeta_3, 'r*-')
title('Relation zeta')
legend('daxuesheng', 'congyerenyuan', 'xingjifandian')

Guess you like

Origin www.cnblogs.com/xiegaosen/p/12004690.html