Correlation Analysis (three) - GSP algorithm

Transfer: http://www.cnblogs.com/liuqing910/p/8964863.html

Association rules --Apriori association mode algorithm concept of the discussion stressed the simultaneous relationship, and sequence information (time / space) to ignore data:

Time Series : Customers buy the product X, Y are likely to purchase products over a period of time;

Spatial sequence : at some point discovered phenomenon A, is likely to find the next point phenomenon Y.

Example: six months ago to buy a Pentium PC customers are likely to order a new CPU chip in a month.

Note: 1) sequence model association rules + = time / space dimension

       2) sequential pattern mining discussed herein refers to the tap on the time dimension.

First, the basic definitions

Sequence: all events related to object A press stamp increasing order, to obtain a sequence of object A s.

Element (transaction): the sequence is an ordered list of transaction, may be denoted , wherein each of the one or more events (items) of the set family, i.e. .

Length of the sequence : the number of elements in the sequence.

The size sequence : the number of events in the sequence, K- sequence is a sequence of events comprising k.

Such as: curriculum follows the sequence contains four elements, eight events.

Sequences: sequence s t is a sequence another sequence, if each ordered subset element is a s t of the ordered elements. That is, the sequence is the sequence of sequence, if there exists an integer , such that .

Example:

Sequence Database : contains a data set of one or more data sequences, as follows:

 

Second, sequential pattern mining

Support the sequence : support means all contained sequence s s of a data sequence (a single data object (in this case the A / B / C) an ordered list of events associated with) the proportion, if the sequence s support greater than or equal minsup, a sequence pattern is called s (frequently sequence).

Sequential pattern mining : D data set in a given sequence and the user-specified minimum support minsup, support find all sequences of greater than or equal minsup.

Example: In the following example, it is assumed minsup = 50%, since the sequence (promoter sequence) <{2} {3}> contained in the A, B, C in, so its support = 3/5 = 0.6, similar .

 

 

Pattern generation sequence

1, brute force

Enumerate all possible sequences, and their respective statistical support. It is noted that: the number of candidate sequences is much greater than the number of candidates is set, the following two reasons:

 

2, class Apriori algorithm

Candidate process : a pair of frequent (k-1) sequences were combined to produce k- candidate sequences. Unique to produce combined following principles:

Sequence S1 and sequence S2 combined only when the same sequences S1 sequence is removed from the first event obtained in S2 and removed from the last event obtained, as a combined result of S1 and S2 last connection event, connections are in two ways:

1) If the last two of S2 events belong to the same elements , the last event of S2 is a part of the last element in the sequence S1 of the merged;

2) If the last two of S2 events belong to different elements , the last event as a separate element S2 is connected to the end of the S1 sequence of the merged.

Example:

 <(1) (2) (3)> + <(2) (3) (4)> = <(1) (2) (3) (4)> :除去S1中第一个事件(1)与除去S2中最后一个事件(4)所剩下的子序列均为<(2) (3)>,且S2最后两个事件(3)(4)属于不同的元素,故单独列出;

<(2 5) (3)> + <(5) (3 4)> = <(2 5) (3 4)>:除去事件2和事件4,剩下子序列相同,由于S2最后两个事件(3 4)属于相同的元素,所以合并到S1最后,而不是写成<(2 5) (3) (3 4)>。

 

 

候选剪枝:若候选k-序列的(k-1)-序列至少有一个是非频繁的,则被剪枝。

上例中,候选剪枝后只剩下<{1} {2,5} {3}>。

3、时限约束

施加时限约束时,序列模式的每个元素都与一个时间窗口[l,u]相关联,其中l是该时间窗口内事件的最早发生时间,u是该时间窗口内事件的最晚发生时间。

最大跨度约束:整个序列中所允许的事件的最晚和最早发生时间的最大时间差,记为maxspan,一般地,maxspan越长,在数据序列中检测到模式的可能性越大,但较长的maxspan也可能捕获不真实的模式。

注:最大跨度影响序列模式发现算法的支持度计数,施加最大时间跨度约束之后,有些数据序列就不再支持候选模式。

最小间隔和最大间隔约束:假设最大间隔maxgap=3(天),最小间隔mingap=1,即元素中的事件必须在前一个元素的事件出现后三(一)天内出现。

注:使用最大间隔约束可能违反先验原理,以图2.1为例,无约束情形下,<{2} {5}>和<{2}{3}{5}>的支持度都是60%,若施加约束mingap=0,maxgap=1,<{2} {5}>的支持度下降至40%(缺少D的支持),而<{2}{3}{5}>的支持度仍是60%,即超集的支持度比原集要高——与先验原理违背。使用邻接子序列的概念可避免这一问题。

 

例:

使用邻接子序列修改先验原理如下:

修订的先验原理:若一个k-序列是频繁的,则它的所有邻接(k-1)-子序列也一定是频繁的。

注:根据上述原理,在候选剪枝阶段,并非所有k-1-序列都序列都需要检查(违反最大间隔约束)。

例:若maxgap=1,则不必检查<{1}{2,3}{4}{5}>的子序列<{1}{2,3}{5}>是否频繁,因为{2,3}和{5}之间的时间差为2,大于一个单位,只需考察其邻接子序列:<{1}{2,3}{4}>,<{2,3}{4}{5}>,<1}{2}{4}{5}>,<{1}{3}{4}{5}>。

窗口大小约束:元素中的事件不必同时出现,可定义一个窗口大小阈值(ws)来指定序列模式的任意元素中事件最晚和最早出现之间的最大允许时间差。(ws=0表示同一元素中的所有事件必须同时出现)。

  

--GSP算法

算法基本思路

1、长度为1的序列模式L1,作为初始的种子集;

2、根据长度为i的种子集Li,通过连接操作和剪切操作生成长度为i+1的候选序列模式,然后扫描数据库,计算每个候选序列模式的支持度,产生长度为i+1的序列模式并作为新的种子集。

3、重复第二步,直到没有新的序列模式或新的候选序列模式产生为止。

解决两大问题

1、候选集产生:合并+剪枝=期望尽可能少的候选集;

2、支持度计数

两个技巧:

1)哈希树存储数据,减少对于候选序列需要检查的原数据序列个数。

2)改变原数据系列的表达形式以有效发现一个候选项是否是数据序列的子序列。

3、具体做法:

对事物数据库中的每个数据序列的每一项进行哈希,从而确定应该考察哈希树哪些叶子节点中的候选K序列;对于叶子节点中的每个候选K序列,须考察其是否包含在该数据序列中,对每个包含在该数据序列中的候选序列,其计数值加1。

如何考察数据序列d是否包含某个候选K序列s?分两步:

 

例:假设maxgap=30,mingap=5,ws=0,考察候选序列s=<(1,2)(3)(4)>是否包含在下列数据序列中。

1)首先寻找s的第一个元素(1,2)在该数据序列中第一次出现的位置,对应时间为10;

2)由mingap=5,故在时间15后寻找下一元素(3),发现其第一次出现时间为45,而45-10>30,转入向后阶段;

3)重新寻找(1,2)的第一次出现位置:50,接着在时间55后寻找(3):65,由65-50<30,故满足最大时间间隔约束,转入向前阶段;

4)寻找(3)的下一个元素(4)在时间70(65+5)后的第一次出现位置:90,由90-65<30,满足;

5)考察结束,包含。

Guess you like

Origin blog.csdn.net/App_12062011/article/details/90341639