Data algorithm --hadoop / spark data processing skills - (. 5 6. The moving average data mining market basket analysis MBA)

Fives. Moving Average

  A plurality of successive cycles the average time series data (observed values ​​obtained by the same time interval, such as once an hour or once per day) is called a moving average. It is called a move, because with the arrival of the new time series data, we must continue to re-calculate the mean value, due deletes the oldest newest value while increasing the value of the average value accordingly "move."

  example:

            

java code:

MR program:

  Scheme 1: For each of the statute control key, the data in the RAM species sorted time series, there is a problem in this method: If there is not enough to complete the sort operation RAm's statute, this approach is not feasible.

  Scheme 2: Let MRF complete sort of time-series data (one frame MR is the main characteristic values ​​of the key sorting and grouping, hadoop this very good). Compared with Option 1, this program is much better scalability, sorting is done by the MRF sort and shuffle function, if this option is adopted, we need to modify key-value pair, and write some custom plug-in class to complete secondary sort.

  

  Scheme 1: map () function to process key split sent directly. reduce () of the same sort key data , performing the calculation of the average in the window.

   Necessary to provide a secondary sorting:: Scheme 2 The partition map determines the output key. Which transmits the output of mapper unit to which the statute. Typically, the key will be different in different groups, but we may want a different key in the same group, in this case to use the output value of the packet comparator for outputting a packet to the mapper. Comparator for comparing the output of the key mapper output stage in the sort key.

         

    

 

six. Basket Analysis

  MBA may reveal a similarity between different products or product groups. The general objective of data mining is to extract interesting information from the associated large data set, for example, millions of supermarket transactions. MBA can help us identify the goods are likely to buy together, association rule mining will find a correlation between the concentration of trading goods. You can then use these association rules on store shelves or online related goods placed next to each other. It belongs to the computationally intensive problem, it is suitable for MRF.

  1. The order of the corresponding N Ganso MR solutions, you can find this place ah frequent pattern.

  2.spark解决方案,不仅可以找出频繁模式,还会为他们生成关联规则。

  

  在数据挖掘中,关联规则有两个度量标准。 

    

  

   1.MR解决方案。     生成频繁模式。

    主要算法  :map  -》 reduce   

        

    

  2.spark不仅生成频繁模式,同时生成规则。

    流程:

      

    流程中第一个MR:      (也就是生成频繁模式)                                                                                                                  第二个MR:

                           

  第二个MR不太好理解:

    针对map的的输出(也就是生成所有频繁模式的子模式):                                                        

           子模式的生成规则:

              

                               

    然后groupByKey():

      

    然后再生成规则:

      

 

    生成的规则代码为:

      

      

 

 

Guess you like

Origin www.cnblogs.com/dhName/p/11364106.html