What Mapsidejoin that? The most detailed application description here

We know that the data analysis is the first step in preparing the data, so in the previous lesson, we introduced the metadata. Today, this article introduces the large amount of data in the combined data set of application examples Wing Hung: Mapsidejoin.

What is Mapsidejoin? Literally, Mapsidejoin node M- is - a combination thereof. Before understanding Mapsidejoin, we need to look at the role of four nodes CNMR of MapReduce models and products, through MapReduce model, and compare Mapsidejoin Reducesidejoin understand when combined in a large amount of data sets, Mapsidejoin advantages.

Yonghong introduction of cluster nodes

Client Node -C node is a client node access, customers to submit tasks by accessing the C node.

Naming Node -N cluster node is equivalent to the brain, in addition to the other nodes in the cluster monitor, but also to collect customers to submit tasks assigned by the node C and so on.

Map Node - M node is a node that stores data files

Reduce Node -R node is used for calculation of summary calculations

MapReduce model introduced

Baidu Encyclopedia of MapReduce define the feeling is quite comprehensive, simple recap: MapReduce is a cluster-based computing platform, is a distributed computing framework to simplify programming, is a distributed computing abstract Map and Reduce in two stages programming model. And used when performing the combined dataset Yonghong MapReduce model is calculated.

Application scenarios: multiple distributed cluster node M, a combination of a large data size table comprises a large table join small, large table join large table.

1, why use Mapsidejoin
in MapReduce model for a combination of computing can be divided Map-side-join and Reduce-side-join two kinds, with an example below brief:

Suppose we have two tables: Table 1 person table for the big table, Table 2. Table area small table, as shown below:

What Mapsidejoin that?  The most detailed application description here

We would like the table name and Address 1 in Table 2 are connected by id together, then we need to id is connected to the column, do inner join, connected to the corresponding id to id = 1, id = 2, id = 3, id = 4 .

If we now cluster there are two Map1 and Map2 Map node, then when we Tables 1 and 2 after the market after splitting the stored data, we may appear the following conditions:

► Case 1: You can Mapsidejoin

What Mapsidejoin that?  The most detailed application description here
As shown above, the split through the table 1 and Table 2, column id = 1-4 corresponding to the connection data stored in the same node, when the join, Map detecting node connected to a corresponding column of data has been completed, if at this time, the corresponding data may be performed after the join nodes on the Map, Map node transmits the result to Reduce the join node, the node Reduce calculated Summarizing it.

► Case 2: will be Reducesidejoin can not be carried out Mapsidejoin

What Mapsidejoin that?  The most detailed application description here
As shown in the figure, after the data stored in the split display id1,2 Table 1 stored in the Map1 nodes; id1,2 and stored in the Table 2 above Map2 node; Map detected this time at the corresponding join node when data not on the same node, you will get all the data node Reduce the total amount of re-join.

In both cases a simple description of Mapsidejoin and Reducesidejoin.

2, Mapsidejoin and advantages Reducesidejoin

Map端join的好处是可以提前过滤掉join中需要排除的大量数据,会减少数据的传输,因此Mapsidejoin 适用于大数据量join的场景。
Reduce端做join优点是比较灵活,缺点是需要做大量数据传输和整个join过程都比较耗时,因此Reducesidejoin适用于小数据量的场景。
此外, 由于当数据量巨大时,做join是非常消耗资源的,对于非Mapsidejoin的形式,无论是直连数据库压到数据库做join,还是数据集市的形式去做Reducesidejoin,都会对节点造成极大压力,容易造成产品很卡的情况,再严重就会造成OOM,宕机等。所以我们需要使用Mapsidejoin来规避这种场景, 当数据量大的时候,我们可以部署多个M节点,通过将数据先导入集市,存放在集群中的多个M节点,然后在M节点上面进行计算来实现Mapsidejoin,这样能把C,R节点的压力平均分到M节点上面,解决大数据量join可能带来的使用压力,让资源的利用更加高效。

那么我们怎么实现Mapsidejoin呢 ?如何保证数据经过拆分后,连接列对应的数据一定存放在同一个Map节点上面呢?下面介绍永洪Mapsidejoin的两种实现方式。

永洪Mapsidejoin的两种形式

事实表——维度表

适用场景(大表join小表)

在分布式系统中,当有星形数据(一个大表,若干个小表)需要join的时候,可以将小表的数据复制到每个Map节点,执行Mapsidejoin, 而无须到Reduce节点进行连接操作,从而提升表连接的效率。

在MPP集市中,我们将大表以普通的增量导入形式入集市,将所有小表在增量导入时勾选维度表的形式,如下图所示:

What Mapsidejoin that?  The most detailed application description here

此时勾选维度表的小表会全量生成在每一个Map节点。

以上面表1人员表,表2地区表的为例:表1增量导入正常拆分,表2以增量导入维度表形式入集市。

What Mapsidejoin that?  The most detailed application description here

如图所示,此时在每一个M节点上,因为表2全量存储,所以表1和表2对应的id数据就一定能在同一个M节点找到。

但是事实表——维度表的形式也有局限性,比如两个以上的大表做join时,就需要将其中的一个或多个大表,存放到每一个M节点上,大数据量的大表进行维度表存储本来就会加大资源消耗,而且大表作为维度表,无法压到内存中进行计算,因此无法使用Mapsidejoin。

所以针对这种情况,我们采取分片列来支持大表join大表的使用场景。

分片列

适用场景(大表join大表)

在8.5.1版本之前,我们只能用维度表join事实表的形式去做Mapsidejoin,在一些用户场景中,无法提前进行数据表关联做成宽表模型入集市,同时也不满足Mapsidejoin(或broadcast join)计算的要求,因此需要在集市中做分布式join的计算支持。

具体场景有:

1)业务上需要,比如:部分汇总后再进行关联,某时间段内产品销售额大于特定值时的产品报修批次分布;特定值进行关联,选择某个时间段里面最后出现的数据和另外的表关联;自关联,本月数据和上月数据关联计算等等;这些场景下(一般是雪花模型或更复杂)如果提前join会导致数据膨胀,从而产生非常多的冗余数据,但实际使用时因为有过滤条件则不会产生太多数据;

2)数据量较大的事实表需要频繁增量更新,且全量数据join成宽表入集市的时间开销太大;

3)自服务场景下,是否要关联表,以及关联什么字段存在不确定性,需要保留原始细节表来进行自助查询。

分片列的Mapsidejoin实现逻辑其实和上面情况1的图片类似。

What Mapsidejoin that?  The most detailed application description here

我们通过增量导入分片列的形式将表1和表2的关联列使用hash算法,保证两张表的id对应的数据经过拆分后一定会存储在同一个Map节点上面,这样经过拆分的大表就可以压到内存中计算。

操作步骤:

1、将需要组合的大表以增量导入的形式入集市,同时需要勾选分片列属性,选择分片列为链接列。比如表1在增量导入集市时要勾选分片列为id ,表2也需要同样的操作。

2, the data set generated by combining data marts, Map node automatically Mapsidejoin form connections when the detection data.

summary

Always remember to use Mapsidejoin on the premise that a distributed multi-cluster node M and the amount of data marts large data sets do join.

Finally, we use a picture to simply look at two forms of Mapsidejoin.

What Mapsidejoin that?  The most detailed application description here
Large tables large table join columns fragment
large tables with the fact table join small table - table dimension

These are the applications we introduce for the Mapsidejoin.

Guess you like

Origin blog.51cto.com/14637453/2464936