Alibaba Cloud's big data tool Maxcompute - using mapjoin to optimize queries

Big Data Computing Service (MaxCompute, formerly ODPS) is a fast, fully managed GB/TB/PB level data warehouse solution.
https://help.aliyun.com/document_detail/27800.html?spm=5176.7840267.6.539.po3IvS
There are three main ways to operate data: SQL, UDF, MapReduce. Students who know hadoop are familiar with these things.

 那么Maxcompute的SQL和标准SQL最大的区别就是在Maxcompute中SQL会被解析成MapReduce去执行,当然也可以直接去写MapReduce去计算数据,UDF就是当自带的一些sql引用的函数不能满足业务计算的时候,自己通过代码编写一个函数,sql执行的时候引用。

It can be seen that the underlying calculations are actually performed by the computing engine MapReduce. First understand what is MapReduce. When a piece of data is large, it is stored in a distributed manner on MaxCompute, that is, it is stored separately on many servers. When a task is executed, a process is started from the server where the data is located to read the data and perform calculations. operation, it will also start a process to summarize and analyze the data and output it. The former process is called Map, and the latter process is called Reduce, which together are called MapReduce tasks.
When using sql to manipulate data, join is often used. For example, select * from A a join B b on a.id=b.id, when this sql is converted into MapReduce task execution:
1. The map task reads the data, and tags the data of the two tables with different tags for use. Distinction
2. The reduce side receives the marked data, and puts together the data of the same associated fields of the different marked table data.
Suppose there are two tables, which we call the Big table and the Small table for the time being. Among them, the Big table has a relatively large amount of data and is distributed. If the form exists on n instance servers, the Small table exists on one server and is put down.
First, MaxCompute will start some Map processes (Map tasks) to read and mark the data. The number of Maps is controlled by a parameter and will not be explained here for the time being. Note that each map task that reads the Big table may be on other servers, then you need to pull the data from the server where the data is located, and the Small table will also start one or several map tasks to read the file system. After the data is read, it will go to the Reduce side to receive the data for association, and if the associated fields are judged to be equal, they will be put together and output to achieve the association effect.
We can look at an example. I have prepared a relatively large table train_user_lt, 5G in size, with about 700 million pieces of data.
A relatively small table map_join_test is prepared, with only 3 pieces of data.

select a.* from train_user_lt a left outer join map_join_test b on a.user_id = b.user_id;

After executing this sql, as shown in the figure,
logview
the execution process diagram is unique to Maxcompute and can help users to view the process of task execution. It is called logview, which is a tool for viewing and debugging tasks after ODPS Job submission https://help.aliyun .com/document_detail/27987.html
It can be seen from the figure that it is divided into three parts
1. The large table train_user_lt starts 39 map tasks to read 707025259 pieces of data
2, and the small table starts a map task to read 3 pieces of data .
3. In the reduce phase, 3+707025259=707025262 pieces of data are received, and 707025259 pieces of data are output, and the left outer join is output according to the large table on the left.
But looking at the time consumed is 40 minutes, which is a long time. So how to optimize and improve the speed? Is there a more convenient and more direct and violent way to optimize?
Then the focus of this article comes--Mapjoin:
MAPJION will read all the small tables into memory, and copy the small tables into multiple copies Distribute to the memory of the instance where the large table data is located, and directly match the data of another table with the data of the in-memory table in the map phase. Since the join operation is performed in the map, the efficiency of eliminating the reduce operation will be much higher.
The condition used is when a large table is joined with one or more small tables. SQL will load all the small tables specified by the user into the memory of the program that performs the join operation, thereby speeding up the execution speed of the join. It should be noted that when Maxcompute uses mapjoin:
the left table of the left outer join must be a large table;
the right table of the right outer join must be a large table; the
left table or the right table of the inner join can be used as a large table;
Full outer join cannot use mapjoin;
mapjoin supports small tables as subqueries;
when using mapjoin, you need to refer to small tables or subqueries, you need to refer to aliases;
in mapjoin, you can use unequal joins or use or to join multiple conditions;
Currently, MaxCompute supports specifying up to 8 small tables in mapjoin, otherwise a syntax error will be reported;
if mapjoin is used, the total memory occupied by all small tables must not exceed 512MB. Please note that since MaxCompute is a compressed storage, the data size of small tables will expand dramatically after being loaded into memory. The 512MB limit here is the size of the space after being loaded into memory;
when multiple tables are joined, the two leftmost tables cannot be mapjoin tables at the same time.
So why does the left table of the left outer join have to be a large table,
because when the left table is a large table, it will match all the data of the small table with the data in the instance server where the large table is located, and the small table just happens to be in memory inside. If the left table is a small table, then you need to pull all the data in the large table to match the small table, and imagine how the performance will be.
Let's look at the spelling

select /* + mapjoin(b) */  a.* from train_user_lt a left outer join map_join_test b on a.user_id = b.user_id;
//就是在sql语句前加一个标记说这是mapjoin,把小表别名写在括号里

Looking at the optimized effect, the
222
task has become two parts. The map side directly reads the data and associates it with the small table in the memory, and then outputs it, one less reduce. That is to say, the association is transferred from reduce to map side for join, eliminating the step of reduce, so it is called: mapjoin.
Look at the execution time of more than 1 minute and 20 seconds. It was 40 minutes before. Of course, my test here is to compare two extreme data, so the effect is more obvious. From this, it seems that mapjoin can be used to optimize queries when large tables are associated with small tables.
So what can mapjoin do besides optimizing performance.
MaxCompute SQL does not support the use of complex join conditions such as unequal expressions, or, like and other logic in the on condition of ordinary join, but the above operations can be performed in mapjoin. E.g

    select /*+ mapjoin(a) */
        a.total_price,
        b.total_price
    from shop a join sale_detail b
    on a.total_price < b.total_price or a.total_price + b.total_price < 500;

Summary: mapjoin seems to be a small operation change, but it can actually bring about a great improvement in efficiency, and it can also solve some business scenarios with unequal associations.
As Jack Ma often said:
small is beautiful, small is powerful!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326034246&siteId=291194637