Why are ODPS DSHIP import and JDBC batch write library super fast?

Alibaba Data Cloud ODPS cannot use traditional INSERT statements to insert data information in the command line window. The main reason is that the implementation of its command line window is too simple, and it is used to achieve true INSERT insert statement parsing.

When a statement needs to be inserted into a table in ODPS, the following methods are usually used:

INSERT   INTO  TABLE  DES_TABLE
SELECT C1,C2,C3,....
FROM(
      SELECT COUNT(1) FROM DES_TABLE
)A

 In this way, ODPS will put this SQL into ODPS for parsing, generate DAG jobs, and execute them step by step. When inserting data in this way, the response time is usually about 5 to 10 seconds for a single data entry. Obviously, it cannot be tolerated if a large amount of data is inserted. At this time, it is necessary to use the method of file import, such as data import through DSHIP.

sh ./odps-dship/dship upload -p project_test aa.dat TABLE_AA -s true -fd "," -ni "NULL";

 The speed of ODPS import is very fast, usually at least 5000 transactions per second.

DSHIP and ODPS command line windows are also developed based on the ODPS API. Why is there such a big gap? After understanding the source code of DSHIP, its implementation principle is essentially to use ODPS Tunnel to directly write data into ODPS in batches through Profobuf, without the need for job parsing like executing SQL in the ODPS command line window, and the DAG job is slowly executed step by step.

Similar to DSHIP, ODPS-JDBC also uses ODPS Tunnel to write data in batches. Therefore, if you need to insert data from an application to ODPS, it is recommended to use the ODPS-JDBC batch write method, which has the same writing efficiency as DSHIP batch import.

Then again, why can ODPS-JDBC implement batch data INSERT, but the ODPS command line window cannot? It turns out that ODPS requires pre-validation of data before writing data to ODPS, so the INSERT process needs to be divided into two steps: the first step is to simply verify the INSERT statement; the second step is to write the data through Tunnel. to ODPS. That is, submit the precompiled SQL first (Note: ODPS requires that the number of ? in the INSERT statement must be the same as the number of columns in the table ):

INSERT  INTO  TABLE  DEST_DATE VALUES(?,?,?,?);

 Then submit the data to be inserted. The ODPS command line does not implement INSERT statements with VALUE values. (Actually it's not that hard to do.)

Therefore, the ODPS command line window must use a compromised SQL statement to achieve data insertion. This in turn leads to the illusion that ODPS writes data very slowly. In fact, the JDBC batch insert and DSHIP import speed based on ODPS Tunnel are very fast.

It is hoped that the ODPS command line window can be improved to allow data insertion based on the ODPS Tunnel.

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326684908&siteId=291194637