Doris-04-Data import and export & data backup and recovery

Data import and export

data import

Import (Load) function is to import the user's original data into Doris. After the import is successful, the user can query the data through the Mysql client. In order to adapt to different data import needs, the Doris system provides 6 different import methods. Each import method supports different data sources and has different usage methods (asynchronous, synchronous).

All import methods support csv data format. Broker load also supports parquet and orc data formats.

  • Broker load : Access and read external data sources (such as HDFS) through the Broker process and import them into Doris . After the user submits the import job through the Mysql protocol, it is executed asynchronously. View the import results through the SHOW LOAD command.
  • Stream load : The user submits a request through the HTTP protocol and carries the original data to create an import . It is mainly used to quickly import data from local files or data streams to Doris. The import command returns the import results synchronously . Currently Stream Load supports two data formats: CSV (text) and JSON .
  • Insert : Similar to the Insert statement in MySQL, Doris provides INSERT INTO tbl SELECT ...; to read data from the Doris table and import it to another table. Or insert a single piece of data through INSERT INTO tbl VALUES(…);.
  • Multi load : Users submit multiple import jobs through HTTP protocol. Multi Load can ensure that multiple import jobs take effect atomically.
  • Routine load : The user submits a routine import job through the MySQL protocol, generates a resident thread, and continuously reads data from the data source (such as Kafka) and imports it into Doris.
  • Direct import through S3 protocol : Users directly import data through S3 protocol, and the usage is similar to Broker Load. Broker load is an asynchronous import method, and the supported data sources depend on the data sources supported by the Broker process. Users need to create a Broker load import through the MySQL protocol and check the import results by viewing the import command.
  • Binlog Load : Provides a CDC (Change Data Capture) function that enables Doris to incrementally synchronize users' data update operations in the Mysql database. Need to rely on canal as an intermediary.
Import method Supported formats
Broker Load Parquet,ORC,csv,gzip
Stream Load csv, gzip, json
Routine Load csv, json

Broker Load

(1) Applicable scenarios

The source data is in the storage system that Broker can access, such as HDFS. The amount of data ranges from tens to hundreds of GB.

(2) Basic principles

After the user submits the import task, FE will generate the corresponding Plan and distribute the Plan to multiple BEs for execution based on the current number of BEs and file size. Each BE will execute a portion of the imported data.

During the execution process, BE will pull data from the Broker and import the data into the system after transforming the data. All BEs have completed the import, and FE ultimately decides whether the import is successful.

(3) Basic grammar:

LOAD LABEL db_name.label_name 
(data_desc, ...)
WITH BROKER broker_name broker_properties
[PROPERTIES (key1=value1, ... )]
* data_desc:
 DATA INFILE ('file_path', ...)
 [NEGATIVE]
 INTO TABLE tbl_name
 [PARTITION (p1, p2)]
 [COLUMNS TERMINATED BY separator ]
 [(col1, ...)]
 [PRECEDING FILTER predicate]
 [SET (k1=f1(xx), k2=f2(xx))]
 [WHERE predicate]
* broker_properties: 
 (key1=value1, ...)

To create the detailed syntax of the import, execute HELP BROKER LOAD to view the syntax help. Here we mainly introduce the meaning of parameters and precautions in the creation and import syntax of Broker load.

  • Label : The identifier of the import task. Each import task has a unique Label within a single database. Label is a user-defined name in the import command. Through this Label, users can view the execution status of the corresponding import task.

    Another function of Label is to prevent users from repeatedly importing the same data . It is strongly recommended that users use the same label for the same batch of data . In this way, repeated requests for the same batch of data will only be accepted once, ensuring ** At-Most-Once semantics

    When the status of the import job corresponding to the Label is CANCELLED, you can use the Label again to submit the import job.

  • Data description parameters : Data description parameters mainly refer to the parameters belonging to the data_desc part in the Broker load creation import statement.

    Each group of data_desc mainly describes the data source address, ETL function, target table and partition and other information involved in this import. The following mainly explains in detail some parameters of the data description class:

    • Multi-table import: Broker load supports one import task involving multiple tables. Each Broker load import task can declare multiple tables in multiple data_descs to implement multi-table import . Each individual data_desc can also specify the data source address belonging to the table. Broker load ensures atomic success or failure between multiple tables imported in a single time.
    • Negative: Data_desc can also be set to negate the data import. This function is mainly used when the types of aggregate columns in the data table are all SUM types. If you want to undo a certain batch of imported data. Then you can import the same batch of data through the negative parameter. Doris will automatically invert the data on the aggregation column for this batch of data to achieve the function of eliminating the same batch of data.
    • partition: In data_desc, you can specify the partition information of the table to be imported. If the data to be imported does not belong to the specified partition, it will not be imported. At the same time, data that is not in the specified Partition will be considered as error data.
    • preceding filter predicate: used to filter original data . Raw data is data without column mapping or conversion. Users can filter the data before conversion, select the desired data, and then perform conversion.
    • set column mapping: The SET statement in data_desc is responsible for setting the column function transformation. The column function transformation here supports the equivalent expression transformation of all queries. This attribute is needed if the columns of the original data do not correspond to the columns in the table.
    • where predicate: The WHERE statement in data_desc is responsible for filtering the data that has completed transformation . The filtered data will not enter the tolerance rate statistics. If multiple conditions of the same table are declared in multiple data_descs, multiple conditions of the same table will be merged, and the merge strategy is AND.
  • Import job parameters : Import job parameters mainly refer to the parameters belonging to the opt_properties part in the Broker load creation import statement. Import job parameters are applied to the entire import job.

    The following mainly explains in detail some parameters of the imported job parameters:

    • timeout : The timeout period of the import job (in seconds), the user can set the timeout period for each import in opt_properties. If the import task is not completed within the set timeout, it will be canceled by the system and become CANCELLED. The default import timeout for Broker load is 4 hours.

      Normally, users do not need to manually set the timeout for import tasks. When the import cannot be completed within the default timeout, you can manually set the timeout for the task.

      Recommended timeout: Total file size (MB) / Slowest import speed of user Doris cluster (MB/s) > timeout > ((Total file size (MB) * Number of tables to be imported and related Roll up tables) / ( 10 * import concurrency number) )

    • max_filter_ratio : The maximum tolerance rate of the imported task, the default is 0 tolerance, the value range is 0~1. When the import error rate exceeds this value, the import fails.

      If the user wants to ignore erroneous rows, he can set this parameter greater than 0 to ensure that the import can be successful.

      The calculation formula is:

      max_filter_ratio = (dpp.abnorm.ALL / (dpp.abnorm.ALL + dpp.norm.ALL ) )
      

      dpp.abnorm.ALL indicates the number of rows whose data quality is not acceptable. If the type does not match, the number of columns does not match, the length does not match, and so on.

      dpp.norm.ALL refers to the number of correct data during the import process. You can query the correct data volume of the import task through the SHOW LOAD command.

      Number of lines in the original file = dpp.abnorm.ALL + dpp.norm.ALL

    • exec_mem_limit : Import memory limit. The default is 2GB. The unit is bytes.

    • strict_mode : Broker load import can enable strict mode. The opening method is properties (“strict_mode” = “true”). The default strict mode is off.

      The strict mode means: strict filtering of column type conversion during the import process. The strict filtering strategy is as follows:

      ①For column type conversion, if strict mode is true, incorrect data will be filtered. The erroneous data here refers to data in which the original data is not null, but the result is null after participating in column type conversion.

      ② When an imported column is generated by function transformation, strict mode has no effect on it.

      ③ If the imported column type contains range restrictions, if the original data can pass the type conversion normally but cannot pass the range restrictions, strict mode will not affect it. For example: if the type is decimal(1,0), the original data is 10, which can be converted by type but is not within the scope of column declaration. This data strict has no effect on it.

    • merge_type : The merge type of data. It supports three types: APPEND, DELETE, and MERGE. APPEND is the default value, which means that all the data in this batch need to be appended to the existing data. DELETE means that all rows with the same key as this batch of data are deleted. MERGE semantics need to be used in conjunction with the delete condition, which means that data that meets the delete condition is processed according to DELETE semantics and the rest is processed according to APPEND semantics.

(4) Import example

  • Create table in Doris

    create table student_result
    (
        id int ,
        name varchar(50),
        age int ,
        score decimal(10,4)
    )
    DUPLICATE KEY(id)
    DISTRIBUTED BY HASH(id) BUCKETS 10;
    
  • File upload to HDFS

    Start HDFS related services:

    hadoop fs -put student.csv /
    
  • Import Data

    csv file import:

    LOAD LABEL test_db.student_result
    (
        DATA INFILE("hdfs://my_cluster/student.csv")
        INTO TABLE `student_result`
        COLUMNS TERMINATED BY ","
        FORMAT AS "csv"
        (id, name, age, score)
    )
    WITH BROKER broker_name
    (#开启了 HA 的写法,其他 HDFS 参数可以在这里指定
        "dfs.nameservices" = "my_cluster",
        "dfs.ha.namenodes.my_cluster" = "nn1,nn2,nn3",
        "dfs.namenode.rpc-address.my_cluster.nn1" = "hadoop1:8020",
        "dfs.namenode.rpc-address.my_cluster.nn2" = "hadoop2:8020",
        "dfs.namenode.rpc-address.my_cluster.nn3" = "hadoop3:8020",
        "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
    )
    PROPERTIES
    (
        "timeout" = "3600"
    );
    
  • Common file format writing:

    LOAD LABEL test_db.student_result
    (
        DATA INFILE("hdfs://hadoop1:8020/student.csv")
        INTO TABLE `student_result`
        COLUMNS TERMINATED BY ","
        (c1, c2, c3, c4)
        set(
            id=c1,
            name=c2, 
            age=c3,
            score=c4
        ) )
    WITH BROKER broker_name
    (#开启了 HA 的写法,其他 HDFS 参数可以在这里指定
        "dfs.nameservices" = "my_cluster",
        "dfs.ha.namenodes.my_cluster" = "nn1,nn2,nn3",
        "dfs.namenode.rpc-address.my_cluster.nn1" = "hadoop1:8020",
        "dfs.namenode.rpc-address.my_cluster.nn2" = "hadoop2:8020",
        "dfs.namenode.rpc-address.my_cluster.nn3" = "hadoop3:8020",
        "dfs.client.failover.proxy.provider" = 
        "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProx
        yProvider"
    )
    PROPERTIES
    (
        "timeout" = "3600"
    );
    

(5) View import

Since the Broker load import method is asynchronous, the user must create an imported Label record and use the Label in the view import command to view the import results . View import commands are common to all import methods,

mysql> show load order by createtime desc limit 1\G
*************************** 1. row ***************************
 JobId: 76391
 Label: label1
 State: FINISHED
 Progress: ETL:N/A; LOAD:100%
 Type: BROKER
 EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; 
dpp.norm.ALL=28133376
 TaskInfo: cluster:N/A; timeout(s):10800; 
max_filter_ratio:5.0E-5
 ErrorMsg: N/A
 CreateTime: 2019-07-27 11:46:42
 EtlStartTime: 2019-07-27 11:46:44
EtlFinishTime: 2019-07-27 11:46:44
LoadStartTime: 2019-07-27 11:46:44
LoadFinishTime: 2019-07-27 11:50:16
 URL: 
http://192.168.1.1:8040/api/_load_error_log?file=__shard_4/error_
log_insert_stmt_4bb00753932c491a-a6da6e2725415317_4bb00753932c491a_a6da6e2725415317
 JobDetails: {
   
   "Unfinished backends":{
   
   "9c3441027ff948a0-
8287923329a2b6a7":[10002]},"ScannedRows":2390016,"TaskNumber":1,"
All backends":{
   
   "9c3441027ff948a0-
8287923329a2b6a7":[10002]},"FileNumber":1,"FileSize":1073741824}

The following mainly introduces the meaning of parameters in the result set returned by the import command:

  • JobId: The unique ID of the import task. The JobId of each import task is different and automatically generated by the system. Unlike Label, JobId will never be the same, and Label can be reused after the import job fails.

  • Label: The identification of the import task.

  • State: The current stage of the import task. During the Broker load import process, the two import states of PENDING and LOADING will mainly appear. If the Broker load is in the PENDING state, it means that the current import task is waiting to be executed; the LOADING state means that it is being executed.

    There are two final stages of the import task: CANCELLED and FINISHED . When the Load job is in these two stages, the import is complete. Among them, CANCELLED means that the import has failed, and FINISHED means that the import has succeeded.

  • Progress: Progress description of the import task. There are two progresses: ETL and LOAD, which correspond to the two stages of the import process, ETL and LOADING. Currently, the Broker load only has the LOADING stage, so the ETL will always be displayed as N/A.

    The progress range of LOAD is: 0~100%.

    LOAD progress = number of tables currently completed to be imported / total number of tables designed for this import task * 100%

    If all imported tables have been imported, the LOAD progress will be 99% and the import will enter the final effective stage. The LOAD progress will not change to 100% until the entire import is completed.

    Import progress is not linear. So if the progress has not changed for a period of time, it does not mean that the import is not being executed.

  • Type: Type of import task. The type value of Broker load is only BROKER.

  • EtlInfo: Mainly displays the imported data volume indicators unselected.rows, dpp.norm.ALL and dpp.abnorm.ALL. Users can judge how many rows are filtered by the where condition according to the first value, and the latter two indicators verify whether the error rate of the current import task exceeds max_filter_ratio. The sum of the three indicators is the total number of rows of the original data volume.

  • TaskInfo: mainly displays the current import task parameters, that is, the import task parameters specified by the user when creating the Broker load import task, including: cluster, timeout and max_filter_ratio.

  • ErrorMsg: When the status of the import task is CANCELLED, the reason for the failure will be displayed. The display is divided into two parts: type and msg. If the import task is successful, N/A will be displayed.

    The value meaning of type:

    USER_CANCEL: task canceled by the user

    ETL_RUN_FAIL: Import task failed during ETL stage

    ETL_QUALITY_UNSATISFIED: The data quality is unqualified, that is, the error data rate exceeds max_filter_ratio

    LOAD_RUN_FAIL: Import tasks that failed during the LOADING phase

    TIMEOUT: The import task was not completed within the timeout period

    UNKNOWN: unknown import error

  • The values ​​of CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime represent the time of import creation, the start time of the ETL phase, the completion time of the ETL phase, the start time of the Loading phase, and the completion time of the entire import task.

    Since there is no ETL stage in the Broker load import, its EtlStartTime, EtlFinishTime, and LoadStartTime are set to the same value.

    If the import task stays at CreateTime for a long time, and the LoadStartTime is N/A, it means that the import task is seriously accumulated at present. Users can reduce the frequency of import submissions.

    LoadFinishTime - CreateTime = time taken by the entire import task

    LoadFinishTime - LoadStartTime = Entire Broker load import task execution time = time consumed by the entire import task - import task waiting time

  • URL: The error data sample of the import task. You can access the error data sample of this import by accessing the URL address. When there is no error data in this import, the URL field is N/A.

  • JobDetails: Displays the detailed running status of some jobs. Including the number of imported files, the total size (bytes), the number of subtasks, the number of processed original lines, the ID of the BE node running the subtask, and the ID of the unfinished BE node.

    The number of raw rows processed, updated every 5 seconds. This number of rows is only used to show the current progress and does not represent the final actual number of rows processed. The actual number of rows processed is subject to what is displayed in EtlInfo.

(6) Cancel import

When the Broker load job status is not CANCELLED or FINISHED, it can be canceled manually by the user. When canceling, you need to specify the Label of the import task to be canceled. You can execute HELP CANCEL LOAD to view the cancel import command syntax.

CANCEL LOAD
[FROM db_name]
WHERE LABEL=”load_label”;

Stream Load

Stream load is a synchronous import method. Users send requests to import local files or data streams into Doris by sending HTTP protocol. Stream load executes the import synchronously and returns the import result. The user can directly judge whether the import is successful through the return body of the request.

(1) Applicable scenarios

Stream load is mainly suitable for importing local files or importing data in data streams through programs.

Currently Stream Load supports two data formats: CSV (text) and JSON .

(2) Basic principles

The following figure shows the main process of Stream load, omitting some import details.

In Stream load, Doris will select a node as the Coordinator node. This node is responsible for receiving data and distributing data to other data nodes.

Users submit import commands via the HTTP protocol. If it is submitted to FE, FE will forward the request to a certain BE through the HTTP redirect command. Users can also directly submit import commands to a specified BE.

The final result of the import is returned to the user by Coordinator BE.

(3) Basic grammar

Stream load submits and transmits data via HTTP protocol. Here we show how to submit the import through the curl command.

Users can also operate through other HTTP clients:

curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{
    
    db}/{
    
    table}/_stream_load

The detailed syntax of creating an import can be viewed by executing HELP STREAM LOAD. The following mainly introduces the meaning of some parameters of creating a Stream load.

  • Signature parameters : user/passwd

    Stream load Because the protocol used to create and import uses the HTTP protocol, it is signed through Basic access authentication. The Doris system verifies user identity and import permissions based on the signature.

  • Import task parameters

    Since Stream load uses the HTTP protocol, all parameters related to the import task are set in the Header . The format is: -H "key1:value1". The following mainly introduces the meaning of some parameters of the Stream load import task parameters.

    • Label: the identification of the import task

    • column_separator: used to specify the column separator in the imported file, the default is \t. If it is an invisible character, you need to add \x as a prefix, and use hexadecimal to represent the separator.

      For example, the separator \x01 of the hive file needs to be specified as -H “column_separator:\x01”. Combinations of several characters can be used as column separators.

    • line_delimiter: used to specify the newline character in the imported file, the default is \n. You can use a combination of multiple characters as line breaks.

    • max_filter_ratio: the maximum tolerance rate of the import task

    • where: filter conditions specified by the import task. Stream load supports filtering original data by specifying where statements. The filtered data will not be imported and will not participate in the calculation of filter ratio, but will be counted in num_rows_unselected.

    • partition: Partition information of the table to be imported. If the data to be imported does not belong to the specified Partition, it will not be imported. These data will be counted in dpp.abnorm.ALL.

    • Columns: Function transformation configuration of the data to be imported. Currently, the function transformation methods supported by Stream load include column order changes and expression transformation. The expression transformation method is consistent with the query statement.

      Example of column order transformation: The original data has three columns (src_c1, src_c2, src_c3), and the current doris table also has three columns (dst_c1, dst_c2, dst_c3). If the src_c1 column of the original table corresponds to the dst_c1 column of the target table, the src_c2 column of the original table corresponds to the dst_c2 column of the target table, and the src_c3 column of the original table corresponds to the dst_c3 column of the target table, the writing method is as follows:

      columns: dst_c1, dst_c2, dst_c3

      If the src_c1 column of the original table corresponds to the dst_c2 column of the target table, the src_c2 column of the original table corresponds to the dst_c3 column of the target table, and the src_c3 column of the original table corresponds to the dst_c1 column of the target table, the writing method is as follows:

      columns: dst_c2, dst_c3, dst_c1

      Expression transformation example: The original file has two columns, and the target table also has two columns (c1, c2). However, both columns of the original file need to be transformed by functions to correspond to the two columns of the target table. The writing is as follows:

      columns: tmp_c1, tmp_c2, c1 = year(tmp_c1), c2 = month(tmp_c2)

      Among them, tmp_* is a placeholder, representing the two original columns in the original file.

    • exec_mem_limit: Import memory limit. The default is 2GB, the unit is bytes.

    • strict_mode

    • two_phase_commit : Stream load import can enable two-phase transaction commit mode. The way to enable it is to declare two_phase_commit=true in HEADER. The default two-phase batch transaction commit is off. Two-phase batch transaction submission mode

      This means: During the Stream load process, information will be returned to the user after the data writing is completed. At this time, the data is not visible, and the transaction status is PRECOMMITTED. The data will not be visible until the user manually triggers the commit operation. Users can call the following interface to trigger the commit operation on the stream load transaction:

      curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" 
      http://fe_host:http_port/api/{
              
              db}/_stream_load_2pc
      curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H 
      "txn_operation:commit" 
      http://be_host:webserver_port/api/{
              
              db}/_stream_load_2pc
      

      Users can call the following interface to trigger the abort operation on the stream load transaction:

      curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" 
      http://fe_host:http_port/api/{
              
              db}/_stream_load_2pc
      curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" 
      http://be_host:webserver_port/api/{
              
              db}/_stream_load_2pc
      

(4) Import example

curl --location-trusted -u root -H "label:123" -H"column_separator:," -T student.csv -X PUT 
http://hadoop1:8030/api/test_db/student_result/_stream_load

Since Stream load is a synchronous import method, the import results will be directly returned to the user by creating the import return value.

Note: Since Stream load is a synchronous import method, the import information will not be recorded in the Doris system. Users cannot see Stream load asynchronously by viewing the import command. When using it, you need to monitor the return value of the create import request to obtain the import result.

(5) Cancel import

Users cannot manually cancel Stream load . Stream load will be automatically canceled by the system after timeout or import error. Stream Load is a synchronous import method. Users import local files or data streams into Doris by sending HTTP protocol. Stream Load executes the import synchronously and returns the results. Users can directly determine whether the import is successful by returning the result.

Routine Load

The Routine Load function provides users with a function to automatically import data from specified data sources.

(1) Applicable scenarios

Currently only routine import from the Kafka system is supported, usage restrictions:

  • Supports Kafka access without authentication and Kafka cluster authenticated through SSL.
  • Supported message formats are csv, json text formats. Each message in csv is one line, and the end of the line does not contain a newline character.
  • By default, Kafka 0.10.0.0 (inclusive) and later versions are supported. If you want to use Kafka versions below 0.10.0.0 (0.9.0, 0.8.2, 0.8.1, 0.8.0), you need to modify the configuration of be, set the value of kafka_broker_version_fallback to the old version to be compatible, or create a routine load When directly setting the value of property.broker.version.fallback to the old version to be compatible, the cost of using the old version is that some new features of routine load may not be available, such as setting the offset of the kafka partition according to the time.

(2) Basic principles

insert image description here

As shown in the figure above, Client submits a routine import job to FE.

  • FE splits an import job into several Tasks through JobScheduler. Each Task is responsible for importing a specified part of the data. Task is assigned by TaskScheduler to the specified BE for execution.
  • On BE, a Task is regarded as an ordinary import task and is imported through the import mechanism of Stream Load. After the import is completed, report to FE.
  • The JobScheduler in FE will continue to generate subsequent new tasks based on the report results, or retry failed tasks.
  • The entire routine import job continuously generates new tasks to complete the uninterrupted import of data.

**(3)Basic Grammar**

CREATE ROUTINE LOAD [db.]job_name ON tbl_name
[merge_type]
[load_properties]
[job_properties]
FROM data_source
[data_source_properties]
  • [db.]job_name : The name of the import job. In the same database, only one job with the same name can be running.

  • tbl_name : Specify the name of the table to be imported.

  • merge_type : The merge type of data. It supports three types: APPEND, DELETE, and MERGE. APPEND is the default value, which means that all the data in this batch need to be appended to the existing data. DELETE means that all rows with the same key as this batch of data are deleted. MERGE semantics need to be used in conjunction with the delete on condition, which means that data that meets the delete condition is processed according to DELETE semantics and the rest is processed according to APPEND semantics. The syntax is [WITH MERGE|APPEND|DELETE]

  • load_properties : used to describe imported data. Syntax: [column_separator], [columns_mapping], [where_predicates], [delete_on_predicates], [source_sequence], [partitions], [preceding_predicates]

    • column_separator: Specify the column separator, such as: COLUMNS TERMINATED BY "," This parameter only needs to be specified when importing text data. This parameter does not need to be specified when importing data in JSON format. Default: \t;

    • columns_mapping: Specify the mapping relationship of columns in the source data, and define how derived columns are generated.

      Mapping columns: Specify in order which columns in the source data correspond to the columns in the destination table. For columns you wish to skip, you can specify a non-existent column name. Assume that the destination table has three columns k1, k2, v1. The source data has 4 columns, of which columns 1, 2, and 4 correspond to k2, k1, and v1 respectively. Then it is written as follows: COLUMNS (k2, k1, xxx, v1)

      Among them, xxx is a column that does not exist and is used to skip the third column in the source data.

      Derived columns: Columns expressed in the form col_name = expr are called derived columns. That is, it supports calculating the value of the corresponding column in the target table through expr. Derived columns are usually arranged after mapped columns. Although this is not a mandatory rule, Doris always parses mapped columns first and then parses derived columns. Continuing from the previous example, assume that the destination table also has a fourth column v2, and v2 is generated by the sum of k1 and k2. Then it can be written as follows:

      COLUMNS (k2, k1, xxx, v1, v2 = k1 + k2);

      For another example, assume that the user needs to import a table containing only one column k1, and the column type is int. And the corresponding columns in the source file need to be processed: negative numbers are converted to positive numbers, and positive numbers are multiplied by 100. This function can be implemented through the case when function. The correct writing method should be as follows:

      COLUMNS (xx, k1 = case when xx < 0 then cast(-xx as varchar) else cast((xx + ‘100’) as varchar) end)

    • where_predicates: used to specify filter conditions to filter out unnecessary columns. Filter columns can be mapped or derived columns. For example, if we only want to import columns where k1 is greater than 100 and k2 is equal to 1000, we would write as follows: WHERE k1 > 100 and k2 = 1000

    • partitions: Specify which partitions of the import destination table. If not specified, it will be automatically imported into the corresponding partition.

      Example: PARTITION(p1, p2, p3)

    • delete_on_predicates: Indicates deletion conditions, only meaningful when merge type is MERGE, the syntax is the same as where

    • source_sequence: only applicable to UNIQUE_KEYS. Under the same key column, ensure that the value column is REPLACEd according to the source_sequence column. The source_sequence can be a column in the data source or a column in the table structure.

    • preceding_predicates: PRECEDING FILTER predicate. Used to filter raw data. Raw data is data without column mapping or conversion. Users can filter the data before conversion, select the desired data, and then perform conversion.

  • job_properties. . Common parameters for specifying routine import jobs. grammar:

    PROPERTIES (
    "key1" = "val1",
    "key2" = "val2"
    )
    

    The following parameters are currently supported:

    • desired_concurrent_number: desired concurrency. A routine import job will be divided into multiple subtasks for execution. This parameter specifies how many tasks a job can execute at the same time. Must be greater than 0. Default is 3. This concurrency is not the actual concurrency. The actual concurrency will be comprehensively considered based on the number of nodes in the cluster, load conditions, and data sources.

      A job has a maximum number of tasks that can be executed simultaneously. For Kafka imports, the current actual concurrency is calculated as follows:

      Min(partition num, desired_concurrent_number, alive_backend_num, 
      Config.max_routine_load_task_concurrrent_num)
      

      Among them, Config.max_routine_load_task_concurrrent_num is a default maximum concurrency limit of the system. This is an FE configuration and can be adjusted by changing the configuration. The default is 5.

      The partition num refers to the number of partitions of the subscribed Kafka topic. alive_backend_num is the current number of normal BE nodes.

    • max_batch_interval/max_batch_rows/max_batch_size

      These three parameters represent:

      ① The maximum execution time of each subtask, in seconds. The range is 5 to 60. Default is 10.

      ② The maximum number of rows read by each subtask. Must be greater than or equal to 200000. The default is 200000.

      ③ The maximum number of bytes read by each subtask. The unit is byte and the range is 100MB to 1GB. The default is 100MB.

      These three parameters are used to control the execution time and processing volume of a subtask. When any one reaches the threshold, the task

      Finish. For example:

      "max_batch_interval" = "20",
      "max_batch_rows" = "300000",
      "max_batch_size" = "209715200"
      
    • max_error_number: The maximum number of error lines allowed within the sampling window. Must be greater than or equal to 0. The default is 0, which means no error lines are allowed.

      The sampling window is max_batch_rows * 10. That is, if within the sampling window, the number of error rows is greater than max_error_number, routine jobs will be suspended, and manual intervention is required to check data quality problems. Rows filtered out by the where condition are not counted as error rows

    • strict_mode: Whether to enable strict mode, the default is off. If turned on, the column type transformation result of non-null original data will be filtered if it is NULL. Specified as "strict_mode" = "true"

    • timezone: Specifies the time zone used by the import job. The default is to use the Session's timezone parameter. This parameter will affect the results of all imported time zone-related functions

    • format: Specify the import data format, the default is csv, and supports json format

    • jsonpaths: The way to import json is divided into: simple mode and matching mode. If jsonpath is set, it is a matching pattern import, otherwise it is a simple pattern import, for details, please refer to the example

    • strip_outer_array: Boolean type, true means that the json data starts with an array object and the array object will be flattened, the default value is false

    • json_root: json_root is a legal jsonpath string, used to specify the root node of the json document, the default value is ""

    • send_batch_parallelism: integer, used to set the parallelism of sending batch data, if the parallelism value exceeds max_send_batch_parallelism_per_job in the BE configuration, then the BE as the coordination point will use the value of max_send_batch_parallelism_per_job

    • data_source_properties: The type of data source. Current support: Kafka

      (
      "key1" = "val1",
      "key2" = "val2"
      )
      

(4) Kafka import example

  • Create a corresponding table in doris

    create table student_kafka
    (
        id int,
        name varchar(50),
        age int
    )
    DUPLICATE KEY(id)
    DISTRIBUTED BY HASH(id) BUCKETS 10;
    
  • Start kafka and prepare data

    bin/kafka-topics.sh --create \
    --zookeeper hadoop1:2181/kafka \
    --replication-factor 1 \
    --partitions 1 \
    --topic test_doris1
    bin/kafka-console-producer.sh \
    --broker-list hadoop1:9092,hadoop2:9092,hadoop3:9092 \
    --topic test_doris
    
  • Create import task

    CREATE ROUTINE LOAD test_db.kafka_test ON student_kafka
    COLUMNS TERMINATED BY ",",
    COLUMNS(id, name, age)
    PROPERTIES
    ("desired_concurrent_number"="3",
     "strict_mode" = "false"
    )
    FROM KAFKA
    (
        "kafka_broker_list"= "hadoop1:9092,hadoop2:9092,hadoop3:9092",
        "kafka_topic" = "test_doris1",
        "property.group.id"="test_doris_group",
        "property.kafka_default_offsets" = "OFFSET_BEGINNING",
        "property.enable.auto.commit"="false"
    );
    
  • View table

    select * from student_kafka;
    

    Continue to send data to kafka and check the changes in the table

(5) Check the import job status

Specific commands and examples for viewing job status can be viewed through the HELP SHOW ROUTINE LOAD; command.

Specific commands and examples for viewing task running status can be viewed through the HELP SHOW ROUTINE LOAD TASK; command.

Only tasks that are currently running can be viewed. Ended and unstarted tasks cannot be viewed.

(6) Modify job attributes

Users can modify already created jobs. Specific instructions can be viewed through the HELP ALTER ROUTINE LOAD; command. Or see ALTER ROUTINE LOAD

(7) Operation control

Users can control the stop, pause and restart of jobs through the three commands STOP/PAUSE/RESUME. Help and examples can be viewed through the three commands: HELP STOP ROUTINE LOAD; HELP PAUSE ROUTINE LOAD; and HELP RESUME ROUTINE LOAD;.

(8) Other instructions

  • The relationship between routine import jobs and ALTER TABLE operations

    Routine import does not block SCHEMA CHANGE and ROLLUP operations. However, note that if the column mapping relationship cannot be matched after SCHEMA CHANGE is completed, it will cause the job's error data to surge, eventually causing the job to be suspended. It is recommended to reduce such problems by explicitly specifying column mappings in routine import jobs and by adding Nullable columns or columns with Default values.

    Deleting the partition of the table may cause the imported data to be unable to find the corresponding partition, and the job will be suspended.

  • The relationship between routine import jobs and other import jobs (LOAD, DELETE, INSERT)

    Routine imports do not conflict with other LOAD jobs and INSERT operations. When performing a DELETE operation, the corresponding table partition cannot have any import tasks being executed. Therefore, before executing the DELETE operation, you may need to pause the routine import job and wait for all assigned tasks to be completed before executing the DELETE.

  • The relationship between routine import jobs and DROP DATABASE/TABLE operations

    When the database or table corresponding to the routine import is deleted, the job will automatically CANCEL.

  • The relationship between routine import jobs of kafka type and kafka topic

    When the kafka_topic declared by the user when creating a routine import does not exist in the kafka cluster:

    • If the broker of the user's kafka cluster sets auto.create.topics.enable = true, kafka_topic will be automatically created first, and the number of automatically created partitions is determined by the broker configuration num.partitions in the user's kafka cluster. The routine job will continue to read the data of the topic normally.
    • If the broker of the user's kafka cluster sets auto.create.topics.enable = false, the topic will not be automatically created, and the routine job will be suspended before reading any data, and the status will be PAUSED.

    Therefore, if the user wants the kafka topic to be automatically created by a routine job when it does not exist, he only needs to set the broker in the user's kafka cluster to auto.create.topics.enable = true.

  • Possible problems that may occur in a network isolation environment. In some environments, there are isolation measures for network segments and domain name resolution, so you need to pay attention to:

    • The Broker list specified in the creation of the Routine load task must be accessible to the Doris service
    • If advertised.listeners is configured in Kafka, the address in advertised.listeners must be accessible to the Doris service
  • About Partition and Offset of specified consumption

    Doris supports specifying Partition and Offset to start consumption. The new version also supports the function of consumption at a specified time point. Here is the configuration relationship of the corresponding parameters.

    There are three relevant parameters:

    kafka_partitions: Specify the partition list to be consumed, such as: "0, 1, 2, 3".

    kafka_offsets: Specifies the starting offset of each partition, which must correspond to the number of kafka_partitions lists. For example: "1000, 1000, 2000, 2000"

    property.kafka_default_offset: Specifies the default starting offset of the partition.

    When creating an import job, these three parameters can have the following combinations:

    combination kafka_partitions kafka_offsets property.kafka_default_offset Behavior
    1 No No No The system will automatically find all partitions corresponding to the topic and start consuming from OFFSET_END
    2 No No Yes The system will automatically find all partitions corresponding to the topic and start consuming from the location specified by default offset
    3 Yes No No The system will start consumption from OFFSET_END of the specified partition
    4 Yes Yes No The system will start consumption from the specified offset of the specified partition.
    5 Yes No Yes The system will start consumption from the specified partition and the location specified by default offset.
  • The difference between STOP and PAUSE

    FE will automatically clean up the ROUTINE LOAD in the STOP state regularly, and the ROUTINE LOAD in the PAUSE state can be restored and enabled again.

Binlog Load

Binlog Load provides a CDC (Change Data Capture) function that enables Doris to incrementally synchronize users' data update operations in the Mysql database.

(1) Applicable scenarios

  • INSERT/UPDATE/DELETE supported.
  • Filter Query.
  • Not compatible with DDL statements yet

(2) Basic principles

In the first phase of the design, Binlog Load needs to rely on canal as an intermediary, allowing canal to pretend to be a slave node to obtain the Binlog on the Mysql master node and parse it, and then let Doris obtain the parsed data on Canal, mainly involving Mysql end, Canal end and Doris end, the overall data flow is as follows:

As shown in the figure above, the user submits a data synchronization job to FE.

  • FE will start a canal client for each data synchronization job to subscribe to the canal server and obtain data.
  • The receiver in the client will be responsible for receiving data through the Get command. Every time a data batch is obtained, it will be distributed to different channels by the consumer according to the corresponding table. Each channel will generate a subtask for sending data for this data batch.
  • On FE, a Task is a sub-task for the channel to send data to BE, which contains the data of the same batch distributed to the current channel.
  • Channels control the start, commit, and termination of individual table transactions. Within a transaction cycle, multiple batches of data will generally be obtained from the consumer, so multiple sub-tasks will be generated to send data to BE. These Tasks will not actually take effect until the transaction is successfully submitted.
  • When certain conditions are met (such as exceeding a certain time and reaching the maximum submitted data size), the consumer will block and notify each channel to submit the transaction.
  • If and only if all channels are successfully submitted, the canal will be notified through the Ack command and continue to acquire and consume data.
  • If any channel fails to submit, the data will be retrieved from the location where the last consumption was successful and submitted again (the successfully submitted channel will not be submitted again to ensure idempotence).
  • During the entire data synchronization operation, FE continuously obtains data from canal through the above process and submits it to BE to complete data synchronization.

(3) Configure MySQL side

In the master-slave synchronization of MySQLCluster mode, the binary log file (Binlog) records all data changes on the master node. Data synchronization and backup among multiple nodes of the Cluster must be performed through the Binlog log, thereby improving the availability of the cluster. The architecture usually consists of a master node (responsible for writing) and one or more slave nodes (responsible for reading). All data changes that occur on the master node will be copied to the slave nodes.

Note: Currently, Mysql 5.7 and above must be used to support the Binlog Load function.

  • Turn on the binary binlog function of mysql and edit the my.cnf configuration file

    [mysqld]
    log-bin = mysql-bin # 开启 binlog
    binlog-format=ROW # 选择 ROW 模式
    binlog-do-db=test #指定具体要同步的数据库,也可以不设置
    
  • Enable GTID mode [optional]

    A global transaction identifier (global transaction identifier) ​​identifies a transaction that has been committed on the master node, and is uniquely valid globally. After the Binlog is enabled, the GTID will be written into the Binlog file, corresponding to the transaction one by one.

    Edit the my.cnf configuration file.

    gtid-mode=on // 开启 gtid 模式
    enforce-gtid-consistency=1 // 强制 gtid 和事务的一致性
    

    In GTID mode, the main server can easily track transactions, restore data, and replicate copies without the need for Binlog file names and offsets.

    In GTID mode, due to the global validity of GTID, the slave node no longer needs to save the file name and offset to locate the Binlog location on the master node, but can locate it through the data itself. During data synchronization, slave nodes will skip executing any GTID transactions that are identified as executed.

    GTID is expressed as a pair of coordinates, source_id identifies the master node, and transaction_id indicates the order in which this transaction is executed on the master node (maximum 263-1).

  • Restart MySQL for the configuration to take effect

    sudo systemctl restart mysqld
    
  • Create user and authorize

    set global validate_password_length=4;
    set global validate_password_policy=0;
    GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%' IDENTIFIED BY 'canal' ;
    
  • Prepare test sheet

    CREATE TABLE `test`.`tbl1` (
        `a` int(11) NOT NULL COMMENT "",
        `b` int(11) NOT NULL COMMENT ""
    )
    insert into test.tbl1 values(1,1),(2,2),(3,3);
    

(4) Configure Canal side

Canal is a sub-project under Alibaba's otter project. Its main purpose is to provide incremental data subscription and consumption based on incremental log analysis of MySQL database. It is used to solve cross-machine room synchronization business scenarios. It is recommended to use canal 1.1.5 and above. Version.

Download address: https://github.com/alibaba/canal/releases

  • Upload and unzip canal deployer

    mkdir /opt/module/canal-1.1.5 tar -zxvf canal.deployer-1.1.5.tar.gz -C /opt/module/canal-1.1.5
    
  • Create a new directory under the conf folder and rename it

    There can be multiple instances in a canal service. Each directory under conf/ is an instance. Each instance has an independent configuration file.

    mkdir /opt/module/canel-1.1.5/conf/doris-load
    拷贝配置文件模板
    cp /opt/module/canel-1.1.5/conf/example/instance.properties /opt/module/canel-1.1.5/conf/doris-load
    
  • Modify the configuration of conf/canal.properties

    canal.destinations = doris-load
    
  • Modify the instance configuration file

    vim /opt/module/canel-1.1.5/conf/doris-load/instance.properties
    ## canal instance serverId
    canal.instance.mysql.slaveId = 1234
    ## mysql address
    canal.instance.master.address = hadoop1:3306 
    ## mysql username/password
    canal.instance.dbUsername = canal
    canal.instance.dbPassword = canal
    
  • start up

    sh bin/startup.sh
    
  • Verify startup is successful

    cat logs/doris-load/doris-load.log
    

    Note: There is a one-to-one correspondence between canal client and canal instance. Binlog Load has restricted multiple data synchronization jobs from connecting to the same destination.

(5) Configure target table

  • Doris creates the target table corresponding to Mysql

    CREATE TABLE `binlog_test` (
        `a` int(11) NOT NULL COMMENT "",
        `b` int(11) NOT NULL COMMENT ""
    ) ENGINE=OLAP
    UNIQUE KEY(`a`)
    COMMENT "OLAP"
    DISTRIBUTED BY HASH(`a`) BUCKETS 8;
    

    Binlog Load can only support Unique type target tables, and the Batch Delete function of the target table must be activated.

  • Turn on SYNC function

    Set enable_create_sync_job to true in fe.conf. If you don’t want to modify the configuration file and restart, you can execute the following:

    使用 root 账号登陆
    ADMIN SET FRONTEND CONFIG ("enable_create_sync_job" = "true");
    

(6) Basic grammar

For the detailed syntax of creating a data synchronization job, you can connect to Doris and execute HELP CREATE SYNC JOB; view the syntax help.

CREATE SYNC [db.]job_name
(
 channel_desc, 
 channel_desc
 ...
)
binlog_desc
  • job_name: job_name is the unique identifier of the data synchronization job in the current database. Only one job with the same job_name can be running.

  • channel_desc: channel_desc is used to define the data channel under the task, which can represent the mapping relationship between the MySQL source table and the doris target table. When setting this item, if there are multiple mapping relationships, it must be satisfied that the MySQL source table should have a one-to-one correspondence with the doris target table. Any other mapping relationships (such as one-to-many relationships) will be considered illegal when checking the syntax. .

    FROM mysql_db.src_tbl INTO des_tbl
    [partitions]
    [columns_mapping]
    

    column_mapping mainly refers to the mapping relationship between the columns of the MySQL source table and the doris target table. If not specified, FE will default to a one-to-one correspondence between the columns of the source table and the target table in order. However, we still recommend explicitly specifying the column mapping relationship, so that when the structure of the target table changes (such as adding a nullable column), the data synchronization job can still proceed. Otherwise, when the above changes occur, the import will report an error because the column mapping relationship no longer corresponds one to one.

  • binlog_desc: The attributes in binlog_desc define some necessary information for docking with the remote Binlog address. Currently, the only supported docking type is canal mode, and all configuration items need to be preceded by the canal prefix.

    FROM BINLOG
    (
     "key1" = "value1", 
     "key2" = "value2"
    )
    

    canal.server.ip: address of canal server

    canal.server.port: port of canal server

    canal.destination: the string identifier of the instance mentioned above

    canal.batchSize: The maximum batch size obtained from the canal server in each batch, default 8192

    canal.username: username of instance

    canal.password: password of instance

    canal.debug: When set to true, detailed information about the batch and each row of data will be printed, which will affect performance.

(7) Example

  • Create a sync job:

    CREATE SYNC test_db.job1
    (
        FROM test.tbl1 INTO binlog_test
    )
    FROM BINLOG 
    (
        "type" = "canal",
        "canal.server.ip" = "hadoop1",
        "canal.server.port" = "11111",
        "canal.destination" = "doris-load",
        "canal.username" = "canal",
        "canal.password" = "canal"
    );
    
  • View job status: Specific commands and examples for viewing job status can be viewed through the HELP SHOW SYNC JOB; command.

    # 展示当前数据库的所有数据同步作业状态。
    SHOW SYNC JOB;
    # 展示数据库 `test_db` 下的所有数据同步作业状态。
    SHOW SYNC JOB FROM `test_db`;
    

    The parameters of the returned result set have the following meanings:

    • State: The current stage of the job. The transition between job states is shown in the figure below:

      After the job is submitted, the status is PENDING. After the canal client is started by FE scheduling, the status changes to RUNNING. The user can control the stop, pause and resume of the job through the three commands STOP/PAUSE/RESUME. After the operation, the job status is CANCELLED/PAUSED respectively. /RUNNING.

      There is only one CANCELLED in the final stage of the job. When the job status changes to CANCELLED, it cannot be restored again. When an error occurs in a job, if the error is unrecoverable, the status will change to CANCELLED, otherwise it will change to PAUSED.

    • Channel: The mapping relationship between all source tables of the job and the target table.

    • Status: The current binlog consumption position (if the GTID mode is set, the GTID will be displayed), and the delay time of the doris side execution time compared to the mysql side.

    • JobConfig: The connected remote server information, such as the address of the canal server and the destination of the connected instance.

  • The MySQL table continues to insert data, observe Doris' table

  • Controlling jobs: Users can control the stop, pause and resume of jobs through the three commands STOP/PAUSE/RESUME.

    # 停止名称为 `job_name` 的数据同步作业
    STOP SYNC JOB [db.]job_name
    # 暂停名称为 `job_name` 的数据同步作业
    PAUSE SYNC JOB [db.]job_name
    # 恢复名称为 `job_name` 的数据同步作业
    RESUME SYNC JOB `job_name
    

Insert Into

The use of the Insert Into statement is similar to the use of the Insert Into statement in databases such as MySQL. But in Doris, all data writing is an independent import job. Therefore, Insert Into is also introduced here as an import method.

The main Insert Into commands include the following two types:

INSERT INTO tbl SELECT ...
INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...);

The second command is only for Demo and should not be used in testing or production environments.

The Insert Into command needs to be submitted through the MySQL protocol. Creating an import request will return the import results synchronously.

(1) Grammar

INSERT INTO table_name [partition_info] [WITH LABEL label] [col_list] [query_stmt] [VALUES];

WITH LABEL:

The INSERT operation acts as an import task and can also specify a label. If not specified, the system will automatically specify a UUID as label.

This feature requires version 0.11+.

Note: It is recommended to specify the Label instead of automatically assigning it by the system. If it is automatically allocated by the system, but during the execution of the Insert Into statement, the connection is disconnected due to network errors, etc., it is impossible to know whether the Insert Into was successful. If you specify a Label, you can view the task results through the Label again.

Example:

INSERT INTO tbl2 WITH LABEL label1 SELECT * FROM tbl3;
INSERT INTO tbl1 VALUES ("qweasdzxcqweasdzxc"), ("a");

Notice:

When you need to use CTE (Common Table Expressions) as the query part in the insert operation, you must specify the WITH LABEL and column list parts. Example

INSERT INTO tbl1 WITH LABEL label1
WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1;
INSERT INTO tbl1 (k1)
WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2)
SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1;

(2)SHOW LAST INSERT

It is difficult to obtain the json string in the returned result in the MySQL class library of some languages. Therefore, Doris also provides the SHOW LAST INSERT command to explicitly obtain the result of the latest insert operation.

After executing an insert operation, SHOW LAST INSERT can be executed in the same session connection. This command will return the result of the latest insert operation, such as:

mysql> show last insert\G
*************************** 1. row ***************************
 TransactionId: 64067
 Label: insert_ba8f33aea9544866-8ed77e2844d0cc9b
 Database: default_cluster:db1
 Table: t1
TransactionStatus: VISIBLE
 LoadedRows: 2
 FilteredRows: 0

This command returns insert and the details of the corresponding transaction. Therefore, the user can continue to execute the show last insert command after each insert operation to obtain the insert result.

Note: This command will only return the result of the latest insert operation in the same session connection. If the connection is broken or replaced with a new one, an empty set will be returned.

S3 Load

Starting from version 0.14, Doris supports importing data directly from online storage systems that support the S3 protocol through the S3 protocol.

This document mainly introduces how to import data stored in AWS S3. It also supports the import of other object storage systems that support the S3 protocol, such as Baidu Cloud's BOS, Alibaba Cloud's OSS, and Tencent Cloud's COS.

(1) Applicable scenarios

  • The source data is in a storage system that supports the S3 protocol, such as S3, BOS, etc.
  • The amount of data ranges from tens to hundreds of GB.

(2) Preparation work

  • To prepare AK and SK, you first need to find or regenerate AWS . You can find the generation method Access keysin the AWS console ;My Security Credentials
  • Prepare REGION and ENDPOINT REGION can be selected when creating a bucket or viewed in the bucket list. ENDPOINT AWS documents can be found through REGION on the following page .

(3) Example

The import method is basically the same as Broker Load. You only need to WITH BROKER broker_name ()replace the statement with the following part

WITH S3
(
    "AWS_ENDPOINT" = "AWS_ENDPOINT",
    "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
    "AWS_SECRET_KEY"="AWS_SECRET_KEY",
    "AWS_REGION" = "AWS_REGION"
)

The complete example is as follows:

LOAD LABEL example_db.exmpale_label_1
(
    DATA INFILE("s3://your_bucket_name/your_file.txt")
    INTO TABLE load_test
    COLUMNS TERMINATED BY ","
)
WITH S3
(
    "AWS_ENDPOINT" = "AWS_ENDPOINT",
    "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
    "AWS_SECRET_KEY"="AWS_SECRET_KEY",
    "AWS_REGION" = "AWS_REGION"
)
PROPERTIES
(
    "timeout" = "3600"
);

Data output

ExportExport

Data export is a function provided by Doris to export data. This function can export the data of the table or partition specified by the user to the remote storage, such as HDFS/BOS, etc., through the Broker process in text format.

(1) Basic principles

After the user submits an Export job. Doris will count all tablets involved in this job. These Tablets are then grouped, and each group generates a special query plan. The query plan will read the data on the included Tablet, and then write the data to the path specified by the remote storage through Broker. It can also be directly exported to a remote storage that supports the S3 protocol through the S3 protocol.

a. Scheduling method:

  • The user submits an Export job to FE.
  • FE's Export scheduler will execute an Export job in two stages:
    • PENDING: FE generates ExportPendingTask, sends snapshot command to BE, and takes a snapshot of all involved Tablets. and generate multiple query plans.
    • EXPORTING: FE generates ExportExportingTask and starts executing the query plan.

b. Query plan split

The Export job will generate multiple query plans, and each query plan is responsible for scanning a part of the tablet. The number of tablets scanned by each query plan is specified by the FE configuration parameter export_tablet_num_per_task, which is 5 by default. That is, assuming a total of 100 Tablets, 20 query plans will be generated. Users can also specify this value through the job attribute tablet_num_per_task when submitting a job.

c. Query plan execution

Multiple query plans for a job are executed sequentially.

A query plan scans multiple shards, organizes the read data into rows, and writes a batch every 1024 rows, calling Broker to write to the remote storage.

If the query plan encounters an error, it will be automatically retried three times. If a query plan fails after three retries, the entire job fails.

Doris will first create a temporary directory named __doris_export_tmp_12345 (where 12345 is the job id) in the specified remote storage path. The exported data will first be written to this temporary directory. Each query plan will generate a file, example file name:

export-data-c69fcf2b6db5420f-a96b94c1ff8bccef-1561453713822

Among them, c69fcf2b6db5420f-a96b94c1ff8bccef is the query id of the query plan. 1561453713822 The timestamp generated for the file.

When all data is exported, Doris will rename the files to the user-specified path.

(2) Basic grammar

The detailed command of Export can be viewed through HELP EXPORT:

EXPORT TABLE db1.tbl1 
PARTITION (p1,p2)
[WHERE [expr]]
TO "hdfs://host/path/to/export/" 
PROPERTIES
(
    "label" = "mylabel",
    "column_separator"=",",
    "columns" = "col1,col2",
    "exec_mem_limit"="2147483648",
    "timeout" = "3600"
)
WITH BROKER "hdfs"
(
    "username" = "user",
    "password" = "passwd"
);
  • label: The identifier of this export job. You can use this ID to check the job status later.
  • column_separator: column separator. The default is \t. Support invisible characters, such as '\x07'.
  • Columns: Columns to be exported, separated by English commas. If this parameter is not filled in, all columns of the table will be exported by default.
  • line_delimiter: line delimiter. Default is\n. Supports invisible characters, such as '\x07'.
  • exec_mem_limit: Indicates the memory usage limit of a query plan on a single BE in the Export job. Default is 2GB. Unit byte.
  • timeout: job timeout. Default is 2 hours. Unit seconds.
  • tablet_num_per_task: The maximum number of shards allocated per query plan. Default is 5

(3) Export example

  • Start the hadoop cluster

  • Execute export

    export table example_site_visit2
    to "hdfs://mycluster/doris-export"
    PROPERTIES
    (
        "label" = "mylabel",
        "column_separator"="|",
        "timeout" = "3600"
    )
    WITH BROKER "broker_name"
    (
        #HDFS 开启 HA 需要指定,还指定其他参数
        "dfs.nameservices"="mycluster",
        "dfs.ha.namenodes.mycluster"="nn1,nn2,nn3",
        "dfs.namenode.rpc-address.mycluster.nn1"= "hadoop1:8020",
        "dfs.namenode.rpc-address.mycluster.nn2"= "hadoop2:8020",
        "dfs.namenode.rpc-address.mycluster.nn3"="hadoop3:8020",
        "dfs.client.failover.proxy.provider.mycluster"="org.apache.hadoop
        .hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" 
    );
    
  • After exporting, check the corresponding path of hdfs, and there will be many more files.

(4) Query export job status

After submitting the job, you can query the export job status through the SHOW EXPORT command. Examples of results are as follows:

JobId: 14008
 Label: mylabel
 State: FINISHED
 Progress: 100%
 TaskInfo: {
    
    "partitions":["*"],"exec mem 
limit":2147483648,"column separator":",","line 
delimiter":"\n","tablet num":1,"broker":"hdfs","coord 
num":1,"db":"default_cluster:db1","tbl":"tbl3"}
 Path: bos://bj-test-cmy/export/
CreateTime: 2019-06-25 17:08:24
StartTime: 2019-06-25 17:08:28
FinishTime: 2019-06-25 17:08:34
 Timeout: 3600
 ErrorMsg: N/A
  • JobId: the unique ID of the job

  • Label: custom job identification

  • State: Job status:

    PENDING: Job to be scheduled

    EXPORTING: Data is being exported

    FINISHED: The job was successful

    CANCELLED: Job failed

  • Progress: job progress. The progress is in query plans. Suppose there are 10 query plans in total, and 3 of them are currently completed, so the progress is 30%.

  • TaskInfo: Job information displayed in Json format:

    db: database name

    tbl: table name

    partitions: Specify the exported partitions. * indicates all partitions.

    exec mem limit: query plan memory usage limit. Unit byte.

    column separator: The column separator of the exported file.

    line delimiter: The line delimiter of the exported file.

    tablet num: The total number of tablets involved.

    broker: The name of the broker used.

    coord num: The number of query plans.

  • Path: The export path on the remote storage.

  • CreateTime/StartTime/FinishTime: The creation time, start scheduling time, and finish time of the job.

  • Timeout: job timeout. The unit is seconds. This time is calculated from CreateTime.

  • ErrorMsg: If an error occurs in the job, the reason for the error will be displayed here.

(5) Precautions

  • It is not recommended to export large amounts of data at once. The maximum amount of export data recommended for an Export job is tens of GB. Excessively large exports result in more junk files and higher retry costs.
  • If the amount of table data is too large, it is recommended to export by partition.
  • During the running of the Export job, if FE is restarted or the master is switched, the Export job will fail and the user will need to resubmit it.
  • If the Export job fails, the doris_export_tmp_xxx temporary directory generated in the remote storage and the generated files will not be deleted and need to be deleted manually by the user.
  • If the Export job runs successfully, the __doris_export_tmp_xxx directories generated in the remote storage may be retained or cleared based on the file system semantics of the remote storage. For example, in Baidu Object Storage (BOS), after the last file in a directory is removed through the rename operation, the directory will also be deleted. If the directory has not been cleared, the user can clear it manually.
  • When the Export operation is completed (success or failure), FE restarts or switches masters, and some of the job information displayed by SHOW EXPORT will be lost and cannot be viewed.
  • The Export job will only export the data of the Base table, not the data of the Rollup Index.
  • The Export job will scan data, occupy IO resources, and may affect the query delay of the system.

Export query results

The SELECT INTO OUTFILE statement can export query results to a file. Currently, it is supported to export to remote storage, such as HDFS, S3, BOS, COS (Tencent Cloud) through the Broker process, through the S3 protocol, or directly through the HDFS protocol.

(1) Grammar

The syntax is as follows:

query_stmt
INTO OUTFILE "file_path"
[format_as]
[properties]
  • file_path

    file_path points to the path where the file is stored and the file prefix. Such as hdfs://path/to/my_file_.

    The final file name will be composed of my_file_, file serial number and file format suffix. The file serial number starts from 0

    Initially, the number is the number of divided files. like:

    my_file_abcdefg_0.csv

    my_file_abcdefg_1.csv

    my_file_abcdegf_2.csv

  • [format_as]

    FORMAT AS CSV
    

    Specify the export format. Default is CSV.

  • [properties]

    Specify relevant properties. Currently, export through the Broker process or through the S3 protocol is supported.

    Broker related attributes need to be prefixed with broker.. Please refer to the Broker documentation for details.

    HDFS related attributes need to be prefixed with hdfs. Among them, hdfs.fs.defaultFS is used to fill in the namenode address and port.

    mouth. Is required.

    For the S3 protocol, you can directly execute the S3 protocol configuration.

    Example:

    ("broker.prop_key" = "broker.prop_val", ...)
    or
    ("hdfs.fs.defaultFS" = "xxx", "hdfs.hdfs_user" = "xxx")
    or 
    ("AWS_ENDPOINT" = "xxx", ...)
    

    Other properties:

    ("key1" = "val1", "key2" = "val2", ...)
    

    Currently the following attributes are supported:

    • column_separator: Column separator, only applicable to CSV format. The default is \t.
    • line_delimiter: Line delimiter, only applicable to CSV format. Default is\n.
    • max_file_size: The maximum size of a single file. The default is 1GB. The value range is between 5MB and 2GB. Files exceeding this size will be split.
    • schema: PARQUET file schema information. Applies to PARQUET format only. When the export file format is PARQUET, schema must be specified.

(2) Concurrent export

  • Conditions for concurrent export: By default, the export of query result sets is non-concurrent, that is, single-point export. If the user wants the query result set to be exported concurrently, the following conditions need to be met:

    • session variable 'enable_parallel_outfile' turns on concurrent export:

      set enable_parallel_outfile = true;

    • The export method is S3 or HDFS instead of using a broker;

    • The query can meet the needs of concurrent export. For example, the top level does not contain single nodes such as sort. (I will give examples later to illustrate which types of queries cannot export result sets concurrently)

    If the above three conditions are met, the concurrent export of the query result set can be triggered.

    Concurrency = be_instacne_num * parallel_fragment_exec_instance_num

  • Verification result sets are exported concurrently.

    After the user turns on concurrent export through session variable settings, if he wants to verify whether the current query can be exported concurrently, he can use the following method.

    explain select xxx from xxx where xxx into outfile "s3://xxx" 
    format as csv properties ("AWS_ENDPOINT" = "xxx", ...);
    

    After explaining the query, Doris will return the plan of the query. If RESULT FILE SINK appears in PLAN FRAGMENT 1, it means that the export concurrency is successfully enabled. If RESULT FILE SINK appears in PLAN FRAGMENT 0, it means that the current query cannot be exported concurrently (the current query does not meet the three conditions for concurrent export at the same time).

(3) Usage examples

**Example 1:**Use the broker method to export simple query results

SELECT * FROM example_site_visit
INTO OUTFILE "hdfs://hadoop1:8020/doris-out/broker_a_"
FORMAT AS CSV
PROPERTIES
(
    "broker.name" = "broker_name",
    "column_separator" = ",",
    "line_delimiter" = "\n",
    "max_file_size" = "100MB"
);

If the final generated file is not larger than 100MB, it will be: result_0.csv.

If it's larger than 100MB, it might be result_0.csv, result_1.csv, ....

**Example 2:**Use the broker method and specify the export format as PARQUET

SELECT city, age FROM example_site_visit
INTO OUTFILE "hdfs://hadoop1:8020/doris-out/broker_b_"
FORMAT AS PARQUET
PROPERTIES
(
    "broker.name" = "broker_name",
    "schema"="required,byte_array,city;required,int32,age"
);

Exporting query results to parquet files needs to be specified explicitly schema.

**Example 3:**Export using HDFS

SELECT * FROM example_site_visit
INTO OUTFILE "hdfs://doris-out/hdfs_"
FORMAT AS CSV
PROPERTIES
(
    "hdfs.fs.defaultFS" = "hdfs://hadoop1:8020",
    "hdfs.hdfs_user" = "atguigu",
    "column_separator" = ","
);

If the final generated file is not larger than 100MB, it will be: result_0.csv.

If it's larger than 100MB, it might be result_0.csv, result_1.csv, ....

**Example 4:**Export using HDFS and enable concurrent export

set enable_parallel_outfile = true;
EXPLAIN SELECT * FROM example_site_visit
INTO OUTFILE "hdfs://doris-out/hdfs_"
FORMAT AS CSV
PROPERTIES
(
    "hdfs.fs.defaultFS" = "hdfs://hadoop1:8020",
    "hdfs.hdfs_user" = "atguigu",
    "column_separator" = ","
);

**Example 5:** Export the query results of the CTE statement to a file hdfs://path/to/result.txt. The default export format is CSV. Use my_brokerand set up HDFS high availability information. Use the default row and column separators.

WITH
x1 AS
(SELECT k1, k2 FROM tbl1),
x2 AS
(SELECT k3 FROM tbl2)
SELEC k1 FROM x1 UNION SELECT k3 FROM x2
INTO OUTFILE "hdfs://path/to/result_"
PROPERTIES
(
    "broker.name" = "my_broker",
    "broker.username"="user",
    "broker.password"="passwd",
    "broker.dfs.nameservices" = "my_ha",
    "broker.dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
    "broker.dfs.namenode.rpc-address.my_ha.my_namenode1" = 
    "nn1_host:rpc_port",
    "broker.dfs.namenode.rpc-address.my_ha.my_namenode2" = 
    "nn2_host:rpc_port",
    "broker.dfs.client.failover.proxy.provider" = 
    "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProx
    yProvider"
);

If the final generated file is not larger than 1GB, it will be: result_0.csv.

If it's larger than 1GB, it's probably result_0.csv, result_1.csv, ....

**Example 6:**Export the query results of the UNION statement to a file bos://bucket/result.txt. Specify the export format as PARQUET. Use my_brokerand set up HDFS high availability information. The PARQUET format does not require specifying column delimiters.

After the export is completed, an identification file is generated.

SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
INTO OUTFILE "bos://bucket/result_"
FORMAT AS PARQUET
PROPERTIES
(
    "broker.name" = "my_broker",
    "broker.bos_endpoint" = "http://bj.bcebos.com",
    "broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
    "broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy",
    "schema"="required,int32,k1;required,byte_array,k2"
);

**Example 7:**Export the query results of the select statement to a file cos://${bucket_name}/path/result.txt. Specify the export format as csv.

After the export is completed, an identification file is generated.

select k1,k2,v1 from tbl1 limit 100000
into outfile "s3a://my_bucket/export/my_file_"
FORMAT AS CSV
PROPERTIES
(
    "broker.name" = "hdfs_broker",
    "broker.fs.s3a.access.key" = "xxx",
    "broker.fs.s3a.secret.key" = "xxxx",
    "broker.fs.s3a.endpoint" = "https://cos.xxxxxx.myqcloud.com/",
    "column_separator" = ",",
    "line_delimiter" = "\n",
    "max_file_size" = "1024MB",
    "success_file_name" = "SUCCESS"
)

If the final generated file is not larger than 1GB, it will be: my_file_0.csv.

If it's larger than 1GB, it's probably my_file_0.csv, result_1.csv, ....

Verify on cos:

① Paths that do not exist will be automatically created

② access.key/secret.key/endpoint needs to be confirmed with cos classmates. Especially for the value of endpoint, bucket_name does not need to be filled in.

**Example 8:** Use the s3 protocol to export to bos, and concurrent export is enabled:

set enable_parallel_outfile = true;
select k1 from tb1 limit 1000
into outfile "s3://my_bucket/export/my_file_"
format as csv
properties
(
    "AWS_ENDPOINT" = "http://s3.bd.bcebos.com",
    "AWS_ACCESS_KEY" = "xxxx",
    "AWS_SECRET_KEY" = "xxx",
    "AWS_REGION" = "bd"
)

The final generated file has a prefix of my_file_{fragment_instance_id}_.

**Example 9:** Use the s3 protocol to export to BOS, and concurrent export of session variables is enabled.

Note: However, since the query statement has a top-level sorting node, even if the session variable for concurrent export is enabled for this query, it cannot be exported concurrently.

set enable_parallel_outfile = true;
select k1 from tb1 order by k1 limit 1000
into outfile "s3://my_bucket/export/my_file_"
format as csv
properties
(
    "AWS_ENDPOINT" = "http://s3.bd.bcebos.com",
    "AWS_ACCESS_KEY" = "xxxx",
    "AWS_SECRET_KEY" = "xxx",
    "AWS_REGION" = "bd"
)

mysqldump export

Doris 1.0 supports exporting data or table structure through the mysqldump tool, the following operations:

  • Export the user table in the test database:

    mysqldump -h127.0.0.1 -P9030 -uroot --no-tablespaces --databases test_db --tables user > dump1.sql
    
  • Export the user table structure in the test_db database:

    mysqldump -h127.0.0.1 -P9030 -uroot --no-tablespaces --databases test_db --tables user --no-data > dump2.sql
    
  • Export all tables in the test_db database:

    mysqldump -h127.0.0.1 -P9030 -uroot --no-tablespaces --databases test_db
    
  • Export all databases and tables

    mysqldump -h127.0.0.1 -P9030 -uroot --no-tablespaces --all-databases
    
  • The exported results can be redirected to a file, and then imported into Doris through the source command

    source /opt/module/doris-1.0.0/dump1.sql
    

Data backup and recovery

Doris supports backing up the current data to the remote storage system through the broker in the form of files. Afterwards, the data can be restored from the remote storage system to any Doris cluster through the restoration command. Through this function, Doris can support regular snapshot backup of data. You can also use this function to migrate data between different clusters.

This function requires Doris version 0.8.2+ to use this function, and a broker corresponding to remote storage needs to be deployed. Such as BOS, HDFS, etc. You can view the currently deployed broker through SHOW BROKER;.

Brief explanation of principle

Backup

The backup operation is to upload the data of the specified table or partition directly to the remote warehouse for storage in the form of files stored in Doris. When the user submits the Backup request, the system will perform the following operations internally:

(1) Snapshot and snapshot upload

The snapshot phase takes a snapshot of the specified table or partition data file. After that, backups are performed on snapshots. After the snapshot, changes to the table, imports, and other operations will no longer affect the backup results. Snapshot only generates a hard link to the current data file and takes very little time. After the snapshot is completed, these snapshot files will be uploaded one by one. Snapshot upload is completed concurrently by each Backend.

(2) Metadata preparation and uploading

After the data file snapshot is uploaded, Frontend will first write the corresponding metadata into a local file, and then upload the local metadata file to the remote warehouse through the broker. Complete the final backup job.

Restore

The recovery operation needs to specify an existing backup in the remote warehouse, and then restore the contents of the backup to the local cluster. When the user submits a Restore request, the system will perform the following operations internally:

(1) Create corresponding metadata locally

This step will first create and restore structures such as table partitions in the local cluster. After creation, the table is visible but not accessible.

(2) Local snapshot

This step is to take a snapshot of the table created in the previous step. This is actually an empty snapshot (because the newly created table has no data). Its main purpose is to generate the corresponding snapshot directory on the Backend, which is used to later receive the snapshot files downloaded from the remote warehouse.

(3) Download snapshot

The snapshot files in the remote warehouse will be downloaded to the corresponding snapshot directory generated in the previous step. This step is completed concurrently by each Backend.

(4) Effective snapshot

After the snapshot download is completed, we need to map each snapshot to the metadata of the current local table. Then reload these snapshots to make them effective and complete the final recovery operation.

Best Practices

(1) Backup

Currently supports full backup at the granularity of the smallest partition (Partition) (incremental backup may be supported in future versions). If you need to back up data on a regular basis, you first need to plan the partitioning and bucketing of the table reasonably when building the table, such as partitioning by time. Then, during the subsequent operation, regular data backups are performed according to the granularity of the partition.

(2) Data migration

Users can first back up the data to the remote warehouse, and then restore the data to another cluster through the remote warehouse to complete the data migration. Because data backup is done in the form of snapshots, new imported data after the snapshot phase of the backup job will not be backed up. Therefore, between the completion of the snapshot and the completion of the recovery job, the data imported on the original cluster must also be imported on the new cluster.

It is recommended that after the migration is completed, the new and old clusters are imported in parallel for a period of time. After completing the data and business correctness verification, migrate the business to the new cluster.

(3) Key points

  • Backup and recovery related operations are currently only allowed to be performed by users with ADMIN permissions.
  • Within a Database, only one backup or recovery job is allowed to be executed.
  • Both backup and recovery support operations at the minimum partition (Partition) level. When the amount of data in a table is large, it is recommended to perform operations by partition to reduce the cost of failed retries.
  • Because the backup and restore operations operate on actual data files. Therefore, when a table has too many shards, or a shard has too many small versions, it may take a long time to back up or restore even if the total data volume is small. Users can use SHOW PARTITIONS FROM table_name; and SHOW TABLET FROM table_name; to check the number of shards in each partition and the number of file versions in each shard to estimate the job execution time. The number of files has a great impact on the job execution time, so it is recommended to plan partitions and buckets reasonably when creating tables to avoid excessive fragmentation.
  • When checking the job status through the SHOW BACKUP or SHOW RESTORE command. It is possible to see error messages in the TaskErrMsg column. But as long as the State column is not CANCELLED, it means the job is still continuing. These Tasks may be retried successfully. Of course, some Task errors will directly cause job failure.
  • If the recovery job is an overwrite operation (specifying recovery data to an existing table or partition), then starting from the COMMIT phase of the recovery job, the overwritten data on the current cluster may no longer be restored. If the recovery job fails or is canceled at this time, the previous data may be damaged and inaccessible. In this case, the only way to recover is by performing the recovery operation again and waiting for the job to complete. Therefore, we recommend that you try not to use overwriting to recover data unless necessary unless you confirm that the current data is no longer in use.

backup

(1) Create a remote warehouse path

CREATE REPOSITORY `hdfs_ods_dw_backup`
WITH BROKER `broker_name`
ON LOCATION "hdfs://hadoop1:8020/tmp/doris_backup"
PROPERTIES (
    "username" = "",
    "password" = ""
)

(2) Perform backup

BACKUP SNAPSHOT [db_name].{snapshot_name}
TO `repository_name`
ON (
    `table_name` [PARTITION (`p1`, ...)],
    ...
)
PROPERTIES ("key"="value", ...);

Example:

BACKUP SNAPSHOT test_db.backup1
 TO hdfs_ods_dw_backup
 ON
 (
     table1
 );

(3) View backup tasks

SHOW BACKUP [FROM db_name]

(4) View the remote warehouse image

grammar:

SHOW SNAPSHOT ON `repo_name` [WHERE SNAPSHOT = "snapshot" [AND TIMESTAMP = "backup_timestamp"]];

Example 1: View existing backups in the warehouse hdfs_ods_dw_backup:

SHOW SNAPSHOT ON hdfs_ods_dw_backup;    

Example 2: View only the backup named backup1 in the warehouse hdfs_ods_dw_backup:

SHOW SNAPSHOT ON hdfs_ods_dw_backup WHERE SNAPSHOT = "backup1";

Example 3: View the detailed information of the backup named backup1 in the warehouse hdfs_ods_dw_backup, with the time version "2021-05-05-15-34-26":

SHOW SNAPSHOT ON hdfs_ods_dw_backup WHERE SNAPSHOT = "backup1" AND TIMESTAMP = "2021-05-05-15-34-26";

(5) Cancel backup

Syntax for canceling an ongoing backup job:

CANCEL BACKUP FROM db_name;

Example: Cancel the BACKUP task under test_db

CANCEL BACKUP FROM test_db;

recover

Restore the data previously backed up by the BACKUP command to the specified database. This command is an asynchronous operation. After the submission is successful, you need to check the progress through the SHOW RESTORE command.

  • Only tables of type OLAP are supported
  • Supports restoring multiple tables at one time. This needs to be consistent with the table in your corresponding backup.

(1) Usage

RESTORE SNAPSHOT [db_name].{snapshot_name}
FROM `repository_name`
ON (
    `table_name` [PARTITION (`p1`, ...)] [AS `tbl_alias`],
    ...
)
PROPERTIES ("key"="value", ...);

illustrate:

  • There can only be one executing BACKUP or RESTORE task under the same database.

  • The tables and partitions that need to be restored are identified in the ON clause. If no partition is specified, all partitions of the table will be restored by default. The specified table and partition must already exist in the warehouse backup

  • The table name backed up in the warehouse can be restored as a new table through the AS statement. But the new table name must not already exist in the database. The partition name cannot be modified.

  • You can restore the backed up table in the warehouse and replace the existing table with the same name in the database, but you must ensure that the table structure of the two tables is completely consistent. The table structure includes: table name, column, partition, Rollup, etc.

  • You can specify some partitions of the recovery table, and the system will check whether the partition Range or List can match.

  • PROPERTIES currently supports the following properties:

    "backup_timestamp" = "2018-05-04-16-45-08": Specifies which time version to restore the corresponding backup, required. This information can be obtained with the SHOW SNAPSHOT ON repo; statement.

    "replication_num" = "3": Specifies the number of replicas of the restored table or partition. Default is 3. If restoring an existing table or partition, the number of copies must be the same as the number of copies of the existing table or partition. At the same time, there must be enough hosts to accommodate multiple replicas.

    "timeout" = "3600": task timeout, the default is one day. Unit seconds.

    "meta_version" = 40: Use the specified meta_version to read the metadata of the previous backup. Note that this parameter is a temporary solution and is only used to restore data backed up by the old version of Doris. The latest version of the backup data already contains the meta version, so there is no need to specify it again.

(2) Usage example

  • example one

    Restore table backup_tbl in backup snapshot_1 from example_repo to database example_db1, and the time version is "2021-05-04-16-45-08". Revert to 1 copy:

    RESTORE SNAPSHOT example_db1.`snapshot_1`
    FROM `example_repo`
    ON ( `backup_tbl` )
    PROPERTIES
    (
        "backup_timestamp"="2021-05-04-16-45-08",
        "replication_num" = "1"
    );
    
  • Example two

    Restore partitions p1 and p2 of table backup_tbl in backup snapshot_2 from example_repo, and table backup_tbl2 to database example_db1, and rename it to new_tbl, and the time version is "2021-05-04-17-11-01". Default recovery is 3 copies:

    RESTORE SNAPSHOT example_db1.`snapshot_2`
    FROM `example_repo`
    ON
    (
     `backup_tbl` PARTITION (`p1`, `p2`),
     `backup_tbl2` AS `new_tbl`
    )
    PROPERTIES
    (
     "backup_timestamp"="2021-05-04-17-11-01"
    );
    
  • Demo

    RESTORE SNAPSHOT test_db.backup1 
    FROM `hdfs_ods_dw_backup` 
    ON 
    (
        table1 AS table_restore
    )
    PROPERTIES 
    (
        "backup_timestamp"="2022-04-01-16-45-19" 
    );
    

(3) View recovery tasks

You can check the status of data recovery through the following statement:

SHOW RESTORE [FROM db_name]

(4) Cancel recovery

The following statement is used to cancel a job that is performing data recovery:

CANCEL RESTORE FROM db_name;

When cancellation occurs around COMMIT or later stages of recovery, it may cause the recovered tables to become inaccessible. At this time, data can only be recovered by executing the recovery job again.

Delete remote repository

This statement is used to delete a created warehouse. Only root or superuser users can delete repositories. User here refers to Doris' user syntax:

DROP REPOSITORY `repo_name`;

illustrate:

Deleting a warehouse only deletes the mapping of the warehouse in Doris and does not delete the actual warehouse data. After deletion, you can map to the warehouse again by specifying the same broker and LOCATION.

Guess you like

Origin blog.csdn.net/qq_44766883/article/details/131353520