Doris (5): Broker load of data import (Load)

In order to adapt to different data import requirements, the Doris system provides five different data import methods, each data import method supports different data sources, and there are different methods (asynchronous, synchronous)

  • Broker load

Access and read the external data source (HDFS) through the Broker process to import Doris, the user submits the import job through Mysql, executes it asynchronously, and checks the import result through the show load command

  • Stream load

The user submits the request through the HTTP protocol and carries the original data to create the import, which is mainly used to quickly import the data in the local file or data stream to Doris, and the import command returns the result synchronously

  • Insert

Similar to the insert statement in Mysql, Doris provides insert into table select ... to read data from Doris's table and import it into another table, or insert a single piece of data through insert into table values(...)

  • Multi load

Users can submit multiple import jobs through the HTTP protocol, and Multi load can ensure that the atomicity of multiple import jobs takes effect

  • Routine load

The user submits a routine import job through the Mysql protocol, generates a resident thread, and continuously reads data from the data source (such as Kafka) and imports it into Doris

1 Broker Load

Broker load is an asynchronous way of importing, and different data sources need to deploy different broker processes. You can view the deployed brokers through the show broker command.

1.1 Applicable scenarios

  • The source data is in a storage system that Broker can access, such as HDFS
  • The amount of data ranges from tens to hundreds of GB

1.2 Basic Principles

After the user submits the import task, FE (the metadata and scheduling node of the Doris system) will generate the corresponding PLAN (import the execution plan, BE will import the plan and import the input into Doris) and according to BE (the computing and storage node of the Doris system) The number of files and the size of the file, the PLAN is distributed to multiple BEs for execution, and each BE imports a part of the data. BE will pull data from Broker during execution, and import the data into the system after converting the data. All BEs will be imported, and FE will finally decide whether the import is successful.

1.3 Preconditions

Start hdfs cluster: start-dfs.sh

1.4 Grammar

LOAD LABEL load_label
(
data_desc1[, data_desc2, ...]
)
WITH BROKER broker_name
[broker_properties]
[opt_properties];

  • load_label

The label of the current import batch, unique within a database.

grammar:

[database_name.]your_label

  • date_desc

Used to describe a batch of imported data.

grammar:

DATA INFILE

(

"file_path1"[, file_path2, ...]

)

[NEGATIVE]

INTO TABLE `table_name`

[PARTITION (p1, p2)]

[COLUMNS TERMINATED BY "column_separator"]

[(column_list)]

[SET (k1 = func(k2))]

file_path: file path, you can specify a file, or use * wildcard to specify all files in a certain directory. Wildcards must match files, not directories.

PARTITION: If this parameter is specified, only the specified partition will be imported, and data outside the imported partition will be filtered out. If not specified, all partitions of the table are imported by default.

NEGATIVE: If this parameter is specified, it is equivalent to importing a batch of "negative" data. Used to offset the same batch of data imported earlier. This parameter is only applicable when there is a value column and the aggregation type of the value column is only SUM.

column_separator: Used to specify the column separator in the import file. The default is \t. If it is an invisible character, you need to add \\x as a prefix, and use hexadecimal to represent the separator. For example, the separator \x01 of the hive file is specified as "\\x01"

column_list: used to specify the corresponding relationship between the columns in the imported file and the columns in the table. When you need to skip a certain column in the import file, just specify the column as a column name that does not exist in the table.

Syntax: (col_name1, col_name2, ...)

SET: If this parameter is specified, a column of the source file can be converted according to the function, and then the converted result can be imported into the table.

Currently supported functions are:

strftime(fmt, column) date conversion function

  • fmt: date format, in the form of %Y%m%d%H%M%S (year, month, day, hour, minute, second)
  • column: The column in column_list, that is, the column in the input file. The storage content should be a numeric timestamp.
  • If there is no column_list, the columns of the input file are defaulted in the order of the columns of the palo table.

time_format(output_fmt, input_fmt, column) date format conversion

  • output_fmt: converted date format, in the form of %Y%m%d%H%M%S (year, month, day, hour, minute, second)
  • input_fmt: the date format of the column before conversion, in the form of %Y%m%d%H%M%S (year, month, day, hour, minute, second)
  • column: The column in column_list, that is, the column in the input file. The storage content should be a date string in input_fmt format.
  • If there is no column_list, the columns of the input file are defaulted in the order of the columns of the palo table.

alignment_timestamp(precision, column) align the timestamp to the specified precision

  • precision: year|month|day|hour
  • column: The column in column_list, that is, the column in the input file. The storage content should be a numeric timestamp.
  • If there is no column_list, the columns of the input file are defaulted in the order of the columns of the palo table.
  • Note: When the alignment precision is year and month, only timestamps in the range of 20050101~20191231 are supported.

default_value(value) Set the default value of a column import

  • If not specified, the default value of the column when the table is created will be used
  • md5sum(column1, column2, ...) Calculate md5sum of the value of the specified imported column and return a 32-bit hexadecimal string
  • replace_value(old_value[, new_value]) replace the old_value specified in the import file with new_value
  • If new_value is not specified, the default value of the column when the table is created will be used
  • hll_hash(column) is used to convert a certain column in the table or data into a data structure of HLL column
  • now() Sets the imported data of a certain column as the time point of import execution. The column must be of type DATE/DATETIME

  • broker_name

The broker name used can be viewed through the show broker command. Different data sources need to use corresponding brokers.

  • broker_properties

Used to provide information for accessing data sources through brokers. Different brokers and different access methods require different information.

Apache HDFS:

The community version of hdfs supports simple authentication and kerberos authentication. As well as support for HA configurations.

Simple authentication:

  • hadoop.security.authentication = simple (默认)
  • username: hdfs username
  • password: hdfs password

kerberos authentication:

  • hadoop.security.authentication = kerberos
  • kerberos_principal: specify the principal of kerberos
  • kerberos_keytab: Specifies the path of the keytab file of kerberos. This file must be a file on the server where the broker process resides.
  • kerberos_keytab_content: Specifies the base64-encoded content of the keytab file in kerberos. You can choose one of these and kerberos_keytab configuration.

namenode HA:

By configuring namenode HA, the new namenode can be automatically recognized when the namenode switches

  • dfs.nameservices: Specifies the name of the hdfs service, customized, such as: "dfs.nameservices" = "my_ha"
  • dfs.ha.namenodes.xxx: custom namenode name, multiple names separated by commas. Where xxx is a custom name in dfs.nameservices, such as "dfs.ha.namenodes.my_ha" = "my_nn"
  • dfs.namenode.rpc-address.xxx.nn: Specifies the rpc address information of the namenode. Where nn represents the namenode name configured in dfs.ha.namenodes.xxx, such as: "dfs.namenode.rpc-address.my_ha.my_nn" = "host:port"
  • dfs.client.failover.proxy.provider: Specify the provider for the client to connect to the namenode, the default is: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

opt_properties

Used to specify some special parameters.

grammar:

[PROPERTIES ("key"="value", ...)]

The following parameters can be specified:

  • timeout: Specifies the timeout period for the import operation. The default timeout is 4 hours. The unit is second.
  • max_filter_ratio: The maximum tolerable data ratio that can be filtered (for reasons such as data irregularity). The default is zero tolerance.
  • exec_mem_limit: Set the upper limit of the memory used by the import. The default is 2G, unit byte. This refers to the upper memory limit of a single BE node. An import may be distributed across multiple BEs. We assume that 1GB of data processing requires a maximum of 5GB of memory on a single node. Then assuming that the 1GB file is distributed to 2 nodes for processing, then theoretically each node requires 2.5GB of memory. Then this parameter can be set to 2684354560, which is 2.5GB

1.4 Examples

Start hdfs cluster

start-dfs.sh

Enter mysqlclient, create a table

CREATE TABLE test_db.user_result(
    id INT,
    name VARCHAR(50),
    age INT,
    gender INT,
    province  VARCHAR(50),
    city   VARCHAR(50),
    region  VARCHAR(50),
    phone VARCHAR(50),
    birthday VARCHAR(50),
    hobby  VARCHAR(50),
    register_date VARCHAR(50)
)DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 10;

1.5 upload data to hdfs

hdfs dfs -put user.csv /datas/user.csv

1.6 Import data

LOAD LABEL test_db.user_result
(
DATA INFILE("hdfs://192.168.222.138:9000/datas/user.csv")
INTO TABLE `user_result`
COLUMNS TERMINATED BY ","
FORMAT AS "csv"
(id, name, age, gender, province,city,region,phone,birthday,hobby,register_date)
)
WITH BROKER broker_name
(
"dfs.nameservices" = "my_cluster",
"dfs.ha.namenodes.my_cluster" = "nn1,nn2,nn3",
"dfs.namenode.rpc-address.my_cluster.nn1" = "192.168.222.143:9000",
"dfs.namenode.rpc-address.my_cluster.nn2" = "192.168.222.144:9000",
"dfs.namenode.rpc-address.my_cluster.nn3" = "192.168.222.145:9000",
"dfs.client.failover.proxy.provider" = 	"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)
PROPERTIES
(
    "max_filter_ratio"="0.00002"
);

Notice

broker_name: the name of the broker, which can be viewed through show broker;

1.7 View the load job

show load;

1.8 View imported data

select * from user_result;

1.9 View import

Since the import method of Broker load is asynchronous, the user must create an imported Label record, and use Label in the view import command to view the import result. View import commands are common in all import methods, and the specific syntax can be viewed by executing HELP SHOW LOAD.

show load order by createtime desc limit 1\G

  • JobId

The unique ID of the imported job. The JobId of each imported job is different and automatically generated by the system. Unlike Label, JobId will never be the same, and Label can be reused after the import task fails.

  • Label

The ID of the import task.

  • State

The current phase of the import task. During the import process of Broker load, there are mainly two import states, PENDING and LOADING. If the Broker load is in the PENDING state, it means that the current import task is waiting to be executed; the LOADING state means that it is being executed.

There are two final phases of the import task: CANCELLED and FINISHED. When the Load job is in these two phases, the import is complete. Among them, CANCELLED means that the import has failed, and FINISHED means that the import has succeeded.

  • Progress

A progress description of the import task. There are two progresses: ETL and LOAD, which correspond to the two stages of the import process, ETL and LOADING. At present, since Broker load only has the LOADING stage, ETL will always be displayed as N/A

The progress range of LOAD is: 0~100%.

LOAD progress = the number of tables currently imported / the total number of tables designed for this import task * 100%

If all imported tables are imported, the progress of LOAD is 99% at this time. The import enters the final effective stage. After the entire import is completed, the progress of LOAD will change to 100%.

Import progress is not linear. So if the progress has not changed for a period of time, it does not mean that the import is not being executed.

  • Type

The type of import task. The type value of Broker load is only BROKER.

  • EtlInfo

It mainly shows the imported data volume indicators unselected.rows , dpp.norm.ALL and dpp.abnorm.ALL. Users can judge how many rows are filtered by the where condition according to the first value, and the latter two indicators verify whether the error rate of the current import task exceeds max_filter_ratio.

The sum of the three indicators is the total number of rows of the original data volume.

  • TaskInfo

It mainly displays the current import task parameters, that is, the import task parameters specified by the user when creating the Broker load import task, including: cluster, timeout and max_filter_ratio.

  • ErrorMsg

When the status of the import task is CANCELLED, the reason for the failure will be displayed. The display is divided into two parts: type and msg. If the import task is successful, N/A will be displayed.

The value meaning of type:

USER_CANCEL: task canceled by the user

ETL_RUN_FAIL: Import task that failed during the ETL phase

ETL_QUALITY_UNSATISFIED: The data quality is unqualified, that is, the error data rate exceeds max_filter_ratio

LOAD_RUN_FAIL: Import tasks that failed during the LOADING phase

TIMEOUT: The import task did not complete within the timeout period

UNKNOWN: unknown import error

  • CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime

These values ​​represent the time when the import was created, the time when the ETL phase started, the time when the ETL phase was completed, the time when the Loading phase started, and the time when the entire import task was completed.

Broker load import has no ETL stage, so its EtlStartTime, EtlFinishTime, LoadStartTime are set to the same value.

If the import task stays at CreateTime for a long time, and the LoadStartTime is N/A, it means that the import task is seriously accumulated at present. Users can import submissions less frequently.

LoadFinishTime - CreateTime = time taken by the entire import task

LoadFinishTime - LoadStartTime = Entire Broker load import task execution time = time consumed by the entire import task - import task waiting time

  • URL

The error data sample of the import task, visit the URL address to get the error data sample of this import. When there is no error data in this import, the URL field is N/A.

  • JobDetails

Displays detailed running status of some jobs. Including the number of imported files, the total size (bytes), the number of subtasks, the number of processed original lines, the ID of the BE node running the subtask, and the ID of the unfinished BE node.

{"Unfinished backends":{"9c3441027ff948a0-8287923329a2b6a7":[10002]},"ScannedRows":2390016,"TaskNumber":1,"All backends":{"9c3441027ff948a0-8287923329a2b6a7":[10002]},"FileNumber":1,"FileSize":1073741824}

The number of raw rows that have been processed is updated every 5 seconds. The number of lines is only used to show the current progress, and does not represent the final actual number of lines processed. The actual number of processed rows is subject to the one displayed in EtlInfo.

1.10 Cancel import

When the Broker load job status is not CANCELLED or FINISHED, it can be canceled manually by the user. When canceling, you need to specify the Label of the import task to be canceled. You can execute HELP CANCEL LOAD to view the cancel import command syntax.

1.11 Other import case references

Import a batch of data from HDFS, the data format is CSV, and use the kerberos authentication method at the same time, and configure namenode HA at the same time

  • Set the maximum tolerable data ratio that can be filtered (for reasons such as data irregularity).
LOAD LABEL test_db.user_result2
(
        DATA INFILE("hdfs://node1:9000/datas/user.csv")
        INTO TABLE `user_result`
        COLUMNS TERMINATED BY ","
        FORMAT AS "csv"
        (id, name, age, gender, province,city,region,phone,birthday,hobby,register_date)
)
WITH BROKER broker_name
(
        "hadoop.security.authentication"="kerberos",
        "kerberos_principal"="[email protected]",
        "kerberos_keytab_content"="BQIAAABEAAEACUJBSURVLkNPTQAEcGFsbw",
        "dfs.nameservices" = "my_ha",
        "dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
        "dfs.namenode.rpc-address.my_ha.my_namenode1" = "node1:9000",
        "dfs.namenode.rpc-address.my_ha.my_namenode2" = "node2:9000",
        "dfs.client.failover.proxy.provider" ="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)
PROPERTIES
(

    "max_filter_ratio"="0.00002"
);

  • Import a batch of data from HDFS, specify the timeout time and filtering ratio. Use the broker with the inscription my_hdfs_broker. Simple authentication.
LOAD LABEL test_db.user_result3
(
    DATA INFILE("hdfs://node1:9000/datas/user.csv")
    INTO TABLE `user_result`
)
WITH BROKER broker_name
(
    "username" = "hdfs_user",
    "password" = "hdfs_passwd"
)
PROPERTIES
(
    "timeout" = "3600",
    "max_filter_ratio" = "0.1"
);

Where hdfs_host is the host of namenode, hdfs_port is fs.defaultFS port (default 9000)

  • Import a batch of data from HDFS, specify hive's default separator \x01, and use wildcard * to specify all files in the directory

Use simple authentication and configure namenode HA at the same time.

LOAD LABEL test_db.user_result4
(
    DATA INFILE("hdfs://node1:9000/datas/input/*")
    INTO TABLE `user_result`
    COLUMNS TERMINATED BY "\\x01"
)
WITH BROKER broker_name
(
    "username" = "hdfs_user",
    "password" = "hdfs_passwd",
    "dfs.nameservices" = "my_ha",
    "dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
    "dfs.namenode.rpc-address.my_ha.my_namenode1" = "node1:8020",
    "dfs.namenode.rpc-address.my_ha.my_namenode2" = "node2:8020",
    "dfs.client.failover.proxy.provider" ="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)

  • Import a batch of "negative" data from HDFS. Also use the kerberos authentication method. Provide keytab file path
LOAD LABEL test_db.user_result5
(
    DATA INFILE("hdfs://node1:9000/datas/input/old_file)
    NEGATIVE
    INTO TABLE `user_result`
    COLUMNS TERMINATED BY "\t"
)
WITH BROKER broker_name
(
    "hadoop.security.authentication" = "kerberos",
    "kerberos_principal"="[email protected]",
    "kerberos_keytab"="/home/palo/palo.keytab"
)

  • Import a batch of data from HDFS and specify partitions. Also use the kerberos authentication method. Provide the base64 encoded keytab file content
LOAD LABEL test_db.user_result6
(
    DATA INFILE("hdfs://node1:9000/datas/input/file")
    INTO TABLE `user_result`
    PARTITION (p1, p2)
    COLUMNS TERMINATED BY ","
    (k1, k3, k2, v1, v2)
)
WITH BROKER broker_name
(
    "hadoop.security.authentication"="kerberos",
    "kerberos_principal"="[email protected]",
    "kerberos_keytab_content"="BQIAAABEAAEACUJBSURVLkNPTQAEcGFsbw"
)

Guess you like

Origin blog.csdn.net/u013938578/article/details/130160347