[Druid] (8) Apache Druid core plug-in Kafka Indexing Service & SLS Indexing Service

I. Introduction

Kafka Indexing Service is a plug-in launched by Apache Druid for real-time consumption of Kafka data using Apache Druid's Indexing Service service.

Kafka Indexing Service can be configured on Overlord Supervisor (here specifically refers to the regulators KafkaSupervisor, under the supervision of the individual responsible for the DataSource KafkaIndexTaskAt the time of its construction, we can accept KafkaSupervisorSpecTopic configuration information to know Kafka, as well as the rules intake, with To generate KafkaIndexTask index tasks), and is responsible for managing the creation and life cycle of Kafka index tasks. These KIS tasks use Kafka's own partition and offset mechanism to read events, so it can provide an exact-once ingestion guarantee (under the old version, Tranquility uses the push method, so it is completely impossible to achieve the characteristics of no loss or repetition) . The KIS task can also read non-recent events from Kafka and is not affected by the window period imposed by other ingestion mechanisms. In addition, Supervisor monitors the status of indexing tasks to manage failures and ensure scalability and easy replication. For more differences, see the comparison table below:

Insert picture description here

In version 0.16.0, Apache Druid completely removed Realtime Node related plug-ins, including druid-kafka-eight, druid-kafka-eight-simpleConsumer, druid-rabbitmq and druid-rocketmq

Although the newly introduced KIS has many benefits, there is no "silver bullet" in the world. Because KIS uses the pull method to ingest data, there is bound to be a pull frequency. The frequency is set by offsetFetchPeriodparameter control, default 30s will take a pull, and pull the fastest 5s only once. Then why can't you set a smaller value? Because if you initiate requests to Kafka too frequently, it may affect the stability of Kafka.

Supplement: We also mentioned above that the plugin will start a supervisor in Overlord. After the supervisor is started, some indexing tasks will be started in Middlemanager. These tasks will connect to the Kafka cluster to consume topic data and complete index creation.

  1. Task creation and running process

Insert picture description here

  1. Task stop process

Insert picture description here

Two, interact with Kafka cluster

The configuration of the interaction between the E-MapReduce Druid cluster and the Kafka cluster is similar to that of the Hadoop cluster. Both connectivity and hosts need to be set.

For non-secure Kafka clusters, please follow the steps below:

  1. Ensure that the clusters can communicate (two clusters are under one security group, or two clusters are in different security groups, but access rules are configured between the two security groups).
  2. Write the hosts of the Kafka cluster into the hosts list of each node of the E-MapReduce Druid cluster.

Note that the hostname of the Kafka cluster should be a long name, such as emr-header-1.cluster-xxxxxxxx.

For a secure Kafka cluster, you need to perform the following operations (the first two steps are the same as for a non-secure Kafka cluster):

  1. Ensure that the clusters can communicate (two clusters are under one security group, or two clusters are in different security groups, but access rules are configured between the two security groups).
  2. Write the hosts of the Kafka cluster into the hosts list of each node of the E-MapReduce Druid cluster.

Note that the hostname of the Kafka cluster should be a long name, such as emr-header-1.cluster-xxxxxxxx.

  1. Set Kerberos cross-domain trust between the two clusters (see details cross-domain trust ), it is recommended to do two-way trust.
  2. Prepare a client security configuration file. The format of the file content is as follows.
KafkaClient {
    
    
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=true
      storeKey=true
      keyTab="/etc/ecm/druid-conf/druid.keytab"
      principal="[email protected]";
  };

After the file is prepared, synchronize the configuration file to all nodes of the E-MapReduce Druid cluster and place it under a certain directory (for example /tmp/kafka/kafka_client_jaas.conf).

  1. The following options are added to the overlord.jvm on the E-MapReduce Druid configuration page.
Djava.security.auth.login.config=/tmp/kafka/kafka_client_jaas.conf
  1. Configure druid.indexer.runner.javaOpts=-Djava.security.auth.login.confi=/tmp/kafka/kafka_client_jaas.confand other JVM startup parameters in middleManager.runtime on the E-MapReduce Druid configuration page .
  2. Restart the Druid service.

3. Use Apache Druid Kafka Indexing Service to consume Kafka data in real time

  1. Run the following command on the Kafka cluster (or Gateway) to create a topic named metrics.
-- 如果开启了 Kafka 高安全:
 export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/ecm/kafka-conf/kafka_client_jaas.conf"
 --
 kafka-topics.sh --create --zookeeper emr-header-1:2181,emr-header-2,emr-header-3/kafka-1.0.0 --partitions 1 --replication-factor 1 --topic metrics

When actually creating a topic, you need to replace each parameter in the above command according to your environment configuration. Among them, --zookeeperthe parameter /kafka-1.0.0is a path, the path of the acquisition method is: Log Ali cloud E-MapReduce Console > enter Kafka Kafka service cluster configuration page> zookeeper.connect view the value of configuration items. If your Kafka cluster is a self-built cluster, you need to replace the --zookeeper parameter according to the actual configuration of the cluster.

  1. Define the data format description file of the data source (named metrics-kafka.json) and place it in the current directory (or place it in another directory you specify).
{
    
    
     "type": "kafka",
     "dataSchema": {
    
    
         "dataSource": "metrics-kafka",
         "parser": {
    
    
             "type": "string",
             "parseSpec": {
    
    
                 "timestampSpec": {
    
    
                     "column": "time",
                     "format": "auto"
                 },
                 "dimensionsSpec": {
    
    
                     "dimensions": ["url", "user"]
                 },
                 "format": "json"
             }
         },
         "granularitySpec": {
    
    
             "type": "uniform",
             "segmentGranularity": "hour",
             "queryGranularity": "none"
         },
         "metricsSpec": [{
    
    
                 "type": "count",
                 "name": "views"
             },
             {
    
    
                 "name": "latencyMs",
                 "type": "doubleSum",
                 "fieldName": "latencyMs"
             }
         ]
     },
     "ioConfig": {
    
    
         "topic": "metrics",
         "consumerProperties": {
    
    
             "bootstrap.servers": "emr-worker-1.cluster-xxxxxxxx:9092(您 Kafka 集群的 bootstrap.servers)",
             "group.id": "kafka-indexing-service",
             "security.protocol": "SASL_PLAINTEXT",
             "sasl.mechanism": "GSSAPI"
         },
         "taskCount": 1,
         "replicas": 1,
         "taskDuration": "PT1H"
     },
     "tuningConfig": {
    
    
         "type": "kafka",
         "maxRowsInMemory": "100000"
     }
 }

Explain that ioConfig.consumerProperties.security.protocol and ioConfig.consumerProperties.sasl.mechanism are security-related options (not required for non-secure Kafka clusters).

  1. Execute the following command to add Kafka supervisor.
curl --negotiate -u:druid -b ~/cookies -c ~/cookies -XPOST -H 'Content-Type: application/json' -d @metrics-kafka.json http://emr-header-1.cluster-1234:18090/druid/indexer/v1/supervisor

Where --negotiate, -u, -b, -cand so are the options for safe E-MapReduce Druid cluster.

  1. Start a console producer on the Kafka cluster.
-- 如果开启了 Kafka 高安全:
 export KAFKA_OPTS="-Djava.security.auth.login.config=/etc/ecm/kafka-conf/kafka_client_jaas.conf"
 echo -e "security.protocol=SASL_PLAINTEXT\nsasl.mechanism=GSSAPI" > /tmp/Kafka/producer.conf
 --
 Kafka-console-producer.sh --producer.config /tmp/kafka/producer.conf --broker-list emr-worker-1:9092,emr-worker-2:9092,emr-worker-3:9092 --topic metrics
 >

Which --producer.config /tmp/Kafka/producer.confis a safe option for Kafka cluster.

  1. Enter some data at the command prompt of kafka_console_producer.
{
    
    "time": "2018-03-06T09:57:58Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{
    
    "time": "2018-03-06T09:57:59Z", "url": "/", "user": "bob", "latencyMs": 11}
{
    
    "time": "2018-03-06T09:58:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}

The timestamp can be generated with the following python command:

python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
  1. Prepare a query file and name it metrics-search.json.
{
    
    
     "queryType" : "search",
     "dataSource" : "metrics-kafka",
     "intervals" : ["2018-03-02T00:00:00.000/2018-03-08T00:00:00.000"],
     "granularity" : "all",
     "searchDimensions": [
         "url",
         "user"
     ],
     "query": {
    
    
         "type": "insensitive_contains",
         "value": "bob"
     }
 }
  1. Execute the query on the E-MapReduce Druid cluster Master.
curl --negotiate -u:Druid -b ~/cookies -c ~/cookies -XPOST -H 'Content-Type: application/json' -d @metrics-search.json http://emr-header-1.cluster-1234:18082/druid/v2/?pretty

Where --negotiate, -u, -b, -cand so are the options for safe E-MapReduce Druid cluster.

Examples of normal return results:

[ {
    
    
   "timestamp" : "2018-03-06T09:00:00.000Z",
   "result" : [ {
    
    
     "dimension" : "user",
     "value" : "bob",
     "count" : 2
   } ]
 } ]

4. About SLS Indexing Service

SLS Indexing Service is a Druid plug-in launched by E-MapReduce to consume data from SLS.

4.1 Background introduction

The consumption principle of SLS Indexing Service is similar to that of Kafka Indexing Service, so it also supports the same Exactly-Once semantics of Kafka Indexing Service. It combines the advantages of SLS and Kafka Indexing Service:

  • Very convenient data collection, you can use SLS's multiple data collection methods to import data into SLS in real time.
  • There is no need to maintain an additional Kafka cluster, eliminating a link in the data flow.
  • Support Exactly-Once semantics.
  • High reliability guarantee for consumer jobs, job failure retry, cluster restart/upgrade business unawareness, etc.

4.2 Preparation

  • If you have not activated the SLS service, please activate the SLS service first, and configure the corresponding Project and Logstore.
  • Prepare the following configuration items:
    • The endpoint of the SLS service (note the use of the intranet service entrance)
    • AccessKeyId and corresponding AccessKeySecret that can access SLS service

4.3 Use SLS Indexing Service

  1. Prepare data format description file
    If you are familiar with Kafka Indexing Service, SLS Indexing Service will be very simple. For details, please refer to the introduction of Kafka Indexing Service . We use the same data for indexing, then the data format description file of the data source is as follows (save it as metrics-sls.json):
{
    
    
    "type": "sls",
    "dataSchema": {
    
    
        "dataSource": "metrics-sls",
        "parser": {
    
    
            "type": "string",
            "parseSpec": {
    
    
                "timestampSpec": {
    
    
                    "column": "time",
                    "format": "auto"
                },
                "dimensionsSpec": {
    
    
                    "dimensions": ["url", "user"]
                },
                "format": "json"
            }
        },
        "granularitySpec": {
    
    
            "type": "uniform",
            "segmentGranularity": "hour",
            "queryGranularity": "none"
        },
        "metricsSpec": [{
    
    
                "type": "count",
                "name": "views"
            },
            {
    
    
                "name": "latencyMs",
                "type": "doubleSum",
                "fieldName": "latencyMs"
            }
        ]
    },
    "ioConfig": {
    
    
        "project": <your_project>,
        "logstore": <your_logstore>,
        "consumerProperties": {
    
    
            "endpoint": "cn-hangzhou-intranet.log.aliyuncs.com", (以杭州为例,注意使用内网服务入口)
            "access-key-id": <your_access_key_id>,
            "access-key-secret": <your_access_key_secret>,
            "logtail.collection-mode": "simple"/"other"
        },
        "taskCount": 1,
        "replicas": 1,
        "taskDuration": "PT1H"
    },
    "tuningConfig": {
    
    
        "type": "sls",
        "maxRowsInMemory": "100000"
    }
}

Comparing the introduction in the Kafka Indexing Service section, we found that the two are basically the same. Here is a brief list of fields that need attention:

  • type: sls。
  • dataSchema.parser.parseSpec.format: It is related to ioConfig.consumerProperties.logtail.collection-mode, which is related to the collection mode of SLS logs. If it is collected in a simple mode, then fill in what format the original file is. If it is collected in non-minimalist mode (other), the value here is json.
  • ioConfig.project: The project of the log you want to collect.
  • ioConfig.logstore: The logstore of the logs you want to collect.
  • ioConfig.consumerProperties.endpoint: SLS intranet service address, for example, it corresponds to Hangzhou cn-hangzhou-intranet.log.aliyuncs.com.
  • ioConfig.consumerProperties.access-key-id: AccessKeyID of the account.
  • ioConfig.consumerProperties.access-key-secret: AccessKeySecret of the account.
  • ioConfig.consumerProperties.logtail.collection-mode: SLS log collection mode, fill simple in minimalist mode, fill other in other cases.

Note that the ioConfig configuration format in the above configuration file is only applicable to EMR-3.20.0 and earlier versions. Starting from EMR-3.21.0, ioConfig configuration changes are as follows:

"ioConfig": {
    
    
        "project": <your_project>,
        "logstore": <your_logstore>,
        "endpoint": "cn-hangzhou-intranet.log.aliyuncs.com", (以杭州为例,注意使用内网服务入口)
        "accessKeyId": <your_access_key_id>,
        "accessKeySec": <your_access_key_secret>,
        "collectMode": "simple"/"other"
        "taskCount": 1,
        "replicas": 1,
        "taskDuration": "PT1H"
    },

That is, the consumerProperties level, access-key-id, and access-key-secret are cancelled, and logtail.collection-mode is changed to accessKeyIdaccessKeySeccollectMode.

  1. Execute the following command to add SLS supervisor.
curl --negotiate -u:druid -b ~/cookies -c ~/cookies -XPOST -H 'Content-Type: application/json' -d @metrics-sls.json http://emr-header-1.cluster-1234:18090/druid/indexer/v1/supervisor

Note that the --negotiate, -u, -b, -c and other options are for the secure Druid cluster.

  1. Import data into SLS.

You can import data into SLS in many ways. Please refer to the SLS document for details.

  1. Perform related queries on the Druid side.

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/107832615