Kafka MaxCompute Data Migration Best Practices

Prerequisites
build Kafka cluster
before data migration, you need to ensure your own Kafka cluster environment correctly. As used herein, Ali cloud service automation to build Kafka EMR cluster, detailed procedure, see: Kafka started quickly.

EMR Kafka version information in this document are as follows:
EMR Version: EMR-3.12.1
cluster type: Kafka
software information: Ganglia 3.7.2 ZooKeeper 3.4.12 Kafka used to live 2.11-1.0.1 Kafka used to live-Manager 1.3.3.16
Kafka used to live cluster uses a proprietary network , CHINA region 1 (Hangzhou), the main examples of computing resources set ECS and the public network IP, the specific configuration as shown below.

Creating MaxCompute project
launched MaxCompute service and create a good project, this project created in East bigdata_DOC 1 (Hangzhou) region, and start DataWorks related services, as shown below. Please refer to the opening of MaxCompute.

Background
Kafka is a distributed publish-and-subscribe messaging middleware, high performance, high volume throughput characteristics are widely used, can handle millions of messages per second. Kafka suitable for processing streaming data, mainly used to track user behavior, log collection and other scenes.

A typical Kafka cluster consists of a number of producers (Producer), Broker, consumers (Consumer) and a Zookeeper cluster. Kafka cluster by configuring the cluster itself Zookeeper management and collaboration service.

Topic Kafka cluster is a collection of the most common messages, a message is stored in logical concept. Topic physical disk is not stored, but in the cluster on the specific message Topic by partition (the Partition) storage disk of each node. Topic Each producer may have a plurality of messages transmitted to it, there may be a plurality of consumers to pull it (consumption) message.

As each message is added to the partition will be assigned a offset (offset, numbered from 0), the message is a unique number in the partition.

Steps
to prepare test table with data
Kafka cluster create test data
to ensure that you can smooth landing EMR cluster Header host and MaxCompute and DataWorks can smoothly and Header host communication EMR cluster, you first configure EMR cluster Header host security group, put through TCP 22 and TCP 9092 port.
Log EMR Header host address of the cluster
into the EMR Hadoop Cluster Management Console> Host List page, confirm EMR cluster Header host address, and connect remote login via SSH.

Create Test Topic
using kafka-topics.sh --zookeeper emr-header-1: Create a test used 2181 / kafka-1.0.1 --partitions 10 --replication -factor 3 --topic testkafka --create command Topic testkafka . You can use kafka-topics.sh --list --zookeeper emr-header-1 : 2181 / kafka-1.0.1 Command View Topic has been created.
[EMR-header-the root @ ~. 1] # kafka-topics.sh --zookeeper-EMR-header. 1: 2181/10 --replication --partitions Kafka-1.0.1-factor. 3 --topic testkafka --create
the Created Topic "testkafka."
[the root-header-EMR @ ~. 1] # kafka-topics.sh --list --zookeeper-EMR-header. 1: 2181 / Kafka-1.0.1
__consumer_offsets
_emr-Client-metrics
_schemas
Connect-configs
Qoffsets-Connect
Connect-Status
testkafka
writing test data
Analog producer writes data to the Topic testkafka in 9092 --topic testkafka command: You can use kafka-console-producer.sh --broker-list emr-header- 1. Since Kafka used to process streaming data, you can continuing to write to the data. To ensure the test results, we recommend that you write more than 10 data.
[root @ emr-header-1 ~] # kafka-console-producer.sh --broker-list emr-header-1: 9092 --topic testkafka

123
abc

To verify the data is written to take effect at the same time you can re-open an SSH window, use kafka-console-consumer.sh --bootstrap-server emr-header-1: 9092 --topic testkafka --from-beginning command to simulate the consumer, verify whether the data has been successfully written to Kafka. As shown below, when data is written successfully, you can see that the data has been written.

[EMR-header-the root @ ~. 1] # kafka-console-consumer.sh --bootstrap-Server-EMR-header. 1: 9092 --topic testkafka Beginning --from-
123
ABC
created MaxCompute table
in order to ensure smooth reception can MaxCompute Kafka data, you first create a table on MaxCompute. This embodiment is convenient testing, a non-partitioned table.
Log DataWorks creating tables, please refer to Table Management.

You can click DDL mode to build tables, the following statements illustrate construction of the table.
TABLE the CREATE testkafka(

`key` string,
`value` string,
`partition1` string,
`timestamp1` string,
`offset` string,
`t123` string,
`event_id` string,
`tag` string

);
Where each column corresponds to the data integration DataWorks default columns Kafka Reader, you can self-named. For details, see Configuring Reader Kafka used to live:
__key__ represent key messages.
__value__ represents the full contents of the message.
__partition__ where the message indicates that the current partition.
__headers__ information indicates that the current message headers.
__offset__ message represents the current offset.
__timestamp__ represents the current timestamp of the message.
Data Synchronization
New Custom Resource Group
because of the current DataWorks default resource group can not be the perfect support for Kafka plugin, you need to use a custom set of resources to complete data synchronization. Custom Resource Group resource details, see the new task.

In this context, in order to save resources, the direct use of EMR cluster Header host as a custom resource group. When finished, wait for the server status becomes available.

New and run the synchronization task
, right-click your business process data integration, data integration node and select New> Data Synchronization.

After the new data synchronization node, you need to select the data source data source for Kafka, the whereabouts of the data source for data ODPS, and use the default data source odps_first. Select the data table for the whereabouts of your new testkafka. After the above configuration, please click below button box, and converted into a script mode.

Configure the following script code interpretation see Configuring Kafka Reader.
{

"type": "job",
"steps": [
    {
        "stepType": "kafka",
        "parameter": {
            "server": "47.xxx.xxx.xxx:9092",
            "kafkaConfig": {
                "group.id": "console-consumer-83505"
            },
            "valueType": "ByteArray",
            "column": [
                "__key__",
                "__value__",
                "__partition__",
                "__timestamp__",
                "__offset__",
                "'123'",
                "event_id",
                "tag.desc"
            ],
            "topic": "testkafka",
            "keyType": "ByteArray",
            "waitTime": "10",
            "beginOffset": "0",
            "endOffset": "3"
        },
        "name": "Reader",
        "category": "reader"
    },
    {
        "stepType": "odps",
        "parameter": {
            "partition": "",
            "truncate": true,
            "compress": false,
            "datasource": "odps_first",
            "column": [
                "key",
                "value",
                "partition1",
                "timestamp1",
                "offset",
                "t123",
                "event_id",
                "tag"
            ],
            "emptyAsNull": false,
            "table": "testkafka"
        },
        "name": "Writer",
        "category": "writer"
    }
],
"version": "2.0",
"order": {
    "hops": [
        {
            "from": "Reader",
            "to": "Writer"
        }
    ]
},
"setting": {
    "errorLimit": {
        "record": ""
    },
    "speed": {
        "throttle": false,
        "concurrent": 1,
        "dmu": 1
    }
}

}
You can use the Header host kafka-consumer-groups.sh --bootstrap-server emr-header- 1: 9092 --list command to view group.id parameters, and consumers Group name.
[root @ EMR-header-1 ~] # kafka-consumer-groups.sh --bootstrap-Server EMR-header-1: 9092 --list
Note: This by Will not Show the About Old Zookeeper-based Information Consumers.

-Client-metrics-_emr Handler Group-
Console-Consumer-69493
console-consumer-83505
Console-Consumer-21030
Console-Consumer-45322
Console-Consumer-14773
to console-consumer-83505, for example, you can based on the parameters in the Header host using kafka-consumer-groups.sh --bootstrap-server emr-header- 1: 9092 --describe --group console-consumer-83505 and endOffset beginOffset confirmation command parameters.
[EMR-header-the root @ ~. 1] # kafka-consumer-groups.sh --bootstrap-Server-EMR-header. 1: 9092 --describe --group Console-Consumer-83505
Note: This Will Not Show Information About Old Consumers-based ZooKeeper.
Consumer Group 'Console-Consumer-83 505' has Active Members NO.
TOPIC the PARTITION the CURRENT the OFFSET-the LOG the OFFSET-the END-CONSUMER the LAG-ID-the HOST the CLIENT ID
testkafka. 6 0 0 0 - - -
the Test 6 3 3 0 - - -
testkafka 0 0 0 0 - - -
testkafka 1 1 1 0 - - -
testkafka 5 0 0 0 - - -
After completing the script configuration, first switch tasks resource group just created for your resource group and then click run.

After completing the run, you can see the results running in the run log, the log following a successful run.

The results verify that
you can create a new data development task runs SQL statements from Kafka to see if existing data synchronized from the current table. This example uses select * from testkafka; statement after click run.

Execution results are as follows, in this case to ensure the result, a plurality of data input in testkafka Topic, you can check whether the data is consistent and you enter.

Guess you like

Origin yq.aliyun.com/articles/704008