Kafka, the necessary skills for big data practitioners

 

Set as "top or star", dry goods will be delivered as soon as possible.

 

Preface

The hottest technology in the Internet industry is called ABC, that is, AI artificial intelligence, BigData, and Cloud cloud computing platform. Of course, blockchain technology and the digital currency that the central bank recently tried out may also be mentioned. A and C are high-level skills that are generally not needed by companies and are not easy to master. For B, it is relatively civilian, and companies large and small can participate with the open source technology stack.

Why is Kafka?

The latest version of Kafka: 2.6.0.

2.6.0 is the latest release. The current stable version is 2.6.0.

As an engineer or architect, you must have participated in the construction of many big data business systems in the actual work process. Since these systems serve the company's business, they usually only perform some conventional business logic, so they cannot be regarded as computationally intensive applications, but rather data-intensive.

For data-intensive applications, how to deal with the surge in data volume, the increase in data complexity, and the faster data change rate is the most effective representation of the skills of big data engineers and architects. In actual engineering practice, we found that Kafka can play a very good role in helping us deal with these problems. Take the surge in data volume as an example, Kafka can effectively isolate upstream and downstream services, cache the sudden increase in upstream traffic, and transmit it to downstream subsystems in a smooth manner, avoiding irregular traffic impacts. This shows that as a big data practitioner, proficiency in Kafka is a very necessary skill.

In fact, Kafka has a very wide range of application scenarios. In the industry, Apache Kafka is currently regarded as the leader of the entire messaging engine field. This alone is worth learning and mastering it. We only need to learn a set of frameworks to implement message engine applications, application integration, and distributed storage construction in actual business systems.

,Even

Stream processing application

Development and deployment.

The two sessions in 2019 once again mentioned the need to deepen the R&D and development of applications in big data, artificial intelligence and other fields, and Kafka can play an important role in the field of big data engineering whether it is used as a messaging engine or a real-time stream processing platform.

How to learn Kafka?

The first step to master Kafka is to find the corresponding Kafka client according to the programming language you master. At present, the two most important clients of Kafka are the Java client and the libkafka client. They are updated and maintained very quickly, which is very suitable for us to continue to invest time.

https://github.com/edenhill/librdkafka

  1. Once you have determined the client you want to use, go to the official website to learn about the code examples. If you can compile and run these examples correctly, you can easily control the client.
  • For example, Java client: http://kafka.apache.org/24/documentation.html#api

Maven Dependency:

<dependency>
 <groupId>org.apache.kafka</groupId>
 <artifactId>kafka-clients</artifactId>
 <version>2.6.0</version>
</dependency>

Producer demo code:

package kafka;

import java.util.Properties;
import java.util.concurrent.Future;

import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import org.apache.kafka.common.serialization.StringSerializer;

public class Producer {

    public static void main(String[] args) {
        Properties properties = new Properties();
        // bootstrap.servers kafka集群地址 host1:port1,host2:port2 ....
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "127.0.0.1:9092");
        // key.deserializer 消息key序列化方式
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        // value.deserializer 消息体序列化方式
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

        // 0 异步发送消息
        for (int i = 0; i < 10; i++) {
            String data = "async :" + i;
            // 发送消息
            producer.send(new ProducerRecord<>("demo-topic", data));
        }

        // 1 同步发送消息 调用get()阻塞返回结果
        for (int i = 0; i < 10; i++) {
            String data = "sync : " + i;
            try {
                // 发送消息
                Future<RecordMetadata> send = producer.send(new ProducerRecord<>("demo-topic", data));
                RecordMetadata recordMetadata = send.get();
                System.out.println(recordMetadata);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        // 2 异步发送消息 回调callback()
        for (int i = 0; i < 10; i++) {
            String data = "callback : " + i;
            // 发送消息
            producer.send(new ProducerRecord<>("demo-topic", data), new Callback() {
                @Override
                public void onCompletion(RecordMetadata metadata, Exception exception) {
                    // 发送消息的回调
                    if (exception != null) {
                        exception.printStackTrace();
                    } else {
                        System.out.println(metadata);
                    }
                }
            });
        }

        producer.close();
    }
}

Consumer demo code:

package kafka;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;

public class Consumer {

    public static void main(String[] args) {
        Properties properties = new Properties();

        //bootstrap.servers kafka集群地址 host1:port1,host2:port2 ....
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "127.0.0.1:9092");
        // key.deserializer 消息key序列化方式
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        // value.deserializer 消息体序列化方式
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        // group.id 消费组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "demo-group");
        // enable.auto.commit 设置自动提交offset
        properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true);
        // auto.offset.reset
        properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
        String[] topics = new String[]{"demo-topic"};
        consumer.subscribe(Arrays.asList(topics));


        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
            for (ConsumerRecord<String, String> record : records) {
                System.out.println(record);
            }
        }

    }
}

Libkafka examples:

https://github.com/edenhill/librdkafka/tree/master/examples

May use Python's kafka client:

https://github.com/Parsely/pykafka

Install pykafka client module

$ pip install pykafka

Initialize the client object

>>> from pykafka import KafkaClient
>>> client = KafkaClient(hosts="127.0.0.1:9092,127.0.0.1:9093,...")

TLS (https connection)

>>> from pykafka import KafkaClient, SslConfig
>>> config = SslConfig(cafile='/your/ca.cert',
...                    certfile='/your/client.cert',  # optional
...                    keyfile='/your/client.key',  # optional
...                    password='unlock my client key please')  # optional
>>> client = KafkaClient(hosts="127.0.0.1:<ssl-port>,...",
...                      ssl_config=config)

Monitor topic

>>> client.topics
>>> topic = client.topics['my.test']

Send a message to the topic, here it is sent synchronously, you need to wait for the message confirmation before sending the next one

>>> with topic.get_sync_producer() as producer:
...     for i in range(4):
...         producer.produce('test message ' + str(i ** 2))

In order to improve throughput, it is recommended that the Producer adopts the asynchronous message sending mode, and the produce() function will return immediately after being called

>>> with topic.get_producer(delivery_reports=True) as producer:
...     count = 0
...     while True:
...         count += 1
...         producer.produce('test msg', partition_key='{}'.format(count))
...         if count % 10 ** 5 == 0:  # adjust this or bring lots of RAM ;)
...             while True:
...                 try:
...                     msg, exc = producer.get_delivery_report(block=False)
...                     if exc is not None:
...                         print 'Failed to deliver msg {}: {}'.format(
...                             msg.partition_key, repr(exc))
...                     else:
...                         print 'Successfully delivered msg {}'.format(
...                         msg.partition_key)
...                 except Queue.Empty:
...                     break

Consumer news in topic

>>> consumer = topic.get_simple_consumer()
>>> for message in consumer:
...     if message is not None:
...         print message.offset, message.value
0 test message 0
1 test message 1
2 test message 4
3 test message 9

Load-balanced Consumer-BalancedConsumer

>>> balanced_consumer = topic.get_balanced_consumer(
...     consumer_group='testgroup',
...     auto_commit_enable=True,
...     zookeeper_connect='myZkClusterNode1.com:2181,myZkClusterNode2.com:2181/myZkChroot'
... )

PyKafka contains a C extension library, you can use librdkafka to speed up some operations of Producer and Consumer.

  1. In the next step, you can try to modify the sample code to try to understand and use other APIs, and then observe the execution results after modifying the sample code. If these are not difficult for us, then we can write a small project to verify the learning results, and then improve and enhance the reliability and performance of the client. At this point, you can read the Kafka official website documents again later to ensure that you understand the parameters that may affect reliability and performance.
  2. Finally, learn the advanced features of Kafka, such as stream processing application development. Stream processing API can not only produce and consume messages, but also perform advanced stream processing operations, such as time window aggregation, stream processing connection, etc.

For a system administrator or operation and maintenance engineer, the corresponding learning goal should be to learn to build and manage Kafka online environment. How to evaluate and build a production line environment based on actual business needs will be the main learning goal. In addition, monitoring the production environment is also a top priority. Kafka provides a lot of JMX monitoring indicators. You can choose any well-known framework for monitoring, such as Kafka-Eagle.

https://github.com/smartloli/kafka-eagle

 

The core content that Kafka needs to master

  1. The general principles and uses of systems such as message engines, and Kafka's performance in this regard as a representative of excellent message engines.
  2. How Kafka is used in the production environment, especially the formulation of online environment programs.
  3. All aspects of the Kafka client, including the principles and practices of producers and consumers.
  4. The core design principles of Kafka include the design mechanism of Controller and the whole process analysis of request processing.
  5. Kafka operation and maintenance and monitoring content, efficient operation and maintenance of Kafka clusters and effective monitoring of Kafka

Guess you like

Origin blog.csdn.net/sinat_37903468/article/details/108680300