Application of Kafka in Big Data Processing

Application of Kafka in Big Data Processing

1. Introduction to Kafka

1. Basic concepts

Kafka is a highly available distributed message system, which is mainly responsible for supporting reliable and continuous message transmission between different applications. In this process, Kafka is responsible for allocating, balancing, and storing message data.

2. The main functions of Kafka

The main functions of Kafka include the production and consumption of messages. In terms of message production, Kafka supports sending messages to multiple receivers, realizing efficient transmission between applications; in terms of message consumption, Kafka can control the consumption progress to ensure that each consumer can follow its expected way to receive the message.

3. Features of Kafka

Kafka has the following characteristics:

  • High availability: supports partition and copy mechanism to ensure high availability.
  • High scalability: supports horizontal expansion, and can support PB-level data processing.
  • Persistence: Messages are persisted to disk and will not expire within a certain period of time.
  • High performance: Kafka's IO implementation adopts the method of sequential reading and writing, which has a high reading and writing speed and can meet the requirements of high throughput.
  • Multi-language support: Kafka provides APIs in multiple languages ​​such as Java, C++, and Python, which are suitable for various development scenarios.

2. Application scenarios

1. Data collection and consumption

As an efficient message transmission mechanism, Kafka has a wide range of applications in the data collection process. Data producers can send the original data to Kafka, and various data consumers consume through Kafka, thus building a complete data collection and transmission system.

The following shows how to perform production and consumption operations on Kafka:

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.clients.consumer.*;

public class KafkaDemo {
    
    

    public static void main(String[] args) throws Exception {
    
    

        // 生产者
        Properties producerProps = new Properties();
        producerProps.put("bootstrap.servers", "localhost:9092");
        producerProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        producerProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        Producer<String, String> producer = new KafkaProducer<>(producerProps);
        producer.send(new ProducerRecord<>("test_topic", "key", "value"));

        // 消费者
        Properties consumerProps = new Properties();
        consumerProps.put("bootstrap.servers", "localhost:9092");
        consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        Consumer<String, String> consumer = new KafkaConsumer<>(consumerProps);
        consumer.subscribe(Arrays.asList("test_topic"));
        while (true) {
    
    
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records) {
    
    
                System.out.println(record.key() + ":" + record.value());
            }
        }
    }
}

2. Data storage and persistence

Kafka can also be used as an efficient data storage and persistence mechanism. Using the persistence mechanism provided by Kafka, different types of data can be stored in Kafka Broker in the form of logs, and can be searched and retrieved when needed.

3. Real-time data processing and stream computing

Kafka performs real-time data processing and stream computing by supporting the Streaming Data Architecture. Users can use the Kafka Streams API to implement real-time applications, and Kafka also supports the integration of some stream processing frameworks (such as Storm and Flink).

4. Data Communication and Collaboration

As a powerful message queuing system, Kafka can support data communication and collaboration between different distributed components. For example, users can use Kafka to send data to various endpoints, thereby realizing the interaction between different components.

3. Technology integration

1. The integration of Kafka and Hadoop ecological technology

As a high-throughput distributed publish-subscribe message system, Kafka can be well integrated with Hadoop ecological technology. The two commonly used methods are:

1) Use Kafka as the data source of Hadoop

We can use Kafka as the data source of Hadoop for data collection, data transmission and other scenarios:

import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.*;
import kafka.serializer.StringDecoder;

public class KafkaStreamingApp {
    
    
   public static void main(String[] args) throws Exception {
    
    
      String brokers = "localhost:9092";
      String topics = "testTopic";
      
      SparkConf conf = new SparkConf().setAppName("KafkaStreamingApp").setMaster("local[2]");
      JavaStreamingContext context = new JavaStreamingContext(conf, Durations.seconds(10));
      
      Map<String, String> kafkaParams = new HashMap<>();
      kafkaParams.put("metadata.broker.list", brokers);
      
      Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
      
      JavaPairInputDStream<String, String> stream = KafkaUtils.createDirectStream(
         context,
         String.class,
         String.class,
         StringDecoder.class,
         StringDecoder.class,
         kafkaParams,
         topicsSet
      );
      
      stream.print();
      
      context.start();
      context.awaitTermination();
   }
}

2) Use Hadoop as a Kafka consumer

After using Kafka as a data source in Hadoop, we can also send the data in Hadoop to other consumers through Kafka:

import org.apache.kafka.clients.producer.*;
import java.util.Properties;
 
public class KafkaProducerExample {
    
    
    public static void main(String[] args) throws Exception{
    
    
        String topicName = "testTopic";
        String key = "Key1";
        String value= "Value-99";
        
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");         
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        Producer<String, String> producer = new KafkaProducer<>(props);
        
        ProducerRecord<String, String> record = new ProducerRecord<>(topicName,key,value);
        producer.send(record);
        
        producer.close();
        System.out.println("A message has been successfully sent!");
    }
}

2. Integration of Kafka with stream processing frameworks such as Spark and Flink

Kafka can be well integrated with stream processing frameworks such as Spark and Flink. Among these stream processing frameworks, Kafka is widely used for data input source and output source, and has high efficiency and stability. Let's take Spark Streaming as an example:

import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.*;
import kafka.serializer.DefaultDecoder;
import scala.Tuple2;

import java.util.*;

public class KafkaStreamingApp {
    
    
  public static void main(String[] args) throws InterruptedException     {
    
    
      String brokers = "localhost:9092";//设置Kafka连接信息 
      String topics = "testTopic";//设置订阅的主题名称 
      SparkConf conf = new SparkConf().setAppName("KafkaStreamingApp").setMaster("local[2]");//设置Spark配置信息 
      JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(2000));//设置数据流间隔时间 

      Map<String, String> kafkaParams = new HashMap<String, String>();
      kafkaParams.put("metadata.broker.list", brokers);//设置连接的Kafka Broker地址列表

      Set<String> topicsSet = new HashSet<String>(Arrays.asList(topics.split(",")));//设置订阅主题集合 

      JavaPairInputDStream<byte[], byte[]> messages = KafkaUtils.createDirectStream(
          jssc,
          byte[].class,
          byte[].class,
          DefaultDecoder.class,
          DefaultDecoder.class,
          kafkaParams,
          topicsSet
      );//创建输入数据流 

      JavaDStream<String> lines = messages.map(new Function<Tuple2<byte[], byte[]>, String>() {
    
    
          public String call(Tuple2<byte[], byte[]> tuple2) {
    
    //将元组转化为字符串 
              return new String(tuple2._2());
          }
      });

      lines.print();//打印数据流中的数据 

      jssc.start();//开始运行Spark Streaming应用程序 
      jssc.awaitTermination();//等待应用程序终止
   }
}

3. Integration of log search engines such as Kafka and Elasticsearch

Kafka can be well integrated with log search engines such as Elasticsearch for real-time processing and searching. During use, we need to use Kafka Connect to connect Kafka and Elasticsearch, and batch process and import data. The specific code is as follows:

import org.apache.kafka.connect.data.Schema;
import org.apache.kafka.connect.data.SchemaBuilder;
import org.apache.kafka.connect.json.JsonConverter;
import org.apache.kafka.connect.runtime.standalone.StandaloneConfig;
import org.apache.kafka.connect.sink.SinkConnector;
import org.apache.kafka.connect.sink.SinkRecord;
import org.apache.kafka.connect.sink.SinkTask;

import java.util.*;

public class ElasticsearchSinkExample implements SinkConnector {
    
    
  private Map<String, String> configProps;

  public void start(Map<String, String> props) {
    
    
      this.configProps = props;
  }

  public Class<? extends Task> taskClass() {
    
    
      return ElasticsearchSinkTask.class;
  }

  public List<Map<String, String>> taskConfigs(int i) {
    
    
      List<Map<String, String>> configs = new ArrayList<>(i);
      for (int x = 0; x < i; x++) {
    
    
          configs.add(configProps);
      }
      return configs;
  }

  public void stop() {
    
    
  }

  public ConfigDef config() {
    
    
      return new ConfigDef();
  }

  public String version() {
    
    
      return "1";
  }

  public static class ElasticsearchSinkTask extends SinkTask {
    
    
      private String hostname;
      private int port;
      private String indexPrefix;

      public String version() {
    
    
          return "1";
      }

      public void start(Map<String, String> props) {
    
    
          hostname = props.get("address");
          port = Integer.parseInt(props.get("port"));
          indexPrefix = props.get("index");

          // Connect to Elasticsearch and create index if not exists
          //...
      }

      public void put(Collection<SinkRecord> records) {
    
    
          // Convert records to JSON
          Schema schema = SchemaBuilder.struct().name("record").version(1)
              .field("id", Schema.STRING_SCHEMA)
              .field("name", Schema.STRING_SCHEMA)
              .field("age", Schema.INT32_SCHEMA)
              .build()

          JsonConverter converter = new JsonConverter();
          converter.configure(new HashMap<>(), false);
          
          List<Map<String, String>> convertedRecords = new ArrayList<>(records.size());
          for (SinkRecord record: records) {
    
    
              String json = converter.fromConnectData("topic", schema, record.value())
              Map<String, String> convertedRecord = new HashMap<>();
              convertedRecord.put("id", record.key().toString());
              convertedRecord.put("json", json);
              convertedRecords.add(convertedRecord);
          }

          // Write records to Elasticsearch
          //...
      }

      public void flush(Map<TopicPartition, OffsetAndMetadata> offsets) {
    
    
      }

      public void stop() {
    
    
      }
  }
}

4. Performance optimization

In high-concurrency and high-traffic scenarios, Kafka needs performance optimization to ensure its stability and reliability. This article will discuss Kafka's performance tuning process, including producers, consumers, and performance parameter adjustments in the cluster.

1. The main process of Kafka performance tuning

Kafka performance tuning is mainly divided into the following two processes:

  • Identify current bottlenecks : Before doing any performance tuning, you first need to identify what your current bottlenecks are. For example, is it due to the decrease in throughput caused by slow network transmission speed, or the accumulation situation caused by the mismatch between message production and consumption speed.
  • Adjust performance parameters : After the current bottleneck is identified, it is necessary to adjust the performance parameters according to the specific situation, so as to improve the performance of Kafka while optimizing the bottleneck.

2. Producer performance tuning

2.1 Send messages in batches

When a producer sends a message to Kafka, it can send multiple messages at one time instead of sending them one by one. This can reduce the overhead of network transmission and IO, thereby improving throughput. We can batch.sizecontrol the number of messages sent in batches by setting parameters.

2.2 Compressed messages

Compressing messages can also greatly reduce the overhead of network transmission, thereby increasing throughput. Kafka supports multiple compression algorithms, such as gzip, snappy, and lz4. We can compression.typecontrol which compression algorithm to use by setting parameters.

2.3 Send messages asynchronously

When the producer sends a message, you can choose synchronous mode and asynchronous mode. The synchronous mode will block the thread until the message is sent successfully, but the asynchronous mode will not. The asynchronous approach can greatly improve throughput, but increases the risk of message delivery failures. We can linger.msadjust the time interval for sending messages asynchronously by setting parameters.

3. Consumer performance tuning

3.1 Increase the number of partitions and consumers

In Kafka's consumer group, each consumer instance can only process messages from certain partitions. If there are too many partitions, the efficiency of consumers may be affected. We can increase the efficiency of consumers by increasing the number of partitions and the number of consumers.

3.2 Increase the pull data size

In the process of consumers acquiring data, the size of the data pulled each time will also have an impact on performance. Generally, the more data pulled, the higher the throughput of consumers. We can fetch.max.bytesincrease the size of the data pulled each time by setting the parameter.

3.3 Increase the interval of pulling data

In the process of consumers obtaining data, the interval for each consumer to pull data can also be adjusted. We can fetch.max.wait.msadjust the time interval for pulling data by setting parameters.

4. Cluster performance tuning

4.1 Improve recovery speed

Kafka's cluster consists of multiple Brokers. When one of the Brokers goes down, Kafka needs to restore data. In order to improve the recovery speed, we can adopt the method of synchronous copy, so that the data consistency between the copy and the master node is more guaranteed, and the master node can be quickly switched to the copy node when the master node is down.

4.2 Allocation Partition Balance

In the Kafka cluster, there are multiple Brokers and multiple Topics. In order to ensure that the number of partitions on each Broker is relatively balanced, we can use the Kafka toolkit to handle the distribution of partitions to ensure that each Broker is load balanced.

5. Storage Management

1. Message compression configuration

Messages can be compressed in Kafka to save disk space and network bandwidth. Kafka supports several compression algorithms, including GZIP, Snappy, and LZ4. When Kafka Broker receives the written message, it can compress the message according to the configuration of Kafka Producer. When a Consumer reads a message from Kafka, Kafka automatically decompresses the message and passes it to the Consumer for processing.

The following is an example of configuring message compression using the Java API:

Properties props = new Properties();
props.put("compression.type", "gzip"); // 设置压缩类型为 GZIP
Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());

2. Storage cleanup strategy

Kafka's messages are stored on the Broker. If messages are continuously written without deletion, the disk space will occupy more and more space. Therefore, Kafka provides several storage cleaning strategies to control the amount of message storage on the Broker.

Kafka supports two storage cleanup strategies: deletion and compaction. The deletion strategy will delete some old or expired messages, thus freeing disk space; while the compression strategy will compress the data to further save disk space.

The following is an example of configuring a storage cleanup policy using the Java API:

Properties props = new Properties();
props.put("log.cleanup.policy", "delete"); // 设置清理策略为删除策略
props.put("log.retention.hours", "24"); // 设置留存的小时数为 24 小时
props.put("log.cleanup.policy", "compact"); // 设置清理策略为压缩策略
Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());

3. Message storage and retrieval principle

The message storage and retrieval principle of Kafka is very simple. In Kafka, each Topic is divided into multiple Partitions. When Producer writes a message to a Topic, Kafka will store the message in one or more Partitions specified by the Topic. The messages in each Partition are stored sequentially, and each message has a unique offset. Consumers can start reading messages from any offset, which enables Consumers to process messages one by one and realize repeated consumption and other functions.

Kafka's message storage method is log-type storage. In Kafka, each Partition maintains a message log (Message Log), which is a collection of messages sorted by time, and all messages are appended to the log. Since the writing operation of the message only involves the appending operation, not the modifying and deleting operation, efficient data writing and reading can be realized.

Kafka maps the message log into memory through the mmap operation to achieve efficient message reading. In addition, Kafka adopts a time series index method in data organization, which can quickly locate the message corresponding to a certain offset.

6. Kafka Security

Kafka provides various security features, including authentication, authorization, and encryption. Among them, SSL/TLS encryption is used to ensure the security of data transmission, and the ACL mechanism realizes fine-grained authorization management.

1. Authentication, authorization and encryption

In Kafka, authentication, authorization, and encryption are configurable. Among them, the implementation of authentication and authorization mainly depends on the Jaas (Java Authentication and Authorization Service) framework, while encryption uses the SSL/TLS protocol.

Authentication can use Kafka's built-in authentication method or a custom method, such as LDAP or Kerberos. Authorization is implemented through the ACL (access control lists) mechanism, which is used to limit users' access to resources such as Kafka clusters, topics, and partitions.

2. SSL/TLS encryption ensures data transmission security

Kafka provides SSL/TLS protocol encrypted data transmission to protect data security during network transmission. You can enable the SSL/TLS function by configuring parameters such as the SSL certificate and private key.

Here is a sample code for transferring data using SSL/TLS:

Properties props = new Properties();
props.setProperty("bootstrap.servers", "kafka1.example.com:9093");
props.setProperty("security.protocol", "SSL");
props.setProperty("ssl.truststore.location", "/path/to/truststore");
props.setProperty("ssl.truststore.password", "password");

Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());

producer.send(new ProducerRecord<>("my-topic", "my-message"));

In this sample code, we set it security.protocolto "SSL", and specified the path and password where the SSL certificate is located. Through these configurations, we can use the SSL/TLS protocol to transmit data.

3. ACL mechanism realizes fine-grained authorization management

Kafka's ACL mechanism provides fine-grained authorization management, which can limit users' access to different resources. The ACL mechanism is resource-based and can authorize resources such as Kafka clusters, topics, and partitions.

The following is a sample code for authorization management using the ACL mechanism:

Properties props = new Properties();
props.setProperty("bootstrap.servers", "kafka1.example.com:9092");

AdminClient adminClient = KafkaAdminClient.create(props);

ResourcePattern pattern = new ResourcePattern(ResourceType.TOPIC, "my-topic");
AccessControlEntry entry = new AccessControlEntry("User:alice", "*", AclOperation.READ, AclPermissionType.ALLOW);
KafkaPrincipal principal = new KafkaPrincipal(KafkaPrincipal.USER_TYPE, "alice");
Resource resource = new Resource(pattern.resourceType().name(), pattern.name());

Set<AclBinding> acls = new HashSet<>();
acls.add(new AclBinding(resource, entry));

adminClient.createAcls(acls);

In this sample code, we created a topic called "my-topic" and granted read permission to the topic to the user "alice". Using the ACL mechanism, we can implement fine-grained control and management of user operations in the Kafka cluster.

Seven, Kafka management tools

1. ZooKeeper cluster management tool

1.1 Introduction to ZooKeeper

ZooKeeper is a distributed open source distributed application coordination service. It is an open source implementation of Google's Chubby and one of the important components of distributed systems such as Hadoop and Kafka.

1.2 ZooKeeper and Kafka

In Kafka, ZooKeeper is responsible for managing broker status information, electing controllers, and managing metadata of Topic and Partition. ZooKeeper is crucial to the normal operation of Kafka. Once ZooKeeper is abnormal or fails, the Kafka cluster will be unavailable.

2. Management tools that come with Kafka

2.1 Kafka Manager

Kafka Manager is a web-based Kafka management tool developed by Yahoo, which can easily view and manage Topic, Broker, Partition and other information in the Kafka cluster. Users can use Kafka Manager to monitor the health of the Kafka cluster, and configure and manage it as needed.

2.2 Kafka Connect

Kafka Connect is a unified data transmission framework provided by Kafka for connecting Kafka with other data sources or data storage systems. Users can quickly read or write data from data sources or data storage systems through Kafka Connect, and inject data directly into Kafka.

3. Third-party Kafka monitoring tools

3.1 Burrow

Burrow is an advanced Kafka Consumer monitoring tool developed by Linkedin, which can be used to monitor and manage information such as the health status of Kafka Consumer Group and the progress status of message processing. Burrow can provide very detailed data information such as partition information, the number of consumed messages, and the number of remaining messages.

3.2 Kafka Web Console

Kafka Web Console is a free, open source, web-based Kafka management tool, which can easily view and manage the Topic, Broker, Partition and other information of the Kafka cluster. Users can know the status of the cluster at any time through the Kafka Web Console and perform related configuration and management operations.

Guess you like

Origin blog.csdn.net/u010349629/article/details/130930101