Apache Kafka - Building Data Pipeline Kafka Connect


insert image description here


overview

Kafka Connect is a tool that helps us transfer data from one place to another. For example, if you have a website and you want to transfer user data to another place for analysis, then you can use Kafka Connect to complete this task.

Using Kafka Connect is very simple. It has two main concepts :source 和 sink. Source is the component that reads data from the data source, and sink is the component that writes data to the target system. Using Kafka Connect, you only need to configure the relevant information of the source and sink, and then the data can be automatically transmitted from one place to another.


main concept

When using Kafka Connect to coordinate data flow, the following are some important concepts:

Connector

  • Connector is a high-level abstraction for coordinating data flow. It describes how to read data from a data source and transfer it to a specific topic in a Kafka cluster or how to read data from a specific topic in a Kafka cluster and write it to a data store or other target system.

Connectors in Kafka Connect define where data should be replicated to and from. A connector instance is a logical job that manages the replication of data between Kafka and another system. All classes implemented or used by a connector are defined in the connector plugin. Both connector instances and connector plug-ins can be referred to as "connectors".

insert image description here
Kafka Connect makes it easy to stream data from multiple sources to Kafka and from Kafka to multiple targets. Kafka Connect has hundreds of different connectors. Among the most popular are:

More details on these connectors are as follows:

RDBMS connector: Used to read data from relational databases (such as Oracle, SQL Server, DB2, Postgres, and MySQL) and write it to a specified topic in a Kafka cluster, or read from a specified topic in a Kafka cluster data and write it into a relational database.

Cloud Object stores connector: Used to read data from cloud object storage (such as Amazon S3, Azure Blob Storage, and Google Cloud Storage) and write it to a specified topic in a Kafka cluster, or from a specified topic in a Kafka cluster Read data and write it to cloud object storage.

Message queues connector: used to read data from message queues (such as ActiveMQ, IBM MQ, and RabbitMQ) and write it to a specified topic in the Kafka cluster, or read data from a specified topic in the Kafka cluster, and It is written to the message queue.

NoSQL and document stores connector: used to read data from NoSQL databases (such as Elasticsearch, MongoDB, and Cassandra) and write it to a specified topic in a Kafka cluster, or read data from a specified topic in a Kafka cluster, and Write it into a NoSQL database.

Cloud data warehouses connector: used to read data from cloud data warehouses (such as Snowflake, Google BigQuery, and Amazon Redshift) and write it to a specified topic in a Kafka cluster, or read data from a specified topic in a Kafka cluster , and write it into the cloud data warehouse.

insert image description here

In addition to the popular connectors above, Kafka Connect supports many other data sources and targets, including:

  • Hadoop File System (HDFS)
  • Amazon Kinesis
  • Twitter
  • FTP/SFTP
  • Salesforce
  • JMS
  • Apache HBase
  • Apache Cassandra
  • InfluxDB
  • Apache Druid

These connectors can make Kafka Connect a flexible and extensible data pipeline that can easily flow data into Kafka from various sources and flow data out to various destinations.


Tasks

Tasks are the main components in the Kafka Connect data model that coordinate the actual data replication process. Each connector instance coordinates a set of tasks responsible for copying data from a source to a target.

Kafka Connect provides built-in support for parallelism and scalability by allowing the connector to break a single job into multiple tasks. These tasks are stateless and do not store any state information locally. Instead, task status is stored in two special topics config.storage.topic and status.storage.topic in Kafka and managed by associated connectors.

By storing task state in Kafka, Kafka Connect enables elastic, scalable data pipelines. This means tasks can be started, stopped or restarted at any time without losing state information. Additionally, since task state is stored in Kafka, state information can be easily shared across different Kafka Connect instances, enabling high availability and fault tolerance.

insert image description here


Workes

  • Workers are running processes that execute connectors and tasks. They read task configurations from specific topics in the Kafka cluster and assign them to tasks for connector instances.

insert image description here


Converters

Converters are a mechanism in Kafka Connect for converting data between systems that send or receive data. They convert data from one format to another for transfer between different systems.

In Kafka Connect, data is usually transmitted in the form of byte arrays. Converters are responsible for serializing Java objects into byte arrays and deserializing byte arrays into Java objects. In this way, data can be transferred between different systems without worrying about data format compatibility.

Kafka Connect provides a variety of built-in converters, such as JSON Converter, Avro Converter, and Protobuf Converter. These converters support a variety of data formats and can be easily configured and used.

In addition, Kafka Connect also supports custom converters, and users can write their own converters to meet specific needs. Custom converters usually need to implement the org.apache.kafka.connect.storage.Converter interface and provide implementations of serialization and deserialization methods.

In short, Converters is a very useful mechanism in Kafka Connect, which can help transfer data between different systems and realize data format conversion.

insert image description here


Transforms

Transforms are a mechanism in Kafka Connect for transforming messages by applying simple logic on every message produced by or sent to a connector. Transforms are commonly used in scenarios such as data cleaning, data transformation, and data enhancement.

Through Transforms, a series of transformation operations can be applied to each message, such as removing fields, renaming fields, adding timestamps, or changing data types. Transforms usually consist of a set of transformers, each responsible for performing a specific transformation operation.

Kafka Connect provides a variety of built-in converters, such as ExtractField, TimestampConverter, and ValueToKey. Additionally, custom converters can be written to meet specific needs.

In short, Transforms is a very useful mechanism in Kafka Connect, which can help change the structure and content of messages, so as to realize functions such as data cleaning, transformation and enhancement.


Dead Letter Queue

Dead Letter Queue is a mechanism for Kafka Connect to handle connector errors. When a connector cannot process a message, it can send the message to the Dead Letter Queue for later inspection and processing.

The Dead Letter Queue is usually a special topic for storing messages that cannot be processed by the connector. These messages may not be deserialized, transformed, or written to the target system, or they may contain invalid data. In either case, sending these messages to the Dead Letter Queue can help ensure reliable and consistent data flow.

With Dead Letter Queue, connector errors can be easily monitored and handled appropriately. For example, you can manually check for messages in the Dead Letter Queue and try to resolve the problem, or you can write a script or application to automatically check for and process these messages.

In short, Dead Letter Queue is an important mechanism for Kafka Connect to handle connector errors. It can help ensure the reliability and consistency of data streams and simplify the error handling process.


Main usage scenarios

Kafka typically has two main usage scenarios in data pipelines:

  1. Kafka acts as an endpoint of a data pipeline, source or destination . For example, export data from Kafka to S3, or import data from MongoDB to Kafka.

insert image description here

  1. Kafka acts as middleware between two endpoints in a data pipeline . For example, import data from xx stream to Kafka, and export from Kafka to Elasticsearch.

insert image description here


main value

The main value Kafka brings to data pipelines is:

  1. It can act as a large buffer, effectively decoupling data producers and consumers.

  2. It is very reliable in terms of security and efficiency, and it is the best choice for building data pipelines.


Kafka Connect API vs Producer 和 Consumer API

The Kafka Connect API is designed to solve common problems in data integration.

Some advantages of Kafka Connect API over using Producer and Consumer API directly are:

  • Simplifies development. No need to manually write producer and consumer logic.
  • It is fault-tolerant. Connect automatically restarts failed tasks and continues to sync data without loss.
  • Common data sources and destinations are already built in. For example, mysql, postgres, elasticsearch and other connectors have been developed and can be used easily.
  • Consistent configuration and management interface. Connector tasks can be easily configured, started, and stopped via the REST API.

In addition to Kafka Connect API, Kafka can also be integrated with other systems to achieve data integration. For example:

  • Integrate with Spark Streaming for real-time data analysis and machine learning.
  • Combined with Flink, it realizes stream processing with Exactly-Once semantics.
  • Combine with Storm to build real-time computing tools.
  • Combined with Hadoop for real-time and batch computing.

Key issues to consider when building a data pipeline

  1. Timeliness: Supports different timeliness requirements and can be migrated. Kafka acts as a buffer, decouples producers and consumers, and supports real-time and batch processing.
  2. Reliability: Avoid single point of failure and be able to recover quickly. Kafka supports at-least-once delivery, and only-once delivery is possible in combination with external systems.
  3. High throughput and dynamic throughput: supports high concurrency and burst traffic. Kafka has high throughput, producers and consumers are decoupled, and can be adjusted dynamically.
  4. Data Format: Various formats are supported, and connectors can convert formats. The Kafka and Connect APIs are format-agnostic, using pluggable converters.
  5. Transformation: ETL vs ELT. ETL saves space and time, but restricts downstream systems. ELT retains the original data and is more flexible.
  6. Security: data encryption, authentication and authorization, audit logs. Kafka supports these security features.
  7. Fault Handling: Handle abnormal data, retry and fix. Because Kafka retains data for a long time, historical data can be reprocessed.
  8. Coupling and flexibility:
    • Avoid creating separate data pipelines for each application and increasing maintenance costs.
    • Preserve metadata and allow schema changes to avoid tight coupling between producers and consumers.
    • Process as little data as possible, leaving more flexibility for downstream systems. Overtreatment can restrict downstream systems.

In short, to build a good data pipeline, you need to consider all aspects of time, security, format conversion, and fault handling. At the same time, you need to be as loosely coupled as possible to give maximum flexibility to downstream systems that use data.

As a stream processing platform, Kafka can solve these problems well and play a buffer role in decoupling producers and consumers. At the same time, Kafka Connect provides a common interface for data input and output, which simplifies the integration work.

The data pipeline built using Kafka can serve both real-time and batch processing scenarios, and has the characteristics of high availability, high throughput, and high scalability.


ETL VS ELT

Different ways of data integration

Two different ways of data integration

  • ETL: Extract-Transform-Load, that is, extract-transform-load. In this approach, data is extracted from the source system, transformed and processed before being loaded into the target system.
  • ELT: Extract-Load-Transform, that is, extract-load-transform. In this way, after the data is extracted from the source system, it is first loaded into the target system, and then transformed and processed in the target system.
  • The main difference between ETL and ELT is the timing and location of data transformation: ETL transforms data before loading, ELT transforms data after loading. The transformation of ETL occurs between the source system and the target system, and the transformation of ELT occurs within the target system.

Both ETL and ELT have advantages and disadvantages:

Advantages of ETL:

  • Data can be filtered, aggregated, and sampled during loading, reducing storage and computing costs.
  • Data format and quality can be ensured before loading data into the target system.
    Disadvantages of ETL:
  • Transformation logic is intermingled in the data pipeline, making it difficult to maintain and debug.
  • The downstream system can only access the converted data, which has poor flexibility.
    Advantages of ELTs:
  • Provide raw data for downstream systems, more flexible. Downstream systems can process and transform the data themselves as needed.
  • The transformation logic is in the downstream system, which is easier to debug and maintain.
  • Source data is easier to backtrack and reprocess.

Disadvantages of ELTs:

  • The target system is required to have powerful data processing capabilities.
  • Larger storage space is required to store raw data.
  • The conversion process may place a heavy load on the target system.

Generally speaking, if the downstream system needs to process data with high flexibility and strong data processing capability, ELT is often more suitable. Otherwise, ETL can be more efficient by preprocessing the data before loading it, offloading the downstream system. In many cases, a mixture of ETL and ELT will also be used

insert image description here

Guess you like

Origin blog.csdn.net/yangshangwei/article/details/130980826