Brooklin: LinkedIn near real-time data streams open source massively scalable and reliable distributed systems

       Recently, LinkedIn has a new open source tool Brooklin, a near real-time streaming data distributed and scalable services. Brooklin start on LinkedIn line running from 2016, handling thousands of data streams and two trillion messages per day.  

Why develop Brooklin

      High-speed, reliable transmission of large data LinkedIn is not the only problem to be solved, as very serious problems caused by the rapid increase in the diversity of data storage and streaming systems, LinkedIn building aspects of Brooklin used to solve the scalability of the system the new requirement that the system to be scalable in terms of both the size of the data, but also can expand the system diversity .

What is the Brooklin

Brooklin is a distributed system for data storage across a plurality of different message systems and streaming data, having high reliability. It exposes a series of abstract, through new Brooklin consumers and producers to write, you can extend its functionality to achieve new system of consumption and generation data.


Brooklin example


There are two major categories of Brooklin Example: converter bridge and change data capture.

Bridge data stream (stream bridge)

Data can be distributed in a different environment (corporate data centers and public cloud), location or deployment of the different groups. In general, since access mechanism, serialization format, compliance and security requirements will add additional complexity. Brooklin as a bridge for transmitting data between these environments. For example, Brooklin can even transfer data between different data centers between different clusters in a data center between different cloud services.


FIG 2  a hypothetical scenario: : brooklin cluster is used as a bridge to transfer data stream, the incoming data from Kinesis kafka, then pass from kafka EventHubs


       Since Brooklin between different environments is a stream of data transmitted exclusive service, all the complexity can be managed in a service, so developers can focus on the process rather than the transmission of data . In addition, this centralized, hosted, scalable framework can allow organizations to promote the implementation of policies and data governance . Illustration, can be configured Brooklin company-wide policy enforcement, such as all data streams must be json format, or any data stream must be encrypted.

Kafka mirroring

        Before Brooklin, data transmission between Kafka cluster is achieved by Kafka MirrorMaker (KMM), but there is a problem in mass transportation. The Brooklin is designed to transmit data stream of generic bridge, so it can easily kafka data transmission of epic proportions.

       In LinkedIn, an example of using as a converter bridge is Brooklin mirror Kafka kafka data between the data center and the cluster.


Figure 3  a hypothetical scenario: : through the use of Brooklin, allowing users to access all of the data becomes very easy in a data. Each data center, a single cluster can handle multiple Brooklin source / destination pair.

Brooklin mirror kafka program data has been tested in a large scale through LinkedIn, and has been completely replaced KMM. By using this program to address some of the pain points of KMM and benefit from its new features, these features are discussed below.

Characteristics 1 Multitenancy (multi-tenant)

In the KMK deployment model, the mirror can only be made between two clusters Kafka. This creates for each data pipeline to be built are a KMM, the result is that there will be several hundreds KMM clusters, which can cause management is very difficult. However, Brooklin is designed to manage different data pipeline at the same time, so we only need to deploy a cluster Brooklin on it. By comparing FIGS. 3 and 4, you can have a more intuitive feel.


FIG 4 is assumed that a scene: KMM achieved using data mirroring across data centers

Characteristics 2 Dynamic provisioning and management (dynamic configuration and management)

       Use Brooklin, just to REST HTTP call endpoints can easily complete the creation of a new data pipeline (also known as data stream) and modify existing data pipeline. For Kafka mirroring use cases, this endpoint can very easily create new mirror mirror pipe or modify existing pipelines white list, without having to change and deploy static configuration.

       Although the pipeline may coexist on the same image a cluster, but can be controlled and configured Brooklin each pipe individually. For example, you can edit the image whitelist or pipeline to add more resources to the pipeline, without affecting any other channels. In addition, demand Brooklin allow suspend and resume individual pipes, which is very useful when temporary or modify the operation pipe. For Kafka mirroring use cases, Brooklin support pause or resume the entire pipeline, whitelist single topic, theme or even a single partition.

Characteristics 3 Diagnostics (diagnostic)

       Brooklin also disclosed a diagnostic REST endpoint, query status on demand data stream. This API can easily query the internal state of the pipeline, including any single topic partitions lag or error. Since the diagnosis endpoint integrates all find the whole cluster of Brooklin, so it is useful for the rapid diagnosis of problems specific partition without having to scan the log file.

Special features (special features)

       Since Brooklin be used instead KMM, so Brooklin stability and operability optimized. It also offers some special features for Kafka mirror.

  1.    Fault isolation:     The most important thing is that we strive to achieve better fault isolation, so the error mirroring a specific partition or topics that affect the entire pipeline will not like the use of cluster or KMM. Brooklin can automatically suspend these issues have mirrored partitions in the partition level detection error . These partitions can be automatically suspended automatically restored after a configurable period of time, eliminating the need for manual intervention, particularly useful for transient errors. At the same time, deal with other partitions and pipes are not affected.
  2.  Brushless mode generated : in order to improve latency and throughput mirror, Brooklin Kafka mirror may also be run in a brushless generation mode (flushless-produce mode), which consumes Kafka track progress in the partition level. A checkpoint is completed for each partition rather than the pipeline level. This allows Brooklin avoid costly Kafka refresh producer called, which is a synchronous blocking call, usually a few minutes can make the entire pipeline stalls.

     By using alternate Brooklin KMM, LinkedIn mirrored cluster reduced from several hundred to more than a dozen. At the same time, it also improves the speed of adding new features and iterative improvement.


Change data capture (CDC)

       Brooklin second principal embodiment is used to change data capture. Goal in these cases is to change the form of low-latency streaming database stream of updates. For example, LinkedIn most real data (e.g. operation, connection and profile information) reside in various databases. Some applications are interested in knowing when to release a new job, create a profile new professional connections or updated members. And each of these be of interest not to make an application to the online database of expensive queries to detect these changes, rather than real-time streaming of these database updates. One of the biggest advantages of using Brooklin generate change data capture event is better between applications and online storage resource isolation . The application can be extended independently of the database, thus avoiding the risk of database downtime. Use Brooklin, we as Oracle, Espresso and MySQL to build the change data capture solutions LinkedIn; moreover, Brooklin scalable model helps write new connectors, CDC added support for any database source.


                                         Figure 5

Characteristics 1 Bootstrap support (initialization support)

       Sometimes, before using the incremental update, the application may require a complete snapshot of data storage. When the application is first started or due to processing logic of change and the need to re-process the entire data set, this situation may occur. Brooklin connector model can be extended to support such use cases.

Characteristics 2 Transaction support (transaction support)

      Many database has transaction support for these sources, Brooklin connector can ensure the maintenance of transaction boundaries.

More information

      For more information about Brooklin, including an overview of its structure and features, please see our previous project blog articles .

      In the first version of Brooklin, the introduction of Kafka mirroring, you can use our provided simple instructions and scripts to test drive. We are trying to project to increase support for more sources and destinations - stay tuned!

future

        Since October 2016, Brooklin has been successfully running on LinkedIn generated environment. It has been replaced by Databus as a source of change and Oracle Espresso capture solutions, is our bridge solution flow moving data between Azure, AWS and LinkedIn, including the mirror number one trillion messages a day many of our kafka cluster.

        We will continue to build connectors to support other data sources (MySQL, Cosmos DB, Azure SQL) and goals (Azure Blob storage, Kinesis, Cosmos DB, Couchbase). We also plan to add functionality to optimize Brooklin, for example, automatically extended functionality based on traffic demand, skip decompression and recompression messages to increase throughput capacity, as well as additional reading and writing in mirror image optimization programs.

project address

Github

to sum up

  1. Brooklin is a distributed data streaming service, you may interface different types of data storage systems and message
  2. Brooklin has now achieved a Kafka mirroring , and provides multi-tenant, dynamic configuration and management, diagnostics, fault isolation, brushless generating extra-fine mode.
  3. Brooklin also provides CDC function, but currently only supports Oracle and Espresso source, no Mysql, Azure SQL and other connectors


Guess you like

Origin juejin.im/post/5d4a47c5f265da03b31bb703