How to design real-time data platform (technical papers)

Agile Song

I smoked a few, therefore I exist | DBus

Fun for all stream processing | Wormhole

When I was on the database | Moonbox

Yen value of the last ten kilometers | Davinci

REVIEW: Real-time data platform (RTDP, Real-time Data Platform) is an important and common big data infrastructure platform. In Part (design articles), we introduced a number of modern warehouse RTDP from the perspective of architecture and typical data processing point of view, and to explore the RTDP overall design architecture. As the next article (technical papers), it is starting from a technical point of view, the introduction RTDP technology selection and related components, to explore the relevant mode is suitable for different application scenarios. Agility embarked on the road of RTDP ~

Further Reading: How to design real-time data platform (design articles)

First, the technology selection introduction

In the design chapter, we give an overall architecture design RTDP (Figure 1). In technical papers, we will recommend overall technical components selection; make a brief description of each technology component, especially for the realization of our abstract and four technical platform (unified data collection platforms, unified stream processing platform, Unified Computing service platform, unified data visualization platform) focuses on the design ideas; end section of the Pipeline topics discussed, including functional integration, data management, data security and so on.

Figure 1 RTDP architecture

1.1 overall technology selection

FIG 2 overall technology selection

First, we briefly interpret what Figure 2:

  • Data sources, client lists the most commonly used type of data source data application projects.
  • Data bus platform DBus, as a unified platform for data acquisition, responsible for docking a variety of data sources. The DBus increments or the total amount of data extracted embodiment, some conventional data processing and finally the processed message posted on Kafka.
  • Distributed messaging system Kafka, a distributed, high availability, high throughput, can publish - subscribe to other capabilities, connecting producers and consumers of the message.
  • Streaming platform Wormhole, as a unified stream processing platform, responsible for handling streaming and docking target a variety of data storage. Wormhole consume messages from Kafka, SQL support arranged manner on the stream flow data processing logic, and data support the configuration of the embodiment at a final consistency (idempotent) effect of different data fall within the target storage (Sink) in.
  • In the calculation of the data storage layer, RTDP open architecture of choice of the technology component selection, the user according to the actual characteristics of the data, calculation mode, access mode, and other information to select the appropriate amount of data storage, the data item to address specific issues. RTDP also supports select multiple different data storage to more flexible support for different project requirements.
  • Computing services platform Moonbox, as a unified computing services platform, responsible for the integration of heterogeneous data storage terminal, computing pushdown optimization, mixed arithmetic and other heterogeneous data storage (data virtualization technology), responsible shut unified metadata for data presentation and interactive end query, unified data calculates and delivers unified data query language (SQL), unified data services interface.
  • Visualization application platform Davinci, as a unified data visualization platform to configure the technology to support different data visualization and interaction requirements, and can be integrated with other data applications to provide data visualization part of the demand solutions, in addition to support different data practitioners on the platform collaboration completion of the daily data applications. Other systems, such as data consumption data terminal development platform Zeppelin, data algorithm platform Jupyter, etc. In this article does not describe.
  • Cut topics such as data management, data security, the development of operation and maintenance, the driving engine, Wormhole, Moonbox, Davinci service interface to integrate and secondary development through the docking DBus,, to support the end to end control and governance requirements.

Below we will further refine the technology components and topics related to the map section, describes the technical components of features that focused on the design philosophy of our self-developed technology components, and discussion on the topic section.

1.2 Technical Components Introduction

1.2.1 Data Bus platform DBus

Figure 3 RTDP architecture of DBus

1.2.1.1 DBus design ideas

1) look at the design ideas from the outside point of view

  • Responsible for docking different data sources in real time to extract incremental data, the database will be used for the operation log extraction mode, with support for multiple log types Agent docking.
  • All UMS message distribution in a uniform message format in Kafka, UMS is a standardized metadata information comes JSON format, UMS implements the logical and the physical message Kafka Topic decoupling unified, so that the same can be flow plurality UMS Topic message table.
  • Database support full amount of data pull and merge into a unified and incremental data UMS messages, no perception of lower consumption transparent.

2) From the internal perspective on design ideas

  • Storm data format based calculation engine, ensure that the message end to end delays to a minimum.
  • Different data source standardized format, UMS generates information, comprising:

The only incremental monotone generating each message id, the corresponding system field ums_id_

Each acknowledgment message event timestamp (event timestamp), the corresponding system field ums_ts_

Confirmation operation mode of each message (additions and deletions, or insert only), the corresponding system field ums_op_

  • Clear upstream metadata changes to the database table structure changes and the use of real-time-aware version number management to ensure lower consumption.
  • Make sure the message strongly ordered (non-absolute ordered) and at least once semantics when it serves Kafka.
  • Table heartbeat mechanism to ensure that messages end-to-live-aware exploration.
1.2.1.2 DBus Features
  • The total amount of data to support the configuration of pull
  • Supporting incremental data configuration of the pull
  • It supports the configuration of the online journal format
  • Supports visual monitoring and early warning
  • It supports the configuration of the multi-tenant security control
  • Support part tables together into a single logical data table
1.2.1.3 DBus technical architecture

FIG 4 DBus data flow architecture of FIG.

More technical details DBus and user interface, you can see:

GitHub: https://github.com/BriData

1.2.2 distributed messaging system Kafka

Kafka has become the de facto standard for large data streaming distributed messaging system, of course, Kafka continues to expand and improve, now have a certain amount of storage capacity and streaming capabilities. Features and technologies on Kafka itself has a lot of information available on this article, this article will not elaborate Kafka's own capabilities.

Here we discuss specific topics Kafka on message metadata management (Metadata Management) and schema evolution (Schema Evolution) is.

Figure 5

Source: http://cloudurable.com/images/kafka-ecosystem-rest-proxy-schema-registry.png

5 shows, in Kafka Confluent solution behind the company, the introduction of a metadata management component: Schema Registry. This component is responsible for managing metadata information and Topic information transfer messages on Kafka, and offers a range of metadata management services. The reason for the introduction of such a component is to Kafka's consumer can understand the flow of the different Topic is what data and metadata information data, and effectively resolve consumer.

Any data transfer link, no matter in what circulation system, there will be a metadata management this data link, Kafka was no exception. Schema Registry is a centralized Kafka data link metadata management solutions, and based on Schema Registry, Confluent provide appropriate security mechanisms and Kafka data schema evolution mechanism.

More about the introduction Schema Registry, you can see:

Kafka Tutorial:Kafka, Avro Serialization and the Schema Registry

http://cloudurable.com/blog/kafka-avro-schema-registry/index.html

Then RTDP architecture, how to solve Kafka and message metadata management schema evolution problem?

1.2.2.1 Metadata Management (Metadata Management)
  • DBus will automatically perceive the real-time database metadata changes recorded and provide services
  • DBus will automatically format the online log to record metadata information and provide services
  • DBus will be released after UMS unified message on Kafka, UMS comes with its own message metadata information, there is no need to call center service metadata downstream consumer, you can get data directly from UMS messages in the metadata information
1.2.2.2 schema evolution (Schema Evolution)
  • UMS message comes Namespace Schema information, the layer is positioned Namespace is a string 7 may be positioned in any unique life cycle of any table, the data table corresponding to the IP address of the form:

[Datastore].[Datastore Instance].[Database].[Table].[TableVersion].[Database Partition].[Table Partition]

例:oracle.oracle01.db1.table1.v2.dbpar01.tablepar01

Where [Table Version] represents a Schema version of this table, if the data source is a database, then this version is automatically maintained by the DBus.

  • In RTDP architecture, downstream Kafka is consumed by the Wormhole, Wormhole in the consumer UMS, will [TableVersion] * as a treatment means that when a change table upstream Schema, Version number will rise automatically, but will ignore the Wormhole version change, will consume all versions of the incremental / full amount of data in this table, then how do Wormhole compatibility mode support it evolve? Wormhole may be configured in the SQL processing and output fields on stream, when the upstream Schema change is a "compatibility change" (a field means to increase or expand the field of modified type, etc.), will not affect the correct execution of the SQL Wormhole. When the upstream incompatible changes occur, Wormhole will complain, then you need human intervention logic of the new Schema repair.

As can be seen from the above, Schema Registry DBus + UMS and design ideas are two different solutions and metadata management schema evolution, both have their advantages and disadvantages, a simple comparison, see Table 1.

Table 1 Schema Registry comparison with DBus + UMS

Here is an example of UMS:

FIG 6 UMS message Examples

1.2.3 streaming internet Wormhole

Figure 7 RTDP architecture of Wormhole

1.2.3.1 Wormhole design ideas

1) look at the design ideas from the outside point of view

  • Consumer UMS messages from Kafka and custom JSON messages
  • The final consistency is responsible for docking target different data storage (Sink), and implemented by Sink logic of power, etc.
  • SQL support arranged manner on the stream processing logic
  • Flow provides abstraction. Flow is defined by a Source Namespace and a Sink Namespace, and have unique properties. Flow may be defined on the processing logic is a logical abstraction process stream by physical Spark Streaming, Flink Streaming decoupling, so that the same Stream Flow can process multiple processing streams, and can switch on Flow Stream different.
  • Support Kappa architecture is based on recharge (backfill); and supports Wormhole Job's Lambda architecture

2) From the internal perspective on design ideas

  • Processed data stream based Spark Streaming, Flink calculation engine. Spark Streaming supports high throughput, batch Lookup, bulk write Sink and other scenes; Flink supports low latency, CEP rules the scene.
  • By ums_id_, ums_op_ achieve different logical storage idempotent of Sink
  • Lookup calculation logic achieved by optimizing the pushdown
  • Several abstract unity to support the functionality and flexibility of design consistency

DAG unified high order fractal abstract

UMS Unified messaging flow universal protocol abstraction

Uniform Data namespace Namespace abstract logical table

  • Several abstract interface to support scalability

SinkProcessor: Extended support more Sink

SwiftsInterface: since the defined flow processing logic support

UDF: UDF support more processing flow

  • Feedback message via real-time streaming operations dynamic imputation indicators and statistics
1.2.3.2 Wormhole Features
  • Support visualization, configuration technology, SQL streaming technology development project
  • Support command Dynamic streaming management, operation and maintenance, diagnostics and monitoring
  • UMS unified messaging support for structured and semi-structured custom JSON messages
  • CRUD support processing flow event message tristate
  • Supporting a single physical logic flow simultaneously parallel processing of multiple service flows
  • 支持流上Lookup Anywhere,Pushdown Anywhere
  • Support incidents based on business policy timestamp streaming
  • UDF support registration management and dynamic load
  • Support for multi-target data storage system concurrent idempotent
  • It supports multiple levels of data quality based on incremental information management
  • It supports streaming processing and batch processing based on incremental message
  • Lambda Kappa architecture and infrastructure support
  • Support seamless integration with the tripartite system, as a tripartite system of flow control engine
  • Support for private cloud deployment, security permissions control and multi-tenant resource management
1.2.3.3 Wormhole Technology Architecture

FIG 8 Wormhole data flow architecture of FIG.

More technical details Wormhole and user interface, you can see:

GitHub:https://github.com/edp963/wormhole

1.2.4 Selection of common data computing storage

RTDP computing storage architecture treat selected data selection to adopt an open attitude towards integration. Different data systems have their own advantages and of the scenes, but does not have a data storage system may be suitable for a wide variety of computing scenarios. So when there is an appropriate and mature, mainstream data systems appear, Wormhole and Moonbox will integrate support in accordance with the corresponding expansion needs.

Here are some of the more common substantially Selection:

  • Relational databases (Oracle / MySQL, etc.): For complex relationship small amount of data calculation
  • Column distributed storage system

Kudu: Scan optimized for OLAP analysis and calculation scene

HBase: random read and write, for providing data services scene

Cassandra: Write High performance, high-frequency writing scene for mass data

ClickHouse: high-performance computing, suitable only insert written scene (the latter will support the update delete)

  • Distributed File System

HDFS / Parquet / Hive: append only, suitable for mass data batch computing scenarios

  • Distributed file system

MongoDB: balance, moderately complex calculations for large amounts of data

  • Distributed indexing system

ElasticSearch: indexing capability, suitable for fuzzy query and OLAP analysis scenarios

  • Distributed pre-computation system

Druid / Kylin: pre-calculation capabilities for high-performance OLAP analysis scenarios

1.2.5 computing services platform Moonbox

Figure 9 RTDP architecture of Moonbox

1.2.5.1 Moonbox design ideas

1) look at the design ideas from the outside point of view

  • Responsible for docking different data systems to support a uniform way across heterogeneous data systems operator impromptu mix
  • Client calls provides three ways: RESTful services, JDBC connection, ODBC connection
  • Unified metadata shut; unified query language SQL shut; shut unified access control
  • Write mode provides two results: Merge, Replace
  • It offers two interactive modes: Batch mode, Adhoc mode
  • Data virtualization implementation, to achieve multi-tenant, can be seen as a virtual database

2) From the internal perspective on design ideas

  • Of SQL parsing routine Catalyst after treatment resolution process, the final generation logic can perform subtree pushdown pushdown data calculation system, and the results were mixed operator and returns back
  • Two-layer Namespace: database.table, to provide a virtual database experience
  • Moonbox Grid provides a distributed service modules provide high availability high concurrency
  • All pair of push-down logic (without mixing operator) provides quick execution channel
1.2.5.2 Moonbox Features
  • Support across heterogeneous systems seamlessly blended count
  • SQL query syntax supports unified computing and writing
  • Supports three call way: RESTful services, JDBC connection, ODBC connection
  • It supports two interactive modes: Batch mode, Adhoc mode
  • Cli Command Support Tools and Zeppelin
  • It supports multi-tenant user rights system
  • Support table level permissions, column-level permissions, read permission, write permission, UDF rights
  • Support YARN scheduler resource management
  • Metadata support services
  • Support for regular tasks
  • Support Security Policy
1.2.5.3 Moonbox technical architecture

FIG logic module 10 Moonbox

More technical details Moonbox and user interface, you can see:

GitHub: https://github.com/edp963/moonbox

1.2.6 visual application platform Davinci

11 RTDP architecture of Davinci map

1.2.6.1 Davinci design ideas

1) look at the design ideas from the outside point of view

  • Responsible for a variety of data visualization display function
  • JDBC data source support
  • Providing equal rights user system, each user can create their own Org, Team and Project
  • Support for SQL write data processing logic, support for drag and drop editing visual display, the social division of labor provides multi-user collaborative environment
  • It offers a variety of charts interactivity and customization capabilities to respond to the needs of different data visualization
  • Providing integrated into other embedded applications data capacity

2) From the internal perspective on design ideas

  • And revolves around View Widget. View is a logical view of the data; the Widget data visualizations
  • By selecting the user-defined classification data, order data and quantized data, in accordance with reasonable logical automatic visual presentation view
1.2.6.2 Davinci Features

1) Data source

  • JDBC data source support
  • CSV file upload support

2) Data View

  • Support is defined SQL templates
  • Support for SQL Highlight
  • Support for SQL Test
  • Support write-back operation

3) a visual component

  • Support predefined chart
  • Support Controller component
  • Support free style

4) interactivity

  • Visual component supports full-screen display
  • Visual component to support local controller
  • Filter support linkage between the visual component
  • Support group control controllers visual component
  • Visual component to support the local Advanced Filters
  • Support large amounts of data and slide show pages

5) integration capabilities

  • Visual component supports CSV download
  • Public support for visual components share
  • Support for visual components authorized share
  • Public support for the dashboard to share
  • Support Dashboard authorized share

6) security permissions

  • Support data ranks rights
  • LDAP Integration Support Login

More technical details Davinci and user interface, you can see:

GitHub:https://github.com/edp963/davinci

Topics discussed in section 1.3

1.3.1 Data Management

1) Metadata Management

  • DBus can get real-time metadata for the data source and provide services inquiry
  • Moonbox can get real-time metadata data systems and provide services inquiry
  • For RTDP architecture, the real-time data source metadata and ad hoc data source information by calling the DBus and Moonbox RESTful service imputation, based on this construction enterprise metadata management system

2) Data Quality

  • Wormhole can configure the message in real time falls HDFS (hdfslog). Based on Wormhole hdfslog of Job support Lambda architecture; based Backfill hdfslog support Kappa architecture. Sink can be regularly updated selection Lambda Kappa architecture or architecture by setting regular tasks, in order to ensure that the final data consistency. Wormhole also supports streaming dealing with abnormal or unusual Sink written message information in real-time Feedback to Wormhole system and provide RESTful service-party applications for call processing.
  • Moonbox can be considered an ad hoc mix of heterogeneous systems, the ability to confer Moonbox "Swiss army knife" like convenience. You can write SQL scripts by Moonbox timed logic, heterogeneous system data of interest to compare, or to focus on the data table fields such as statistics, data quality can be secondary development based detection system capabilities Moonbox.

3) blood analysis

  • Wormhole stream processing logic to meet generally SQL, these SQL imputation may be by RESTful services.
  • Moonbox in charge of the unified entry data query and SQL are all logic, these SQL can be imputation by Moonbox log.
  • For RTDP architecture for real-time processing logic and processing logic of ad hoc SQL by calling Wormhole RESTful service and Moonbox log collection, this can be based on the construction of enterprise-class blood analysis system.

1.3.2 Data Security

FIG 12 RTDP data security

The figure shows the RTDP architecture, open source platform covering the four end data transfer link, and on each node has to consider and support all aspects of data security to ensure the real-time data pipe end to end data safety.

In addition, because Moonbox become a unified entrance for application layer data access, so you can get a lot of information on the safety aspects of operational audit log Moonbox, it can create security early warning mechanism for data about the operations audit log, and then build enterprise-class data security system.

1.3.3 Development of operation and maintenance

1) Operation Management

  • Operation and maintenance management of real-time data processing has always been a sore point, DBus and Wormhole provides visual operation and maintenance management capabilities through visual UI, makes it easy maintenance labor movement.
  • DBus and Wormhole provide health checks, operations management, Backfill, Flow drift RESTful services, operation and maintenance of this research and development of automation systems can be based.

2) monitoring and early warning

  • Wormhole DBus and offer visual monitoring interface, can be seen the logical table level of throughput and delay information in real time.
  • DBus and Wormhole provides heartbeat, Stats, status, etc. RESTful services can be developed based on this automated warning system.

Second, investigate the scene mode

The last chapter we introduced the design architecture and architectural features RTDP various technical components, how far the reader has landed with specific knowledge and understanding of RTDP architecture. So what RTDP common data architecture scenarios can solve it? Here we will explore several usage patterns, and what the scene needs to adapt to the different modes.

2.1 synchronous mode

2.1.1 Mode Description

Synchronous mode, refer to real-time data between the synchronization data system heterogeneous configuration only, without any usage pattern on the flow of the processing logic.

Specifically, by arranging the DBus real-time data extracted from the data source to run on Kafka, and then by arranging the Wormhole Kafka Sink the data is written to the real-time storage. Synchronous mode provides two main capabilities:

  • Subsequent data processing is not executed in the business logic library preparation, reducing the pressure of service library prepared
  • It offers the possibility of different physical standby database business data in real-time synchronized to the same physical data storage

2.1.2 Technical Difficulties

DETAILED DESCRIPTION relatively simple.

IT implementers need to understand common issues too streamed, no need to consider the design and implementation process logic implemented on stream, only need to know the basic flow control parameters can be.

2.1.3 Operation Management

Operation and maintenance management is relatively simple.

Labor movement needs people dimension. However, since there is no flow of the processing logic, it is easy to control the flow rate, the processing logic flow of the power itself, it can be given a relatively stable configuration irrespective synchronous pipeline. Timing is also very easy to do and end data alignment to ensure data quality, because the data source and the target side is exactly the same.

2.1.4 application scenarios

  • Inter-departmental data sharing real-time synchronization
  • Decoupling transaction database and database analysis
  • It supports several real-time warehouse ODS layer construction
  • Real-time reporting simple user self-development
  • and many more

2.2 Flow calculation mode

2.2.1 Mode Description

Flow pattern calculation means on the basis of the synchronous mode, the processing logic is configured to use pattern on the stream.

The processing logic of the flow configuration and support mainly on Wormhole platform RTDP architecture. On the ability to synchronous mode, flow calculation mode provides two main capabilities:

  • The bulk flow computing power concentrated dispersion calculation time on stream continuously increment calculation power, greatly reducing the delay resulting snapshots
  • Computing provides new calculation inlet (the Lookup) mix across heterogeneous systems counted on stream

2.2.2 Technical Difficulties

DETAILED DESCRIPTION relatively difficult.

Users need to know what things do stream processing, for what to do, how to transform the whole calculation logic become increment calculation logic. Also consider the flow of the processing logic and the power factor itself dependent on external data system to adjust more configuration parameters.

2.2.3 Operation Management

Operation and maintenance management is relatively difficult.

Labor movement needs people dimension. But than synchronous mode operation and maintenance management more difficult, mainly reflected in the configuration considerations flow control parameters more, can not support end-to-data comparison, the results of the snapshot to select eventual consistency implementation strategy, consider the flow of the Lookup time-aligned strategies aspects.

2.2.4 application scenarios

  • For low-latency applications requiring high data item or report
  • Low latency required to call external services (such as call external rules engine, and so on-line algorithm model uses stream)
  • Supports several warehouse fact table width + table construction of real-time dimension table
  • Real-time multi-table integration, spin-off, cleaning, standardization Mapping scene
  • and many more

2.3 rotation mode

2.3.1 Mode Description

Rotation mode, calculation means on the basis of the flow model, based on the data in real time for falling while running after a further short timer task computing libraries, the results Kafka running again on the next run on a computing turns, so that the flow calculation subconcessions count, batch count counted using bypass mode.

In RTDP architecture may be utilized Kafka-> Wormhole-> Sink-> Moonbox-> Kafka integrated manner any rounds any frequency of rotation is calculated. Ability to flow over the count mode, the main mode provides the ability to rotate is: theoretically support any complex computation logic low latency transfer.

2.3.2 Technical Difficulties

Specific implementation difficult.

Wormhole ability Moonbox turn introduced, further increasing the variable considerations than the flow calculation mode, multi-selection Sink, Moonbox calculated frequency setting, and how to separate aspects Wormhole Moonbox division calculation problems.

2.3.3 Operation Management

Operation and maintenance management difficult.

Labor movement needs people dimension. And the flow ratio calculation mode, data systems need to consider more factors, arranged more tuning parameters, the more difficult and diagnostic monitoring data quality management.

2.3.4 application scenarios

  • Complex multi-step data processing logic low latency scenes
  • Company-wide real-time data stream processing network construction

2.4 Intelligent mode

2.4.1 Mode Description

Intelligent mode, is the use of model rules or algorithms to optimize efficiency and usage patterns.

Intelligently points:

  • Wormhole Flow Smart drift (intelligent automation operation and maintenance)
  • Moonbox pre-calculated intelligent optimization (intelligent automated tuning)
  • Full amount calculation logic for the intelligent computing logic to stream, and then deployed in Wormhole + Moonbox (intelligent automation development and deployment)
  • and many more

2.4.2 Technical Difficulties

DETAILED DESCRIPTION OF THE easiest in theory, but effective technology to achieve the most difficult.

Users only need to complete an offline logical development, and the rest handed over to intelligence tools to complete the development, deployment, tuning, operation and maintenance.

2.4.3 Operation Management

Zero operation and maintenance.

2.4.4 application scenarios

The whole scene.

Since then, our discussion of "how to design real-time data platform," the subject of a temporary close. Our concept from the background, to discuss the design architecture, and then introduced the technology components, and finally discusses the scene mode. Because each topic are great point involved here, we just do a superficial presentation and discussion. Follow us occasionally deployed for a specific topic discussed in detail point, the presentation of our practice and experience, start a discussion, brainstorming. If you are interested in four open-source platform RTDP architecture, please find us on GitHub, understand the use, exchange recommendations.

Author: Lu Shanwei

Source: CreditEase Institute of Technology

Guess you like

Origin yq.aliyun.com/articles/707015