Transwarp's self-developed technology accelerates big data from persistence, unification, assetization, businessization to ecologicalization

Since its establishment in 2013, Transwarp Technology has focused on the better combination of big data basic technology and enterprise data business. At the same time, in the face of China's more complex data application scenarios, it has developed a variety of solutions that are more suitable for domestic big data application needs. Big data management technology has a number of basic technological breakthroughs in the field of big data technology. Transwarp Technology has created a number of world-class technological achievements on the road of insisting on self-development of technology. This article introduces Transwarp Technology's big data technology.

—Overview  of Transwarp Big Data Technology— 

In order to meet the new data business needs and solve the original technical problems, Transwarp Technology redesigned the big data technology stack and established a highly unified data platform, which can effectively solve the 4 V problems of big data and open up the value of big data The output technology chain, thereby accelerating the value path of big data from persistence, unification, capitalization, businessization to ecologicalization, this is the Transwarp Big Data 3.0 technology system.

Transwarp implemented a Hadoop-based distributed analytical database in 2015, which is the first distributed analytical data that supports complete SQL standards, stored procedures, and distributed transactions. In the same year, the low-latency stream computing engine was launched, and the SQL language extension of StreamSQL was first launched in the industry to reduce the difficulty of stream application development. At the same time, it launched a computing engine with a delay of less than 5ms, which is far lower than the calculation delay of Spark Streaming.

In 2017, it was the first in the industry to launch a big data cloud service based on Docker and Kubernetes to achieve better cross-platform and cloud capabilities for big data products. It was the first manufacturer in the industry to adopt Kubernetes technology, and Cloudera did not complete the relevant research and development until Q3 of 2020 Work.

In 2018, a distributed graph database supporting trillion-level graph point and edge data was released, providing powerful graph analysis and storage capabilities, and accelerating the computing of cognitive intelligence. In the same year, a new-generation distributed analysis database based on flash memory was released. The new-generation columnar storage based on the Raft protocol and self-developed flash memory can significantly improve the performance of interactive analysis and meet the scenario requirements of data warehouses and interactive analysis of full data.

—Design  Considerations and Overall Architecture— 

At the beginning of the design, we defined that the new generation of big data technology must have the following characteristics:

(1) A unified and integrated data platform to replace the hybrid architecture

The current enterprise data business architecture often needs to include different data systems such as data lakes, data warehouses, data marts, and comprehensive search. Many enterprises adopt complex hybrid architectures, which not only generate huge data redundancy, but also severely limit data applications. timeliness. The new big data platform needs to be able to meet all needs in one stop, respond to the needs of all levels from quick response to massive analysis, and eliminate the hybrid architecture model.

(2) Integration of development methods, SQL as a unified interface

As a structured query language that has been tested by history, SQL has a huge user base and flexibility. However, in the past, API development methods had problems such as poor application compatibility and high development difficulty. The new generation of big data platforms need to use SQL to support all functions, including data warehouses, online transactions, search engines, spatio-temporal databases, etc., lowering the barriers to entry for developers and speeding up product development and launch.

(3) Cloudification of big data and promotion of inclusiveness of big data

The elasticity and ubiquitous access of cloud computing can allow more data businesses and developers to use big data technology. Therefore, new big data technologies need to be able to provide cloud capabilities. At the hardware level, the big data platform manages and allocates CPU, GPU, network, storage and other resources in a unified manner. Based on container technology, it realizes the unified deployment of big data applications on the cloud, and platform tenants apply for big data technologies and products on demand.

(4) The integration of big data and application ecology supports data businessization and business digitization

Data businessization is the ultimate value embodiment of big data technology. At the data level, all data on the platform is stored in a unified manner, and a unified data warehouse and data asset catalog are established. Each business department calls it according to needs; at the model layer, through the establishment of a model market, tenants The trained model can be published to the model market with one click, and other tenants can directly call it. At the application layer, users in the platform can publish business-verified applications to the enterprise-level application market and share them with other users. All running applications are managed uniformly.

In order to meet the higher integration requirements of enterprises for big data, and to support new data storage and computing requirements, Transwarp Technology has redesigned the big data technology stack as a whole, and at the same time, try to ensure that all levels are connected by common interfaces , so as to ensure the subsequent scalability, avoid the architectural shortcomings of Hadoop technology, and gradually complete the independent research and development of big data basic technology. After more than 7 years of development, the overall technology stack has basically completed the self-development process.

The above figure is the logical architecture diagram of the big data technology stack of Transwarp Technology. From bottom to top, the bottom layer is the resource scheduling layer that can manage and schedule various computing tasks. We choose to build it based on Kubernetes technology. With the development of data applications, computing tasks are not only MapReduce, but also Spark, deep learning, and even high-performance computing tasks such as MPI, as well as elastic data applications. Therefore, YARN specially designed for Hadoop cannot meet the requirements. need. Through the innovation of Kubernetes and the bottom layer of big data, our resource scheduling layer can not only support various computing tasks, but also connect with the bottom layer of cloud computing to solve the problem of big data cloudification.

In order to better meet the needs of future data storage and analysis and support various new storage engines, we have abstracted a unified storage management layer, which can plug and unplug different storage engines to realize the storage, retrieval and processing of different types of data. Analysis request. In the future, there may be a dedicated distributed storage engine to support certain specific applications. After using a unified distributed block storage management layer, architects only need to design a stand-alone version of the storage engine or file system, and access The storage management layer can implement a distributed storage engine that supports functions such as distributed transactions, MVCC, indexing, and SQL expression pushdown, which can greatly reduce the complexity of storage development.

Below the block storage management layer are various database cores or storage, including columnar storage for analytical databases, NoSQL Bigtable, full-text indexing for search engines, and graph storage engines for graph computing. Execute the plan, and then generate scan/put/write/transaction operations on the storage layer to complete specific processing tasks.

Above the storage layer is a unified computing engine layer. We have chosen a DAG-based computing model to support various computing of big data. Compared with the MPP mode, DAG computing is better suitable for various communication and computing tasks between large-scale clusters, and has higher scalability, and can satisfy multi-iterative computing characteristics including graph computing and deep learning. , At the same time, through code generation and other technologies, the performance can also be optimized to a level very close to the native code.

The top layer is a unified development interface layer. For analysis databases, transaction databases, etc., we provide developers with standard SQL development interfaces to reduce the complexity of data development and analysis. In addition, through the perfect SQL optimizer design, no special optimization is required, and the SQL business can also have very high performance, even better than direct API-level programming, without knowing the details of the underlying architecture. For the graph database, we provide the Cypher language interface, and the optimizer system all reuses the SQL optimizer. In addition, the development interface layer also provides a unified transaction processing unit, so as to ensure that data development has a complete transaction guarantee and ensure the ACID of the data.

—Development  interface layer— 

The core of the unified development interface layer is the SQL compiler, optimizer and transaction management unit, which can provide developers with a better database experience, without the need to do business development based on the underlying API, and to ensure the support for traditional businesses. Better optimize business.

Different from traditional big data SQL engines (such as Hive), we redesigned the SQL compiler, which includes three parsers that can generate semantic expressions from SQL, stored procedures, or Cypher statements, and a distributed transaction processing unit. After a SQL is processed by Parser, it will go through 4 different optimizers to generate the best execution plan, and finally push the execution plan to the vectorized execution engine layer.

  • lRBO (Rule-Based Optimizer) optimizes based on existing expert rules. Different storage engines or database developers will provide special optimization rules. We have accumulated hundreds of optimization rules so far. Among them, the most effective optimization rules are for IO-related optimization, such as filter pushdown, implicit filter condition folding, partition or bucket-based IO optimization, Partition elimination, redundant field elimination and other technologies, which can save data in SQL. Various IO operations are eliminated as much as possible, thereby improving overall performance.

  • ISO (Inter SQL Optimizer) is used for internal optimization of stored procedures. When there are similar SQL queries or analysis in multiple SQLs in a stored procedure, it can combine these operations together to reduce unnecessary calculation tasks or SQL operations. In order to make the stored procedure have better performance, the PL/SQL parser will generate SQL DAG according to the context relationship in the stored procedure, and then recompile the execution plan of each SQL, and use the physical optimizer to convert some non-dependent The execution plans are merged to generate a final physical execution plan DAG. Therefore, a stored procedure is parsed into a large DAG so that a large number of stages can be executed concurrently, avoiding the startup overhead of executing SQL multiple times and ensuring the concurrent performance of the system.

  • MBO (Materialize-Based Optimizer) is an optimizer based on materialized views or cubes. If there are already materialized views or cubes built in the database, and SQL operations can be optimized based on this materialized object, MBO will generate corresponding materialized Object operations, thereby reducing the amount of computation.

  • CBO (Cost-Based Optimizer) is a cost-based optimizer. It will select an optimal execution plan based on the IO cost, network cost, and computing cost of multiple potential execution plans, and the cost estimate comes from the metadata service. . In the future, we also plan to introduce machine learning capabilities to generate more robust execution plans through effective analysis of historical SQL execution statistics. Some very effective optimization rules include multi-table join order optimization, join type selection, task concurrency control, etc.

The SQL compiler and optimizer are very critical to the big data technology stack. As we concluded in the preamble, they are the key to determining the success of the ecological construction of the entire technology. In addition to the SQL interface, distributed transactions and interfaces are also critical components of the big data technology stack. Data consistency can be guaranteed under a complex system architecture and fault-tolerant design, and multiple transaction isolation levels are supported, so that the database can be expanded to support more applications.

—  Computing Engine Layer  

Our execution engine chooses a DAG-based model. In addition, in order to have better execution efficiency, we use quantitative execution engine technology to speed up data processing. The quantitative execution engine processes batches of data for each calculation, rather than recording one by one. For columnar data storage, the vector execution engine has a very high speed-up effect. In addition, similar to many research developments in the academic world, Transwarp Technology also uses the same computing engine to support real-time computing and offline computing, so as to better support the business scenario of stream-batch unification.

 There are currently two mainstream computing frameworks for solving the scalability of database computing performance, one is based on MPP (Massive Parallel Processing) acceleration, and the other is based on DAG (Directed Acyclic Graph). On the whole, the MPP-based approach is not flexible enough in terms of fault tolerance, scalability, and business adaptation, and cannot meet our needs for future diversified data service support. Therefore, we chose a DAG-based computing model. At the same time, on the basis of it, the execution performance is deeply optimized, which can not only support more diverse data computing requirements, but also obtain extreme performance.

technical point

MPP

DAY

SQL compilation

Rely on the SQL capability of the stand-alone database

Self-developed SQL compiler

data storage

Share nothing architecture

Shared distributed storage architecture

metadata information

With relatively limited meta information, it is difficult to optimize the global computing tasks

There is global meta information, which can better coordinate data communication between executors and start and stop tasks

Performance within the shard

The execution speed of the local library is high, which is theoretically the upper limit of the DAG

Performance can be optimized through technologies such as executors and Codegen

fault tolerance

Rely on each database to complete the segmentation task, so fault tolerance is insufficient

Shared data storage, Task design can be simple, idempotent, and better fault-tolerant

data communication performance

Rely on data distribution to reduce the performance loss of data communication, so it is not flexible

Rely on global data element information to reduce communication performance loss and be more flexible

core advantages

The optimizer is mature and the local execution performance is better

Higher flexibility and fault tolerance, which can better reduce data communication consumption

architectural issues

Overall performance depends on business characteristics and data distribution

The scalability of some MPPs still needs to be improved

SQL, transactions, optimizers, etc. still need continuous improvement, basically approaching the performance of MPP

Since 2018, enterprises' demand for real-time computing has grown very rapidly. In addition, because real-time computing is mostly a production system, it also has higher technical requirements than analysis systems, including:

  • High concurrency: instant high concurrency data operation or analysis

  • Low latency: millisecond-level processing response time is required

  • Accuracy: data is not lost or duplicated, business is highly available

  • Business continuity: connect production data services online

In order to be able to systematically adapt to business needs, we gave up open source solutions such as Spark or Flink, but completely designed the entire real-time computing product. First of all, we redesigned the computing mode of the stream computing engine to ensure that its computing delay for data streams can be as low as 5 milliseconds. Link security.

In addition, in terms of calculation mode, streaming data can not only perform complex calculations with data in other time windows, but also need to perform calculations with historical data (data persisted in various databases), so we introduced the CEP engine (Complex Event Processing Engine), which can calculate multiple input events, perform complex pattern matching and aggregation calculations, etc. It also supports various sliding window calculations, and can also perform associated calculations with historical data or persistent data.

For complex application business, we have designed a rule engine (Rule Engine) to process business rules, and it can be compatible with business rules designed by other rule engines, so that complex business rules can be realized. Finally, in order to better deal with business indicators, we have also added a memory-based distributed cache to the streaming engine to accelerate the high-speed storage and reading of data indicators, while supporting data subscription and publishing.

In the SQL model layer, we define the SQL language extension of StreamSQL, and add objects such as Stream, Stream Application, and Stream Job. A Stream is used to receive data from a data source, either directly or by performing certain conversion operations on the data. A Stream Job defines specific stream data operation logic, such as rule matching logic, real-time ETL logic, etc. A Stream Application is a combination of a set of Stream Jobs related to business logic.

USE APPLICATION cep_example;

CREATE STREAM robotarm_2(armid STRING, location STRING) tblproperties(

    "topic"="arm_t2",

    "kafka.ZooKeeper"="localhost:2181",

    "kafka.broker.list"="localhost:9092"

);

CREATE TABLE coords_miss(armid STRING, location STRING);

INSERT INTO coords_miss

SELECT e1.armid, e1.location

FROM PATTERN(

-- When the robot arm passes position A and then does not pass position B, the alarm will be issued

e1=robotarm_2[e1.location='A'] NOTNEXT

e2=robotarm_2[e2.armid=e1.armid AND e2.location='B'] ) WITHIN ('1' minute);

—Distributed  block storage management layer— 

The unified distributed block storage management layer is a major transformation we have made to the new generation of big data technology. Data consistency is the foundation of a distributed system. The emergence of the Paxos protocol theoretically guarantees its feasibility, and the more concise Raft protocol is more efficient in engineering implementation. In engineering, multiple open source distributed storages also have many deficiencies in the way of achieving high data availability and data consistency. For example, Cassandra can guarantee high availability in terms of architecture, but it will have the problem of replica data inconsistency, and it cannot support transactional operations; the underlying layer of HBase uses HDFS to ensure data persistence and consistency, but HMaster adopts the master-backup method, switching The process may be relatively long, so there is a single point of failure problem, and availability cannot be guaranteed; Elasticsearch is similar, and the consistency of data in the partition is also a problem in production.

 With the deepening of the development of enterprise data business, more requirements for dedicated storage engines will be introduced, such as data storage and analysis for geographic information, graph data, storage and calculation of high-dimensional features and other special scenarios, plus support for For the existing four types of NoSQL storage requirements, the workload of implementing a separate storage engine for each scenario is very heavy, and there is also the problem of reinventing the wheel.

 In order to solve this problem, we abstract the common parts of each distributed storage into the storage management layer, including data consistency, storage engine optimization interface, transaction operation interface, MVCC interface, distributed metadata management, Functions such as data partition strategy, fault tolerance and disaster recovery strategy, coordinate various roles through the self-developed Raft-based distributed control layer. Each storage engine only needs to implement its stand-alone storage engine, and then connect to the unified storage management layer to become a highly available distributed storage system.

In terms of implementation, we use the Raft protocol to guarantee the consistency between various storages, mainly including:

  • State machine synchronization between tablet replicas composed of each stand-alone storage

  • Master election and state machine synchronization

  • Master election and state machine synchronization of transaction coordination group

  • Recovery Service Capability of Storage Service

  • Other management and operation capabilities

—Resource  Scheduling Layer— 

Similar to the scheduling module of the operating system, the resource scheduling layer is the key technology for the entire big data platform to operate effectively. The figure below shows the overall architecture of the resource scheduling layer. The bottom layer is the Kubernetes service, and our self-developed products or services are running on the upper layer. Among them, the configuration center is used to collect and manage the configuration parameters of the services running in the cloud platform in real time; the physical resource pool is a logical resource pooled through various resources; the cloud storage service is a distributed storage service developed based on local storage, which will last Stateful service data is guaranteed to ensure the ultimate persistence of application data and system disaster recovery capabilities; cloud network is a self-developed network service that provides applications and tenants with VPC-like network capabilities. On top of this is the cloud scheduling system, which receives application input, obtains real-time operating indicators from the configuration center, label center, cloud storage, and network services, and obtains resource usage from the resource pool, so as to Make precise scheduling decisions. On top of the scheduling system are various application services, including big data, AI, databases, and various microservices, that is, various applications that can be well supported by the cloud platform.

—Summary—  _ _

This article introduces the big data technology of Transwarp Technology. In the future, Transwarp Technology will continue to improve this new big data architecture system, add more new data storage and computing capabilities, and improve the technical puzzle of data businessization, including based on machine learning. Data governance, data service release and other capabilities further consolidate the technical gap between data and business, allowing big data technology to better play its value.

Guess you like

Origin blog.csdn.net/mkt_transwarp/article/details/130129687