You do not need a number of real-time warehouse | you need it is an appropriate and powerful OLAP database (on)

Foreword

This year there is a phenomenon, the number of real-time warehouse building suddenly was of concern to everyone. I also wrote a number of individuals in the public and reprinted several articles and programs on real-time data warehouse.

But for real time data warehouse enthusiast need not be.

First, technically almost no difficulty to realize the needs of real-time data warehouse based on the powerful open source middleware has become less difficult. Second, building real time data warehouse must be accompanied by the development of business and development, arbitrary Kappa architecture must be considered the best real-time number of warehouse architecture is wrong. With the actual situation in the framework of the development of the number of warehouse business has become less one or the other.

In the construction of the entire number of positions in real time, selection OLAP database directly restricts the usability and functionality of real-time number of positions. This paper starts from a typical number of warehouse construction and development of the industry a few, from architecture, technology selection and the advantages and disadvantages to all the open source OLAP analysis engine on the market today are designed to facilitate the process of technology selection can be based on actual business select.

Benevolence - Rookie / know almost / US Mission / Netease number of carefully selected real-time warehouse building

Why build real-time data warehouse

After traditional offline data warehouse stores traffic data set to a fixed timing calculation logic ETL modeling output and the other statements and other applications. Offline data warehouse is built primarily offline data T + 1, pull through daily incremental data scheduled task, and then create a theme related to the dimension data of each business, provide external data query interface to T + 1. Computing and real-time data are poor, business people can not get a few minutes before the real-time data according to their need for immediacy. The value of the data itself over time will gradually weaken, it is necessary to reach the hands of users as soon as possible after the occurrence of the data, the number of positions to build real-time needs have emerged.

In short sentence: timeliness requirements.

Ali rookie of the number of real-time warehouse design

file

Real-time rookie overall design as the right number of positions, based on data service system, the data model is the traditional hierarchical design Summary (Detail / summary mild / height summary); calculation engine, choice of internal Ali Blink; Data Access with TIANGONG access (inventions tool is connected to a plurality of data sources, the purpose of the shield is directly connected to a large number of various databases); corresponding to the application data of each service is novice.

Architecture real-time rookie of the number of bins is a very typical very tried and tested design. Real-time data access part through messaging middleware (open source big data field is none other than non-Kafka, Pulsar is a rising star), Hbase as highly aggregated KV directory assistance.

So where a large number of direct support for business in? Here: ADS.

ADS (later renamed the ADB, adding new features) is an independent research and development of Alibaba huge amounts of data in real time highly concurrent online analysis (Realtime OLAP) database cloud computing. ( Https://help.aliyun.com/document_detail/93838.html )

file
The classic scene in real-time data cleansing
file
classic scene in real time the number of positions

Given the ability of ADB ADB official document in:

Fast
ADB using MPP + DAG fusion engines, using technology ranks coexist, automatic indexing techniques, rapid expansion to thousands of nodes.

Flexible
be adjusted number of nodes and dynamic lift with examples specifications.

Ease of use
is fully compatible with MySQL and SQL protocol

Very large scale
fully distributed architecture, with no single point design for easy scale-out SQL handle concurrent increase.

High concurrent write
100,000 TPS ability to write small, up to 2 million + TPS writing ability through horizontal expansion node. After writing the data in real time, analysis can be about 1 second. Maximum support 2PB single table data, ten trillion record.

The real time data warehouse almost known design

The number of real-time warehouse practices and the evolution of architecture know almost divided into three phases:

  • Real-time warehouse number 1.0 version themes: ETL logic real-time, technical solution: Spark Streaming
  • Real-time warehouse version number 2.0, Subject: hierarchical data, real-time index calculation, technical solutions: Flink Streaming
  • The number of real-time warehouse Future: Streaming SQL platform, systematic meta-information management, automated inspection results

file
The number of real-time warehouse version 1.0
file
real-time warehouse version number 2.0

On the technical architecture, increased index layer summary, the summary index layer is a layer or a detailed summary of the detail layer obtained by polymerizing calculation, this layer most of the real number output bin index, which is the maximum number of bins in real time with 1.0 difference.

The technology selection, the selected HBase known and almost real-time metrics as Redis storage engine according to different scenarios, in OLAP selection, the selected known almost Druid.

file
Know almost real-time multidimensional analysis platform architecture
file
Druid overall architecture

Druid is an efficient data query system, the main solution is polymerized for a large number of queries based on the data timing. Real-time data can be ingested into the Druid to be investigated immediately, while the data is almost impossible to change. Usually it based on the fact that the timing of the event, after the fact into the Druid, the external system can be queried that fact.
Druid architecture used:

  • shared-nothing architecture architecture and lambda

Druid three design principles:

  • Quick Search: partial data polymerized (Partial Aggregate) + memory of (In-Memory) + index (Index)
  • The level of development capabilities: Distributed Data (Distributed data) + Parallel Query (Parallelizable Query)
  • Real-time analysis: Immutable Past, Append-Only Future

If you do not understand the Druid, please refer here: https://zhuanlan.zhihu.com/p/35146892

Real-time warehouse design several US groups

file
US Mission in real time the number of data warehouse layered architecture

US aspect consists of the following four groups:

  • ODS layer: Binlog and traffic logs as well as real-time traffic queue.
  • Data of detail: the fact that extract business data integration, offline full amount of real-time and real-time change data to build dimensional data.
  • Summary data layer: use the wide table model detailed data on the supplemental data dimensions, to summarize common index.
  • App layers: application layer built for the specific needs of providing services by the external RPC framework.

According to different scenarios, each model level storage solutions in real-time using the number of warehouses and OLAP engines as follows:

file

  • Data of detail for the associated frequency dimension data of the scene up to 10w + TPS, we chose Cellar (KV US group distributed inside the storage system, similar to the Redis) as a storage, packaging dimensions for the dimension data service providing real time data warehouse.

  • For general data collection layer summary indicators, the need for historical data associated with the data, using the same dimension data and programs stored by Cellar as to associate them with the service mode.

  • The application layer is relatively complex design data of an application layer, and then compared after storage of several different schemes. We have developed to read and write data as the cutoff frequency of 1000 QPS judgment basis. For the average reader, but frequencies above 1000 QPS queries less complex real-time applications, such as real-time business management data. Cellar for the use of storage, providing real-time data services. For some queries and complex applications that require detailed list, use Elasticsearch as storage is more appropriate. The low frequency of some queries, such as some internal data operations. Druid indexing real-time processing by the message, and can rapidly provide real time data analysis OLAP by prepolymerization. For when some historical versions of data products for real-time transformation can also be used to facilitate the MySQL storage product iterations.

In short, the same Druid based on OLAP selection.

NetEase carefully selected number of real-time warehouse design

file

NetEase carefully selected real frame number of bins based on the overall flow of data is divided into different levels, each data access layer collects traffic data access system in accordance with various tools. Both the number of bins of data offline message queue original data, the original data is calculated in real time, it can guarantee real-time and offline raw data are uniform. Flink + layer after calculating the real-time computation engine to do some processing, and then fall into the storage layer among different storage media. Different storage media are selected based on different scenarios. There are interactive framework Flink and Kafka performs on a hierarchical design data, calculation engine fish data from Kafka to do some processing and then returned to Kafka. In the data storage layer will be processed by the service layer two services: a unified query, index management, query a unified service interface is the transfer of data through the business side, index management is the definition of the indicator and data management. By application service layer to a different application data, application data may be our official product directly or business systems.

Based on the above design, selection techniques are as follows:
file

For storage layer will choose a different storage medium based on the characteristics of the different layers of data, ODS layer and layer DWD are some real-time data storage, chose Kafka storage, data will be associated with some historical details in the DWD layer, it will Redis put inside. In the DIM layer is mainly to do some high concurrent queries related dimensions, it will generally be stored in HBase inside, for DIM complex layer parity, data must be considered for landing requirements and specific query engine to select a different storage methods. For common indicators aggregated model MySQL directly on the inside, more of dimensions written update on the larger model will HBase inside, there is need to do some detailed data analysis or multi-dimensional stores it in association Greenplum inside, also there is a dimension that more needs to be done to sort, query requirements are relatively high, such as the user's list of sales during the event and other large list stored directly in Redis inside.

YORK selected carefully selected GreenPulm, Hbase, Redis as MySQL and revealing layers and the calculation data.

GreenPulm technical characteristics are as follows:

  • Support for mass data storage and processing
  • Support Just In Time BI: quasi real-time, real-time data loading, to achieve real-time updates of data warehouse, so as to realize the dynamic data warehouse (ADW), based on dynamic data warehousing, business users can perform BI real-time analysis (Just for the current business data In Time BI)
  • Support mainstream sql syntax, very convenient to use, low-cost learning
  • Scalable, multi-language support custom functions and custom types
  • Offers a number of maintenance tools, maintenance is easy to use
  • Supports linear expansion: MPP using parallel processing architecture. Increasing MPP node structure can be linear storage capacity and processing power providing system of
  • Better support for concurrency and high availability support in addition to providing hardware level Raid technology, but also provide protection database layer Mirror mechanism provided Master / Stand by mechanisms master node fault tolerant, when the master node error, can switch to Stand by node continues service
  • Support MapReduce: for large scale data analysis techniques
  • Internal database compression

If you are not familiar GreenPulm can be found here:
https://www.cnblogs.com/wujin/p/6781264.html

to sum up

We can see by the above analysis, in real time construction of the entire number of positions in the industry already has a mature program. Overall architecture design through hierarchical design for the OLAP query share the pressure, so that the computing space, unified complex calculations in real-time computation layer do, avoid OLAP query to bring too much pressure. Summary calculations were taught to OLAP databases. We can say that the real-time calculated over the entire architecture is generally Spark + Flink cooperation, message queues Kafka a single large, the entire field of big data messaging application queue are still dealing with a monopoly, latecomers surpass Pulsar want to make very difficult, Hbase, Redis, and MySQL all have a place in a particular scene.
Only in OLAP field, contending, their own merits. Large open OLAP engine data field including but not limited to Hive, Druid, Hawq, Presto, Impala, Sparksql, Clickhouse, Greenplum like. Next we will make a detailed comparison of the various advantages and disadvantages of open source OLAP engine and usage scenario, allowing developers to be aware of when the technology selection.

Big Data technology and architecture
Welcome to my public concern scan code number, reply] [JAVAPDF can get a 200 Autumn trick interview questions!

Guess you like

Origin www.cnblogs.com/importbigdata/p/11521403.html