"Offline and real-time big data development combat" (2) Big data platform architecture & technology overview

Preface

Then the last chapter to build large data development knowledge map , this teacher to continue to share in the state of "off-line and real-time big data to develop real" study notes. What kind of platform can be regarded as a big data platform? With this question, we start today's content (•̀ ω • ́ )✧

What is a data platform? Or more fashionable, what is a big data platform? At present, the industry does not have a precise definition of the data platform, but the commonly referred to as the data platform mainly includes the following three parts:

  • Data-related tools, products and technologies : such as Sqoop for batch data collection and transmission, offline data processing Hadoop and Hive, real-time stream processing Storm, Spark, and data analysis R, etc.;
  • Data assets : not only include data generated and accumulated by the company's business itself, but also data generated by the company's operations (such as finance, administration), and data purchased, exchanged, or crawled from the outside;
  • Data management : With data tools, there are also data assets, but they must also be managed to maximize the value of data and minimize risks. Therefore, data platforms usually also include data management related concepts and technologies, such as data warehouses and data Modeling, data quality, data specification, data security and metadata management, etc.

The above is a logical division of the data areas of the platform, the platform is actually data from angle data processing timeliness, or generally divided off internet data and real-time data platform .

  1. Offline data platforms usually take days as a typical data processing cycle, and data delays are also based on days. The data application of the offline data platform is mainly based on "seeing". From the current data status of the industry, the offline data platform is still the main battlefield of the data platform.

  2. However, with the deepening of the application of big data and the rise of the wave of artificial intelligence, the trend of intelligent products is becoming more and more obvious. The real-time and online data also put forward higher and higher requirements for the real-time performance of the data platform. From the beginning of the minute delay to the current second or even millisecond delay, real-time data platforms are getting more and more attention, and the challenges are getting bigger and bigger. Of course, they are becoming more and more mainstream. With the development of Spark, Flink, and Beam technologies, One day in the future, the technology and architecture of offline data platforms may be disrupted.

The next step is introduced data platform for clear logic and technology-related considerations, mainly from the offline data platform, real-time data platform and data management concepts and techniques three aspects of data related to the platform are introduced.

1. Architecture, technology and design of offline data platform

For company managers and front-line business personnel, the question that often needs to be answered is: What is the current and past quarter or month sales trend? Which products are selling well? Which products are not selling well? Which customers are buying our products? Managers and business personnel need to constantly monitor these business indicators, and adjust business strategies and play methods based on these indicators in a targeted manner. This is repeated to form a closed loop. This is the basic idea of ​​digital operation.

And this kind of analysis and "seeing" needs are what offline data platforms are good at. For this kind of analytical needs, the timeliness of the data is not a strong demand. No matter if the data of the day is available, even if it doesn't, the impact will be small. , Offline data technology and tools have been developed for many years, and there are many open source solutions and commercial solutions, which have been able to solve such problems very maturely.

Offline data platform is the foundation and foundation for building company and enterprise data platform, and it is also the main battlefield of current data platform.

1.1 The overall architecture of the offline data platform

The offline data platform is usually connected with Hadoop Hive, data warehouse, ETL, dimensional modeling, data common layer, etc.

Before the advent of Hadoop, the main processing technology of data warehouses was commercial databases, such as Microsoft's SQL Server, Oracle's Oracle, and IBM DB2. With the rise of big data and the continuous explosion and exponential growth of data volume, Hadoop, MapReduce, Hive and other big data processing technologies have become more and more widely used and accepted.

Big picture of the overall architecture of the offline data platform

1.2 Data warehouse technology

1. OLTP 和 OLAP

The data warehouse is gradually developed with the needs of data analysis. The initial data analysis and reports are based on the database of the business system, which is the OLTP database, such as commercial Oracle, MS SQL Server and open source MySQL. database.

The full name of OLTP is Online Transaction Processing. As the name implies, OLTP database is mainly used for transaction processing, such as adding an order, modifying an order, querying an order, and canceling an order. The core requirement of the OLTP database is the efficient and fast processing of a single record. The most fundamental requirements such as indexing technology, sub-database and sub-table are to solve this problem.

  • This is the natural opposite of the demand for data analysis. Data analysis usually requires access to a large amount of data. The analysis of a single piece of data does not mean anything. Data analysis not only requires access to a large amount of data, but also frequent statistics and queries. Fast database administrators found that these statistical analysis requests took up a lot of database resources, which had seriously affected the performance of the production system.
  • Therefore, it is a natural choice to isolate these data analysis requests to a separate standby database or completely copy a new database for data analysts to use.

After solving the problem of the impact on the production database, the OLTP database administrator quickly discovered that the standby database and the replication database still could not meet the needs of data analysts, especially in terms of performance. A large number of data accesses usually require full table scans. Frequent and often concurrent full table scans will cause the OLTP database to respond abnormally slow or even downtime. New theoretical support and technological breakthroughs are required to satisfy these analysis requests.

So the OLAP database came into being. It is a specialized analytical database developed to meet the statistical analysis needs of analysts.

  • The OLAP database itself can process and count a large amount of data, and unlike the OLTP database, it needs to consider the addition, deletion, modification, and checking of data and concurrency lock control. OLAP data generally only needs to process data query requests, and the data is imported in batches. Therefore, technologies such as column storage, column compression, and bitmap indexing can greatly speed up the response to requests.

Simple comparison of OLTP and OLAP databases

2. Analytical database

Analytical databases are mainly faced with the statistical and aggregation operations of large data sets by analysts and business analysts. Its architecture, design, and principles are completely different from traditional database products (OLTP databases). Generally speaking, data warehouse products must be distributed, but it is obviously different from the problems to be solved by distributed OLTP databases. The distributed OLTP database (such as sub-database and table and other technologies) is mainly to solve the pressure of a large number of single data requests. Its main purpose is to evenly distribute all user requests to each node. The distributed OLTP is to divide the user The task of requesting large data sets is assigned to each node for independent calculation and then aggregated and returned to the user.

In addition, OLAP databases generally use columnar storage, while OLTP generally uses row storage.

  • The so-called column storage is to store each column of the table together one by one, instead of storing all the fields together in rows like row storage.
  • For database tables, the column type is fixed, so column storage can easily use high compression ratio algorithms for compression and decompression, and disk I/O will be greatly reduced.
  • Column storage is very suitable for large data volume statistical query application scenarios. Because analysis and statistics are often for a certain column or certain columns, column storage database products only need to read the corresponding column and process it, instead of reading all of the entire table Line for processing.

3. Hadoop data warehouse

With the improvement of Hadoop over the years and the rise of the Hadoop ecosystem, Hadoop-based data warehouses have completely occupied the main track in just a few years. Especially in Internet companies, Hadoop-based data warehouses are basically standard.

The inherent technical genes of Hadoop determine the data warehouse solution based on Hadoop (currently mainly Hive is very easy to expand (you can easily add nodes and expand the data processing capacity from GB, TB to PB or even EB), and the cost is also very low (no commercial Expensive servers and storage require only ordinary hardware, and the Hadoop framework will perform fault-tolerant processing. These two points are also key factors for the growing popularity of Hadoop in Internet companies.

  • Data warehouse solutions based on Hadoop, especially Hive, face the biggest challenge of data query latency (Hive's latency is generally on the order of minutes, depending on the complexity of Hive SQL and the workload to be processed. In many cases, it even needs to run several times. Hours. Of course, for simple and small data Hive SQL, the results may be returned in a few seconds).

But big data and cloud computing are the future, and future business systems will also be executed in the cloud, whether it is private cloud, public cloud or hybrid cloud. The cloud also determines that the future architecture must be distributed and can be approximately linearly expanded. Based on this, the author believes that Hadoop and Hadoop-like data warehouse solutions will become mainstream and standard in the future, whether it is for Internet companies , Or traditional enterprises.

1.3 Data warehouse modeling technology

Since the birth of the concept of data warehouse, in the field of data warehouse, there are two widely recognized methods to build data warehouse. The representatives of these two factions are Bill Inmon and Ralph Kimball. Bill Inmon is called the "father of data warehouses" and Ralph Kimball is called the "father of business intelligence."

Since the birth of these two viewpoints, the debate about "which architecture is the best" has never stopped. People have expressed their opinions, but they have been unable to reach a unified conclusion, just like "which programming language is the best programming language". It can be called the "religious war" in the field of data warehouses.

1. Ralph Kimball Modeling Methodology

Kimball's theoretical contributions to data warehouses are all related to dimensional design and modeling. Dimensional modeling divides the objective world into measurement and context.

  • Metrics are captured by the business process of the organization and the source systems that support them. They often appear in numerical form (such as order amount, inventory quantity, etc.), and dimensional modeling theory calls it a fact;
  • Facts are surrounded by a large number of textual contexts, and these contexts are often intuitively divided into multiple independent logical blocks. Dimensional modeling is called dimension, and dimension describes the 5 W (When, Where, What, Who , Why) information, such as when to place the order, how to place the order, what to buy, who the customer is, etc.

The Kimball data warehouse modeled by dimensional modeling theory is often presented in a star-shaped structure. In the middle of the star-shaped structure is the fact table, and the surrounding of the fact table is the dimensional table of various angles.

Kimball Dimensional Modeling Star Architecture
In dimensional modeling, because the star pattern closely follows the business process, is very intuitive and conforms to the perspective of business personnel, it is widely and widely used. The star pattern is also a major contribution of Kimball to data warehouse modeling.

Kimball's second major contribution to the theory of data warehouse modeling is the dimension-based " bus architecture ". In actual projects, the business process of an enterprise is usually diverse and complex, and exists in multiple business topics. Together, the bus architecture and consistency dimensions ensure that the fact tables and dimension tables of multiple topics can be finally integrated together to provide consistency And the only
caliber to the business staff.

The data warehouse system architecture using Kimball modeling theory is shown in the figure:

Data warehouse system architecture using Kimball modeling theory
It can be seen that the theme of Kimball dimensional modeling is based on the star architecture, and the consistency dimension and enterprise bus architecture are used between the topics and the topics to ensure the integration and consistency of the data warehouse.

2. Bill Inmon Modeling Methodology

In the field of data warehousing, Bill Inmon was the first to propose the concept of OLAP and data warehousing, so he is called the father of data warehousing. Bill Inmon has written a lot of articles introducing his data warehousing methods. He believes that data warehousing is " A collection of subject-oriented, integrated, time-related, unmodifiable data in enterprise management and decision-making."

Different from other database applications, a data warehouse is more like a process, a process of integration, processing and analysis of business data distributed throughout the enterprise, rather than a product that can be purchased. This is what he calls "Enterprise Information Chemical Factory".

Enterprise-level data warehouse system architecture using Bill Inmon modeling theory
Inmon's enterprise information factory includes the source system, preparation area, ETL, enterprise data warehouse, data mart, etc., and the enterprise data warehouse is the hub of the enterprise information factory. Different from Kimball, Inmon data warehouse that companies should integrate the atomic data warehouse should be used in third normal form and ER theory to model rather than dimensional modeling of the fact and dimension tables.

Inmon's enterprise information plant involves the concept of "data mart", the so-called "mart" is a department-level data warehouse. For the data mart, Inmon advocates extracting the required data from the enterprise data warehouse to ensure the consistency of the data. The problem that this brings is that the enterprise data warehouse must be established before the department-level data mart can be established. This is the second major difference between the Inmon data warehouse architecture and the Kimball data warehouse architecture. At the same time, Inmon also believes that Kimball's dimensional modeling theory should be used to build a data mart.

3. Data warehouse modeling practice

From the above introduction of the two data architectures, it can be seen that Inmon’s method is a top-down data warehouse construction method. It advocates that the overall planning of the enterprise data warehouse should be carried out first, and the different OLTP data is concentrated in the subject-oriented, integrated, non-volatile and time-varying enterprise data warehouse. The data mart should be a subset of the data warehouse, and each data mart is specifically designed for an independent department.

The Kimball method is the opposite, which is down-top. Kimball believes that a data warehouse is a collection of a series of data marts. Enterprises can incrementally integrate various data marts through consistent dimension tables and "enterprise bus architecture" to build a data warehouse for the entire enterprise.

Summarize their differences in one sentence:

Kimball : let people build what they want when they want it, we will
integrate them it all when and if we need to.

Inmon: don’t do anything until you have designed everything.

Inmon's method has a long deployment and development cycle, but it is easy to maintain and highly integrated; while Kimball's method can quickly respond to business needs and quickly build a data warehouse, but later integration and maintenance are more troublesome. There is no absolute right or wrong between the two, only the pros and cons of different stages and different scenarios.

4. Data warehouse logical architecture design

Offline data warehouses are usually built based on dimensional modeling theory, but in addition, offline data warehouses are usually logically layered. The logical layering of data warehouses is also the best practice in the industry.

The logical layering of offline data warehouses is mainly based on the following considerations:

Isolation : The user should use the data carefully processed by the data team, rather than the original data from the business system. One of the benefits of this is that users use carefully prepared, standardized, and clean data from a business perspective, which is very easy to understand and use; the second benefit is that if the upstream business system changes or even refactors (such as Table structure, field business meaning, etc.), the data team will be responsible for handling all these changes to minimize the impact on downstream users.

Performance and maintainability : professional people do professional things, data layering makes data processing basically in the data team, so the same business logic does not need to be repeated, saving the corresponding storage and computing overhead, after all, big data is not There is no cost. In addition, the layering of data also makes the maintenance of the data warehouse clear and convenient. Each layer is only responsible for its own tasks. If there is a problem with the data processing of a certain layer, only the layer needs to be repaired.

Normativeness : For a company and organization, the caliber of data is very important. When you talk about an indicator, it must be based on a clear and recognized caliber. In addition, tables, fields, and indicators must also be standardized.

The data warehouse is generally divided into the following layers:

ODS layer: The data tables of the source system of the data warehouse are usually stored intact. This is called the ODS (Operation Data Store) layer. The ODS layer is often called the staging area. They are subsequent data. The warehouse layer (that is, the fact table and dimension table layer generated based on Kimball dimensional modeling, and the summary layer data processed based on these fact tables and schedules) source of processing data, and the ODS layer also stores historical incremental or full data .

OWD and DWS layer: Data Warehouse Detail (DWD) and Data Warehouse Summary (Data Warehouse Summary, DWS) are the main content of the data platform. The data of DWD and DWS is generated by ETL cleaning, conversion, and loading of ODS layer data, and they are usually constructed based on Kimball's dimensional modeling theory, and the dimensions of each subtopic are guaranteed through consistent dimensions and data buses consistency.

ADS layer: The application layer is mainly a data
mart (Data Mart, hereinafter referred to as DM) established by various business parties or departments based on DWD and DWS . The data mart DM is a data warehouse (Data Warehouse, hereinafter referred to as DW) relative to DWD/DWS ) For it. Generally speaking, the data of the application layer comes from the DW layer, but in principle, direct access to the ODS layer is not allowed. In addition, compared to the DW layer, the application layer only contains the detail layer and summary layer data that the department or business side cares about.

The logical layered architecture of the data warehouse using the above ODS layer → DW layer → application layer is shown in the figure:

Data warehouse logical layered architecture

2. Architecture, technology and design of real-time data platform

The cycle of data output by offline data platforms is generally days, that is to say, what you see today is yesterday’s data. For most scenarios of analysis and "seeing" data, this T+1 offline data can satisfy Business analysis needs, but as business operations become more refined, the requirements for timeliness of data are getting higher and higher, and more and more business scenarios need to see business effects immediately, especially in business promotion activities (typically such as Double 11 big promotion, 618 big promotion, etc.).

More importantly, with the rise of the artificial intelligence wave, real-time data is no longer the best, but a must. Data is not only analyzing and "seeing", but becomes part of the production business system together with algorithms.

2.1 The overall architecture of the real-time data platform

The supporting technology of the real-time data platform mainly includes four aspects: real-time data collection (such as Flume), message intermediate (such as Kafka), stream computing framework (such as Strom, Spark, Flink, Beam, etc.), and real-time data storage (such as column family) Stored HBase). At present, mainstream real-time data platforms are also built based on these four related technologies.

Big picture of the overall architecture of the real-time data platform
The real-time data platform must first ensure the real-time nature of the data source. Data sources can generally be divided into two categories: databases and log files. For the former, the best practice in the industry is not to directly access the database to extract data, but to directly collect database change logs.

For MySQL, it is binlog, which is MySQL's database change log, including the status before and after the data change.

The speed and frequency of binlog events collected by data collection tools (such as Flume) usually depend on the source system. The data processing speed of it and downstream real-time data processing tools (such as Storm, Spark, Flink and other stream computing frameworks and platforms) is usually not matched. In addition, real-time data processing usually has the need to restart from a certain historical point in time and use the same source data for a real-time task. Therefore, message middleware is usually used as a buffer to achieve real-time data collection and processing. Match.

Real-time data storage is usually placed in different data stores according to the different ways of downstream data usage. For data in service (that is, the data consumer passes in a certain business ID, and then obtains all related fields of this ID), it is usually placed in HBase ; For large screens of real-time data, it is usually placed in some kind of relational database (such as MySQL). Sometimes in order to improve performance and reduce the pressure on the underlying database, a cache database (such as Redis) is also used.

2.2 Stream computing technology

The popularity and acceptance of stream computing began with Storm, which was born around 2011. Stοrm was quickly known and accepted as a "real-time Hadoop".

So, what is stream computing? What is the difference between it and offline batch processing? Unlike offline batch processing (such as Hadoop Map Reduce), stream computing has the following typical characteristics:

Unbounded : The source of data for stream computing is continuous, just like river water flowing continuously. Accordingly, stream computing tasks need to be running all the time.

Trigger : Unlike Hadoop offline tasks are triggered by scheduled scheduling, each calculation of stream computing tasks is triggered by source data. Triggering is a very important concept of stream computing. In some business scenarios, the logic of triggering messages is more complicated, which poses a great challenge to stream computing.

Latency : Obviously, stream computing must be able to process data efficiently and quickly. Different from the processing delay of offline Hadoop tasks at least in minutes or even hours, the delay of stream computing is usually in seconds or even milliseconds, and minute-level delays are only accepted in some special circumstances.

Historical data : If a Hadoop offline task finds that there is a problem with the data of a certain historical day, it is usually easy to fix the problem and rerun the task, but it is basically impossible or very costly for streaming computing tasks, because first, real-time streaming messages are usually not saved It is a long time (usually a few days), and it is basically impossible to save the history completely on site. Therefore, real-time stream computing can generally only repair the data from the moment the problem is discovered. Historical data cannot be supplemented by streaming.

From the root cause, the implementation mechanism of stream computing currently has two processing methods: one is to imitate offline batch processing, which is to use micro batch processing (ie mini batch). Micro-batch processing brings an increase in throughput, but the corresponding data delay will also increase, basically at the second and minute level. The typical technology is Spark Streaming. The other is native message data, that is, the processing unit is a single piece of data. The early native stream computing technology has low latency (usually tens of milliseconds), but the data throughput is limited. Typically, the native Storm framework, but with Flink With the emergence and development of technologies, throughput is no longer a problem.

2.3 The main stream computing open source framework

Comparison of mainstream flow computing technologies

Three, data management

For a company and organization, data-only technology is not enough, and data must be managed. The scope of data management is very wide, but it mainly includes data exploration, data integration, data quality, metadata management, and data shielding.

3.1 Data exploration

Data exploration, as the name suggests, is to analyze the content of the data itself and the relationship, including but not limited to whether the required data is available, what fields are there, whether the meaning of the fields is standardized and clear, and the distribution and quality of the fields. Commonly used analysis techniques for data exploration include primary and foreign keys, field types, field lengths, proportion of null values, enumeration value distribution, minimum, maximum, average, etc.

Data exploration is divided into strategic and tactical.

  • Strategic data exploration refers to light-weight data analysis on the data source before using the data to determine whether it is available and the stability of the data to determine whether it can be used in the data platform. Strategic data exploration is the first task to be carried out before building a data platform. Unqualified data sources must be eliminated as soon as possible. If the data source is found to be unqualified at a later stage, it will have a major impact on the construction of the data platform.
  • Tactical data exploration refers to the detailed analysis of data by technical means, discovering as many data quality problems as possible, and feeding them back to business personnel or notifying the source system for improvement.

3.2 Data integration

The data integration of the data warehouse is also called ETL (extract: extract, transform: transform, load: load), which is the core of the data platform construction, and it is also the stage that takes the most time and energy in the data platform construction process.

ETL generally refers to the process of extracting data from the data source, undergoing transformations such as cleaning, conversion, and association, and finally loading the data into the data warehouse according to the pre-designed data model.

For data platform users and business personnel, they usually don’t know or care about how many sources of data (such as orders) are used, in which databases, and whether the field definitions are consistent (such as order system 1 on behalf of the order) Success, 0 means failure of the order, and system 2 uses sucess to represent the success of the order and fail to represent the failure of the order), what are the relevant data tables (such as the portrait information of the order customer, the category of the product), and the data user wants What you finally see is a standardized and standardized wide table containing all relevant order information. This wide table contains all the order information that can be used, and all of these back-end extraction, cleaning, conversion and association, as well as the final summary, Complex processes such as association are all completed by ETL, which is also one of the important values ​​that data warehouses can bring to data users.

3.3 Data quality

Data quality is mainly measured from the following four aspects:

Completeness : Completeness refers to whether there is a lack of data information. The situation of data lack may be the lack of the entire data record, or the record of a certain field in the data. The value that incomplete data can learn from will be greatly reduced, and it is also the most basic evaluation standard for data quality.

Consistency : Consistency refers to whether the data complies with the standard and whether the data collection maintains a uniform format. The consistency of data quality is mainly reflected in the specification of data records and whether the data conforms to logic. For example, the IP address must be composed of 4 numbers from 0 to 255 plus ".".

Accuracy : Accuracy refers to whether there are any abnormalities or errors in the information recorded in the data, and whether it meets business expectations.

Timeliness : Timeliness refers to whether the output time of data is timely and punctual and in line with expectations.

3.4 Data mask

The data warehouse stores all the data of the enterprise, some of which are very sensitive, such as the user's credit card information and ID card information. If this information is leaked, it will bring catastrophic consequences to the enterprise or company; but if this information is completely excluded, it will have an impact on development, testing, analysis and statistics.

Data masking is about how to process data irreversibly, so that the processed data can be used for development, testing, analysis and statistics without revealing any information. Common methods such as encryption, replacement, deletion/scrambling, etc.

Four, summary

Today, I mainly introduced the content category of the data platform. We learned the architecture, main concepts and technologies of the data platform from both offline and real-time aspects.

Offline data platforms are currently the main battlefield of data platforms. Related conceptual technologies (such as data warehouses, dimensional modeling, logical layering, Hadoop, and Hive, etc.) are relatively mature and have been widely used in various companies.

With the rise of timeliness of data and artificial intelligence, real-time data platforms are getting more and more attention and placed in a strategic position. The relevant technologies of real-time data platforms are also constantly developing and improving, such as Storm, Flink, and Beam. We It is necessary to maintain a certain degree of attention to these negative aspects and actively embrace these technologies.

Life is not to surpass others, but to surpass yourself.

This is Yun Qi, see you next time.

Insert picture description here

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/108757269