Explain the underlying architecture logic of the data center in detail

 What exactly is the data center, several years have passed, and there have been different opinions.

The author believes that the data center should not be a simple system or a software tool, but a set of architecture and a set of data flow patterns.

The data center needs to collect data as raw materials for data processing and data modeling, and then store them in different categories, and then create various data services (including data application platforms) according to actual business scenarios to achieve business empowerment and acceleration.

However, the realization of the above process needs to be supported by corresponding systems and products. So what systems or products should the basic data center consist of?

Here we can first look at the data middle-end architecture of several enterprises.

picture

 

picture

It can be seen that although each enterprise has different data middle-end systems due to its own business, the overall structure is basically unified, and it needs to pass through a "data collection access" - "processing storage". "-"Unified Management"-"Service Application" stage.

Here, the author believes that the data middle platform architecture summarized in the book "Data Middle Office Product Manager: From Data System to Data Platform Practice" is relatively universal, whether it is the Internet industry or traditional industries, it can be used in this architecture. Carry out transformation, design and build your own middle-stage structure.

In general, the functional architecture of the data center consists of three major parts: the big data platform, the data asset management platform and the data service platform. Among them, the self-service analysis platform and the label management system have the most extensive application scenarios in the data service platform.

picture

1. Big data platform

The big data platform is the base of the data center. We can also call the big data platform a big data development platform. It needs to have development capabilities related to big data, providing data storage, data cleaning/computing, data query display and permissions. management and other functions. So, how should the above functions and services be constructed? Does having the above capabilities equal to successfully building a big data platform?

In fact, we can find that the big data platform system architectures of various companies are actually similar. All kinds of architectures include data acquisition components, data storage components, data computing engines, data permissions and security components, and cluster management and monitoring components.

With the exception of a few companies like Alibaba that are dedicated to building self-developed "Flying" systems, other companies still rely on various open source components for optimization, improvement and secondary development in the selection of underlying components. For example, the data storage component can choose components such as HBase and Hive, and the data computing engine can choose distributed computing engines such as Spark and Flink.

Since the components selected by everyone are the same or similar, why is there still a gap in the service capabilities of the big data platforms of various enterprises? This is somewhat similar to buying parts to assemble a desktop computer. The parts do not need to be the most expensive, but the most suitable ones should be selected according to actual needs.

A useful big data platform needs to have the ability to solve problems for users. Therefore, the construction of the big data platform of the data center is not a competition of how many new technologies are cited and how many technical components are covered, but whether it can solve the complex data situation faced in the construction of the data center and whether it can become a data center The technical guarantee for breaking data barriers, whether it can provide simple and effective data processing tools, such as self-configured data collection and data cleaning tools, and whether it can provide more added value.

The construction of the big data platform in the data center can avoid the waste of resources caused by the construction of big data clusters by the technical teams of each division. For enterprises, a unified and mature big data platform cannot be achieved overnight. It needs to be implemented step by step and step by step to build the enterprise's big data platform ecology in continuous iteration.

2. Data asset management platform

The data asset management platform mainly solves the management of data resources. Data assets are distributed in various big data components, including hive tables, hbase tables, druid datasources, and streams in kafka. It is difficult for the management and control systems of each component to communicate with each other. Therefore, a unified data asset management service is needed to coordinate the management of big data resources.

With the construction of the big data platform, it becomes possible to build a data system in the data center. By classifying and integrating the data of each business line, we can construct various data subject domains, complete the standardized storage of data, form data assets, and then complete the Data asset management.

In the data middle-end system, the data asset management platform is mainly composed of metadata management and data model management. Let's take a look at them separately.

  • metadata management

Talking about metadata management, we need to first figure out what metadata is.

Metadata is usually defined as: data about data (Metadata), or data about data (data about data), descriptive information about data and information resources. Metadata is the most important of all data.

Here is a simple example. When we go to the library to borrow books, we are directly confronted with tens of thousands of books, and it is naturally difficult to find them, but you can obtain accurate information by entering the title, author, publisher and other information of the book in the library query system. book location. Then these book titles, authors and other information can be understood as metadata, while the storage location of books, borrowing history, etc. are ordinary data in our system.

In the database, the table name, creation information (creator, creation time, department), modification information, table fields (field name, field type, field length, etc.) of each data table, and the relationship between this table and other tables The relationship and other information belong to the metadata of this data table.

In fact, there are many ways to classify metadata. The author prefers to distinguish according to the use of metadata, which is divided into three categories: business metadata, technical metadata and management metadata.

►Business metadata: describe the business meaning and business rules of the data, including business rules, data dictionaries, and security standards. By clarifying business metadata, people can have a unified data cognition, eliminate data ambiguity, and allow business parties who do not understand the database to understand the content of the data table.

►Technical metadata: describe data source information, data flow information and data structure information, mainly serving data developers, allowing developers to understand the structure of data tables and the upstream and downstream tasks they depend on, mainly including database table fields (storage location, database tables, field lengths and types), data models, ETL scripts (scheduling information) and SQL scripts, etc.

►Management metadata: describes the management attribution information of data, including business attribution, system attribution, operation and maintenance attribution, and data authority attribution, which is the basis for data security management.

So some people say that metadata records the whole process of data from scratch, like a "dictionary" about data, allowing us to query the meaning and origin of each field, and it is also like a "map" ”, allowing us to trace the path through which the data was produced.

Through the construction of the data system, the metadata of the data center aggregates the data information of various business lines and systems of the enterprise, so that the data center has the ability to provide a global view of data assets, and realizes unified data asset query and access. Target.

Metadata management includes addition, deletion and editing management of metadata, version management, metadata statistical analysis and metamodel management. Through the above functional modules, the implementation of the data system is planned in a planned way, and the structuring and modeling of the metadata in the data can be realized, which can not only avoid the clutter and redundancy of the metadata, but also facilitate users to query and locate the data.

  • Data Model Management

When introducing metadata, we mentioned that technical metadata includes data models, where data models refer to the work product of data modeling using metadata.

According to the usage of the underlying data, such as the association information of the data table, SQL script information (data aggregation and query information, etc.), to obtain the metadata, it can better complete the abstraction of the business and improve the modeling efficiency.

The data model is an effective means of data integration. It completes the design of the mapping relationship between various data sources and provides "implementation drawings" for the construction of data themes.

At the same time, in the process of data modeling, by clarifying data standards, data consistency can be ensured, and redundant data can also be digested.

As for data model management, it means that in the process of data modeling, through the established data model management system, the management of adding, deleting, modifying and checking data models is realized, while complying with the requirements of data standardization and data unification to ensure data quality.

3. Data service platform

  • Self-service analytics platform

Self-service analytics platform, also known as business intelligence platform (BI platform). The BI platform has become the standard configuration of many enterprises. At present, the industry competition in the BI commercial market is becoming more and more fierce. The entrants can be divided into the following three categories:

►Domestic BI manufacturers, typical representatives are FanRuan, which has the largest domestic market share for many years in a row

►Foreign BI vendors, such as Tableau

►Internal incubation of Internet giants

The BI platform is the main exporter of the service capabilities of the data center. In order to make the data center play its due value, the construction of the BI platform is essential. Therefore, the construction of the BI platform needs to be divided into the data center system. On the whole, the BI platform should have the following capabilities.

(1) Data access

In addition to the data center's own data sources, the BI platform also needs to support access to external data sources. There are three main ways to access it.

►File type: Supports uploading of file data such as Excel.

►Data connection type: It supports databases such as Mysql and Oracle, as well as big data platforms such as Hadoop and Spark (the big data platforms in the data center are also listed here).

►API read: Support to obtain third-party system data through API.

picture

Legend: Data Sources Supported by FanRuan BI Platform 

(2) Data processing

The BI platform needs to be able to provide users with data modeling tools to help users create target data (data sets). The functions it provides include dragging and dropping table fields, automatically identifying dimensions/indicators, customizing view statements, previewing data, setting virtual fields, Basic operations such as function calculation and parameter setting, as well as data processing functions such as multi-source heterogeneous JOIN/UNION.

picture

FineBI self-service dataset data processing interface

(3) Data analysis and visualization

On the basis of data processing, the BI platform also needs to provide users with rich chart making and online analytical processing (OLAP) operations, allowing users to complete data analysis and data visualization on the front-end page.

The operation process is as follows: the user selects the processed data set, filters the dimensions and indicators, and then completes the analysis of business requirements through operations such as scrolling up and down, chart linkage, and report jumping. At the same time, the BI platform will provide users with Visualize graphic components to finalize the design of the visual content.

picture

(4) Content distribution and basic services

The BI platform needs to have the ability to distribute visual content, and control viewing permissions and data permissions. The main distribution methods include BI platform, mobile BI (App), large data screen, email, link access, and third-party embedding.

At the same time, the BI platform also needs to have basic operation management, role management, help center and message push functions.

Only a BI platform that meets the above functions and has service capabilities such as multi-dimensional analysis, data visualization, and large data screens can maximize the value in the data middle-office system and effectively help analysts and operation teams to improve work efficiency.

  • Label Management System

In addition to the BI platform, the label management system is also one of the important application directions of data services. At present, business departments are faced with a large number of precise marketing scenarios. These thousands of recommendations and pushes need to be implemented based on a complete and accurate user portrait, and the composition of user portraits needs to be supported by a large number of comprehensive user tags.

Therefore, as the basic data of personalized business applications, the credibility and effectiveness of tag data become the key indicators to measure the maturity of user portraits.

We can regard the tag management system as the base of the user portrait system. Based on the data system created by the data center, we can naturally break through the data barriers in tag management, build an enterprise-level, unified user tag system, and This creates an enterprise-level user portrait system.

The label management system of the data center mainly has the following functions.

(1) User unique identification

Each business line in many enterprises has its own independent user identification system. For example, in the 58 Group, there are various user identification methods such as 58 device fingerprints, Anjuke unique users, recruitment natural persons, and financial natural persons, but most of these identification methods are Serving a single business line, the labels in each business line are also developed for the independent user ID of the business.

The label management system of the data center can provide a unified user identification service, associating and unifying the independent user identifications of each business line, thus opening up the independent user identification and label interaction conversion scheme for the entire enterprise.

(2) Label system management

The main task of tag system management is to formulate tag data and information interaction plans, break through information and data barriers in user portrait research and development and services, and provide tag access, visual tag information display, visual tag permission control, visual user tag analysis, and visual crowd. Targeted extraction and visualization of similar crowd expansion (Lookalike) and other functions.

(3) Label data service

The label management system needs to provide services such as label extraction and query involved in the development and application of user portraits, and provide relevant solutions to business parties in the form of standardized service interfaces (APIs) to support the capabilities of business parties based on the data center , to create personalized services for business lines.

In addition to business intelligence BI and tag management, enterprises also need to maximize the value of data applications according to the characteristics of their own industries.

Reference documentation:

  1. The strongest and most comprehensive guide to data warehouse construction specifications

  2. Meituan data platform and data warehouse construction practice, summary of over 100,000 words

  3. 50,000 words | It took a month to sort out this Hadoop blood vomiting book

  4. Data warehouse construction nanny level tutorial PDF document

  5. The strongest and most comprehensive big data SQL classic interview questions complete PDF version

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324131486&siteId=291194637