Data Center 02: Data Center Architecture

1. The overall architecture of the data center

Previously, we have a certain understanding of the data center platform through the theoretical level. Next, we will take a detailed look at the design of the data center platform through the architectural level.

insert image description here

The data center is a complete system between the underlying storage computing platform and the upper data application.

The data middle platform shields the complexity of the computing technology of the underlying storage platform, reduces the demand for technical talents, and makes the cost of data usage lower.

Establish enterprise data assets through data aggregation and data development modules in the data center.

Hierarchical storage of data through the data system

Through asset management and data services, data assets are transformed into data service capabilities to serve enterprise business.

Data security management and data operation system ensure the long-term healthy and continuous operation of the data center.

1. Data aggregation

Data aggregation is the entry point for data access in the data center. The data center itself does not generate data. All data comes from business systems, databases, logs, files, etc. These data are scattered in different network environments and storage platforms, making it difficult to use and generate business value. Therefore, unified aggregation is required.

2. Data development

Data development is a complete set of tools for data processing and processing, because the data aggregated to the middle station through the data aggregation module has not been processed, and is basically piled together according to the original state of the data, so it is difficult for the business to use directly. Therefore, it is necessary to realize the data processing through the data development module, form valuable data, and provide it to the business department.

3. Data system

Through data aggregation and data development, the middle platform has the basic ability to build a data warehouse platform. This part is actually to build various data collected according to the standards of the data warehouse.

4. Data asset management

The data assets established through the data warehouse are more technical and difficult for business personnel to understand. Asset management is to present data assets to the business personnel of the enterprise in a way that business personnel can better understand.

5. Data service system

The data service system is to turn data into a service capability. Through data services, data can be used to participate in the business and activate the entire data center. The data service system is the value of the data center.

6. Data operation system

It is the basis for the healthy and continuous operation of the data center

7. Data security management

It is to ensure the data security in the data center.

This is a typical overall architecture design of the data center.

2. The four-character motto of Taiwan in the data center

If you have not worked before, you may still not understand the data center platform, so here I summarize the functions of the data center platform into four words: collection, storage, communication, and use

insert image description here
Let us now analyze these four-character proverbs in detail.

1. mining

Collect: It means to collect all the data in the enterprise.

With the rise of technologies such as the Internet, mobile Internet, and the Internet of Things, the business forms of enterprises have begun to diversify, and the forms of data generation are also diversified, correspondingly requiring multiple forms of collection.

Buried point collection, hardware collection, crawler collection, database collection, log collection.

Buried point collection: generally collect user behavior information, such as user browsing, clicking, staying and other behaviors on the platform.

Hardware collection: refers to the data collection of the Internet of Things, such as collecting air quality indicators through drone sensors.

Crawler collection: refers to the collection of public data on the Internet, for example: collection of competing product prices on e-commerce platforms.

Database collection: generally collect business data within the enterprise, such as: user transaction data, user personal information data, etc.

Log collection: Generally, logs generated when the software is running are collected.

These are common forms of collection.

From the form of data organization, it can be divided into: structured data, semi-structured data, and unstructured data.

Structured data: Data that is regular, complete, and can be expressed through two-dimensional logic, and strictly abides by data format and length specifications. Common data include data in databases and data in excel.

Semi-structured data: The data is regular and complete, and it also strictly abides by the data format and length specifications, but it cannot be represented by two-dimensional relationships. Common data formats include JSON and XML.

Unstructured data: The data structure is irregular or incomplete, and it is inconvenient to use a two-dimensional logical table to represent it. It requires complex logical processing to extract the information content. Common data include word documents, pictures, videos, audios, etc.

According to the timeliness of data, it can be divided into: offline data and real-time data.

Offline data: It is mainly used for the periodic migration of large batches of data, and does not require high timeliness. Generally, it adopts the form of distributed batch data synchronization, and reads data through connections. The data can be read in full and incremental ways during the process of reading data. After unified processing, it is written to the target storage.

Real-time data: mainly for low-latency data application scenarios, generally realized by real-time monitoring, for example, by reading the binlog log of the database to realize real-time data collection of the database.

Previously, we analyzed the data collection form, data organization form, and data timeliness. So what type of tools should be used when collecting these data?

Common collection tools include: Flume, FileBeat, Logstash, Sqoop, Canal, DataX, etc.

Among them, Flume, FileBeat, and Logstash are suitable for collecting log data. The characteristics of these three components have been analyzed in detail in the previous project courses, so I won't repeat them here.

Sqoop is a tool for batch data migration between structured data and HDFS. It is suitable for batch collection of data in the database. Its main advantage is that in specific scenarios, the data exchange process will greatly improve performance. The main disadvantage is that the processing process has a high degree of customization, and configuration parameters need to be adjusted in the script to realize it. It is relatively weak in some user-defined logic and data synchronization link monitoring.

DataX is a set of data collection tools open sourced by Alibaba. It provides traffic monitoring for the entire link of data collection, displays job status, data flow, data speed, execution speed and other information, provides dirty data detection function, and supports strategic handling of transmission errors during transmission.

Because it is based on the method of in-process read-write direct connection, the requirements for machine memory in high-concurrency data collection scenarios are relatively high.
However, DataX does not support the collection of unstructured data.

None of these single tools can well meet the complex data collection scenarios of enterprises, so we need to carry out secondary development of existing collection tools, provide them to users in the form of visual configuration, shield the complexity of underlying tools, and support common data source collection: relational databases, NoSQL databases, MQ, file systems, etc., and support incremental synchronization, full synchronization, etc.

2. Save

After the data is collected, data storage needs to be considered.

Here we can divide data into two types: static data and dynamic data.

Among them, static data: uses distributed file systems such as HDFS and S3 as storage engines, and is suitable for high-throughput offline big data analysis scenarios. The limitation of this type of storage is that the data cannot be read and written randomly.

Dynamic data: NoSQL databases such as HBase and Cassandra are used as storage engines, and are suitable for large data random read and write scenarios. The limitation of this type of storage is that the batch read throughput is far inferior to HDFS, so it is not suitable for batch data analysis scenarios.

3. pass

Representation is to process and calculate data, build an enterprise-level data warehouse, and open up the global data in the enterprise.

Data processing and calculation can be divided into two parts, offline calculation and real-time calculation.

Representative frameworks in offline computing are: MapReduce, Hive, and Spark.

Representative frameworks in real-time computing are: Storm, SparkStreaming, and Flink. For real-time computing, Flink is now the main framework.

For these computing frameworks, if code development is required for each computing task, it will be unfriendly to users, especially for some business personnel who can’t write code but can only write SQL, so at this time we need to develop a one-stop development platform based on SQL. The underlying engine uses Spark and Flink to support offline data computing and real-time data computing.
Let users completely avoid the heavy underlying code development work.

4. use

After the enterprise's global data collection and storage are connected, it involves how to use it.
The "use" here includes many levels.

First of all, it includes data asset management, which can also be called data governance, which includes data element standard management, data label management, data model management, metadata management, data quality management, etc., to ensure the rationalization and standardization of data in the data center and give full play to the value of data.

For data owners and managers, through reasonable management and effective application of data, the huge value of data can be revitalized and fully released, but if the data cannot be effectively managed, the data will not be used, or even if it is used, it will not be used well. In this case, piles of unordered data will bring high costs to enterprises.

When using data, it is also necessary to do a good job in data security management. With the rapid development of big data technology and applications, the multi-dimensional business value carried by data has been more and more mined and realized by applications. What follows is that data security and privacy have become a worldwide concern and have risen to the level of national strategy. Recently, Trump has made a lot of noise about banning the foreign version of TikTok. The reason for Trump is that the data on the TikTok platform poses a threat to them.

Therefore, data security is very necessary. The overall data security management system creates a data-oriented security management system framework through layered construction and hierarchical protection to form a complete data security management system.

In the construction of a data center, data security management should always be placed in the most important position. By designing a complete data security management system, data security can be guaranteed in multiple aspects and at multiple levels.

Ultimately, we need to provide safe and valuable data to upper-layer applications quickly and conveniently. At this time, we need to open it to the outside world through data services, that is, in the form of API interfaces.

For example, water is the source of life and an important material resource that people rely on for survival and development. In daily life, water can be used in different ways, which also brings great convenience to our life.

In the data world, data assets are like the water resources needed for life in daily life, ubiquitous and indispensable. However, if there is no corresponding water processing plant and transportation pipeline, people can only go to the reservoir to fetch water for drinking, which will obviously greatly affect people's normal life and work. Therefore, only by encapsulating data into data services and providing them to upper-layer applications in the form of interfaces can the value of data assets be greatly released and enhanced.

Finally, to sum up, the data center can actually be understood in this way, collect enterprise-wide data, store it, open up the relationship between the data through processing and calculation, and finally provide data services in the form of API interfaces. This is what the data center has to do.

Guess you like

Origin blog.csdn.net/weixin_40612128/article/details/123547489