Data middle platform in action (00) - Is the end of big data the data middle platform?

In addition to supporting the group's big data construction, the team also provides To B services, so I also had the opportunity to come into contact with some traditional companies that are undergoing digital transformation. Starting from the end of 2018, various tenders for big data platforms in the market suddenly disappeared, replaced by data middle-end projects. Building data middle-end has become the first choice for the digital transformation of traditional enterprises. Even many experts in the field of big data have It is believed that data center is the next stop for big data.

Why is the data center the next stop for big data? What is the difference between data warehouse, data lake and big data platform? Let’s delve deeper into the history of big data development, starting with the emergence of data warehouses, passing through data lakes, and then to big data platforms. Only in this way can we understand the problems at each stage of big data development and deeply understand the historical positioning of the data center in the development of big data. .

1 Data warehouse

Business Intelligence (BI) was born in the 1990s. It converts existing data of enterprises into knowledge to help enterprises analyze and make decisions. For example, in store management in the retail industry, how to maximize the profit of a single store requires analyzing the sales data and inventory information of each product, and formulating a sales purchase plan for each product:

  • Some products are unsalable and should be reduced in price.
  • Some products are relatively popular and should be purchased in advance based on predictions of future sales data.

All are inseparable from large amounts of data analysis.

Data analysis needs to aggregate data from multiple business systems, such as data from integrated trading systems and warehousing systems. It also needs to save historical data and perform range queries on large amounts of data. The traditional DB is oriented to a single business system and mainly implements transaction-oriented additions, deletions, modifications and queries. It is no longer suitable for several scenarios, so it gave birth to data warehouses.

In "Building the Data Warehouse" published in 1991, Bill Inmon, the father of data warehouse, gave the first complete definition of data warehouse:

A data warehouse is a subject-oriented, integrated, time-related, and unmodifiable collection of data used in enterprise management and decision-making.

1.1 In-depth understanding of the four elements of data warehouse-e-commerce case

E-commerce companies include:

  • DB specifically stores order data
  • Another DB stores member-related data
Build a data warehouse

First, we need to synchronize data from different business systems into a unified data warehouse, and then organize the data according to subject areas.

image-20230926171735601

1.1.1 Subject area

A high-level abstraction of business processes, such as products, transactions, users, and traffic, can be used as a subject domain and can be understood as a directory of a data warehouse. The data in the data warehouse is generally stored in partitions based on time, and is generally retained for more than 5 years. Data in each time partition is additionally written, and a certain record cannot be updated.

1.1.2 Design method

The design method of data warehouse modeling that he and Kimball jointly pioneered was of great significance to the subsequent modern data warehouse design based on data lakes.

top down

Enmen's modeling method refers to the data source, which in traditional data warehouses is each business DB.

Build a data warehouse based on each entity in the business and the relationship between entities.

If a buyer purchases goods, he must first clarify the entities involved in the business process.

The buyer and the product are an entity, and the buyer purchasing the product is a relationship. The following model is obtained:

Buyer list:

Product list:

image-20230927093453021

Buyer product transaction list:

image-20230927093627617

bottom up

Kimball modeling is the opposite of Enmen. It starts from the needs of data analysis and splits dimensions and facts:

  • Users and products are dimensions
  • Inventory and user account balances are facts

Corresponding to the exact same table just now, they are called:

  • User dimension table
  • Product dimension table
  • Account balance fact table
  • Product inventory fact table
Compared
  • Enmen modeling is built from the data source. The construction cost is relatively high. It is suitable for businesses with relatively fixed application scenarios, such as the financial field. The advantage is that there is less redundant data.

  • Kimball modeling starts from the analysis scenario and is suitable for businesses that change quickly, such as Internet businesses. Nowadays, business changes rapidly, and Kimball modeling method is more recommended.

In traditional data warehouses, it is clear for the first time that the application scenarios of data points should be implemented with a separate solution and no longer rely on the business DB. In terms of model design, a methodology for data warehouse model design was proposed, laying the foundation for the subsequent large-scale application of data points. However, after the Internet era, traditional data warehouses declined, and Internet technology gave birth to the big data era.

2 From Hadoop to Data Lake

access to the internet

2.1 Major changes

Unprecedented scale of data

A successful Internet product has over 100 million daily active users, and Douyin generates hundreds of billions of user actions every day. Traditional data warehouses are difficult to expand and cannot carry such massive amounts of data.

Data types become heterogeneous

Internet data comes from:

  • Business DB structured data
  • App and Web front-end buried data or business back-end buried logs are generally semi-structured or even unstructured. Traditional data warehouses have strict requirements for data models. Before data is imported into the data warehouse, the data model must be defined in advance, and the data must be stored according to the model design.

Therefore, limitations in data scale and data type prevent traditional data warehouses from supporting Internet BI.

Internet giant Google was the first to start exploring, and since 2003, it has published papers:

Lay the foundation for modern big data technology. Propose a new unified computing and storage method for massive heterogeneous data.

With the emergence of Hadoop in 2005, big data technology became popular. Hadoop is the open source implementation of the paper

2.2 Hadoop VS traditional data warehouse

  • Completely distributed and easy to expand, low-cost machines can be used to build clusters with strong computing and storage capabilities to meet massive data processing requirements.
  • Weaken the data format. After the data is integrated into Hadoop, no data format can be retained. The data model is separated from the data storage. When the data is used, it can be read according to different models to meet the needs of flexible analysis of heterogeneous data.

As Hadoop matured, in 2010, Pentaho founder and CTO James Dixon proposed at the Hadoop World conference

2.3 Data Lake

A repository or system that stores data in its raw format. The data lake is a symbol of Hadoop's transition from open source to commercial maturity. Enterprises can build data lakes based on Hadoop and use data as the core asset of the enterprise.

However, a commercial Hadoop contains more than 20 computing engines, has many data research and development processes, and technical thresholds limit the commercialization of Hadoop. How to make data processing like a factory, completed directly on the equipment assembly line?

3 Data Factory: Big Data Platform

3.1 Data development process

  • First import the data into the big data platform
  • Then carry out data development according to needs
  • After the development is completed, the data is verified and compared to confirm whether it meets expectations.
  • Publish data online and submit for scheduling
  • Daily task operation and maintenance to ensure that tasks can produce data normally every day

Without an efficient platform, it's like writing code without an IDE. Others can complete ten requirements, but you can't complete any of them.

The big data platform is to improve the efficiency of data research and development, lower the threshold of data research and development, and allow data to be quickly processed in an equipment assembly line.

The big data platform is oriented to data research and development scenarios and is a data workbench covering the complete link of data research and development.

3.2 Big data platform usage scenarios

  • data integration
  • data development
  • Data testing
  • Publish online
  • Task operation and maintenance

The big data platform is used for data development. The underlying infrastructure of the big data platform, represented by Hadoop, is divided into computing, resource scheduling and storage.

3.3 Big data computing engine

  • Hive and Spark solve offline data cleaning and processing. Spark is used more and more, and its performance is much higher than Hive.
  • Flink solves real-time computing
  • Impala solves interactive queries

These computing engines run uniformly on Yarn, and Yarn allocates computing resources. There are also resource scheduling based on k8s. For example, the Spark version (2.4.4) can run in a cluster managed by k8s, which can realize mixed deployment of online and offline resources and save machine costs.

3.4 Data storage

  • HDFS is not updateable and mainly stores the full amount of data.
  • HBase provides an updateable KV, which mainly stores some dimension tables
  • ck/Kudu provides real-time updates and is generally used in real-time data warehouse construction scenarios.

The big data platform is like an equipment assembly line. After processing by the big data platform, the original data becomes indicators and appears in various reports or data products. With the rapid growth of data demand, there are more and more reports, indicators, and data models. Problems such as data cannot be found, data is difficult to use, and slow response to data needs have become increasingly acute, becoming stumbling blocks that prevent data from generating value.

4 Data middle platform (data value)

Around 2016, the Internet developed rapidly, with increasing demand for data and more and more data application scenarios. A large number of data products entered daily operations and became operational work. E-commerce business has a supply chain system, which will generate product replenishment decisions based on the gross profit, inventory, sales data of each product and public opinion of the product, and then push it to the procurement system.

4.1 Application exposure problems of large-scale data - data fragmentation

Don’t dare to use data

In the early stage of business development, in order to quickly realize the business, siled development leads to the separation of different business lines of the enterprise, and even different application data of the same business line. The display results of the same indicators for the two data applications are inconsistent, leading to a decrease in operational trust in the data. For example, if you are an operator and you want to check the sales of a product and find that the sales indicator in both reports has two values, your first reaction is that the data is calculated incorrectly and you dare not use this data.

A lot of repeated calculations and development

This leads to a waste of R&D efficiency, a waste of computing and storage resources, and the application cost of big data is getting higher and higher.

  • When operations want data, development says it will take at least a week. Can it be faster?
  • Data development is faced with a large number of demands, but they complain that there are too many demands and they can’t finish the work.
  • The boss saw that the monthly bill was increasing exponentially and felt that it was too expensive. Can he save more?

The root causes of these problems are

4.2 Data cannot be shared

In 2016, Alibaba proposed a “data middle platform”. The core of the data center is to avoid repeated calculation of data, improve data sharing capabilities and empower data applications through data servitization.

  • In the past, data was useless. Intermediate data was difficult to share and could not be accumulated.
  • After building a data center, you will have everything you need. The research and development speed of data applications is no longer limited by the speed of data development. Many data applications can be hatched overnight according to the scenario. These applications allow data to generate value.

5 Summary

  1. The data middle platform is built on the data lake and has the ability to uniformly calculate and store heterogeneous data in the data lake. At the same time, the messy data in the data lake can be managed in a standardized way.
  2. The data middle platform needs to rely on the big data platform. The big data platform has completed the full process coverage of data research and development, and the data middle platform has added the content of data governance and data service.
  3. The data middle platform draws on the subject domain-oriented data organization model of traditional data warehouses and builds a unified data common layer based on the theory of dimensional modeling.

Data center:

  • Absorb the advantages of traditional data warehouses, data lakes, and big data platforms
  • It also solves the problem of data sharing and realizes the value of data through data application.

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/133344858