Data Lake--Concept, Characteristics, Architecture and Case Overview

1. What is a data lake

Data lake is a hot concept at present, and many enterprises are building or planning to build their own data lake. However, before planning to build a data lake, it is crucial for the construction of a data lake to figure out what a data lake is, to clarify the basic components of a data lake project, and then to design the basic architecture of the data lake. Regarding what a data lake is, there are the following definitions.

Wikipedia defines it this way:

A data lake is a type of system or store that stores data in its natural/raw format, usually object blocks or files. A data lake is usually a single store for the full amount of data in an enterprise. Full data includes copies of the original data produced by the original system and transformed data for various tasks including reporting, visualization, advanced analytics, and machine learning. The data lake includes structured data (rows and columns), semi-structured data (such as CSV, log, XML, JSON), unstructured data (such as email, documents, PDF, etc.) and binary data from relational databases Data (eg images, audio, video). A data swamp is a degraded, poorly managed data lake that is either inaccessible to users or does not provide sufficient value.

The definition of AWS is relatively concise:

A data lake is a centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as-is (without first structuring it) and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics and machine learning to guide better decisions.

Microsoft's definition is even more vague, and it does not clearly give what is Data Lake, but uses the function of Data Lake as a definition:

Azure's data lake includes all the capabilities that make it easier for developers, data scientists, and analysts to store and process data. These capabilities allow users to store data of any size, type, and speed, and can be cross-platform, cross-border Linguistics do all types of analysis and processing. The data lake can help users accelerate application data while eliminating the complexity of data collection and storage, and can also support batch processing, stream computing, interactive analysis, etc. A data lake can work with existing IT investments in data management and governance to keep data consistent, manageable, and secure. It can also seamlessly integrate with existing business databases and data warehouses to help extend existing data applications. Azure Data Lake draws on the experience of a large number of enterprise-level users, and supports large-scale processing and analysis scenarios in some Microsoft businesses, including Office 365, Xbox Live, Azure, Windows, Bing and Skype. Azure addresses many efficiency and scalability challenges as a service that enables users to maximize the value of data assets to meet current and future needs.


There are actually many definitions of data lakes, but they basically revolve around the following characteristics.

  1. The data lake needs to provide sufficient data storage capacity, which stores all the data in an enterprise/organization.

  2. Data lakes can store massive amounts of data of any type, including structured, semi-structured, and unstructured data.

  3.  The data in the data lake is raw data, a complete copy of business data. The data in the data lake retains the way they were in the business system.

  4. A data lake needs to have complete data management capabilities (complete metadata), which can manage various data-related elements, including data sources, data formats, connection information, data schema, and permission management.

  5. A data lake needs to have diverse analysis capabilities, including but not limited to batch processing, stream computing, interactive analysis, and machine learning; at the same time, it also needs to provide certain task scheduling and management capabilities.

  6. A data lake needs to have complete data lifecycle management capabilities. It is not only necessary to store the original data, but also need to be able to save the intermediate results of various analysis and processing, and to completely record the analysis and processing process of the data, which can help users to trace the generation process of any piece of data in complete detail.

  7. A data lake needs to have complete data acquisition and data publishing capabilities. The data lake needs to be able to support a variety of data sources and obtain full/incremental data from related data sources; then standardize the storage. The data lake can push the results of data analysis and processing to a suitable storage engine to meet different application access requirements.

  8. Support for big data, including ultra-large-scale storage and scalable large-scale data processing capabilities.

To sum up, I personally think that a data lake should be an evolving and scalable infrastructure for big data storage, processing, and analysis; it is data-oriented to achieve full acquisition of data from any source, at any speed, at any scale, and of any type. , full storage, multi-mode processing and full life cycle management; and through the interaction and integration with various external heterogeneous data sources, it supports various enterprise-level applications.


                                            Figure 1. Basic capabilities of the data lake

Two more points need to be pointed out here:

  1. Scalability refers to the scalability of scale and the scalability of capabilities, that is, the data lake must not only provide "sufficient" storage and computing power with the increase of data volume, but also need to continuously provide new data processing modes as needed. For example, the business may only need batch processing capabilities at the beginning, but as the business develops, it may require interactive ad hoc analysis capabilities; and as the effectiveness of the business continues to increase, it may be necessary to support rich capabilities such as real-time analysis and machine learning .

  2. Data-oriented means that the data lake should be simple enough and easy to use for users to help users free themselves from complex IT infrastructure operation and maintenance work, and focus on business, models, algorithms, and data. Data lakes are aimed at data scientists and analysts. At present, cloud native should be an ideal way to build a data lake, which will be discussed in detail later in the "Basic Architecture of the Data Lake" section.

2. Basic Features of Data Lakes

After having a basic understanding of the concept of data lake, we need to further clarify the basic characteristics of data lake, especially compared with big data platform or traditional data warehouse, what characteristics data lake has. Before the specific analysis, let's look at a comparison table from the AWS official website


The above table compares the difference between the data lake and the traditional data warehouse. I personally think that we can further analyze the characteristics of the data lake from the two levels of data and calculation. On the data side:

  • "Fidelity". A complete copy of the data in the business system is stored in the data lake. The difference from the data warehouse is that a copy of the original data must be stored in the data lake, and the data format, data schema, and data content should not be modified. In this regard, the data lake emphasizes the preservation of "authentic" business data. At the same time, the data lake should be able to store any type/format of data.

  • "Flexibility": A point in the above table is "write schema" vs "read schema", which is essentially a question of which stage of data schema design occurs. For any data application, schema design is actually essential. Even for some databases that emphasize "schemaless", such as mongoDB, the best practice still recommends that records use the same/similar structure as possible. The implicit logic behind the "write-type schema" is that before data is written, the schema of the data needs to be determined according to the access method of the business, and then the data import is completed according to the established schema, which brings the benefit of good compatibility between the data and the business. However, this also means that the initial cost of ownership of the data warehouse will be relatively high, especially when the business model is unclear and the business is still in the exploratory stage, the flexibility of the data warehouse is not enough. The underlying logic behind the "read schema" emphasized by the data lake is that business uncertainty is the norm: we cannot anticipate business changes, so we maintain a certain degree of flexibility, delay the design, and let the entire Infrastructure has the ability to make data "on-demand" with the business. Therefore, I personally think that "fidelity" and "flexibility" are in the same line: since there is no way to predict business changes, then simply keep the data in the most primitive state, and when needed, the data can be processed according to the needs. Therefore, data lakes are more suitable for innovative enterprises and enterprises with fast-changing business development. At the same time, users of data lakes have higher requirements accordingly. Data scientists and business analysts (with certain visualization tools) are the target customers of data lakes.

  • "Manageable": The data lake should provide comprehensive data management capabilities. Since data requires "fidelity" and "flexibility," at least two types of data exist in the data lake: raw data and processed data. The data in the data lake will continue to accumulate and evolve. Therefore, data management capabilities will also be very demanding, and at least the following data management capabilities should be included: data sources, data connections, data formats, and data schemas (libraries/tables/columns/rows). At the same time, the data lake is a unified data storage place in a single enterprise/organization. Therefore, it also needs to have certain rights management capabilities.

  • "Traceability": A data lake is a storage place for the full amount of data in an organization/enterprise. It is necessary to manage the entire life cycle of data, including the entire process of data definition, access, storage, processing, analysis, and application. A powerful data lake implementation needs to be able to trace the access, storage, processing, and consumption process of any piece of data in between, and to be able to clearly reproduce the complete data generation process and flow process.

In terms of computing, I personally think that the requirements for computing power of data lakes are very extensive, and it depends entirely on the computing requirements of the business.

  • Rich computing engine. From batch processing, streaming computing, interactive analysis to machine learning, all kinds of computing engines belong to the category that data lakes should cover. In general, batch computing engines are used for data loading, transformation, and processing; streaming computing engines are used for parts that require real-time computing; and interactive analysis engines may need to be introduced for some exploratory analysis scenarios. As the combination of big data technology and artificial intelligence technology is getting closer and closer, various machine learning/deep learning algorithms have been introduced. For example, the TensorFlow/PyTorch framework already supports reading sample data from HDFS/S3/OSS for training. Therefore, for a qualified data lake project, the scalability/pluggability of the computing engine should be a basic capability.

  • Multimodal storage engine. In theory, the data lake itself should have a built-in multi-modal storage engine to meet the data access requirements of different applications (taking into account factors such as response time/concurrency/access frequency/cost). However, in the actual use process, the data in the data lake is usually not accessed frequently, and related applications are mostly exploratory data applications. In order to achieve acceptable cost performance, the construction of data lakes is usually It will choose a relatively cheap storage engine (such as S3/OSS/HDFS/OBS), and work together with an external storage engine when needed to meet diverse application requirements.

 3. Basic architecture of data lake

Data lakes can be considered as a new generation of big data infrastructure. In order to better understand the basic architecture of the data lake, let's take a look at the evolution of the big data infrastructure architecture.

1) The first stage: offline data processing infrastructure represented by Hadoop. As shown in the figure below, Hadoop is a batch data processing infrastructure with HDFS as the core storage and MapReduce (referred to as MR) as the basic computing model. Around HDFS and MR, a series of components have been generated to continuously improve the data processing capabilities of the entire big data platform, such as HBase for online KV operations, HIVE for SQL, and PIG for workflow. At the same time, as everyone has higher and higher performance requirements for batch processing, new computing models are constantly being proposed, resulting in computing engines such as Tez, Spark, and Presto, and the MR model has gradually evolved into a DAG model.

On the one hand, the DAG model increases the abstract concurrency capability of the computing model: each computing process is decomposed, and the tasks are logically divided according to the aggregation operation points in the computing process. The tasks are divided into stages, and each stage can be It consists of one or more tasks, and tasks can be executed concurrently, thereby improving the parallelism of the entire computing process; on the other hand, in order to reduce the intermediate result write file operation in the data processing process, computing engines such as Spark and Presto try to use computing The memory of the node caches data, thereby improving the efficiency and system throughput of the entire data process.


                    Figure 2. Schematic diagram of Hadoop architecture

2) The second stage: lambda architecture. With the continuous changes in data processing capabilities and processing requirements, more and more users find that no matter how the batch mode improves performance, it cannot meet some processing scenarios with high real-time requirements. Streaming computing engines emerge as the times require, such as Storm , Spark Streaming, Flink, etc.

However, as more and more applications are launched, everyone finds that the combination of batch processing and stream computing can meet the needs of most applications. For users, they don't really care what the underlying computing model is. It is hoped that whether it is batch processing or stream computing, the processing results can be returned based on a unified data model, so the Lambda architecture is proposed, as shown in the following figure.


                                  Figure 3. Schematic diagram of Lambda architecture

The core concept of the Lambda architecture is "flow-batch integration". As shown in the figure above, the entire data flow flows into the platform from left to right. After entering the platform, it is divided into two parts, one part is in batch mode, and the other part is in stream computing mode. Regardless of the computing mode, the final processing result is provided to the application through the service layer to ensure consistent access.

3) The third stage: Kappa architecture. The Lambda architecture solves the consistency problem of data read by applications, but the "stream-batch separation" processing chain increases the complexity of research and development. Therefore, some people ask whether one system can solve all problems. At present, the more popular approach is to do it based on stream computing. The natural distributed characteristics of stream computing are destined to have better scalability. By increasing the concurrency of stream computing and increasing the "time window" of streaming data, the two computing modes of batch processing and stream processing are unified.


                                      Figure 4. Kappa architecture schematic

In summary, from the traditional hadoop architecture to the lambda architecture, from the lambda architecture to the Kappa architecture, the evolution of the big data platform infrastructure has gradually included various data processing capabilities required by applications, and the big data platform has gradually evolved into an enterprise. /Organization's full-scale data processing platform. In current enterprise practice, except for relational databases that rely on independent business systems, almost all other data are considered to be integrated into the big data platform for unified processing. However, the current big data platform infrastructure focuses on storage and computing, while ignoring the asset management of data, which is exactly one of the directions that data lakes focus on as a new generation of big data infrastructure. .

I once read a very interesting article and asked the following question: Why is the data lake called a data lake instead of a data river or a data sea? An interesting answer is:

  1. "River" emphasizes mobility, "the sea contains hundreds of rivers", rivers will eventually flow into the sea, and enterprise-level data needs to be precipitated for a long time, so it is more appropriate to call "lake" than "river"; at the same time, lake water is naturally divided into This is consistent with the needs of enterprises to build a unified data center to store and manage data. "Hot" data is in the upper layer, which is convenient for applications to use at any time; warm data and cold data are located in different storages in the data center. In the medium, the balance between data storage capacity and cost is achieved.

  2. The reason why it is not called "sea" is that the sea is boundless and unbounded, while the "lake" has boundaries. This boundary is the business boundary of an enterprise/organization; therefore, data lakes require more data management and authority management capabilities.

  3. Another important reason why it is called a "lake" is that data lakes require fine-grained governance. A data lake that lacks control and governance will eventually degenerate into a "data swamp", which prevents applications from effectively accessing data and makes the data stored in it impossible. lose value.

The evolution of big data infrastructure actually reflects a point: within enterprises/organizations, it has become a consensus that data is an important asset; in order to make better use of data, enterprises/organizations need to:

  1. long-term storage as-is

  2. Effective management and centralized governance

  3. Provides multi-modal computing power to meet processing needs

  4. And business-oriented, providing a unified data view, data model and data processing results

It is against this background that the data lake is generated. In addition to the various basic capabilities possessed by the big data platform, the data lake emphasizes the management, governance and capitalization capabilities of data. When it comes to specific implementation, the data lake needs to include a series of data management components, including:

  1. data access

  2. data relocation

  3. data governance

  4. Quality Management

  5. Asset catalog

  6. Access control

  7. task management

  8. task scheduling

  9. metadata management, etc.

As shown in the figure below, the reference architecture of a data lake system is given. For a typical data lake, it is the same as the big data platform in that it also has the storage and computing power required to process ultra-large-scale data, and can provide multi-mode data processing capabilities; the enhancement point is that the data lake provides more In order to improve the data management capabilities, it is embodied in:

  1. More powerful data access capabilities. The data access capability is reflected in the ability to define and manage various external heterogeneous data sources, as well as the ability to extract and migrate data related to external data sources. The extracted and migrated data includes metadata of external data sources and actually stored data.

  2. More powerful data management capabilities. Management capabilities can be divided into basic management capabilities and extended management capabilities. Basic management capabilities include the management of various metadata, data access control, and data asset management, which are necessary for a data lake system. Management capabilities are supported. Extended management capabilities include task management, process orchestration, and capabilities related to data quality and data governance. Task management and process orchestration are mainly used to manage, orchestrate, schedule, and monitor various tasks that process data in the data lake system. Usually, data lake builders will purchase/develop customized data integration or data development subsystems/ Modules provide such capabilities, and customized systems/modules can integrate with the data lake system by reading the relevant metadata of the data lake. Data quality and data governance are more complex issues. In general, the data lake system will not directly provide related functions, but will open various interfaces or metadata for capable enterprises/organizations and existing data. Governance software integration or custom development.

  3. Shareable metadata. Various computing engines in the data lake will be deeply integrated with the data in the data lake, and the basis of the fusion is the metadata of the data lake. In a good data lake system, when processing data, the computing engine can directly obtain data storage location, data format, data mode, data distribution and other information from the metadata, and then directly process the data without manual/programming intervention. Further, a good data lake system can also perform access control on the data in the data lake, and the control strength can be achieved at different levels such as "library table column row".


                                                      Figure 5. Data Lake Components Reference Architecture

It should also be pointed out that the "centralized storage" in the above figure is more of a concentration of business concepts. In essence, it is hoped that the data within an enterprise/organization can be deposited in a clear and unified place. In fact, the storage of the data lake should be a distributed file system that can be expanded on demand. In most data lake practices, it is recommended to use distributed systems such as S3/OSS/OBS/HDFS as the unified storage of the data lake.

We can switch to the data dimension again and look at the way the data lake handles data from the perspective of the data life cycle. The entire life cycle of data in the data lake is shown in Figure 6. In theory, the data in a well-managed data lake will permanently retain the original data, while the process data will continue to improve and evolve to meet the needs of the business.


                                                 Figure 6. Diagram of the data life cycle in the data lake

4. Data lake solutions of various manufacturers

Data lake is a current trend, and major cloud manufacturers have launched their own data lake solutions and related products. This section will analyze the data lake solutions launched by various mainstream manufacturers and map them to the data lake reference architecture to help you understand the advantages and disadvantages of various solutions.

4.1 AWS Data Lake Solutions


                                                    Figure 7. AWS data lake solution

Figure 7 is the data lake solution recommended by AWS. The entire solution is based on AWS Lake Formation. AWS Lake Formation is essentially a management component that cooperates with other AWS services to complete the entire enterprise-level data lake construction function. From left to right, the above figure reflects the four steps of data inflow, data precipitation, data calculation, and data application. Let's take a closer look at its key points:

1) Data inflow


Data inflow is the beginning of the construction of the entire data lake, including the inflow of metadata and the inflow of business data. Metadata inflow includes two steps of data source creation and metadata capture. Finally, a data resource directory will be formed, and corresponding security settings and access control policies will be generated. The solution provides specialized components to obtain relevant metadata from external data sources. This component can connect to external data sources, detect data formats and schemas, and create metadata belonging to the data lake in the corresponding data resource directory. The inflow of business data is done through ETL.

In terms of specific product form, metadata capture, ETL and data preparation AWS abstracts them separately and forms a product called AWS GLUE. AWS GLUE and AWS Lake Formation share the same data resource catalog, which is clearly stated on the AWS GLUE official website document: "Each AWS account has one AWS Glue Data Catalog per AWS region".

Support for heterogeneous data sources. The data lake solution provided by AWS supports S3, AWS relational database, AWS NoSQL database, and AWS uses GLUE, EMR, Athena and other components to support the free flow of data.

2) Data precipitation

Use Amazon S3 as centralized storage for the entire data lake, scale on-demand/pay-per-use.

3) Data calculation

The entire solution leverages AWS GLUE for basic data processing. The basic computing form of GLUE is various batch-mode ETL tasks. The starting methods of tasks are divided into three types: manual triggering, timing triggering, and event triggering. It has to be said that the various services of AWS are very well implemented ecologically. In the event-triggered mode, AWS Lambda can be used for extended development, and one or more tasks can be triggered at the same time, which greatly improves the customized development ability of task-triggered; At the same time, various ETL tasks can be well monitored through CloudWatch.

4) Data application

In addition to providing basic batch computing modes, AWS provides rich computing mode support through various external computing engines, such as Athena/Redshift to provide SQL-based interactive batch processing capabilities; EMR to provide various Spark's computing power, including the stream computing power and machine learning capabilities that Spark can provide.

5) Rights management

AWS's data lake solution provides relatively complete permission management through Lake Formation, with granularity including "library-table-column". However, there is one exception. When GLUE accesses Lake Formation, the granularity is only two levels of "library-table"; this also shows from another side that the integration between GLUE and Lake Formation is tighter. Data has greater access.

The permissions of Lake Formation can be further subdivided into data resource directory access rights and underlying data access rights, which correspond to metadata and actual stored data respectively. The access authority to actually store data is further divided into data access authority and data storage access authority. The data access authority is similar to the access authority to the database table in the database, and the data storage authority further refines the access authority to the specific directory in S3 (divided into explicit and implicit). As shown in Figure 8, user A cannot create a table under the specified bucket in S3 with only data access permissions.

Personally, I think this further reflects the need for data lakes to support various storage engines. In the future, data lakes may not only include core storage such as S3/OSS/OBS/HDFS, but may incorporate more types of storage engines according to application access requirements. For example, S3 stores raw data, NoSQL stores data that is suitable for access in "key-value" mode after processing, and OLAP engine stores data that needs to produce various reports/adhoc queries in real time. Although various materials are currently emphasizing the difference between data lake and data warehouse; however, in essence, data lake should be the concrete realization of a kind of integrated data management idea, and the "integration of lake and warehouse" is likely to be the future. a development trend.


             Figure 8. Schematic representation of permissions separation for AWS data lake solutions

In summary, the AWS data lake solution is highly mature, especially in terms of metadata management and permission management. It has opened up the upstream and downstream relationships between heterogeneous data sources and various computing engines, allowing data to be "moved" freely. In stream computing and machine learning, AWS' solutions are also relatively complete. In terms of stream computing, AWS has launched a special stream computing component Kinesis. The Kinesis data Firehose service in Kinesis can create a fully managed data distribution service. The data processed in real time through the Kinesis data Stream can be conveniently written to S3 with the help of Firehose. And supports corresponding format conversion, such as converting JSON to Parquet format.

The best part of AWS's entire solution is that Kinesis can access the metadata in GLUE, which fully reflects the ecological completeness of the AWS data lake solution. Similarly, in terms of machine learning, AWS provides the SageMaker service, SageMaker can read the training data in S3 and write the trained model back to S3. However, it should be pointed out that in the data lake solution of AWS, stream computing and machine learning are not fixedly bundled, but only as the expansion of computing power and can be easily integrated.

Finally, let's go back to the data lake component reference architecture in Figure 6 to see the component coverage of AWS's data lake solution, see Figure 9.


                                    Figure 9. Mapping of AWS data lake solutions in the reference architecture

In summary, AWS's data lake solution covers all functions except quality management and data governance. In fact, the work of quality management and data governance is strongly related to the organizational structure and business type of the enterprise, and a lot of custom development work needs to be done. Therefore, it is understandable that the general solution does not include this content. In fact, there are also excellent open source projects supporting this project, such as Apache Griffin. If you have strong demands on quality management and data governance, you can customize and develop it yourself.

4.2 Huawei Data Lake Solution


                                                 Figure 10. Huawei Data Lake Solution

Information about Huawei's data lake solution comes from Huawei's official website. The related products currently visible on the official website include Data Lake Insight (DLI) and Intelligent Data Lake Operation Platform (DAYU). DLI is equivalent to a collection of AWS's Lake Formation, GLUE, Athena, and EMR (Flink & Spark). I didn't find the overall architecture diagram of DLI on the official website. I tried to draw one according to my own understanding, mainly to compare it with the AWS solution, so the form should be as consistent as possible. If there are students who know Huawei DLI very well, please also Do not hesitate to enlighten me.

Huawei's data lake solution is relatively complete, and DLI undertakes all the core functions of data lake construction, data processing, data management, and data application. The biggest feature of DLI is the completeness of the analysis engine, including SQL-based interactive analysis and Spark+Flink-based stream-batch integrated processing engine. On the core storage engine, DLI is still provided by the built-in OBS, which basically matches the capabilities of AWS S3. Huawei's data lake solution is more complete than AWS in terms of upstream and downstream ecology. For external data sources, it supports almost all data source services currently provided on Huawei Cloud.

DLI can interface with Huawei's CDM (Cloud Data Migration Service) and DIS (Data Access Service):

  1. With DIS, DLI can define various data points, which can be used in Flink jobs as sources or sinks;

  2. With the help of CDM, DLI can even access data from IDC and third-party cloud services.

To better support advanced data lake functions such as data integration, data development, data governance, and quality management, HUAWEI CLOUD provides the DAYU platform. The DAYU platform is the implementation of Huawei's data lake governance and operation methodology. DAYU covers the core process of the entire data lake governance and provides corresponding tool support for it; even in Huawei's official documents, it gives suggestions for building a data governance organization. The implementation of DAYU's data governance methodology is shown in Figure 11 (from HUAWEI CLOUD official website).

                                            Figure 11 DAYU Data Governance Methodology Process

It can be seen that, in essence, the DAYU data governance methodology is actually an extension of the traditional data warehouse governance methodology on the data lake infrastructure: from the perspective of the data model, it still includes the source layer, multi-source integration layer, and detailed data layer. Fully consistent with the data warehouse. According to the data model and indicator model, quality rules and transformation models will be generated. DAYU will connect with DLI and directly call the relevant data processing services provided by DLI to complete data governance.

The entire data lake solution of HUAWEI CLOUD completely covers the life cycle of data processing, explicitly supports data governance, and provides data governance process tools based on models and indicators. The direction of "integration of lakes and warehouses" has evolved.

4.3 Alibaba Cloud Data Lake Solution

There are many data products on Alibaba Cloud, because I am currently in the data BU, so this section will focus on how to use the database BU products to build a data lake, and other cloud products will be slightly involved. Alibaba Cloud's data lake solutions based on database products are more focused, focusing on two scenarios: data lake analysis and federated analysis. The Alibaba Cloud data lake solution is shown in Figure 12.


                                             Figure 12. Alibaba Cloud Data Lake Solution

The entire solution still uses OSS as the centralized storage of the data lake. In terms of data source support, it currently supports all Alibaba Cloud databases, including OLTP, OLAP, and NoSQL databases. The core key points are as follows:

  1. Data access and relocation. In the process of building the lake, the Formation component of DLA has the capability of metadata discovery and one-click lake building. At the time of writing this article, "one-click lake building" only supports full lake building, but incremental lake building based on binlog It is already under development and is expected to be launched soon. The incremental lake building capability will greatly increase the real-time nature of the data in the data lake and minimize the pressure on the source-end business database. It should be noted here that DLA Formation is an internal component and is not exposed to the outside world.

  2. Data resource directory. DLA provides Meta data catalog components for unified management of data assets in the data lake, no matter whether the data is "in the lake" or "outside the lake". The Meta data catalog is also a unified metadata entry for federated analysis.

  3. On the built-in computing engine, DLA provides SQL computing engine and Spark computing engine. Both SQL and Spark engines are deeply integrated with the Meta data catalog, which can easily obtain metadata information. Based on Spark's capabilities, DLA solutions support computing modes such as batch processing, stream computing, and machine learning.

  4. In the external ecology, in addition to supporting various heterogeneous data sources for data access and aggregation, in terms of external access capabilities, DLA is deeply integrated with the cloud native data warehouse (formerly ADB). On the one hand, the results of DLA processing can be pushed to ADB at any time to satisfy real-time, interactive, and ad hoc complex queries; on the other hand, data in ADB can also be easily returned to OSS with the help of the appearance function. Based on DLA, various heterogeneous data sources on Alibaba Cloud can be fully connected, and data flows freely.

  5. In terms of data integration and development, Alibaba Cloud's data lake solution provides two options: one is to use dataworks to complete; the other is to use DMS to complete. No matter which one you choose, it can provide visual process orchestration, task scheduling, and task management capabilities to the outside world. In terms of data life cycle management, dataworks' data map capabilities are relatively more mature.

  6. In data management and data security, DMS provides powerful capabilities. The data management granularity of DMS is divided into "library-table-column-row", which fully supports enterprise-level data security management and control requirements. In addition to rights management, the more refined aspect of DMS is to extend the original database-based devops concept to the data lake, making the operation, maintenance and development of the data lake more refined.

The data application architecture of the entire data lake solution is further refined, as shown in the following figure.

                                      Figure 13. Alibaba Cloud data lake data application architecture

From left to right, from the data flow direction, data producers generate various types of data (off-cloud/on-cloud/other clouds) and upload them to various general/standard data sources, including OSS/HDFS/DB, using various tools Wait. For various data sources, DLA can complete lake construction operations through data discovery, data access, data migration and other capabilities.

For the data "into the lake", DLA provides data processing capabilities based on SQL and Spark, and can provide external visualization data integration and data development capabilities based on Dataworks/DMS; in terms of external application service capabilities, DLA provides standardized JDBC interfaces , you can directly connect to various reporting tools, large-screen display functions, etc. The feature of Alibaba Cloud's DLA is that it is backed by the entire Alibaba Cloud database ecosystem, including OLTP, OLAP, NoSQL and other databases, and provides SQL-based data processing capabilities. For traditional enterprise database-based development technology stacks, the cost of transformation is relatively high. Lower, the learning curve is relatively flat.

Another feature of Alibaba Cloud's DLA solution is "cloud-native integration of lakes and warehouses". Traditional enterprise-level data warehouses are still irreplaceable in various reporting applications in the era of big data, but data warehouses cannot meet the flexibility requirements of data analysis and processing in the era of big data.

Therefore, we recommend that the data warehouse should exist as the upper-layer application of the data lake: that is, the data lake is the only official data storage place for the original business data in an enterprise/organization; the data lake processes the original data according to the needs of various business applications, Form reusable intermediate results; when the data schema (Schema) of the intermediate results is relatively fixed, DLA can push the intermediate results to the data warehouse for enterprises/organizations to carry out data warehouse-based business applications. While providing DLA, Alibaba Cloud also provides cloud-native data warehouse (formerly ADB). DLA and cloud-native data warehouse are deeply integrated in the following two points.

  • Use the same-origin SQL parsing engine. DLA's SQL is fully grammatically compatible with ADB's SQL, which means that developers can develop data lake applications and data warehouse applications at the same time using a set of technology stacks.

  • Both have built-in access support for OSS. OSS exists directly as the native storage of DLA; for ADB, structured data on OSS can be easily accessed through the capability of external tables. With the help of external tables, data can be freely transferred between DLA and ADB, so as to achieve a real integration of lakes and warehouses.

The combination of DLA+ADB truly achieves the integration of cloud-native lakes and warehouses (what is cloud-native is beyond the scope of this article). In essence, DLA can be regarded as a data warehouse paste source layer with extended capabilities. Compared with the traditional data warehouse, the source layer:

  1. Can save all kinds of structured, semi-structured and unstructured data;

  2. Can connect various heterogeneous data sources;

  3. Ability to discover, manage, and synchronize metadata;

  4. The built-in SQL/Spark computing engine has stronger data processing capabilities to meet diverse data processing needs;

  5. It has the ability to manage the full life cycle of full data. The integrated solution of lake and warehouse based on DLA+ADB will cover the processing capacity of "big data platform + data warehouse" at the same time.

Another important capability of DLA is to build a data flow system that "extends in all directions", and provides external capabilities with the experience of a database, no matter whether the data is on the cloud or off the cloud, whether the data is inside or outside the organization; with the help of the data lake, various systems The data between them no longer has barriers, and can flow in and out freely; more importantly, this flow is regulated, and the data lake completely records the flow of data.

4.4 Azure Data Lake Solution

Azure's data lake solution includes data lake storage, interface layer, resource scheduling and computing engine layer, as shown in Figure 15 (from the Azure official website). The storage layer is built on Azure Object Storage and still supports structured, semi-structured and unstructured data.

The interface layer is WebHDFS. In particular, the interface of HDFS is implemented in Azure object storage. Azure calls this capability "multi-protocol access on data lake storage". In terms of resource scheduling, Azure is implemented based on YARN. In terms of computing engines, Azure provides various processing engines such as U-SQL, hadoop, and Spark.


                       Figure 15. Azure Data lake analysis architecture

The special feature of Azure is that it provides support for customer development based on visual studio.

  1. Support for development tools and deep integration with visual studio; Azure recommends U-SQL as the development language for data lake analysis applications. Visual studio provides a complete development environment for U-SQL; at the same time, in order to reduce the complexity of distributed data lake system development, visual studio encapsulates projects based on projects. When developing U-SQL, you can create a "U-SQL database project" ”, in such projects, using visual studio, it is very convenient to code and debug, and at the same time, it also provides a wizard to publish the developed U-SQL script to the generation environment. U-SQL supports Python and R to be extended to meet custom development needs.

  2. Adaptation of multiple computing engines: SQL, Apache Hadoop and Apache Spark. Hadoop here includes HDInsight (Azure-hosted Hadoop service) provided by Azure, and Spark includes Azure Databricks.

  3. The ability to automatically switch between many different engine tasks. Microsoft recommends U-SQL as the default development tool for data lakes, and provides various conversion tools to support the conversion between U-SQL scripts and Hive, Spark (HDSight & databricks), and Azure Data Factory data flow.

4.5 Summary

This article discusses data lake solutions and does not involve any single product from any cloud vendor. From the aspects of data access, data storage, data computing, data management, and application ecology, we briefly made a summary similar to the following table.

Due to space constraints, in fact, data lake solutions from well-known cloud vendors include Google and Tencent. According to their official websites, the data lake solution is relatively simple, and it is only a conceptual elaboration. The recommended implementation plan is "oss+hadoop (EMR)".

In fact, the data lake should not be viewed from the perspective of a simple technical platform. There are also various ways to realize the data lake. To evaluate the maturity of a data lake solution, the key should be the data management capabilities it provides, including but not limited to meta- Data, data asset catalog, data sources, data processing tasks, data life cycle, data governance, authority management, etc.; and the ability to connect with peripheral ecosystems.

5. Typical data lake application cases

5.1 Advertising data analysis

In recent years, the cost of traffic acquisition has become higher and higher, and the exponential increase in the cost of customer acquisition through online channels has brought severe challenges to all walks of life. In the context of the rising cost of Internet advertising, the main business strategy of spending money to buy traffic and attract new ones will definitely not work. The optimization of the front-end traffic has become the end of the battle. Using data tools to improve the target conversion of traffic after arriving at the site, and refine the operation of each link of advertising is a more direct and effective way to change the status quo. After all, to improve the conversion rate of advertising traffic, we must rely on big data analysis.

In order to provide more decision-making support, it is necessary to collect and analyze more embedded data, including but not limited to channels, delivery time, and delivery population, and conduct data analysis based on the click rate data indicators, so as to provide better results. more rapid solutions and recommendations to achieve high efficiency and high output. Therefore, in the face of the multi-dimensional, multimedia, multi-advertising and other structured, semi-structured and unstructured data collection, storage, analysis and decision-making recommendations in the field of advertising, data lake analysis product solutions are very important for advertisers or publishers. The selection of a new generation of technology has been very warmly favored.

DG is a world-leading enterprise international intelligent marketing service provider. Based on advanced advertising technology, big data and operational capabilities, DG provides customers with global high-quality user acquisition and traffic monetization services. DG has decided to build its IT infrastructure based on the public cloud since its inception. Initially, DG chose the AWS cloud platform to store its advertising data in the form of a data lake in S3, and conduct interactive analysis through Athena. However, with the rapid development of Internet advertising, the advertising industry has brought several challenges, and the mobile advertising publishing and tracking system must solve several key problems:

  1. Concurrency and peak issues. In the advertising industry, traffic peaks often occur, and instant clicks may reach tens of thousands or even hundreds of thousands, which requires the system to have very good scalability to respond quickly and process every click.

  2. How to realize real-time analysis of massive data. In order to monitor the effect of advertisement delivery, the system needs to analyze every click and activation data of the user in real time, and at the same time transmit the relevant data to the downstream media;

  3. The amount of data on the platform is growing rapidly, daily business log data is continuously generated and uploaded, and exposed, clicked, and pushed data are continuously processed. The amount of newly added data every day is already around 10-50TB. higher requirements. How to efficiently complete offline/near real-time statistics on advertising data, and perform aggregate analysis according to the dimensional requirements of advertisers.

In response to the above three business challenges, at the same time, the daily incremental data of DG, a customer, is rapidly increasing (the current daily data scan volume reaches 100+TB). Continuing to use Athena on the AWS platform encounters bottlenecks in Athena's reading S3 data bandwidth and data analysis lag time. After careful and careful testing and analysis, it was finally decided to move from the AWS cloud platform to the Alibaba cloud platform. The new architecture diagram is as follows:


                                     Figure 16. The reformed advertising data lake solution architecture

After moving from AWS to Alibaba Cloud, we designed the ultimate analysis capability of "Using Data Lake Analytics + OSS" for this customer to cope with business peaks and valleys. On the one hand, it is easy to deal with ad hoc analysis from brand customers. On the other hand, the powerful computing power of Data Lake Analytics is used to analyze monthly and quarterly advertisements, and accurately calculate how many activities there will be under a brand. Each activity is divided into media, market, channel and DMP. The sales conversion rate brought by Jiahe intelligent traffic platform for brand marketing has been further enhanced.

And in terms of the total cost of ownership of advertisement placement and analysis, the serverless elastic service provided by Data Lake Analytics is chargeable on demand and does not require the purchase of fixed resources. Greatly reduces operation and maintenance costs and usage costs.


                                          Figure 17 Schematic diagram of data lake deployment

In general, after DG switched from AWS to Alibaba Cloud, hardware costs, labor costs and development costs were greatly saved. Due to the use of DLA serverless cloud services, DG does not need to invest a lot of money to purchase hardware equipment such as servers and storage in advance, nor does it need to purchase a large number of cloud services at one time. The scale of its infrastructure is completely on-demand expansion: increase services when demand is high When the demand decreases, the number of services is reduced, and the utilization rate of funds is improved.

The second significant benefit of using the Alibaba Cloud platform is improved performance. During the period of rapid growth of DG's business and the subsequent access period of multiple business lines, DG's visits to the mobile advertising system often increased explosively. However, the original AWS solution and platform read S3 data in Athena and encountered data read bandwidth. This is a huge bottleneck, and the data analysis time is getting longer and longer. Alibaba Cloud DLA and the OSS team have carried out great optimization and transformation. At the same time, the DLA database analysis is on the computing engine (with TPC-DS ranking first in the world) The AnalyticDB shared computing engine) is dozens of times faster than the native computing engine of Presto, and it also greatly improves the analysis performance for DG.

5.2 Game Operation Analysis

A data lake is a type of big data infrastructure with excellent TCO performance. For many fast-growing game companies, the related data of an explosive game often grows extremely rapidly in a short period of time; at the same time, it is difficult for the technology stack of the company's R&D personnel to match the increment and growth rate of data in the short term; At this time, the explosive growth of data is difficult to be effectively utilized. Data lakes are a technology choice to solve such problems.

YJ is a fast-growing game company. The company hopes to rely on relevant user behavior data to conduct in-depth analysis to guide the development and operation of games. The core logic behind data analysis is that with the expansion of market competition in the game industry, players have higher and higher requirements for quality, and the life cycle of game projects is getting shorter and shorter, which directly affects the project's input-output ratio. It can effectively extend the life cycle of the project and accurately control the business trend of each stage.

With the increasing cost of traffic, how to build an economical and efficient refined data operation system to better support business development has become more and more important. The data operation system needs to have its supporting infrastructure. How to choose such infrastructure is a problem that the company's technical decision makers need to think about. Starting points for thinking include:

  1. Have enough flexibility. For games, it is often a short-term burst, and the amount of data surges; therefore, whether it can adapt to the explosive growth of data and meet the elastic requirements is a key consideration; whether it is computing or storage, it needs to have sufficient flexibility.

  2. There must be enough value for money. For user behavior data, it is often necessary to pull it into a long period for analysis and comparison, such as the retention rate. In many cases, the customer retention rate of 90 days or even 180 days needs to be considered; therefore, how to store it in the most cost-effective way for a long time Massive data is an important consideration.

  3. There must be sufficient analytical capabilities and scalability. In many cases, user behavior is reflected in buried point data, which needs to be correlated and analyzed with structured data such as user registration information, login information, and bills; therefore, in data analysis, at least ETL capabilities of big data, Access capability of heterogeneous data sources and modeling capability of complex analysis.

  4. It should match the company's existing technology stack and facilitate recruitment in the future. For YJ, an important point when selecting technology is the technology stack of its technicians. Most of YJ's technical team is only familiar with traditional database development, namely MySQL; and the manpower is tight, and there are only 1 technicians doing data operation analysis. However, there is simply no ability to independently build the infrastructure for big data analysis in a short period of time. From YJ's point of view, it is best that most of the analysis can be done through SQL; and in the recruitment market, the number of SQL developers is much higher than the number of big data development engineers. According to the customer's situation, we help the customer to make the transformation of the existing program.


                                      Figure 18. Scheme before retrofit

Before the transformation, all the structured data of the customer was stored in a high-standard MySQL; while the player behavior data was collected into the Log Service (SLS) through LogTail, and then delivered to the OSS and ES respectively from the Log Service. The problem with this architecture is:

  1. Behavioural data and structured data are completely separated and cannot be analyzed together;

  2. Intelligently provide retrieval function for behavioral data, and cannot do in-depth mining analysis;

  3. OSS is only used as a data storage resource and does not mine enough data value.

In fact, our analysis of the customer's existing architecture actually already has the prototype of the data lake: the full amount of data has been saved in the OSS, and now it is necessary to further complement the customer's ability to analyze the data in the OSS. Moreover, the SQL-based data processing mode of the data lake also meets the needs of customers for developing technology stacks. In summary, we made the following adjustments to the customer's architecture to help the customer build a data lake.


                              Figure 19. Transformed data lake solution

In general, we did not change the flow of data links for customers, but added DLA components on the basis of OSS to perform secondary processing on OSS data. DLA provides a standard SQL computing engine and supports access to various heterogeneous data sources. After processing the OSS data based on DLA, it generates data that is directly usable by the business. However, the problem of DLA is that it cannot support interactive analysis scenarios with low latency requirements. In order to solve this problem, we introduced ADB, a cloud-native data warehouse, to solve the latency problem of interactive analysis. At the same time, we introduced QuickBI at the front end as a visualization for customers. analyzing tool. The YJ solution is a classic implementation case of the integrated lake and warehouse solution shown in Figure 14 in the game industry.

YM is a data intelligence service provider, providing a series of data analysis and operation services for various small and medium-sized businesses. The technical logic of the specific implementation is shown in the following figure.


                                                   Figure 20. Schematic diagram of YM intelligent data service SaaS model

The platform side provides multi-end SDK for users (businesses provide web pages, APP, applet and other access forms) to access various buried data, and the platform side provides unified data access services and data analysis services in the form of SaaS. Merchants can conduct more fine-grained data analysis by accessing various data analysis services, and complete basic analysis functions such as behavior statistics, customer portraits, customer circle selection, and advertisement placement monitoring. However, there are certain problems in this SaaS model:

  1. Due to the diversification of merchant types and needs, it is difficult for the platform to provide SaaS analysis functions to cover all types of merchants and to meet the customized needs of merchants; for example, some merchants focus on sales, some focus on customer operations, and some focus on cost optimization, which is difficult to meet. all needs.

  2. For some advanced analysis functions, such as customer circle selection relying on custom tags, customer-defined extensions and other functions, the unified data analysis service cannot satisfy them; especially, some custom tags depend on the algorithms customized by merchants, which cannot be satisfied. Client's advanced analytics needs.

  3. Data asset management needs. In the era of big data, it has become the consensus of everyone that data is the asset of an enterprise/organization. How to make the data belonging to the merchants settle reasonably and for a long time is also something that SaaS services need to consider.

To sum up, we have introduced the data lake model to the basic model in the above figure, so that the data lake can be used as the basic support facility for merchants to accumulate data, output models, and analyze and operate. The SaaS data intelligence service model after the introduction of the data lake is as follows.


                                                        Figure 21. Data lake-based data intelligence service

As shown in Figure 21, the platform provides a one-click lake building service for each user, and merchants use this function to build their own data lakes. ) is synchronized to the data lake; on the other hand, all buried point data belonging to the merchant are fully synchronized to the data lake, and the daily incremental data is archived into the lake based on the "T+1" mode. On the basis of traditional data analysis services, the data lake-based service model endows users with three major capabilities: data assetization, analysis modeling, and service customization:

  1. Data capitalization capabilities. Using the data lake, merchants can continue to deposit their own data, how long the data is stored, and how much it costs are entirely up to the merchants to decide. The data lake also provides data asset management capabilities. In addition to managing raw data, merchants can also store processed process data and result data in different categories, which greatly improves the value of buried data.

  2. Analytical modeling capabilities. There are not only raw data in the data lake, but also the model (schema) of the buried data. The buried point data model reflects the abstraction of the business logic of the global data intelligence service platform. Through the data lake, in addition to outputting the original data as assets, the data model is also output. With the help of the buried point data model, merchants can have a deeper understanding The user behavior logic behind the buried point data helps merchants better gain insight into customer behavior and obtain user needs.

  3. Service customization capability. With the data integration and data development capabilities provided by the data lake, based on the understanding of the buried point data model, merchants can customize the data processing process, continuously iteratively process the original data, extract valuable information from the data, and finally obtain more than the original data. The value of data analytics services.

 6. The basic process of data lake construction

Personally, I think that the data lake is a more complete big data processing infrastructure support facility than the traditional big data platform, and the perfect data lake is a technical existence that is closer to the customer's business. All data lakes include features beyond the existence of big data platforms, such as metadata, data asset catalogs, rights management, data lifecycle management, data integration and data development, data governance and quality management, etc. Good close to business, better convenience for customers to use. Some basic technical features emphasized by the data lake, such as elasticity, independent expansion of storage and computing, unified storage engine, multi-mode computing engine, etc., are also to meet business needs and provide business parties with the most cost-effective TCO.

The construction process of the data lake should be closely integrated with the business; however, the construction process of the data lake should be different from the traditional data warehouse or even the popular data center. The difference is that the data lake should be built in a more agile way, "use while building, and govern while using". In order to better understand the agility of data lake construction, let's first look at the construction process of traditional data warehouses. The industry has proposed two modes of "bottom-up" and "top-down" for the construction of traditional data warehouses, which were proposed by Inmon and KimBall respectively. The specific process will not be described in detail, otherwise hundreds of pages can be written, and only the basic ideas are briefly explained here.

  • Inmon proposes a bottom-up (EDW-DM) data warehouse construction model, that is, the data source of an operational or transactional system is extracted, transformed and loaded into the ODS layer of the data warehouse through ETL. The data in the ODS layer is processed according to the pre-designed EDW (Enterprise Data Warehouse) paradigm, and then enters the EDW. EDW is generally a general data model for enterprises/organizations, and it is not convenient for upper-layer applications to do data analysis directly. Therefore, each business unit will again process the data mart layer (DM) from the EDW according to its own needs.

    Advantages: Easy to maintain and highly integrated; Disadvantages: Once the structure is determined, the flexibility is insufficient, and in order to adapt to the business, the deployment cycle is long. Data warehouses constructed in this way are suitable for relatively mature and stable businesses, such as finance.

  • KimBall proposes a top-down (DM-DW) data architecture by extracting or loading data sources from operational or transactional systems to the ODS layer. Then, through the data of ODS, a multi-dimensional subject data mart (DM) is constructed by using dimensional modeling method. Various DMs are linked together through consistent dimensions, and finally form a common data warehouse for enterprises/organizations.

    Advantages: rapid construction, the fastest return on investment, agility and flexibility; Disadvantages: as an enterprise resource, it is not easy to maintain, the structure is complex, and the integration of data marts is difficult. Often used in small and medium-sized enterprises or the Internet industry.

In fact, the above is just a theoretical process. In fact, whether it is to construct EDW or DM first, it is inseparable from the thorough understanding of the data and the design of the data model before the construction of the data warehouse, including the current hot "data in data". Taiwan", can't escape the basic construction process shown in the figure below.


                                      Figure 22. Basic process of data warehouse/data center construction

  1. Data digging. For an enterprise/organization, the initial work in building a data lake is to conduct a comprehensive survey and research on the data within its own enterprise/organization, including data sources, data types, data forms, data patterns, total data, and data growth. amount, etc. At this stage, an implicit and important work is to use the data to further sort out the organizational structure of the enterprise and clarify the relationship between the data and the organizational structure. It lays the foundation for clarifying the user roles, permission design, and service methods of the data lake in the future.

  2. Model abstraction. According to the business characteristics of enterprises/organizations, various types of data are sorted and classified, and the data is divided into fields to form metadata for data management. At the same time, based on the metadata, a general data model is constructed.

  3. data access. According to the results of the first step, determine the data source to be accessed. According to the data source, determine the necessary data access technology capabilities, and complete the selection of data access technology. The data to be accessed at least includes: data source metadata, original data metadata, and original data. All kinds of data are classified and stored according to the results formed in the second step.

  4. Integrated governance. Simply put, it uses various computing engines provided by the data lake to process data, form various intermediate data/result data, and properly manage and save it. The data lake should have perfect data development, task management, task scheduling capabilities, and record the data processing process in detail. In the process of governance, more data models and indicator models will be required.

  5. Business support. On the basis of the general model, each business department customizes its own detailed data model, data usage process, and data access services.

The above process is too heavy for a fast-growing Internet company, and it cannot be implemented in many cases. The most realistic problem is the second step of model abstraction. In many cases, the business is trial-and-error and exploration. It is impossible to extract a general data model without knowing the future direction; without a data model, all subsequent operations will be impossible. This is also why many fast-growing enterprises feel that data warehouses/data middle platforms cannot be implemented. , one of the important reasons for the inability to meet the demand.

A data lake should be a more "agile" way to build a data lake. We recommend the following steps to build a data lake.


                                              Figure 23. Basic process of data lake construction

Compared with Figure 22, there are still five steps, but these five steps are a comprehensive simplification and "landable" improvement.

  1. Data digging. It is still necessary to find out the basic situation of data, including data source, data type, data form, data mode, total amount of data, and data increment. However, that's what needs to be done. The data lake is to store the original data in full, so there is no need for in-depth design in advance.

  2. Technology selection. According to the situation of the data, determine the technical selection of the data lake construction. In fact, this step is also very simple, because there are many common practices in the industry regarding the selection of data lake technologies. I personally recommend three basic principles: "separation of computing and storage", "elasticity", and "independent expansion". The recommended storage selection is a distributed object storage system (such as S3/OSS/OBS); it is recommended to focus on batch processing requirements and SQL processing capabilities for computing engines, because in practice, these two types of capabilities are the key to data processing. The stream computing engine will be discussed later. Whether it is computing or storage, it is recommended to give priority to the form of serverless; in the future, it can be gradually evolved in the application, and an independent resource pool is really needed, and then consider building an exclusive cluster.

  3. data access. Determine the data source to be accessed, and complete the full data extraction and incremental access.

  4. Application governance. This step is the key to the data lake. I personally changed "integration governance" to "application governance". From the perspective of the data lake, data application and data governance should be integrated and inseparable. Start with data application, clarify the requirements in the application, and gradually form the data that can be used by the business in the process of data ETL; at the same time, form the data model, indicator system and corresponding quality standards. The data lake emphasizes the storage of raw data and the exploratory analysis and application of data, but this does not mean that the data lake does not need a data model; on the contrary, the understanding and abstraction of the business will greatly promote the development of the data lake And applications, data lake technology enables data processing and modeling, retains great agility, and can quickly adapt to business development and changes.

From a technical point of view, a data lake is different from a big data platform in that in order to support the full life cycle management and application of data, a data lake needs to have relatively complete data management, category management, process orchestration, task scheduling, data traceability, data Governance, quality management, authority management and other capabilities. In terms of computing power, the current mainstream data lake solutions support both SQL and programmable batch processing modes (for machine learning support, the built-in capabilities of Spark or Flink can be used); in terms of processing paradigms, almost all use Towards the workflow pattern of acyclic graph, and provides the corresponding integrated development environment. For stream computing support, various data lake solutions currently take different approaches. Before discussing the specific method, let's make a classification of flow computing:

  1. Mode 1: Real-time mode. This stream computing mode is equivalent to processing data in a "one-by-one"/"micro-batch" method; it is more common in online businesses, such as risk control, recommendation, and early warning.

  2. Mode 2: Class streaming. This mode needs to obtain the changed data after a specified time point/read a certain version of the data/read the current latest data, etc. It is a stream-like mode; it is more common in data exploration applications, such as analyzing a certain period of time Daily activity, retention, conversion, etc.

The essential difference between the two is that when mode 1 processes data, the data is often not stored in the data lake, but only flows in the network/memory; when mode 2 processes data, the data is already stored in the data lake. In summary, I personally recommend the following pattern:


                                                         Figure 24 Schematic diagram of data flow in the data lake


As shown in Figure 24, when the data lake needs to have the processing capability of Mode 1, Kafka-like middleware should still be introduced as the infrastructure for data forwarding. A complete data lake solution should provide the ability to stream raw data into Kafka. Streaming engines have the ability to read data from Kafka-like components. After processing the data, the streaming computing engine can write the results to OSS/RDBMS/NoSQL/DW for application access as needed. In a sense, the stream computing engine of mode 1 does not necessarily exist as an inseparable part of the data lake, it only needs to be easily introduced when the application needs it. However, it needs to be pointed out here:

  1. The streaming engine still needs to be able to easily read the metadata of the data lake;

  2. Streaming engine tasks also need to be integrated into the task management of the data lake;

  3. Streaming tasks still need to be incorporated into unified permission management.

For mode two, it is closer to batch processing in nature. Now many classic big data components have provided support methods, such as HUDI/IceBerg/Delta, etc., all support classic computing engines such as Spark and Presto. Taking HUDI as an example, by supporting special types of tables (COW/MOR), it provides the ability to access snapshot data (specified version), incremental data, and quasi-real-time data. At present, AWS, Tencent, etc. have integrated HUDI into their EMR services, and Alibaba Cloud's DLA is also planning to launch the capability of DLA on HUDI.

Let's go back to the first chapter at the beginning of this article, we said that the main users of data lakes are data scientists and data analysts, exploratory analysis and machine learning are common operations for this group of people; streaming computing (real-time mode) It is mostly used for online business. Strictly speaking, it is not a rigid need for the target users of the data lake. However, stream computing (real-time mode) is an important part of the online business of most Internet companies at present, and the data lake, as a centralized data storage place within the enterprise/organization, needs to maintain a certain scalability in the architecture, which can be very convenient. Scale to incorporate streaming computing capabilities.

Business support. Although most data lake solutions provide standard access interfaces, such as JDBC, various popular BI reporting tools and large-screen tools on the market can also directly access the data in the data lake. However, in practical applications, we still recommend pushing the processed data from the data lake to the corresponding data engines that support online services, so that the application can have a better experience.

7. Summary

As the infrastructure for the new generation of big data analysis and processing, the data lake needs to surpass the traditional big data platform. Personally, I think that the following aspects are the possible development directions of data lake solutions in the future.

1. Cloud native architecture. There are different opinions about what cloud native architecture is, and it is difficult to find a unified definition. But when it comes to the data lake scenario, I personally think it is the following three characteristics:

(1) Storage and computing are separated, and computing power and storage capacity can be independently expanded;

(2) Multi-modal computing engine support, SQL, batch processing, streaming computing, machine learning, etc.;

(3) Provide serverless services to ensure sufficient elasticity and support pay-as-you-go.

2. Sufficient data management capabilities. The data lake needs to provide more powerful data management capabilities, including but not limited to data source management, data category management, processing flow orchestration, task scheduling, data traceability, data governance, quality management, authority management, etc.

3. The ability of big data, the experience of database. At present, the vast majority of data analysts only have experience in using databases. Although the capabilities of big data platforms are strong, they are not user-friendly. Data scientists and data analysts should pay attention to data, algorithms, models and their relationship with business scenarios. Instead of spending a lot of time and energy learning the development of big data platforms. In order for the data lake to develop rapidly, how to provide users with a good experience is the key. SQL-based database application development has been deeply rooted in the hearts of the people. How to release the capabilities of the data lake in the form of SQL is a major direction in the future.

4. Perfect data integration and data development capabilities. Management and support for various heterogeneous data sources, support for full/incremental migration of heterogeneous data, and support for various data formats are directions that need to be continuously improved. At the same time, a complete, visual and extensible integrated development environment is required.

5. Deep integration and integration with business. The composition of a typical data lake architecture has basically become an industry consensus: distributed object storage + multimodal computing engine + data management. The key to the success of the data lake solution lies in data management. Whether it is the management of raw data, the management of data categories, the management of data models, the management of data permissions, or the management of processing tasks, it is inseparable from the adaptation to the business. and integration; in the future, more and more industry data lake solutions will emerge to form healthy development and interaction with data scientists and data analysts. How to preset industry data models, ETL processes, analysis models and customized algorithms in data lake solutions may be a key point for differentiated competition in the future data lake field.

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/125987504