The concept, characteristics, architecture and cases of Data Lake

This article includes seven subsections: 1. What is a data lake; 2. The basic characteristics of a data lake; 3. The basic architecture of a data lake; 4. Data lake solutions from various vendors; 5. Typical data lake application scenarios; 6. Data The basic process of lake construction; 7. Summary. Limited by personal level, mistakes are inevitable. Students are welcome to discuss together, criticize and correct, and don't hesitate to teach.

1. What is a data lake

Data lake is a hot concept at present, and many enterprises are building or planning to build their own data lakes. But before planning to build a data lake, it is crucial to figure out what a data lake is, clarify the basic components of a data lake project, and then design the basic architecture of a data lake. About what is a data lake? There are different definitions.

Wikipedia says that a data lake is a type of system or storage that stores data in a natural/original format, usually object blocks or files, including copies of raw data generated by the original system and transformed data for various tasks, including from relationships Structured data (rows and columns), semi-structured data (such as CSV, logs, XML, JSON), unstructured data (such as emails, documents, PDFs, etc.) and binary data (such as images, audio, video).

AWS defines a data lake as a centralized repository that allows you to store all structured and unstructured data at any scale.

Microsoft's definition is even more vague. It does not clearly give what is a Data Lake, but takes the function of a data lake as a definition. The data lake includes everything that makes it easier for developers, data scientists, and analysts to store and process data. Data capabilities, these capabilities allow users to store data of any size, type, and speed, and perform all types of analysis and processing across platforms and languages.

There are actually many definitions of data lakes, but they basically revolve around the following characteristics.

1. The data lake needs to provide sufficient data storage capacity, which stores all the data in an enterprise/organization.

2. Data lakes can store massive amounts of data of any type, including structured, semi-structured, and unstructured data.

3. The data in the data lake is the original data, which is a complete copy of the business data. The data in the data lake maintains their original appearance in the business system.

4. The data lake needs to have complete data management capabilities (perfect metadata), which can manage various data-related elements, including data sources, data formats, connection information, data schema, authority management, etc.

5. The data lake needs to have diversified analysis capabilities, including but not limited to batch processing, streaming computing, interactive analysis, and machine learning; at the same time, it also needs to provide certain task scheduling and management capabilities.

6. Data lakes need to have comprehensive data life cycle management capabilities. Not only need to store the original data, but also need to be able to save the intermediate results of various analysis and processing, and record the analysis and processing process of the data completely, which can help users trace the generation process of any piece of data in detail.

7. The data lake needs to have perfect data acquisition and data release capabilities. The data lake needs to be able to support a variety of data sources and obtain full/incremental data from related data sources; then standardize storage. The data lake can push the results of data analysis and processing to an appropriate storage engine to meet different application access requirements.

8. Support for big data, including ultra-large-scale storage and scalable large-scale data processing capabilities.

In summary, I personally think that a data lake should be an evolving and scalable infrastructure for big data storage, processing, and analysis; data-oriented, to achieve full access to any source, any speed, any scale, and any type of data , full storage, multi-mode processing, and full lifecycle management; and through interactive integration with various external heterogeneous data sources, it supports various enterprise-level applications.


Figure 1. Schematic diagram of the basic capabilities of a data lake

Two more points need to be pointed out here:

1) Scalability refers to the scalability of scale and capability, that is, the data lake must not only be able to provide "sufficient" storage and computing capabilities as the amount of data increases; it also needs to continuously provide new data processing as needed For example, the business may only need batch processing capabilities at the beginning, but as the business develops, it may require interactive ad-hoc analysis capabilities; and as the business’s effectiveness requirements continue to increase, it may be necessary to support real-time analysis and machine learning. Ability.

2) Data-oriented means that the data lake should be simple and easy-to-use for users, helping users to be freed from complex IT infrastructure operation and maintenance work, and focus on business, models, algorithms, and data. Data lakes are for data scientists and analysts. At present, cloud native should be an ideal way to build a data lake. This point of view will be discussed in detail in the section "Basic Data Lake Architecture".

2. The basic characteristics of the data lake

After having a basic understanding of the concept of data lakes, we need to further clarify what basic characteristics data lakes need to have, especially what characteristics data lakes have compared with big data platforms or traditional data warehouses. Before the specific analysis, let's look at a comparison table from the AWS official website (the table is quoted from: https://aws.amazon.com/cn/big-data/datalakes-and-analytics/what-is-a-data -lake/)

The above table compares the differences between data lakes and traditional data warehouses. Personally, I think we can further analyze the characteristics of data lakes from the two levels of data and computing. In terms of data:

1) "Fidelity". In the data lake, an "exactly the same" complete copy of the data in the business system will be stored . The difference from the data warehouse is that a copy of the original data must be saved in the data lake, and no matter the data format, data mode, or data content should be modified. In this regard, the data lake emphasizes the preservation of the "authentic" business data. At the same time, a data lake should be able to store data of any type/format.

2) "Flexibility": One of the points in the above table is "write-in schema" vs "read-in schema". In fact, it is essentially a question of at which stage the design of the data schema occurs. For any data application, the design of the schema is actually essential. Even for some databases such as mongoDB that emphasize "schemaless", the best practice still recommends that the records use the same/similar structure as much as possible. The logic behind the "write-in schema" is that before data is written, it is necessary to determine the schema of the data according to the access method of the business, and then complete the data import according to the established schema. However, this also means that the initial cost of ownership of the data warehouse will be relatively high, especially when the business model is not clear and the business is still in the exploratory stage, the flexibility of the data warehouse is not enough.

The underlying logic behind the "readable schema" emphasized by the data lake is that business uncertainty is normal: we cannot predict business changes, so we maintain a certain degree of flexibility and delay the design so that the entire The infrastructure has the ability to make data "on demand" fit the business. Therefore, I personally think that "fidelity" and "flexibility" are in the same line: since there is no way to predict business changes, then simply keep the data in the most original state, and once needed, the data can be processed according to the needs. Therefore, data lakes are more suitable for innovative enterprises and enterprises with rapid business changes and development. At the same time, users of data lakes also have higher requirements. Data scientists and business analysts (with certain visualization tools) are the target customers of data lakes.

3) "Manageable": The data lake should provide comprehensive data management capabilities. Since data requires "fidelity" and "flexibility", there will be at least two types of data in the data lake: raw data and processed data. The data in the data lake will continue to accumulate and evolve. Therefore, there will be high requirements for data management capabilities, which should at least include the following data management capabilities: data source, data connection, data format, and data schema (library/table/column/row). At the same time, the data lake is a unified data storage place in a single enterprise/organization, so it also needs to have certain authority management capabilities.

4) "Traceability": A data lake is a storage place for all data in an organization/enterprise. It needs to manage the entire life cycle of data, including the entire process of data definition, access, storage, processing, analysis, and application. To implement a powerful data lake, it is necessary to be able to trace the access, storage, processing, and consumption process of any piece of data in between, and to be able to clearly reproduce the complete process of data generation and flow.

In terms of computing, I personally think that data lakes have very broad requirements for computing capabilities, and it all depends on the computing requirements of the business.

5) Rich calculation engine. From batch processing, streaming computing, interactive analysis to machine learning, all kinds of computing engines belong to the category that data lakes should cover. In general, batch computing engines are used for data loading, conversion, and processing; streaming computing engines are used for real-time computing; and interactive analysis engines may be required for some exploratory analysis scenarios. As the combination of big data technology and artificial intelligence technology is getting closer, various machine learning/deep learning algorithms are also being introduced continuously. For example, the TensorFlow/PyTorch framework already supports reading sample data from HDFS/S3/OSS for training. Therefore, for a qualified data lake project, the scalability/pluggability of the computing engine should be a basic capability.

6) Multimodal storage engine. Theoretically, the data lake itself should have a built-in multimodal storage engine to meet the data access requirements of different applications (considering factors such as response time/concurrency/access frequency/cost). However, in the actual use process, the data in the data lake is usually not accessed frequently, and the related applications are mostly used for exploratory data applications. In order to achieve acceptable cost performance, data lake construction is usually It will choose a relatively cheap storage engine (such as S3/OSS/HDFS/OBS), and work with an external storage engine when needed to meet diverse application requirements.

3. Basic Architecture of Data Lake

The data lake can be considered as a new generation of big data infrastructure. In order to better understand the basic architecture of the data lake, let's take a look at the evolution process of the big data infrastructure architecture.

1) The first stage: off-line data processing infrastructure represented by Hadoop. As shown in the figure below, Hadoop is a batch data processing infrastructure with HDFS as the core storage and MapReduce (MR for short) as the basic computing model. Around HDFS and MR, a series of components have been produced to continuously improve the data processing capabilities of the entire big data platform, such as HBase for online KV operations, HIVE for SQL, PIG for workflow, etc. At the same time, as everyone's performance requirements for batch processing are getting higher and higher, new computing models are constantly being proposed, and computing engines such as Tez, Spark, and Presto have been produced, and the MR model has gradually evolved into a DAG model. On the one hand, the DAG model increases the abstract concurrency capability of the calculation model: each calculation process is decomposed, and the tasks are logically segmented according to the aggregation operation points in the calculation process. The tasks are divided into stages one by one, and each stage can It consists of one or more Tasks, which can be executed concurrently, thereby improving the parallelism of the entire calculation process; on the other hand, in order to reduce the operation of writing files for intermediate results in the data processing process, calculation engines such as Spark and Presto use calculations as much as possible. The memory of the node caches the data, thereby improving the efficiency and system throughput of the entire data process.



Figure 2. Schematic architecture of Hadoop

2) The second stage: lambda architecture. With the continuous changes in data processing capabilities and processing requirements, more and more users find that no matter how the batch processing mode improves performance, it cannot meet some processing scenarios with high real-time requirements. Streaming computing engines emerge as the times require, such as Storm , Spark Streaming, Flink, etc. However, as more and more applications come online, everyone finds that the combination of batch processing and stream computing can meet the needs of most applications; and for users, they don't actually care about the underlying computing model. It is hoped that both batch processing and stream computing can return processing results based on a unified data model, so the Lambda architecture was proposed, as shown in the figure below. (In order to save trouble, both the lambda architecture and the Kappa architecture diagram are from the Internet)



Figure 3. Schematic diagram of Lambda architecture

The core concept of the Lambda architecture is "flow-batch integration". As shown in the figure above, the entire data flow flows into the platform from left to right. After entering the platform, it is divided into two parts, one part adopts batch processing mode, and the other part adopts streaming computing mode. Regardless of the computing mode, the final processing result is provided to the application through the service layer to ensure the consistency of access.

3) The third stage: Kappa architecture. The Lambda architecture solves the consistency problem of data read by applications, but the processing link of "stream-batch separation" increases the complexity of research and development. Therefore, some people have asked whether it is possible to use one system to solve all problems. At present, the more popular method is to do it based on flow computing. The natural distributed characteristics of stream computing are destined to have better scalability. By increasing the concurrency of streaming computing and increasing the "time window" of streaming data, the two computing modes of batch processing and streaming processing are unified.



Figure 4. Schematic diagram of Kappa architecture

In summary, from the traditional hadoop architecture to the lambda architecture, from the lambda architecture to the evolution of the Kappa architecture, the evolution of the big data platform infrastructure gradually includes all kinds of data processing capabilities required by the application, and the big data platform has gradually evolved into an enterprise /Organization's full data processing platform. In current enterprise practice, except relational databases relying on various independent business systems; almost all other data are considered to be included in the big data platform for unified processing. However, the current big data platform infrastructure focuses on storage and computing, while ignoring the asset management of data. This is precisely one of the directions that data lakes, as a new generation of big data infrastructure, focus on. .

The evolution of big data infrastructure actually reflects one point: within enterprises/organizations, it has become a consensus that data is an important asset; in order to make better use of data, enterprises/organizations need to store data assets 1) as they are for a long time ; 2) Effective management and centralized governance; 3) Provide multi-mode computing capabilities to meet processing needs; 4) And business-oriented, provide a unified data view, data model and data processing results. The data lake was born under this background. In addition to the various basic capabilities of the big data platform, the data lake puts more emphasis on the management, governance and capitalization of data. When it comes to specific implementation, the data lake needs to include a series of data management components, including: 1) data access; 2) data relocation; 3) data governance; 4) quality management; 5) asset catalog; 6) access control ;7) Task management; 8) Task arrangement; 9) Metadata management, etc. As shown in the figure below, a reference architecture of a data lake system is given. For a typical data lake, it is the same as a big data platform in that it also has the storage and computing capabilities required to process ultra-large-scale data, and can provide multi-mode data processing capabilities; the enhancement point is that the data lake provides more In order to improve the data management ability, it is embodied in:

1) Stronger data access capabilities. The data access capability is reflected in the ability to define and manage various external heterogeneous data sources, as well as the ability to extract and migrate data related to external data sources. The extracted and migrated data includes metadata and actual stored data of external data sources.

2) Stronger data management capabilities. Specifically, management capabilities can be divided into basic management capabilities and extended management capabilities. Basic management capabilities include management of various metadata, data access control, and data asset management, which are necessary for a data lake system. Later, we will discuss each vendor's basic Support for management capabilities. Extended management capabilities include task management, process orchestration, and capabilities related to data quality and data governance. Task management and process orchestration are mainly used to manage, arrange, schedule, and monitor various tasks that process data in the data lake system. Usually, data lake builders will purchase/develop custom data integration or data development subsystems/ Modules provide such capabilities, and customized systems/modules can integrate with the data lake system by reading the relevant metadata of the data lake. Data quality and data governance are more complicated issues. Generally, the data lake system will not directly provide related functions, but will open various interfaces or metadata for capable enterprises/organizations to interact with existing data. Manage software integration or do custom development.

3) Sharable metadata. Various computing engines in the data lake will be deeply integrated with the data in the data lake, and the basis of the integration is the metadata of the data lake. In a good data lake system, when processing data, the computing engine can directly obtain information such as data storage location, data format, data mode, and data distribution from metadata, and then directly process data without manual/programming intervention. Furthermore, a good data lake system can also perform access control on the data in the data lake, and the strength of control can be achieved at different levels such as "database table column row".


Figure 5. Data Lake Component Reference Architecture

It should also be pointed out that the "centralized storage" in the above figure is more about the concentration of business concepts. In essence, it is hoped that the data within an enterprise/organization can be deposited in a clear and unified place. In fact, the storage of data lakes should be a type of distributed file system that can be expanded on demand. In practice, most data lakes recommend using distributed systems such as S3/OSS/OBS/HDFS as the unified storage of data lakes.

We can then switch to the data dimension and look at how the data lake processes data from the perspective of the data life cycle. The entire life cycle of data in the data lake is shown in Figure 6. Theoretically, the data in a well-managed data lake will permanently retain the original data, while the process data will be continuously improved and evolved to meet business needs.


Figure 6. Schematic diagram of the data life cycle in the data lake

4. Data lake solutions of various manufacturers

Data lake is a current outlet, and major cloud vendors have launched their own data lake solutions and related products. This section will analyze the data lake solutions launched by various mainstream vendors and map them to the data lake reference architecture to help you understand the advantages and disadvantages of various solutions.

4.1 AWS Data Lake Solution


Figure 7. AWS data lake solution

Figure 7 is the data lake solution recommended by AWS. The entire solution is based on AWS Lake Formation, which is essentially a management component that cooperates with other AWS services to complete the entire enterprise-level data lake construction function. The figure above shows four steps from left to right, data inflow, data precipitation, data calculation, and data application. Let's take a closer look at its key points:

1) Data inflow.
Data inflow is the start of the entire data lake construction, including metadata inflow and business data inflow. Metadata inflow includes two steps: data source creation and metadata capture, and finally a data resource directory will be formed, and corresponding security settings and access control policies will be generated. The solution provides specialized components to obtain relevant metadata of external data sources. This component can connect to external data sources, detect data formats and schemas, and create metadata belonging to the data lake in the corresponding data resource directory. The inflow of business data is done through ETL.

In terms of specific product forms, metadata capture, ETL, and data preparation are abstracted separately by AWS to form a product called AWS GLUE. AWS GLUE and AWS Lake Formation share the same data resource catalog, which is clearly stated on the AWS GLUE official website document: "Each AWS account has one AWS Glue Data Catalog per AWS region".

Support for heterogeneous data sources. The data lake solution provided by AWS supports S3, AWS relational database, and AWS NoSQL database. AWS uses components such as GLUE, EMR, and Athena to support the free flow of data.

2) Data precipitation.

Use Amazon S3 as the centralized storage of the entire data lake, expand on demand/pay by usage.

3) Data calculation.

The entire solution utilizes AWS GLUE for basic data processing. The basic calculation form of GLUE is ETL tasks in various batch modes. There are three ways to start tasks: manual trigger, timing trigger, and event trigger. It has to be said that the various services of AWS are ecologically well realized. In the event trigger mode, AWS Lambda can be used for extended development, and one or more tasks can be triggered at the same time, which greatly improves the custom development ability of task triggering; At the same time, various ETL tasks can be well monitored through CloudWatch.

4) Data application.

In addition to providing basic batch computing modes, AWS provides rich computing mode support through various external computing engines, such as providing SQL-based interactive batch processing capabilities through Athena/Redshift; Spark's computing capabilities include the stream computing capabilities and machine learning capabilities that Spark can provide.

5) Authority management.

AWS's data lake solution provides relatively complete permission management through Lake Formation, and the granularity includes "library-table-column". However, there is one exception. When GLUE accesses Lake Formation, the granularity is only two levels of "library-table"; this also shows from another side that the integration of GLUE and Lake Formation is closer. Data has greater access.

The permissions of Lake Formation can be further subdivided into data resource directory access permissions and underlying data access permissions, which correspond to metadata and actual stored data respectively. The access rights to the actual stored data are further divided into data access rights and data storage access rights. Data access permissions are similar to the access permissions for database tables in the database, and data storage permissions further refine the access permissions for specific directories in S3 (divided into two types: explicit and implicit). As shown in Figure 8, user A cannot create a table under the specified bucket of S3 under the permission of only data access.

Personally, I think this further reflects that the data lake needs to support various storage engines. The future data lake may not only have core storage such as S3/OSS/OBS/HDFS, but may incorporate more types of storage engines according to the access requirements of the application. For example, S3 stores raw data, NoSQL stores processed data suitable for access in "key-value" mode, and OLAP engine stores data that needs to generate various reports/adhoc queries in real time. Although all kinds of current materials are emphasizing the difference between data lakes and data warehouses; but, in essence, data lakes should be the concrete realization of a kind of integrated data management idea, and "integration of lakes and warehouses" is likely to be the future a development trend.


Figure 8. Schematic diagram of AWS data lake solution permission separation

To sum up, the AWS data lake solution is highly mature, especially in terms of metadata management and authority management. It has opened up the upstream and downstream relationships between heterogeneous data sources and various computing engines, allowing data to "move" freely. In terms of flow computing and machine learning, AWS's solutions are also relatively complete. In terms of flow computing, AWS has launched Kinesis, a special flow computing component. The Kinesis data Firehose service in Kinesis can create a fully managed data distribution service. The data processed in real time through Kinesis data Stream can be conveniently written into S3 with the help of Firehose. And support the corresponding format conversion, such as converting JSON to Parquet format. The best part of the entire AWS solution is that Kinesis can access the metadata in GLUE, which fully reflects the ecological completeness of the AWS data lake solution. Similarly, in terms of machine learning, AWS provides the SageMaker service. SageMaker can read the training data in S3 and write the trained model back to S3. However, one thing to point out is that in AWS's data lake solution, flow computing and machine learning are not fixedly bundled, but are just expansions of computing capabilities that can be easily integrated.

Finally, let's go back to the data lake component reference architecture in Figure 6 to see the component coverage of AWS's data lake solution, see Figure 9.


Figure 9. Mapping of AWS data lake solutions in the reference architecture

In summary, AWS's data lake solution covers all functions except quality management and data governance. In fact, the work of quality management and data governance is closely related to the organizational structure and business type of the enterprise, and requires a lot of custom development work. Therefore, it is understandable that the general solution does not include this content. In fact, there are relatively excellent open source projects supporting this project, such as Apache Griffin. If you have strong demands on quality management and data governance, you can customize the development yourself.

4.2 Huawei Data Lake Solution


Figure 10. Huawei Data Lake Solution

Information about Huawei's data lake solution comes from Huawei's official website. Related products currently visible on the official website include Data Lake Insight (DLI) and Intelligent Data Lake Operation Platform (DAYU). Among them, DLI is equivalent to a collection of AWS's Lake Formation, GLUE, Athena, and EMR (Flink & Spark). I couldn’t find the overall architecture diagram of DLI on the official website. I tried to draw one based on my own understanding, mainly to compare with AWS’s solution, so the form should be as consistent as possible. If there are students who know Huawei DLI very well, please also Feel free to enlighten me.

Huawei's data lake solution is relatively complete. DLI undertakes all the core functions of data lake construction, data processing, data management, and data application. The biggest feature of DLI is the completeness of the analysis engine, including SQL-based interactive analysis and Spark+Flink-based stream-batch integrated processing engine. On the core storage engine, DLI is still provided through the built-in OBS, which basically matches the capabilities of AWS S3. Huawei's data lake solution is relatively more complete than AWS in terms of upstream and downstream ecology. For external data sources, it supports almost all data source services currently provided on Huawei Cloud.

DLI can be connected with Huawei's CDM (Cloud Data Migration Service) and DIS (Data Access Service): 1) With DIS, DLI can define various data points, which can be used in Flink operations as sources or sinks ;2) With the help of CDM, DLI can even access data from IDC and third-party cloud services.

In order to better support advanced data lake functions such as data integration, data development, data governance, and quality management, HUAWEI CLOUD provides the DAYU platform. The DAYU platform is the implementation of Huawei's data lake governance and operation methodology. DAYU covers the core process of the entire data lake governance and provides corresponding tool support; even in Huawei's official documents, it gives suggestions for building a data governance organization. The implementation of DAYU's data governance methodology is shown in Figure 11 (from Huawei Cloud official website).

Figure 11 DAYU Data Governance Methodology Process

It can be seen that the DAYU data governance methodology is essentially an extension of the traditional data warehouse governance methodology on the data lake infrastructure: From the perspective of the data model, it still includes the source layer, multi-source integration layer, and detailed data layer. Fully consistent with the data warehouse. According to the data model and index model, quality rules and conversion models will be generated. DAYU will connect with DLI, and directly call the relevant data processing services provided by DLI to complete data governance. The entire data lake solution of HUAWEI CLOUD completely covers the life cycle of data processing, and clearly supports data governance, and provides data governance process tools based on models and indicators. In HUAWEI CLOUD's data lake solution, it gradually begins The direction of "integration of lakes and warehouses" is evolving.

4.3 Alibaba Cloud Data Lake Solution

There are many data products on Alibaba Cloud. Because I am currently in the Data BU, this section will focus on how to use the products of the Database BU to build a data lake. Other cloud products will be slightly involved. Alibaba Cloud's data lake solutions based on database products are more focused, focusing on two scenarios: data lake analysis and federated analysis. Alibaba Cloud Data Lake solution is shown in Figure 12.

Figure 12. Alibaba Cloud Data Lake Solution

The whole solution still uses OSS as the centralized storage of the data lake. In terms of data source support, all Alibaba Cloud databases are currently supported, including OLTP, OLAP, and NoSQL databases. The core key points are as follows:

1) Data access and relocation. During the process of building a lake, the Formation component of DLA has the ability to discover metadata and build a lake with one click. At the time of writing this article, "one-click lake building" currently only supports full lake building, but incremental lake building based on binlog It is already under development and is expected to be launched soon. The incremental lake building capability will greatly increase the real-time performance of data in the data lake and minimize the pressure on the source business database. It should be noted here that DLA Formation is an internal component and is not exposed externally.

2) Data resource directory. DLA provides the Meta data catalog component for unified management of data assets in the data lake, no matter whether the data is "in the lake" or "outside the lake". Meta data catalog is also a unified metadata entry for federated analysis.

3) On the built-in computing engine, DLA provides SQL computing engine and Spark computing engine. Both the SQL and Spark engines are deeply integrated with the Meta data catalog, making it easy to obtain metadata information. Based on the capabilities of Spark, the DLA solution supports computing modes such as batch processing, stream computing, and machine learning.

4) In the peripheral ecology, in addition to supporting various heterogeneous data sources for data access and aggregation, in terms of external access capabilities, DLA is deeply integrated with the cloud-native data warehouse (formerly ADB). On the one hand, the results of DLA processing can be pushed to ADB on the fly, satisfying real-time, interactive, and ad hoc complex queries; on the other hand, the data in ADB can also be conveniently returned to OSS by using the appearance function. Based on DLA, various heterogeneous data sources on Alibaba Cloud can be completely connected, and data can flow freely.

5) In terms of data integration and development, Alibaba Cloud's data lake solution provides two options: one is to use dataworks to complete; the other is to use DMS to complete. No matter which one you choose, you can provide externally with visual process arrangement, task scheduling, and task management capabilities. In terms of data lifecycle management, dataworks' data map capabilities are relatively more mature.

6) In terms of data management and data security, DMS provides powerful capabilities. The data management granularity of DMS is divided into "library-table-column-row", which fully supports enterprise-level data security management and control requirements. In addition to authority management, the more refined part of DMS is to extend the original database-based devops concept to the data lake, making the operation, maintenance and development of the data lake more refined.

Further refine the data application architecture of the entire data lake solution, as shown in the figure below.


Figure 13. Alibaba Cloud Data Lake data application architecture

From left to right, from the perspective of data flow, data producers generate various types of data (off-cloud/on-cloud/other clouds), use various tools, and upload to various general/standard data sources, including OSS/HDFS/DB wait. For various data sources, DLA completes lake building operations through capabilities such as data discovery, data access, and data migration. For data "into the lake", DLA provides data processing capabilities based on SQL and Spark, and can provide externally visualized data integration and data development capabilities based on Dataworks/DMS; in terms of external application service capabilities, DLA provides standardized JDBC interfaces , can be directly connected to various reporting tools, large-screen display functions, etc. The feature of Alibaba Cloud's DLA is that it relies on the entire Alibaba Cloud database ecosystem, including OLTP, OLAP, NoSQL and other databases, and provides SQL-based data processing capabilities to the outside world. For traditional enterprise database-based development technology stacks, the transformation cost is relatively low. Lower, with a gentler learning curve.

Another feature of Alibaba Cloud's DLA solution is "integration of lakes and warehouses based on cloud native". Traditional enterprise-level data warehouses are still irreplaceable in various report applications in the era of big data; however, data warehouses cannot meet the flexibility requirements of data analysis and processing in the era of big data; therefore, we recommend that data warehouses should be used as The upper-layer application of the data lake exists: the data lake is the only official data storage place for the original business data in an enterprise/organization; the data lake processes the original data according to various business application requirements to form reusable intermediate results; When the data schema (Schema) of the intermediate results is relatively fixed, DLA can push the intermediate results to the data warehouse for enterprises/organizations to develop business applications based on data warehouses. While providing DLA, Alibaba Cloud also provides cloud-native data warehouse (formerly ADB). DLA and cloud-native data warehouse are deeply integrated in the following two points.
1) Use the same source SQL parsing engine. DLA's SQL is fully compatible with ADB's SQL syntax, which means that developers can develop data lake applications and data warehouse applications at the same time using a set of technology stacks.
2) Both have built-in access support for OSS. OSS exists directly as the native storage of DLA; for ADB, structured data on OSS can be easily accessed through the ability of external tables. With the help of external tables, data can freely flow between DLA and ADB, achieving a true integration of lakes and warehouses.

The combination of DLA+ADB truly achieves the integration of cloud-native lakes and warehouses (what is cloud-native is not in the scope of this article). In essence, DLA can be regarded as a data warehouse paste source layer with expanded capabilities. Compared with traditional data warehouses, this source layer: (1) can store various structured, semi-structured and unstructured data; (2) can connect to various heterogeneous data sources; (3) has metadata discovery , management, synchronization and other capabilities; (4) The built-in SQL/Spark computing engine has stronger data processing capabilities to meet diverse data processing needs; (5) Full life cycle management capabilities for full data. The lake warehouse integration solution based on DLA+ADB will simultaneously cover the processing capacity of "big data platform + data warehouse".

Another important capability of DLA is to build a data flow system that "extends in all directions" and provide external capabilities with the experience of the database, regardless of whether the data is on or off the cloud, whether the data is inside or outside the organization; with the help of the data lake, each system There are no longer barriers between data, and they can flow in and out freely; more importantly, this flow is regulated, and the data lake completely records the flow of data.

4.4 Azure Data Lake Solution

Azure's data lake solution includes data lake storage, interface layer, resource scheduling and computing engine layer, as shown in Figure 15 (from Azure official website). The storage layer is built on the basis of Azure Object Storage and still provides support for structured, semi-structured and unstructured data. The interface layer is WebHDFS. In particular, the HDFS interface is implemented in Azure object storage. Azure calls this capability "multi-protocol access on data lake storage". In terms of resource scheduling, Azure is implemented based on YARN. As for the computing engine, Azure provides various processing engines such as U-SQL, Hadoop, and Spark.


Figure 15. Azure Data lake analysis architecture

The special feature of Azure is that it provides support for customer development based on visual studio.

1) Support for development tools and deep integration with visual studio; Azure recommends using U-SQL as the development language for data lake analysis applications. Visual studio provides a complete development environment for U-SQL; at the same time, in order to reduce the complexity of distributed data lake system development, visual studio is packaged based on projects. When developing U-SQL, you can create "U-SQL database project ", in such projects, using visual studio, you can easily code and debug, and at the same time, it also provides a wizard to publish the developed U-SQL script to the production environment. U-SQL supports Python and R for expansion to meet custom development needs.

2) Adaptation of multiple computing engines: SQL, Apache Hadoop and Apache Spark. Hadoop here includes HDInsight (Azure-hosted Hadoop service) provided by Azure, and Spark includes Azure Databricks.

3) Automatic conversion capability between various engine tasks. Microsoft recommends U-SQL as the default development tool for data lakes, and provides various conversion tools to support conversion between U-SQL scripts and Hive, Spark (HDSight&databricks), and Azure Data Factory data Flow.

4.5 Summary

What this article discusses is the data lake solution, and does not involve any single product of any cloud vendor. From the aspects of data access, data storage, data calculation, data management, and application ecology, we simply made a summary similar to the table below.

Due to space constraints, in fact, well-known cloud vendors also have data lake solutions from Google and Tencent. From their official websites, the data lake solution is relatively simple, and it is only a conceptual elaboration. The recommended implementation solution is "oss+hadoop (EMR)". In fact, data lakes should not be viewed from the perspective of a simple technical platform. There are also various ways to implement data lakes. To evaluate whether a data lake solution is mature, the key should be to look at the data management capabilities it provides, including but not limited to meta Data, data asset catalog, data source, data processing tasks, data life cycle, data governance, authority management, etc.; and the ability to connect with peripheral ecology.

5. Typical data lake application cases

5.1 Advertising Data Analysis

In recent years, the cost of traffic acquisition has become higher and higher, and the exponential increase in the cost of acquiring customers through online channels has made all walks of life face severe challenges. Against the background of rising Internet advertising costs, the main business strategy of spending money to buy traffic to attract new users is bound to fail. The optimization of the traffic front-end has become the end of the battle. Using data tools to improve the target conversion of traffic after arriving at the station, and refining the various links of operating advertising are the most direct and effective ways to change the status quo. After all, to improve the conversion rate of advertising traffic, we must rely on big data analysis.

In order to provide more support for decision-making, it is necessary to collect and analyze more buried point data, including but not limited to channels, delivery time, delivery crowd, and data analysis based on click-through rate data indicators, so as to give better results. More efficient and faster solutions and suggestions to achieve high efficiency and high output. Therefore, in the face of the multi-dimensional, multimedia, multi-advertising space and other structured, semi-structured, and unstructured data collection, storage, analysis, and decision-making requirements in the field of advertising, the data lake analysis product solution is in the hands of advertisers or publishers. The new generation of technology selection has been very enthusiastically favored.

DG is a world-leading international intelligent marketing service provider for enterprises. Based on advanced advertising technology, big data and operational capabilities, it provides customers with global high-quality user acquisition and traffic monetization services. DG has decided to build its IT infrastructure based on the public cloud since its inception. Initially, DG chose the AWS cloud platform, mainly storing its advertising data in S3 in the form of a data lake, and conducting interactive analysis through Athena. However, with the rapid development of Internet advertising, the advertising industry has brought several major challenges. The release and tracking system of mobile advertising must solve several key issues:

1) Concurrency and peak issues. In the advertising industry, traffic peaks often appear, and the instantaneous click volume may reach tens of thousands, or even hundreds of thousands, which requires the system to have very good scalability to quickly respond and process each click

2) How to realize real-time analysis of massive data. In order to monitor the effect of advertising, the system needs to analyze each click and activation data of the user in real time, and at the same time transmit the relevant data to the downstream media;

3) The amount of data on the platform is increasing rapidly. The daily business log data is continuously generated and uploaded, and the data of exposure, clicks, and push is continuously processed. The amount of newly added data every day is about 10-50TB. The system puts forward higher requirements. How to efficiently complete offline/near real-time statistics of advertising data, and conduct aggregation analysis according to the dimension requirements of advertisers.

In response to the above three business challenges, at the same time, the daily incremental data of DG, a customer, is rapidly increasing (currently, the daily data scanning volume has reached 100+ TB). Continued use on the AWS platform encounters Athena's bandwidth bottleneck in reading S3 data and data analysis lag time After careful and careful testing and analysis, it was decided to relocate all stations from the AWS cloud platform to the Alibaba Cloud platform. The new architecture diagram is as follows:


Figure 16. The transformed advertising data lake solution architecture

After moving from AWS to Alibaba Cloud, we designed the ultimate analysis capability of "Using Data Lake Analytics + OSS" for this customer to cope with business peaks and valleys. On the one hand, it is easy to deal with temporary analysis from brand customers. On the other hand, use the powerful computing power of Data Lake Analytics to analyze monthly and quarterly advertising, and accurately calculate how many activities there will be under a brand. Each activity is divided into media, market, channel, and DMP. It further enhanced the sales conversion rate brought by the Jiahe intelligent traffic platform to brand marketing. In addition, in terms of the total cost of ownership of advertising placement and analysis, the serverless elastic service provided by Data Lake Analytics is charged on demand and does not require the purchase of fixed resources. Greatly reduce operation and maintenance costs and use costs.


Figure 17 Schematic diagram of data lake deployment

Overall, after DG switched from AWS to Alibaba Cloud, it greatly saved hardware costs, labor costs, and development costs. Due to the use of DLA serverless cloud services, DG does not need to invest a lot of money in advance to purchase hardware devices such as servers and storage, nor does it need to purchase a large number of cloud services at one time. The scale of its infrastructure is completely expanded on demand: when the demand is high, add services Quantity, when the demand decreases, the number of services is reduced, and the utilization rate of funds is improved. The second significant benefit of using the Alibaba Cloud platform is the performance improvement. During the rapid growth period of DG business and the subsequent access period of multiple business lines, DG's visits to the mobile advertising system often showed explosive growth. However, the original AWS solution and platform encountered the limitation of data reading bandwidth when reading S3 data in Athena. There is a huge bottleneck, and the time for data analysis is getting longer and longer. Alibaba Cloud DLA and the OSS team have carried out great optimization and transformation. At the same time, DLA database analysis is on the computing engine (ranked first in the world with TPC-DS The AnalyticDB shared computing engine) is dozens of times faster than the native computing engine of Presto, and it also greatly improves the analysis performance for DG.

5.2 Game Operation Analysis

A data lake is a type of big data infrastructure with excellent TCO performance. For many fast-growing game companies, a popular game often has extremely fast growth in related data in the short term; at the same time, it is difficult for the technology stack of the company's R&D personnel to match the increase and growth rate of data in the short term; At this time, the explosive growth of data is difficult to be effectively utilized. A data lake is a technology of choice to solve this type of problem.

YJ is a fast-growing game company. The company hopes to conduct in-depth analysis based on relevant user behavior data to guide the development and operation of the game. The core logic behind data analysis is that with the expansion of market competition in the game industry, players have higher and higher requirements for quality, and the life cycle of game projects is getting shorter and shorter, which directly affects the input-output ratio of projects. It can effectively extend the life cycle of the project and accurately control the business trend of each stage. With the increasing cost of traffic, how to build an economical and efficient refined data operation system to better support business development has become increasingly important. The data operation system needs its supporting infrastructure. How to choose such infrastructure is a question that the company's technical decision makers need to think about. Starting points for thinking include:

1) Be flexible enough. For games, it is often a short-term burst, and the amount of data increases sharply; therefore, whether it can adapt to the explosive growth of data and meet the elastic demand is a key point of consideration; whether it is computing or storage, it needs to have sufficient flexibility.

2) There must be sufficient cost performance. For user behavior data, it often needs to be analyzed and compared over a long period of time, such as the retention rate. In many cases, it is necessary to consider the 90-day or even 180-day customer retention rate; therefore, how to store it for a long time in the most cost-effective way Massive data is a problem that needs to be considered.

3) There must be sufficient analytical capabilities and scalability. In many cases, user behavior is reflected in buried point data, and buried point data needs to be associated with structured data such as user registration information, login information, and bills for analysis; therefore, in terms of data analysis, at least ETL capabilities for big data, The access capability of heterogeneous data sources and the modeling capability of complex analysis.

4) It must match the company's existing technology stack and facilitate subsequent recruitment. For YJ, an important point when selecting technology is the technology stack of its technicians. Most of YJ's technical team is only familiar with traditional database development, that is, MySQL; and the manpower is tight, and only 1 technician does data operation analysis. One, there is no ability to independently build the infrastructure for big data analysis in a short period of time. From YJ's point of view, it is best that most of the analysis can be done through SQL; and in the recruitment market, the number of SQL developers is much higher than the number of big data development engineers. According to the customer's situation, we help the customer to modify the existing scheme.


Figure 18. The scheme before transformation

Before the transformation, all the customer's structured data was stored in a high-standard MySQL; while the player behavior data was collected into the Log Service (SLS) through LogTail, and then delivered to OSS and ES respectively from the Log Service. The problems with this architecture are: 1) Behavioral data and structured data are completely separated, and linkage analysis is impossible; 2) Intelligent search functions are provided for behavioral data, and in-depth mining and analysis cannot be done; 3) OSS is only used as a data storage resource, and Not enough data value has been mined.

In fact, our analysis of the customer's existing architecture already has the prototype of a data lake: the full amount of data has been saved in OSS, and now we need to further improve the customer's ability to analyze data in OSS. Moreover, the SQL-based data processing mode of the data lake also meets customers' needs for developing technology stacks. To sum up, we made the following adjustments to the customer's architecture to help the customer build a data lake.


Figure 19. The transformed data lake solution

In general, we did not change the customer's data link flow, but added DLA components on the basis of OSS to perform secondary processing on OSS data. DLA provides a standard SQL computing engine and supports access to various heterogeneous data sources. After the OSS data is processed based on DLA, data that is directly available to the business is generated. However, the problem with DLA is that it cannot support interactive analysis scenarios with low latency requirements. In order to solve this problem, we introduced ADB, a cloud-native data warehouse, to solve the latency problem of interactive analysis; at the same time, we introduced QuickBI as the customer's visualization at the front end. analyzing tool. The YJ solution is a classic implementation case of the lake warehouse integrated solution shown in Figure 14 in the game industry.

YM is a data intelligence service provider, providing a series of data analysis and operation services for various small and medium businesses. The technical logic of the specific implementation is shown in the figure below.


Figure 20. Schematic diagram of YM smart data service SaaS model

The platform side provides a multi-terminal SDK for users (merchants provide various access forms such as webpages, APPs, and small programs) to access various embedded data, and the platform side provides unified data access services and data analysis services in the form of SaaS. Merchants access various data analysis services to conduct more fine-grained embedded data analysis, and complete basic analysis functions such as behavior statistics, customer portraits, customer circles, and advertising monitoring. However, in this SaaS model, there will be certain problems:

1) Due to the diversification of merchant types and needs, it is difficult for the platform to provide SaaS analysis functions to cover all types of merchants and cannot meet the customized needs of merchants; for example, some merchants focus on sales, some focus on customer operations, and some focus on cost optimization. Difficult to meet all needs.

2) For some advanced analysis functions, such as customer circle selection and customer-defined extensions that rely on custom tags, unified data analysis services cannot satisfy; especially some custom tags rely on merchant-defined algorithms, Unable to meet customers' advanced analysis needs.

3) Data asset management requirements. In the era of big data, it has become a consensus that data is the asset of an enterprise/organization. How to make the data belonging to the merchant settle down in a reasonable and long-term manner is also something that SaaS services need to consider.

To sum up, we have introduced the data lake model on top of the basic model shown in the figure above, allowing the data lake to serve as the basic support facility for merchants to accumulate data, generate models, and analyze operations. The SaaS data intelligence service model after the data lake is introduced is as follows.


Figure 21. Data intelligence service based on data lake

As shown in Figure 21, the platform side provides each user with a one-click lake building service, and businesses use this function to build their own data lakes. ) to the data lake; on the other hand, all the buried point data belonging to the merchant are fully synchronized to the data lake, and based on the "T+1" mode, the daily incremental data is archived into the lake. On the basis of traditional data analysis services, the data lake-based service model endows users with three major capabilities: data assetization, analysis modeling, and service customization:

1) Data capitalization capability. Using data lakes, merchants can continuously deposit their own data, and how long to store data and how much it costs is entirely up to the merchants to decide. The data lake also provides data asset management capabilities. In addition to managing raw data, merchants can also store processed process data and result data by category, which greatly increases the value of embedded data.

2) Analysis and modeling capabilities. There are not only raw data in the data lake, but also the model (schema) of buried data. The buried point data model reflects the abstraction of business logic by the global data intelligent service platform. Through the data lake, in addition to outputting the original data as assets, the data model is also output. With the help of the buried point data model, merchants can have a deeper understanding The logic of user behavior embodied in the embedded data helps merchants better understand customer behavior and obtain user needs.

3) Service customization capabilities. With the help of the data integration and data development capabilities provided by the data lake, based on the understanding of the embedded data model, merchants can customize the data processing process, continuously iteratively process the original data, extract valuable information from the data, and finally obtain data beyond the original data. The value of data analysis services.

6. The basic process of data lake construction

I personally think that the data lake is a more complete big data processing infrastructure support facility than the traditional big data platform, and the perfect data lake is a technology that is closer to the customer's business. All the features included in the data lake and beyond the existence of the big data platform, such as metadata, data asset catalog, rights management, data life cycle management, data integration and data development, data governance and quality management, etc., are all for better Good close to the business, better for customers to use. Some basic technical features emphasized by the data lake, such as elasticity, independent expansion of storage and computing, unified storage engine, multi-mode computing engine, etc., are also to meet business needs and provide the most cost-effective TCO for business parties.

The construction process of the data lake should be closely integrated with the business; however, the construction process of the data lake should be different from the traditional data warehouse or even the popular data center. The difference is that the data lake should be built in a more agile way, "use while building, and manage while using". In order to better understand the agility of data lake construction, let's take a look at the construction process of traditional data warehouses. The industry has proposed two models of "bottom-up" and "top-down" for the construction of traditional data warehouses, which were proposed by Inmon and Kim Ball respectively. The specific process will not be described in detail, otherwise hundreds of pages can be written, and the basic idea is only briefly explained here.

1) Inmon proposes a bottom-up (EDW-DM) data warehouse construction model, that is, the data source of the operational or transactional system is extracted, transformed and loaded into the ODS layer of the data warehouse through ETL; the data in the ODS layer is based on The pre-designed EDW (Enterprise Data Warehouse) paradigm is processed and then entered into the EDW. EDW is generally a common data model for enterprises/organizations, and it is inconvenient for upper-layer applications to do data analysis directly; therefore, each business department will process the data mart layer (DM) from EDW again according to its own needs.

Advantages: easy maintenance and high integration; Disadvantages: Once the structure is determined, the flexibility is insufficient, and in order to adapt to the business, the deployment cycle is long. The data warehouse constructed in this way is suitable for relatively mature and stable businesses, such as finance.

2) KimBall proposes a top-down (DM-DW) data architecture, which extracts or loads the data sources of operational or transactional systems to the ODS layer; then uses the dimensional modeling method to construct multi-dimensional themes through ODS data Data Mart (DM). Each DM is linked together through a consistent dimension, and finally forms a common data warehouse for the enterprise/organization.

Advantages: Fast construction, the fastest return on investment, agile and flexible; Disadvantages: It is not easy to maintain as an enterprise resource, the structure is complex, and the integration of data marts is difficult. It is often used in small and medium-sized enterprises or the Internet industry.

In fact, the above is only a theoretical process. In fact, whether it is to construct EDW or DM first, it is inseparable from a thorough understanding of the data and the design of the data model before the construction of the data warehouse, including the current hot "data in the "Taiwan" cannot escape the basic construction process shown in the figure below.


Figure 22. The basic process of data warehouse/data center construction

1) Find out the data. For an enterprise/organization, the initial work of building a data lake is to conduct a comprehensive investigation and research on the data within the enterprise/organization, including data sources, data types, data forms, data patterns, data volume, and data increments. amount etc. An implicitly important task at this stage is to further sort out the organizational structure of the enterprise with the help of data investigation work, and clarify the relationship between data and organizational structure. It lays the foundation for subsequent clarification of user roles, permission design, and service methods of the data lake.

2) Model abstraction. According to the business characteristics of enterprises/organizations, sort out and classify various types of data, divide the data into domains, form metadata for data management, and build a general data model based on metadata.

3) Data access. According to the results of the first step, determine the data source to be connected. According to the data source, determine the necessary data access technology capabilities, and complete the data access technology selection. The data to be accessed includes at least: data source metadata, original data metadata, and original data. All kinds of data are classified and stored according to the results formed in the second step.

4) Integrated governance. Simply put, it is to use various computing engines provided by the data lake to process data, form various intermediate data/result data, and properly manage and store them. The data lake should have comprehensive data development, task management, and task scheduling capabilities, and record the data processing process in detail. In the process of governance, more data models and indicator models will be needed.

5) Business support. On the basis of the general model, each business department customizes its own detailed data model, data usage process, and data access service.

The above process is too heavy for a fast-growing Internet company, and it cannot be implemented in many cases. The most practical problem is the second step of model abstraction. In many cases, the business is trial and error and exploration. Without knowing where the future direction is, it is impossible to extract a general data model; without a data model, all subsequent operations will be impossible to talk about. This is why many fast-growing companies feel that the data warehouse/data middle platform cannot be implemented. , One of the important reasons why the demand cannot be met.

A data lake should be a more "agile" construction method. We recommend the following steps to construct a data lake.


Figure 23. Basic process of data lake construction

Compared with Figure 22, there are still five steps, but these five steps are a comprehensive simplification and "landable" improvement.

1) Find out the data. It is still necessary to find out the basic situation of the data, including data source, data type, data form, data mode, data volume, and data increment. But, that's all there is to it. The data lake is to save the original data in full, so there is no need for in-depth design in advance.

2) Technology selection. According to the situation of the data, determine the technical selection of the data lake construction. In fact, this step is also very simple, because there are many common practices in the industry regarding the technology selection of data lakes. There are three basic principles that I personally recommend: "separation of computing and storage", "elasticity", and "independent expansion". The recommended storage type is a distributed object storage system (such as S3/OSS/OBS); on the computing engine, it is recommended to focus on batch processing requirements and SQL processing capabilities, because in practice, these two types of capabilities are the key to data processing. The stream computing engine will be discussed later. Whether it is computing or storage, it is recommended to give priority to the form of serverless; follow-up can gradually evolve in the application, really need an independent resource pool, and then consider building a dedicated cluster.

3) Data access. Determine the data source to be connected, and complete the full data extraction and incremental connection.

4) Application Governance. This step is the key to the data lake. I personally changed "convergence governance" to "application governance". From the perspective of data lakes, data application and data governance should be integrated and inseparable. Start with data application, clarify the requirements in the application, and gradually form the data that can be used by the business in the process of data ETL; at the same time, form the data model, index system and corresponding quality standards. Data lakes emphasize the storage of raw data and the exploratory analysis and application of data, but this is by no means to say that data lakes do not need data models; on the contrary, the understanding and abstraction of business will greatly promote the development of data lakes And applications, data lake technology enables data processing and modeling, retains great agility, and can quickly adapt to business development and changes.

From a technical perspective, data lakes are different from big data platforms in that in order to support the full life cycle management and application of data, data lakes need to have relatively complete data management, category management, process orchestration, task scheduling, data traceability, data Governance, quality management, authority management and other capabilities. In terms of computing power, the current mainstream data lake solutions support two modes of SQL and programmable batch processing (to support machine learning, you can use the built-in capabilities of Spark or Flink); The pattern of the workflow to the acyclic graph, and provides the corresponding integrated development environment. For the support of streaming computing, various data lake solutions currently adopt different methods. Before discussing the specific methods, let's make a classification of stream computing:

1) Mode 1: Real-time mode. This flow computing mode is equivalent to processing data in a "one-by-one"/"micro-batch" method; it is more common in online businesses, such as risk control, recommendation, and early warning.

2) Mode 2: stream-like. This mode needs to obtain data that changes after a specified time point/read a certain version of data/read the latest current data, etc. It is a stream-like mode; it is more common in data exploration applications, such as analyzing a certain period of time daily activity, retention, conversion, etc.

The essential difference between the two is that when data is processed in mode 1, the data is often not stored in the data lake, but only flows in the network/memory; when data is processed in mode 2, the data has already been stored in the data lake. In summary, I personally recommend using the following pattern:


Figure 24 Schematic diagram of data flow in the data lake

As shown in Figure 24, when the processing capability of mode 1 is required for the data lake, Kafka-like middleware should be introduced as the infrastructure for data forwarding. A complete data lake solution should provide the ability to divert raw data to Kafka. The streaming engine has the ability to read data from Kafka-like components. After processing the data, the streaming computing engine can write the results to OSS/RDBMS/NoSQL/DW as needed for access by applications. In a sense, the stream computing engine in Mode 1 does not have to exist as an integral part of the data lake, it only needs to be easily introduced when the application needs it. However, here's what needs to be pointed out:

1) The streaming engine still needs to be able to easily read the metadata of the data lake;

2) Streaming engine tasks also need to be integrated into the task management of the data lake;

3) Stream processing tasks still need to be incorporated into unified authority management.

For mode two, it is closer to batch processing in nature. Now many classic big data components have provided support methods, such as HUDI/IceBerg/Delta, etc., all of which support classic computing engines such as Spark and Presto. Taking HUDI as an example, by supporting special types of tables (COW/MOR), it provides the ability to access snapshot data (specified version), incremental data, and quasi-real-time data. At present, AWS, Tencent, etc. have integrated HUDI into their EMR services, and Alibaba Cloud's DLA is also planning to launch the capability of DLA on HUDI.

Let's go back to the first chapter at the beginning of this article. We said that the main users of data lakes are data scientists and data analysts. Exploratory analysis and machine learning are common operations of this group of people; streaming computing (real-time mode) It is mostly used for online business. Strictly speaking, it is not just what the target users of the data lake need. However, streaming computing (real-time mode) is an important part of the online business of most Internet companies at present, and the data lake, as the centralized data storage place within the enterprise/organization, needs to maintain a certain scalability in the architecture, which can be conveniently Expand and integrate streaming computing capabilities.

5) Business support. Although most data lake solutions provide standard access interfaces, such as JDBC, various popular BI reporting tools and large-screen tools on the market can also directly access data in the data lake. However, in actual applications, we still recommend pushing the data processed by the data lake to corresponding various data engines that support online business, so that the application can have a better experience.

7. Summary

As the infrastructure for next-generation big data analysis and processing, data lakes need to go beyond traditional big data platforms. I personally think that the following aspects are the possible future development directions of data lake solutions.

1) Cloud native architecture. There are different opinions on what is a cloud-native architecture, and it is difficult to find a unified definition. But when it comes to the data lake scenario, I personally think that it has the following three characteristics: (1) storage and computing are separated, and computing power and storage capacity can be independently expanded; (2) multi-modal computing engine support, SQL, batch processing, streaming (3) Provide serverless services to ensure sufficient flexibility and support pay-as-you-go.

2) Sufficient data management capabilities. Data lakes need to provide more powerful data management capabilities, including but not limited to data source management, data category management, processing flow orchestration, task scheduling, data traceability, data governance, quality management, authority management, etc.

3) Big data capabilities and database experience. At present, the vast majority of data analysts only have experience in using databases. Although the capabilities of big data platforms are strong, they are not friendly to users. Data scientists and data analysts should pay attention to data, algorithms, models and their relationship with business scenarios Adaptation, instead of spending a lot of time and energy learning the development of big data platforms. If the data lake wants to develop rapidly, how to provide users with a good experience is the key. SQL-based database application development has been deeply rooted in the hearts of the people. How to release the capabilities of data lakes in the form of SQL is a major direction in the future.

4) Perfect data integration and data development capabilities. Management and support for various heterogeneous data sources, full/incremental migration support for heterogeneous data, and support for various data formats are directions that need to be continuously improved. At the same time, it is necessary to have a complete, visualized and scalable integrated development environment.

5) Deep integration and integration with business. The composition of a typical data lake architecture has basically become an industry consensus: distributed object storage + multi-modal computing engine + data management. The key to determining whether the data lake solution wins lies in data management. Whether it is the management of raw data, data category management, data model management, data authority management, or processing task management, it is inseparable from the adaptation to the business. and integration; in the future, more and more industry data lake solutions will emerge, forming healthy development and interaction with data scientists and data analysts. How to pre-set industry data models, ETL processes, analysis models, and custom algorithms in data lake solutions may be a key point of differentiated competition in the future data lake field.

Guess you like

Origin blog.csdn.net/qq_35240226/article/details/108076370