What is a data lake? The concept and development history of data lake

With the rapid development of new generation Internet technologies such as cloud computing, social media, the Internet of Things, and short videos, the amount and complexity of data continues to increase. Many enterprises and organizations have accumulated a large amount of various types of data. How to store and manage these massive data, and how to efficiently analyze and utilize this data, is an important challenge that every organization currently faces. For enterprises, effectively processing and analyzing massive data plays a vital role in making various decisions in the digital transformation process.

The rise of big data has brought many challenges to database research. Big data usually has four typical characteristics, including data quantity, diversity, speed of generation and authenticity. In particular, the Internet of Things, social networks, scientific research, audio and video and other fields generate a large amount of semi-structured and unstructured data. These complex and ever-changing data sets often exhibit a chimney-like dispersed structure. Therefore, enterprises and organizations need to adopt more effective data architectures to store and process these complex data, as well as adopt more flexible data analysis methods.

Although many enterprises and organizations still perform data processing and analysis through traditional data sets and data warehouses, in the era of big data, data warehouses that are good at processing structured data can no longer meet the needs of complex data analysis. The biggest challenges facing traditional data storage technology lie in three aspects: insufficient flexibility, high storage costs, and inability to handle multiple types of data.

In order to achieve comprehensive and flexible analysis of these complex data, the concept of data lake has emerged in recent years. Data lake is a technology that stores data from various sources in raw data format, simplifying and improving big data storage, management and analysis. The advantages of data lakes have attracted widespread attention from business and technical experts and academic researchers in the field of big data. Research on data lakes has developed rapidly in recent years. To date, a variety of data lake solutions and system architectures have been proposed, however, since the development of the data lake concept is still in its early stages, many studies and programs are targeted at specific application areas. In addition, research on key technologies of data lakes mainly focuses on some areas such as architecture and metadata management, lacking detailed technical analysis and discussion of each stage in the entire data processing process.

Especially in China, the concept of data lake is still relatively unfamiliar, and many organizations and enterprises do not yet fully understand the concept of data lake. However, domestic academic research on data lake-related technologies is very limited, and many scholars have not systematically compared and analyzed the concepts related to data lakes and big data technology platforms.

The development history of data lake

Search the Google Scholar search engine with the keyword "datalake" and get the following statistical chart. It can be clearly seen from the figure that before 2014, data lakes were still in their infancy, and the number of related articles was hovering at a low level. From 2015 to 2017, the concept of data lake began to be well-known in the industry, and related technologies were accumulated, so the number of papers showed a clear increasing trend. Since 2018, data lake technology has developed vigorously, and the number of related articles and papers has increased rapidly. Based on the above analysis data, we can divide the development of data lake into three stages: embryonic stage, technology accumulation stage and rapid development stage.

Infancy

In the business world, data lakes have been proposed, hyped, criticized, and improved upon. Data lakes first appeared as big data platform solutions, solving the limitations of traditional data marts. In 2013, Pivot proposed the business data lake architecture, trying to solve the problem of data integration and instant access to analytical data with the idea of data lake, but it did not propose a complete data governance solution. In 2014, the business world generally accepted data lakes as data centers to improve scalability and flexibility. Many big data vendors have begun to hype the concept of "data lake", but Gartner has criticized and questioned the data lake, pointing out the development direction of data lake technology in the next few years. PricewaterhouseCoopers has applied data lakes to enterprise data integration solutions. IBM has proposed big data analysis solutions for business themes. Academics have also begun to pay attention to data lakes and made suggestions to overcome the problems of data lakes in data through AI and crowdsourcing. Challenges with integration, access and data quality.

technology accumulation period

From 2015 to 2017, the period of data lake technology accumulation, business and academic circles have increased their recognition of data lakes, and technology accumulation has increased. In 2015, Terrizzano et al. described the challenges in the implementation process of data lakes in Data Debate, including problems encountered in data collection, sorting, supply, and security. This technical document is the first comprehensive explanation of the challenges faced by data lakes since Geithner raised questions about data lakes, and points out the direction of problems that need to be solved in the future. At the same time, Huang published "Data Lake Management" in the era of big data, and data lakes began to receive widespread attention in academia. During this period, research on data lake applications also began to appear, and many IT business giants launched their own data lake products, such as Google's goodssystem, Microsoft's Azure Data Lake Store, and SAP's Vora. During this period, data lake research mainly focused on concept definition, and data lake architecture research expansion was limited, mainly focusing on metadata management. The number of application studies on data lakes is limited, and its use is limited to big data storage, and has not reached deeper application levels.

rapid growth period

Since 2018, data lakes have flourished in the business and academic worlds. During this period, the data lake has been enriched in terms of architecture, concepts, applications, governance, etc. First of all, many major IT vendors have proposed their own data lake solutions. There are Amazon, Microsoft, and Google abroad, and domestically there are Alibaba, Huawei, Tencent, Xinghuan, etc., which can provide mature methods and solutions for each component of the data lake. tool.

At the same time, academic research on the prototype implementation of data lakes has also received widespread attention, including metadata management, data quality, data sources, data preparation, data set organization, data integration, data discovery, etc. It can also be seen that a large number of data lake application research emerged during this period. These application fields include medical care, electricity, smart cities, education, communications and other aspects, which played a very key role in the deep integration of big data platforms in various fields. . At this stage, domestic researchers also began to pay attention to data lake technology, and the research areas involved data lake architecture and security technology.

After rapid development in recent years, coupled with the continuous breakthroughs of data lakes in industry and academia, it has provided enterprises and organizations with more abundant solutions and suggestions for the implementation of data lakes. However, the concept of data lake is still in its early stages, and its architecture has not yet formed an industry standard. There are endless problems that need to be solved in terms of technical details, and problems such as over-reliance on machine learning in solutions need to be solved.

Data lake concept

The current definition of data lake is relatively vague. It can be a method of storing massive data, a development product based on the existing data architecture, and a flexible and scalable data storage and management system.

Data lakes and data warehouses

The concept of data warehouse was first proposed by IBM. Inmon defined it as a collection of data that supports management decisions, is subject-oriented, non-volatile, integrated and constantly changing. With the emergence of the concept of data lake, many people associate it with data warehouse, and some even think that data lake is the data warehouse in the era of big data. Both centrally store data from different sources, providing an important basis for the organization's data integration. They also provide the organization with a data management and processing platform for data analysis, mining, and decision-making. However, there are huge differences in the background and time of the creation of these two concepts. More importantly, there are huge differences in the ideas of data processing between the two.

One of the main differences between the two is how the data is obtained. A data warehouse primarily fetches processed and filtered data, while a data lake primarily fetches raw or unprocessed data. Specifically, the data will be processed (such as cleaning and transforming through the ETL process) before being stored in the warehouse, while the data stored in the data lake is unprocessed raw data. The data in the data warehouse has been cleaned and can be analyzed directly, which is the so-called "write mode". In contrast, data lakes adopt a "read mode" where data is selectively organized and analyzed as needed, allowing for more flexible processing of data.

Another key differentiator is the theme or objective used. The data obtained by the data warehouse is usually used for specific topics or goals, so unnecessary storage space is not wasted and the professional knowledge of data analysts is not required. In contrast, the purpose of using the data lake is not determined upfront, and the data can be used for any future analysis goals. This means analysts need to be familiar with large amounts of unprocessed data and may need to rely on the help of data scientists with specific skills.

The accessibility or ease of use of the data repository is another aspect that differentiates a data warehouse from a data lake. Since the structure of the data warehouse is relatively fixed, the cost of adjusting the data structure can be very high. On the contrary, since the data lake does not have a fixed data structure, it is extremely flexible.

Data center and data lake

The concept of middle platform was first proposed by Alibaba Group. It is the product of shared business ideas within the enterprise. Middle platform is divided into business middle platform, data middle platform and technology middle platform. Among them, the data platform (DataPlatform) is data-centered and provides full life cycle management of data in the form of services based on data integration (especially semantic integration) to facilitate business construction and realize the value of data for application businesses. Its essence is a data platform.

Data middle platform and data lake are both data architecture solutions for enterprises to deal with internal and external big data ecological challenges. The core of both concepts include unified data integration, open data capabilities, and flexible data access. Although both concepts were born in the era of big data, the scope of problems they solve is different.

The data lake emphasizes data storage and governance solutions to deal with big data challenges, while the data center is a global data solution. The data center is a superset of the data lake concept. In addition to the characteristics of the data lake concept, the data center also needs to meet more system functions, including data asset management, governance mechanisms, data security, data capability sharing, etc.

The problem backgrounds solved by the two are different. The emergence of the data lake concept has brought about changes in the way of data storage and exploration, effectively coping with the technical challenges brought by big data, while the data center solves the problems at the implementation level of enterprise big data platforms, focusing on how to make it better. Exploring the value of data belongs to the scope of enterprise information management.

Many domestic data vendors, enterprises and institutions have introduced the concept of data middle platform in their digital transformation plans. It can be seen that the concept of data middle platform has covered the concept of data lake in China. The concept of data middle platform is currently more in the commercial field and has not received enough attention in the academic field. In contrast, the data lake concept has developed very rapidly in foreign academic fields and has formed a certain academic system.