On Data Lake Technology and Its Application

On Data Lake Technology and Its Application

Summary

In June 2020, my company won the bid for a bank's data lake platform construction project 1.0. The project period is 2 years and the total investment is 50 million yuan. Through this project, the bank's data lake construction project will be built to realize all the bank's businesses The data and user behavior logs are stored in the lake, providing precise marketing for banks in investment and wealth management, finance, loans, etc., providing business support for mining potential customers, etc., and helping banks achieve rapid business growth. I was fortunate to participate in the construction and development process of the project as the person in charge and architect of this project. The project has a tight schedule and heavy tasks, and involves many personnel and organizations. It is directly related to more than 600 people from 40 internal departments of the bank, and more than 300 people from more than 20 manufacturers' teams. The project completed the system launch in May 2022, and passed the final acceptance in June 2022, which was unanimously affirmed by users and successfully achieved the project's established goals. This article focuses on combining practical experience and taking this project as an example to discuss the data lake technology and its application in the project construction process.

     

text

In June 2020, as the project leader and architect, I presided over a bank's data lake platform construction project. The project period is 2 years and the total investment is 50 million RMB. The project has a tight schedule and heavy tasks, which is quite challenging. First, there are many transformation departments that need to cooperate. There are nearly 40 departments and 60 applications to cooperate with the construction of the data lake platform. It is necessary to cooperate with these 60 applications to define a unified data entry format into the lake and access a unified data lake interface. The second is the choice of data lake architecture. How to choose a highly available and storable data lake architecture has become a technical difficulty of the project. Because the data lake needs to store a large amount of data, the required storage space and memory must be quite large, and it is convenient for subsequent expansion and expansion.

According to the survey, the data lake is obviously different from the data warehouse. A data warehouse is a database optimized for analyzing relational data from transactional systems and line-of-business applications. Data warehousing technology requires prior definition of data structures and data schemas (Schema) to optimize fast SQL queries, where the results are often used for operational reporting and analysis. Data is cleaned, enriched and transformed so it can serve as a trusted "single source of truth" for users. A data lake can store both relational data from line-of-business applications and non-relational data from mobile apps, IoT devices, and social media. During data capture, there is no need to define data structures or data schemas (Schema). The data lake supports users to use different types of analysis on data (such as SQL query, big data analysis, full-text search, real-time analysis and machine learning, etc.), providing support for enterprise intelligent decision-making.

The following compares data lake and data warehouse technologies from six aspects: main data sources, data mode conversion timing, data storage costs, data quality, user-facing and main supporting application types:

  1. Primary data source. The main sources of data for data lakes are structured, semi-structured, and unstructured data from IoT devices, the Internet, mobile applications, social media, and enterprise applications. The main sources of data for data warehouses are structured data from transactional systems, operational databases, and line-of-business applications.
  2. Data mode transition time. No mode conversion is performed when data enters the data lake, and mode conversion is guessed during actual data analysis. When the data is in the data warehouse, it is generally necessary to design the schema of the data warehouse in advance.
  3. Data storage costs. Data lakes are usually based on non-relational data lakes, and data storage costs are relatively low. Data warehouses are usually based on relational databases, and data storage costs are high.
  4. data quality. A data lake is raw, unprocessed data. Data Warehouse High-quality data that serves as an important factual basis.
  5. Facing users. Data lakes are generally targeted at business analysts, application developers, and data scientists. Data warehouses are generally geared toward business analysts.
  6. The main supported application types. The main supporting application types of data lakes are machine learning, predictive analysis, data discovery and analysis. The main supporting application types of data warehouse are batch processing report, business intelligence and data visualization.

Now that you understand the differences between data lake and data warehouse technologies, you need to find an appropriate architecture for your data lake. The technical means of data warehouse implementation generally include hadoop, flink, hive and other technologies. In principle, these data need to be cleaned and filtered before entering the data warehouse. However, when the data of the data lake enters the lake, there is no need to pre-process the data. Processing, only when the data in the data lake is actually used, the data is cleaned, filtered, etc., and displayed visually. Finally, after multiple investigations and evaluations by the senior architects in the industry and our company, it was agreed that the hudi architecture should be used to access the data lake. At this time, the raw data of all applications is divided into two parts and entered into the lake. The first part of data is business data. We pass in this part of data through flinkCDC and write it into the ods layer (hudi table). The second part of the data is the user behavior log. This part of the data is written to the ods layer (in the hudi table) through flume+kafka, and then mapped through flinkSQL. The specific implementation is to convert some relevant data in the ods layer to the dim layer (dimension layer), display multiple business tables in a dimensional manner, and write them into the hudi table. Convert some relevant data in the ods layer to the dwd layer (details layer), aggregate multiple business tables (such as order tables, order details tables), and write them into the hudi table. Aggregate and calculate the dwd detailed layer data (need to use flink's related aggregation function sum, etc.), and at the same time perform dimension association on dimension information. Finally, we map some order data, marketing data, user behavior logs, etc. that all businesses need to query, as long as they are the data that users want to query, map the results to mysql in the cluster architecture, and visualize them through superset.

The project successfully passed the acceptance test in June 2022. The project has been running well so far, and there has never been a major production problem. The bank's data lake construction platform project 1.0 has achieved great success. It not only imports the original business data into the lake every year, but also provides users with a visual interface that needs to be displayed, so that users have a clear grasp of existing marketing data and business production data. It is also provided to business personnel and users for data analysis, which has certain practical guiding significance for precision marketing and improves the resource utilization rate of banks. The entry of raw business data into the lake also provides a certain degree of data protection for banks to cope with the supervision of the China Banking Regulatory Commission.

There are two shortcomings. First, we ignored the availability of the system during the architecture design process. During the system testing phase, it was found that sometimes the entire data lake architecture was unavailable due to a single hudi architecture being disconnected from the computer room. For this reason, we have adopted redundancy and heartbeat detection mechanisms, and implemented the hudi architecture to be deployed in north and south computer rooms. When one server is unavailable, another server will take over, which improves the availability of the system. The second is when mapping data to mysql, because sometimes the data is too large and mapped to a single mysql, the data query is quite slow. For this situation, we adopted the mysql cluster architecture deployment and deployed multiple Master-slave nodes, and implement mysql sub-database sub-table, and adopt read-write separation method, which successfully solves the reason of slow data query.

In short, in this project, I learned a lot. In the future work, I will use my professional knowledge to strive to contribute to the development of the country and society.

Guess you like

Origin blog.csdn.net/miachen520/article/details/131236243