Enterprise big data lake overall planning and big data lake integrated operation management construction plan

Background: Data enters the lake quickly, analysis is more intelligent, applications are more diverse, and services are more open

3d0ea4734423f9253d3e5b07a949a08a.jpeg

More enterprise data will enter the data lake, data from traditional systems and new data resources such as sensors will continue to merge, and data silos will continue to be broken.
With the continuous improvement of big data analysis capabilities, the importance of artificial intelligence has been gradually increased. Today's most advanced machine learning and artificial intelligence systems are moving beyond traditional rule-based algorithms to create systems that understand, learn, predict, adapt, and even operate autonomously.

Data service: With deeper data opening, cross-industry big data association. Build targeted industry solutions with more diverse application capabilities.

Data application: intelligent application, based on AI and machine learning analysis, personalized service provision. Rapid application construction, fine-grained collection, exploration and analysis based on data lakes

Data analysis: from deep learning to machine learning, from machine learning to artificial intelligence. Based on a large amount of raw data in the data lake, deep training, fast analysis

Data Governance: Entering the lake is governance, formulate data-driven governance norms for entering the lake standard for the input data of the data source system , and formulate governance norms in real time with data as the core.

Data platform: The storage mode of the data platform is transformed into a data lake model, and multiple data are aggregated. Support structured, semi-structured and unstructured data multi-data into the lake .

Definition and characteristics of data lake

A data lake is a method of storing data in a natural format in a system or repository, which facilitates the configuration of data in various schema and structural forms, usually object blocks or files. Data in the lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, XML, JSON logs), unstructured data (emails, documents, PDFs) and binary data (images, audio , video) to form a centralized data store for all forms of data.

69ce82f844c43b8e0e2488a2f60a75c1.jpeg

Understanding the concept of data lake from comparison - advantages

67c2fb34342824ac2a7266eb8b58c030.jpeg

Understanding the concept of data lakes from comparison - disadvantages

90ae3b1439dcd375a41dd0ba066a6bc5.jpeg

Data lake is an extension of the concept of big data

"Data lake" is a concept about the application of big data in enterprises, and it is the best big data solution for enterprises.
"Data lake" is not only a unit of data storage and processing, but also a process of releasing data value. The
successful application of big data in enterprises The key is not to store all the data, but to create a more meaningful "data lake" to help enterprises accelerate the speed of extracting high-value data.
Data lake is an advanced stage in the development direction of big data, and it is a construction concept. Not a specific implementation method "data lake" is an architectural concept, an evolution of data warehouse, and an extension of the concept of big data

f516786af10d24928406a393769e0711.jpeg

Architecture Planning of Data Lake System

Logical architecture of the data lake

Persistence Layer: Stores all structured, semi-structured and unstructured data obtained internally and externally.
Analytics Sandbox: Data scientists and analysts are granted access to the persistence layer and use it for data research and experimentation.
Explore data sources (Curated): Data analysts will process commercially valuable data and create new data sources to provide to business analysts.
Operational layer: Business analysts continue to refine the processed data, and work with the data management team to transform these data into data that is easier to operate and use, and store them for wider use.

9bccdb21848797d3b7ba2bdce6452a3b.jpeg

Big Data Lake System Planning

39bdf8dcdef3eb0343005efcd8c00e31.jpeg

Big Data Lake Storage Capacity Planning—Unified Standard/Partitioned Storage

Principles of storage partition:
production data area: follow the data modeling standards of the Telecom Group and master data specification requirements; the specification level belongs to the category of big data lakes, and the physical resource level can use lake resources or self-build; native data area: domain-based and classified storage
production Data; standardize and transform non-standard data;
integrate data area: use technologies such as big data mining to collect and complete entities; build entity association views;
master data area: store enterprise-level master data of the entire network, and the only provider of big data master data Application
area: In line with the principle of data out of the lake and fully mining the value of data, it provides users with a processing space based on their own, original, and integrated data, and carries out data processing for applications;

02e83f5148749cbb0b8acccf41a84a78.jpeg

Big Data Lake Native Data Area Planning - Native Lake/Classified Storage/On-Demand Use

Based on the purpose of using the native classified storage in the lake on demand, the domain-classified storage stores the original data periodically, providing native data sharing services for cloud companies, group ODS, and provincial big data platforms, and providing integrated data areas and application data areas in the lake. Raw data service.

dd30a520c20596c4ccd516d7f0d37140.jpeg

Big data lake integrated data area planning - building an enterprise-level core entity association view

The integrated data area completes data cleaning, code conversion, entity alignment, and builds an enterprise-level core entity association view, providing integrated data services for the application area. The integrated data area maintains the atomic granularity of the data, does not aggregate the data, and does not affect the processing of business indicators in the business area.

2b5e949bdaf74c6122a9a2cca39342fb.jpeg

Big Data Lake Application Data Zone Planning——Application-oriented self-built and self-maintained data zone

Independence: ensure resource independence, data independence, and application independence;
availability: ensure high availability and stability of storage, computing, and data resources; ensure that zone resources can be online and smoothly expanded;
ease of use: provide rich visual development and Zone operation tools;
manageability: the big data lake has the ability to monitor and audit the zone;
data serviceability: zone data can be served in the form of data, and can also be directly linked and called by developing applications;

3054f3afa9e8c5dfcfb454a8a0f00fdd.jpeg

Big data lake master data area planning——Enterprise-level core/unified operation guarantee

The master data area is responsible for storing global master data and ensuring synchronization with master data producers, providing a unique master data source for each area of ​​the big data lake to ensure the consistency and integrity of the enterprise-level core entity data in the lake, and improving the big data lake Operational efficiency and effectiveness.

Unified master data standards: provide master data standards for all production systems in all domains across the country;
unified master data storage: provide unified master data storage capabilities for big data lakes;
unified master data integration: clean and integrate master data in various domains to form a unified, standardized, Unique master data;
unified master data service: provide master data services for all districts in the big data lake;

2a5ac92c3f9db3f050dc25dd1109d597.jpeg

Collection and alignment of natural person entities in the ecosystem

The collection of customer data should have rule matching identification based on data information and mining and identification methods based on massive communication-related information of customers:
1) Through rule matching and identification technology, the natural person identification with high accuracy of data information can be efficiently completed;
2) Based on Big data technology builds a natural person identification model as an effective supplement to rule identification, improves the success rate of natural person identification, and reduces the workload of manual verification and confirmation.

85c4be05735008139b9998e4ec66bb6c.jpeg

Ecosystem data access and storage

Through the data collection of the five major ecological circles into the lake, after unified and standardized conversion, it provides data support for various special area applications. According to the construction situation of the ecological circle system, scientifically plan multiple collection methods to enter the lake
ecological circle. The data specification focuses on the functional division of the big data lake, explores the storage requirements of various data and builds capacity . Determine the application support model of the big data lake, and build a construction specification for the special area





dc74b0f73f8ca85f1f17087b9c13bab8.jpeg

Big data lake unified access shared construction plan - unified directory/transparent access

Access sharing is the bridge between data in the lake and applications and capabilities. When any function/application module uses data in the lake, it does not need to care about the storage method, storage medium, storage location and other information of the data, as long as it is connected with access sharing. Access to data in the lake

772c7d0556c34d61afb56d537ddfebb5.jpeg

Guess you like

Origin blog.csdn.net/zuoan1993/article/details/130085790