Data service: a sharp tool to ensure data security and enhance data value

04-08 The metadata and five application scenarios based on it: data discovery (data map), index management, model design, data quality, and cost optimization are all explained. This part of the content corresponds to the OneData methodology in the data center. After learning this part, you have understood how the OneData methodology is implemented within the enterprise.

Another core methodology in the data center, the realization of OneService: data service.

Servitization is more common in business systems, and business systems are simplified, which is the only way to realize business splitting (especially the concept of microservices). What does service mean in data center? What problem does data service solve?

Service-oriented: different systems interact through services, and services usually exist in the form of API interfaces.

To find out what problems data services solve, you must first know the pain points that exist in daily data construction without data services.

1 There are many data access methods, but the access efficiency is low

The processed data in the data center is usually stored in HDFS in the form of Hive tables. If you want to display directly through the data report or data product front-end, in order to ensure the query speed, the data will be imported to an intermediate storage:

  • MySQL, Oracle and other DBs with a small amount of data are easy to deploy and maintain, with a small amount of data and strong query performance. If the data volume is less than 500W records, it is recommended to use DB intermediate storage
  • Available GreenPlum involving large data volumes and multi-dimensional queries, OLAP with massive data has excellent performance. If the amount of data exceeds 500W records, it is necessary to filter and query with multiple conditions
  • HBase can be used for single-key queries involving large amounts of data. Under the large amount of data, HBase has good read and write performance. If the record exceeds 500W, query the V scene according to K. If you need to use the secondary index, since HBase does not natively support the secondary index, ES can be introduced to build the mapping relationship between the secondary index and RowKey (Key in HBase) based on ES. When querying, first find the RowKey in ES according to the secondary index , and then obtain the Value value in HBase according to the RowKey.

Because different intermediate storages involve different access APIs, for data application development, each data application must develop corresponding codes according to different intermediate storages. If multiple intermediate storages are involved, multiple sets of codes must be developed , the data access efficiency is very low.

The data service shields different intermediate storage for data development, and application development uses a unified API interface to access data, which greatly improves the research and development efficiency of data applications.

The low efficiency of data access is not only related to the connection with different intermediate storages, but also because data and interfaces cannot be reused.

2 There is no way to reuse data and interfaces

Schematic diagram of data and interfaces that cannot be reused

When developing "data application-business analysis", data development will process table c based on table a, and then data application development will import the data of a and b to "data application-business analysis database db1", and then develop business analysis services The end code provides services to the web through interface 1.

When receiving the task of developing "Data Application - Gross Profit Analysis", the data in table b must also be used. Although the b data already exists in db1, db1 is the "Data Application - Business Analysis" database and cannot be shared with "Data Application - Gross Profit Analysis". .

The service-side interface of business analysis cannot be directly used for gross profit analysis, because the interface belongs to the business analysis application and has been highly customized according to application requirements.

Even if the data is repeated, it cannot be reused between different data applications, on the intermediate storage and server interfaces. This kind of chimney development leads to low efficiency of data application research and development.

With data services, what is exposed in the data center is no longer data, but interfaces. Interfaces no longer belong to a certain data application, but on a unified data service. The interface can be shared between different data applications. At the same time, because the data service has the current limiting function, it is possible to share the data behind the interface and solve the problem of mutual influence of shared data of different applications.

After the data application goes online, it enters the operation and maintenance stage. If there is no data service at this stage, what will happen?

3 Do not know which applications access the data

Schematic Diagram of Fault Recovery

Zhang Haoliang is a data developer. One morning, he received a call to the police: there were a large number of abnormal tasks (corresponding to the output tasks in the red table above). Locate and confirm the source database of the problem source business system. Due to a change in the structure of the database table, the original data cleaning in the data center is abnormal, which affects multiple downstream tasks.

In front of you is a bunch of tasks that need to be resumed and rerun. Queue resources are limited, which one should be recovered first? Which task will ultimately affect the report that the boss will see the next day?

Although the data relationship establishes the link relationship between tables, at the end of the table, we don't know which applications access the table, so the link relationship applied to the table is broken. When a task is abnormal, we cannot quickly determine which data applications are affected by the task, nor can we determine the recovery priority based on the scope of the impact. In the end, important reports are not restored, but unimportant reports are restored first.

In cost management, there is no link relationship between applications and data, and data is not dared to be offline.

The data service opens up the access link between data and applications, and establishes a full-link data blood relationship from data application to data center data, which means that we have obtained a map in the maze. When any task has a problem, we can follow it. According to the map, find out which applications are affected by this fault, so as to speed up recovery for important applications. Similarly, we can safely offline any table in Taichung.

In addition to not knowing which downstream applications the data is used for, during the operation and maintenance phase, data tables are often restructured, which may be the worst nightmare of data application development.

4 The change of the data department field leads to the change of the application

The field changes of the underlying model in the data center are relatively frequent, because the model of the summary layer itself is also optimized according to the demand.

"Data Application-Business Analysis" uses the c field of the ads_mamager_1d table in the data center. If we reconstruct this table, the access field needs to be replaced with the e field. At this time, the data application needs to modify the code. It is very unreasonable that the application needs to be re-launched due to the data change in the data center. It will not only increase the additional workload of application development, but also slow down the progress of the data change.

With the data service, the data application and the middle-end data will be decoupled. If the structure of the middle-end data table changes, only the mapping relationship between the interface parameters and the data fields on the data service needs to be modified. There is no need to modify the code and re-launch the data application.

5 summary

Typical problems encountered in the process of data access and operation and maintenance, and a brief analysis of why data services can help us solve these problems. These problems will make the use of middle-end data inefficient by data applications, and also bring troubles in middle-end data maintenance.

Next, let’s talk about the functions of data services. If you are planning to design a data service or select a product for a data service, you must pay attention. Finally, a data service implementation plan will be provided to you, telling you the key design of data service implementation.

6 FAQ

Data service solves the problem of data security, does it make sense?

Yes, Data Services employs a number of security measures to ensure the security of your data. For example, a data service may use encryption to protect data to ensure that only authorized users can access it. In addition, data services can implement access control and authentication measures to ensure that only authorized users can access data. Therefore, the data service can effectively solve the security problem of data.

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/131969609