Exploring the essence of data services: Possibilities beyond API

Data service plays an important role in data construction. What exactly are data services? Does it only provide an API externally? So simple?

And I hope that after you finish this part of the content, you can truly master the product function design and system architecture design of data services. Because it will be of great help to you in designing a data service or choosing a commercial product.

1 Eight functions that data services should have

Data services must have at least eight functions to solve the problems mentioned above. For example, there are various data access methods and low access efficiency; there is no way to share data and interfaces; it is not known which applications access data...

Assuming a large Cainiao Station, there are many sets of shelves, and there are staff in front of each shelf to help us pick up the courier, and at the same time, there are many queues.

To pick up the express delivery, first agree on the interface (such as using the pick-up code uniformly). Then, in order to ensure that different teams can get the courier, limit the flow of each team (for example, a team can only pick up one person at a time). When you pick up the courier, the station machine scans and records which courier was taken, which is convenient for tracing.

During this period of time, the service of Cainiao Post Station has been upgraded, not only for express delivery, but also for door-to-door delivery. The shelves corresponding to different types of express delivery have also become different, such as fresh food, the shelves are refrigerated refrigerators, and the shelves for documents and envelopes are file cabinets.

For the person who picks up the courier, if he buys fresh food and buys envelopes, he has to line up in several lines, which is inconvenient. Generally, it is best for the person who picks up the courier to only line up in one line, and the post station staff will help him pick up the courier from multiple shelves at once.

However, there are too many shelves in the post station. In order to facilitate everyone who picks up the courier to quickly find each shelf and team, the post station provides a guide. In order not to let the staff make mistakes, the staff of the station must pass strict tests before they can work.

Back to the eight functions of data services. In the take express example, you can:

  • Data service as Cainiao station
  • Think of workers as API decoupling libraries
  • shelves as intermediate storage
  • courier thinks it is data

Corresponding to eight functions:

  • Interface standardization definition, take the delivery code agreed by the courier, and take the courier based on the unified delivery code
  • The data gateway can be regarded as we limit the flow of the queue in front of each shelf to ensure that each queue can take the express
  • The maintenance of the link relationship can be seen as the post station will record who took what courier
  • Data delivery can be regarded as a post station that provides express delivery and door-to-door services at the same time
  • Provide a variety of intermediate storage, which can be regarded as different types of shelves
  • The logical model can be regarded as a staff member who can pick up express delivery from multiple shelves
  • The API interface can be regarded as a guide for different teams on different shelves of the station
  • API testing can be seen as a test for station staff before taking up their jobs

Through this story, do you already have a vivid perception of the eight functions of data services? Next, let's take a look at what the eight functions of the data service specifically include.

1.1 Standardized definition of interface

The standardized definition of the interface is the pick-up code we agreed upon when picking up the express. The data service shields different intermediate storage for each data application and provides a unified API.

Schematic diagram of the data service interface:

The above figure can define the input and output parameters of each API interface on the data service.

1.2 Data Gateway

As a gateway service, the data service must have the four major functions of authentication, authority, current limiting, and monitoring, which is the premise of data and interface multiplexing. This is the same as when we pick up the courier in front of the rookie station, we need to authenticate and limit the flow of each team.

certified

In order to solve the problem of interface security, the data service will first assign a pair of accesskey and secretkey to each registered application, which must be carried every time the application calls the API interface.

For each published API, the person in charge of the API can authorize the application, and only authorized applications can call the interface.

The person in charge of the API interface can limit the flow of the application (for example, limit the QPS per second to no more than 200), and trigger a fuse if it exceeds

For interface multiplexing, the current limiting function is necessary, otherwise it will cause mutual influence between different applications.

Schematic diagram of application-to-interface authorization

Of course, the data service also provides interface-related monitoring, such as 90% of the request response time of the interface, the number of interface calls, and the number of failures. In addition, APIs that have not been called for a long time should be offline. The advantage of this is to prevent useless interfaces from occupying additional resources.

1.3 Open up the whole link

The data service must also be responsible for maintaining the link relationship between the data model and the data application.

The business analysis in the above figure is a data application, and Zhen Meili is a data application developer. When she wants to access a certain interface in the data service to obtain the data of tables A and B, she needs to apply for authorization from the interface publisher Ma Shuai. Then business analysis can obtain data through the interface.

The data service will push the business analysis and the access relationship of tables A and B to the metadata center of the data center. Next, on metadata center tables A, B, and all upstream tables of A and B (D, E in the figure), there are labels for business analysis data applications.

When the output task of Table D is abnormal, Ma Shuai can quickly judge through the metadata center that the task affects the data output of business analysis data products. At the same time, when Ma Shuaishuai wants to offline table D, he can quickly judge whether there is still application access downstream of this table by checking whether this table has a label. When canceling the authorization of the API interface, the metadata center will also clean up the relevant tags of the table.

A data application involves many pages. When analyzing the impact, it is too coarse to only analyze the application granularity. If a task is abnormal, it is necessary to know not only which data product but also which page it is. When authorizing the interface, the page name can be marked.

1.4 Push and pull data delivery methods

The data services you hear are all provided in the form of API interfaces, but in actual business scenarios, APIs alone are not enough. The API is called the pull method, but push is also required in the actual business.

For example, in a real-time live broadcast scenario, merchants need to obtain sales data about the event as soon as possible. At this time, the data service needs to have the ability to push. I call it the data door-to-door service. The data service writes data into a Kafka in real time, and then the application can get real-time data push by subscribing to the topic of Kafka.

1.5 Use intermediate storage to speed up data query

Taichung data in the data exists in the form of Hive tables, based on Hive or Spark computing engines, which cannot meet the low-latency and high-concurrency access requirements of data products.

The general approach is to import data from the Hive table to an intermediate storage, which provides real-time query capabilities. Data services need to support multiple intermediate storages according to application scenarios, and some commonly used intermediate storages and scenarios are listed

1.6 Logical model to realize data reuse

There are a group of staff on each shelf, and they are not friendly to those who pick up the express. It is best to pick up all the express for us alone. Similar to the logical model in data services.

A logical model can be defined in the data service, and then an API can be published based on the logical model. Behind the logical model are multiple physical tables. From the user's perspective, one interface can access multiple different physical tables.

The logical model can be compared to a database view. Compared with the physical model, the logical model only defines the mapping between tables and fields, and the data is dynamically calculated during query. The logical model can be regarded as a large wide table composed of physical models with the same primary key. The logical model solves the problem of data reuse. On the same physical model, applications can build different logical models according to their own needs, and each application sees different columns.

In the above example, there are three physical models, but the primary key is the product ID. For the product operation system and the store staff, we can build two different logical models, and look at the data from different perspectives. The logical model does not actually exist. , but when querying, according to the physical model field mapped by the logical model, dynamically split the request to multiple physical models, and then aggregate the multiple query results to obtain the query result of the logical model.

1.7 Build an API marketplace to realize interface reuse

In order to realize interface reuse, we need to build an API market. Application developers can directly find existing data interfaces in the API market, directly apply for the API permission of the interface, and then access the data without repeated development.

Through the metadata center, the data service can obtain the indicators associated with the tables accessed by the interface. Users can filter interfaces based on the combination of indicators, and search for interfaces that can provide the data according to the desired data, forming a closed loop.

How should data services be implemented?

2 Architecture Design of Data Service System

When implementing data services, cloud native, logical models and automatic data export are mainly used:

  • You can learn from my way to complete the design of data services
  • Or when choosing a commercial product, refer to the architecture selection

2.1 Cloud Native

The core advantage is that each service has at least two copies to achieve high service availability. At the same time, the number of service copies can be dynamically adjusted according to the amount of access. Based on service discovery, it can realize transparent elastic scaling for clients. Resources are isolated between services based on containers to avoid mutual influence. These features are suitable for data services with high concurrency, low latency, and online data query.

The deployment architecture of data services. Each published API interface corresponds to a k8s Service. Each Service consists of multiple copies of Pods. The code for each API interface to access the back-end storage engine runs in the container corresponding to the Pod. The number of API interface calls changes, and Pods can be dynamically created and destroyed.

Envoy is a service gateway that can load balance Http requests to multiple Pods of Service. The Ingress Controller can view the Pod changes of each Service in Kubernetes, and dynamically write the Pod IP back to Envoy to realize dynamic service discovery. The front-end APP, Web or business system server end, access Envoy through 4-layer load balancing LB.

Cloud native design solves:

  • Resource isolation between different interfaces of data services
  • Dynamic horizontal expansion can be realized based on the request volume
  • Realize current limiting and fusing with Envoy

2.2 Logical Model

Compared with the physical model, the logical model does not save the actual data, but only includes the mapping between the logical model and the physical model, and the data is dynamically generated each time it is queried. The design of the logical model solves the need to see only the data you need for the same data with different interfaces.

System design diagram of data service logic model:

The interface publisher selects multiple physical tables with the same primary key in the data service to build a logical model, and then publishes the interface based on the logical model. After the API service receives the query request, according to the mapping relationship between the logical model and the physical model fields, the logical execution plan is disassembled into a physical execution plan for the physical model, and multiple physical models are issued for execution, and finally the execution results are analyzed. aggregated and returned to the client.

The physical model associated with a logical model can be distributed on different query engines, but considering performance factors at this time, only primary key-based filtering is supported.

2.3 Automatic data export

The data service selects a table in the data center, then exports the data to the intermediate storage, and provides API externally. When will the data be imported to the intermediate storage? Wait until the data output is complete.

Therefore, after the user selects a table in the data center and defines the intermediate storage of the table, the data service will automatically generate a data export task, and at the same time establish the dependency relationship between the output tasks of the table in the data center, and wait for each scheduling After the output task is completed, the data export service is triggered, and the data is exported to the intermediate storage. At this time, the latest data can be queried through the API interface.

3 summary

Data service is not as simple as an API interface, but behind it is a complete set of processes for data standardized delivery. This article has learned eight key function designs and three system architecture designs of data services.

  • The data service realizes the full link connection between the data center model and data application, and solves the problem of task abnormality impact analysis and data offline without knowing which applications will be affected
  • Based on the physical model of the same primary key, a logical model can be constructed, which solves the problem of data reuse and improves the publishing efficiency of interface models
  • Data services should adopt cloud-native design patterns, which can solve the problems of high service availability, elastic scaling, and resource isolation

Data service plays an important role in accelerating the data delivery process and the efficiency of operation and maintenance management after data delivery, and is also a key component of the data center.

FAQ

If the data service wants to solve the problem of which applications access the data, it must ensure that all data applications must obtain the data of the data center through the data service. Then the question arises, how to ensure that the data service is the only exit of the data center?

Ensuring that data services are the only outlet for the data center can be achieved through the following measures:

  1. Determine the access permissions of data services: only authorized applications can access data services, and other applications cannot access them.

  2. Implement network isolation: deploy data services in an independent network, and ensure that only authorized applications can access data services through network isolation.

  3. Implement authentication and authorization: For each application accessing data services, authentication and authorization are required to ensure that only authorized applications can access data services.

  4. Implement auditing and monitoring: For each application that accesses data services, auditing and monitoring are required to ensure that only authorized applications can access data services, and abnormal situations can be discovered and handled in a timely manner.

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/131974361