DTBoost: A new generation of enterprise-level big data application model revealed

Enterprises in the DT era do not need to build data applications according to traditional ideas; DTBoost, a new generation of enterprise-level big data application models, was born in response to DT to accelerate enterprise data operations.

A new generation of enterprise-level big data application model

three questions

  • Is there a need for a sophisticated EDW (Enterprise Data Warehouse) today?
  • Who are the target users of the data system?
  • Adapting data to computing power or computing following data?

After the concept of data warehouse was proposed by Bill Inmon more than 20 years ago, almost all IT vendors have begun to intervene in this field to design very complex architecture and data models for enterprise-level data warehouses. The typical enterprise-level data application architecture is as follows :
363368_20151021110239458_1935396228

  • This architecture has a very clear hierarchical structure, but the links are very long, resulting in very large data redundancy. At the same time, the structure of the data table is complex. It is a typical model used by technicians. It is very difficult for business students to use data. It is impossible to understand the underlying complex table structure and the complex relationship between tables. This status quo still exists today. Many companies have invested a lot of resources to build enterprise-level data warehouses. The goal is to improve their own data-based operations capabilities. A reporting system is in place.
  • At the same time, with the rise of the Internet and the mobile Internet, almost all enterprises are embracing the Internet. Many Internet applications are generated in enterprises, and a large amount of unstructured data is also generated. As a result, the problem comes, and it is found that the design is based on such a structure. The data model does not seem to solve the enterprise's ability to apply unstructured data. At the same time, it does not bring much use to improve the efficiency of data processing.
  • Businesses are changing rapidly, which is particularly evident in emerging Internet companies. Driven by innovation, business changes are very frequent. At the same time, the concept of big data and the combined use of multi-source data will become the mainstream data application model, leading to data warehouse engineers. It is difficult to abstract a relatively stable data warehouse model.
  • A large amount of sleeping data is generated. In many enterprises, ODS, DW, DM, and RT layers are designed, resulting in a large number of data tables and data tasks. As a result, there is not much data actually used in production, resulting in a large number of related tasks every day. consume resources. In some cases I have encountered, there are only more than 10,000 tables extracted from the business library every day, but after various processing, millions of data tables are generated in the entire library, which is very difficult for technicians to use. No data found.

Let's go back and analyze why this is so, and also think about those three questions. Why is it designed according to this system:

  • In my opinion, the biggest reason for the above architecture is insufficient computing power. The traditional IT architecture is used to implement the DT architecture, and the computing power is limited. It must be adapted to computing by changing the organizational form of data. That is, the data I mentioned earlier follows the calculation to actively adapt to the computing power. This also leads to another more serious problem. After more than 20 years of development, various big data manufacturers have put a lot of energy on the design and construction of data models, ignoring the calculation for upper-level business scenario applications. Model exploration, so under the new technical system, we need to rethink these issues.
  • In this mode, the upper-layer data application mode is limited, and it is impossible for business personnel in the enterprise to understand a complex IT architecture. All requirements are realized by business personnel driving technical personnel. Business people who really need data cannot understand the language of technology. They cannot understand what is a table, what is a field, what is a primary key, what is a foreign key, how tables are related to each other, and even how to write SQL. Difficult to understand. In daily work, what business personnel can understand better, they can understand who their customers are, what they look like, and what kind of temperament they have; what products they have, what functions they have, and what problems they can solve; How your customers interact with your product, and what are the results of that interaction. And so on, this is the ability of business people to understand. So what we need to do now is to abstract and provide a data model that can be directly understood and used by business personnel, and this model must not be a traditional data warehouse model.
  • It is closed and opaque, which is also an important reason for the expansion of data in the enterprise. The same label, or even the same table, can be found everywhere in the enterprise data warehouse, resulting in a large amount of data redundancy, because many technical students do not know the database. What, the metadata management proposed by the manufacturer is also a technology-oriented solution, and it does not fundamentally solve the business view of data. Therefore, enterprises need a new data application system that can be coordinated, shared and co-constructed to ensure the effective disclosure of effective data.
  • Enterprises need data applications to improve their data-based operation capabilities, but whether they need a complex DW model, I don’t think it is needed now, and the design should lighten the data model (note that the data model is light here, not not), and re-calculate the model.

DTBoost new generation enterprise data application mode

What is DTBoost? DTBoost is an enterprise-level big data application platform abstracted by Alibaba Cloud combined with Alibaba's own big data application scenarios after years of summarization. Its goal is to allow business personnel to quickly understand and apply data; light data model design and heavy calculation model Design; open structure, rapid support for data application development; internal co-construction and sharing, collaborative development.

What is the target product of DTBoost?

People around me often ask me this question, and I can tell you directly that DTBoost does not currently have a benchmark product. In my understanding, DTBoost is a brand-new enterprise-level data application development model. We implement this model into a set of PaaS for data applications on public cloud computing platforms by means of DT technology, and can also be deployed on proprietary clouds. . Through DTBoost, enterprises can quickly implement data business solutions, and at the same time make it a reality for business personnel to directly use data.

DTBoost Architecture

The core of DTBoost is a three-tier architecture system:
14706476785789

data model

Through the previous analysis, the DT era needs a new data model, which will also be the basis of the entire DTBoost. We have to stand in the business perspective to design. At the same time, a data model management system should be provided to facilitate the design and construction of the model. To this end, this part of the data model will contain the following core modules.

14727091624639

In the above figure, the bottom three label factories, domain OLT templates, and intelligent OL discovery are mainly to accelerate the construction of business OLT models and label production.

  • OLT (Object Link Tag) model: The so-called entities, such as consumers, merchants, commodities, etc., can be represented as an entity, which can be understood by students who have direct business. Relationships such as transactions, favorites, clicks, searches, etc. are a relationship, and a certain behavior occurs between multiple entities. At the same time, we will put a lot of labels on entities and relationships to describe entities and relationships. It sounds very similar to the OLP model. Yes, it is consistent in the overall model structure. We focus on the tag (tag) part. The tag is the easiest data form to understand for business personnel. It can be a score deeply processed by an algorithm, or a calculation logic of a combination of multiple tags.

14706486396111

  • Co-construction and sharing: DTBoost can implement permission control at the granularity of tags to ensure the formation of a data application model of co-construction and sharing within the enterprise. Tags can be developed by multiple teams, and can be published and authorized to be shared with other departments for viewing and use. Ensure that the business application data layer is open and transparent.
  • Market mechanism: At this layer, the market mechanism can also be used to ensure data quality, strictly speaking, the quality of tag data. DTBoost can ensure that business students can quickly understand the meaning of tag business through the openness and transparency of tag metadata; The data distribution of tags is visualized to ensure the stability of data output; the use of tags in the business line is used to ensure whether the tags are to be eliminated. If a tag is not used for a long time, the system can consider deactivating it. The underlying computing resources are released; further, the organization of the physical layer data can be automatically optimized through the application of the upper layer. Here is an example, if the three tags A, B, and C are often used in combination by the business side, and the three tags were originally distributed in three tables at the physical layer, in this case, DTBoost will automatically detect and automatically construct a new The underlying physical table combines three tags into the same table, optimizing storage and optimizing computation.
  • Smart relocation: Here, in the tag meta information, DTBoost will record the physical storage corresponding to the tag in detail. When the business side applies the tag, it only needs to select the calculation model, and does not need to care about the physical storage of the data. This module will be based on the calculation model. Instructions to complete the automatic association and relocation of the underlying physical data (relocation here refers to the automatic relocation of data from one storage to the storage required by the computing model), and students who do not need data development will do physical data association and data transmission. Configuration of the task.
  • API: For all the following functions, DTBoost encapsulates them into standard APIs for secondary development by partners or developers.
  • UI: DTBoost encapsulates the underlying functions through an official standard interactive interface to provide users with a unified operating experience.

_1

  • Label Factory: Why do you need this module? A very important part of the DTBoost data model is the label, but how the label is generated and which labels are valid labels are very important. There are many ways to generate tags. Data developers can implement them one by one through SQL or MR according to the definition of business classmates. This is also inevitable. However, after analyzing the business requirements, you will find that a part of the calculation logic is very general. For this reason, DTBoost can provide customers with this part of the functions to solve 30-50% of the label processing needs in the enterprise, and let the business personnel themselves Label processing of a generic method can be achieved. At the same time, the label factory can shield the table connection logic between the bottom layers for the user, and the user only needs to know the meaning of the label used. When multiple tags are generated, processed, and analyzed at the same time in a certain period of time, the tag factory can automatically find out the common dependencies of these processes, the same calculation, etc., save computing resources, and avoid some popular physical tables from being used multiple times. Full scan. The planned functions at this stage are as follows:

14727218241604

Currently supported derived methods:

时间序列上的衍生:

方法名称    方法描述
cnt 变量在一定周期内的发生次数
cntd    变量在一定周期内出现的不同值次数
totv    变量在一定周期内的总和
ttav    变量在一定周期内的均值
hmax    变量在一定周期内的最大值
hmin    变量在一定周期内的最小值
hmedian 变量在一定周期内的中位数
stddev  变量在一定周期内的标准差
variance    变量在一定周期内的方差
days    变量在一定周期内满足条件的天数
ftdays  变量在一定周期内满足条件的首次行为距今时长
ltdays  变量在一定周期内满足条件的末次行为距今时长

组合标签支持的表达式以及函数:

计算运算:+, -, *, /, %
数学函数:abs,acos,asin,atan,ceil,conv,cos,cosh,cot,exp,floor,ln,log,pow,round,sin,
sinh,sqrt,tanh,tanh
  • Intelligent discovery: The function of this module is to accelerate the process of building OLT models. If the label factory is the process of accelerating T, then intelligent discovery is the process of accelerating OL. How to build an effective OLT model is very critical, and it is also the part that may take the longest time in this new generation of big data application model. To this end, we use technical means to assist in solving this problem. Most entities in physical data exist in the form of keys, and relationships generally exist in the form of combined keys. We use machine learning methods to log the business database. It can automatically discover possible entities and relationships, and cut them into different sub-graphs according to the strength of the relationship to help modelers confirm and discover key business models.

_

  • Domain OLT template: This is very interesting, and it is also domain knowledge in the true sense. Through the continuous output of DTBoost in different industries, entity relationship models in different fields can be summarized and precipitated, and label models and label classification systems in different fields can be precipitated to form DTBoost domain knowledge base. At the same time, it is not only a domain template of the model layer, it will be linked with the upper-layer computing model to form a complete set of templates from the model layer to the application layer. For example, in the financial field, a set of entity relationship labeling models in the financial field will first be precipitated. Based on this, a set of multi-dimensional cross-analysis templates, risk control early warning templates, and marketing templates can be precipitated. When outputting in the same domain, rapid customization can be done based on this domain knowledge base.

At this point, the data model part can come to an end. This part is the foundation of the new generation of enterprise big data application model and is very important. For this reason, DTBoost spent a lot of time and resources on the design and development of this part. It is important but indeed a foundation. It cannot directly solve business problems. The real impact on business is the computing model based on this data model. Please pay attention to the next chapter: DTBoost calculation model - calculation model.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326328836&siteId=291194637