Data Warehousing practice Zatan - (d) - Metadata

[table of Contents]

Data Warehousing practice Zatan (d) metadata

Whether in the field of big data or data warehouse, metadata is the most important thing.

Metadata is defined as: describes data, descriptive information of the data and information resources.

The concept is abstract. Briefly, you have to handle business systems each have a table structure, namely, T1, T2, T3, etc., together called T. Now create another set of table M, used to describe all T-- how much each system table, what table, each table has a number of fields, what fields, each table is what to do with, come, how come, and each field is what type of value constraints. Short description of what data you are dealing with and how to deal with the parameters of these data, can be considered metadata. This definition is only a small few, all you need to deal with data related to parameters, configurations can be considered metadata. Metadata-driven, change when the parameter is to metadata, parametric driver words.

So in the design of the data processing program, the most important feature is the metadata for program design, rather than the data itself design. Program run up, what kind of data is, in fact, is not too concerned, as long as the metadata definition is right on the line.

Then those who do OR-Mapping, what Mybatis, basically, to do with us not too big. Because we are dealing with data will be considered to be dynamic, dynamic content, structure, dynamics, all depend on the definition of metadata.

More specifically say how metadata definitions. A typical defined as follows:

  • Business Domain: describe the system which business areas are included, for example, loans, deposits and other;
  • Source system: data from which system, how to get data from the system;
  • Entity Description: description of how the system comprises a standardized name, code, business domain belongs, from which system the like, and loading the data table and the like;
  • Attribute Description: The description of the fields in the table, the value constraint validation rules and special meaning which table service, and the like belong to the field;
  • Data mapping: mapping relationship data (tables and fields) and a data area of ​​each of the next area.

If necessary, you can continue to expand the content metadata from the business meaning of the data, processing rules, the use of rules.
Simple metadata ER model
Metadata can be doing, or how to use? It can be considered from the following aspects:

  • Each table defines how the data acquisition, and how to access the next layer. For example TXT file of a system provided, may be defined obtain (FTP address, path, URL, etc.) of this text file, the file format (delimiters, length) from where the file is loaded into the total amount of detail area which table (from source full details of the amount of area considered to be one to one);
  • Loading each table defined in the rule data, such as the total amount, increment, zipper pattern;
  • Validation rule definition data, such as in the range (range of values, a set of dictionaries), whether empty, field length, data type, and the like;
  • Special meaning defined business data, such as field service division and other related rights.
  • Data mapping relationship: the Mapping also known, mainly in a region to another region (such as the total amount of data from the region to the integrated region cartridge model number), field-level mapping relationship;
  • Data Association: data associated with the data itself is also important, similar to a foreign key, a table can be linked to another table (the same region / upper and lower regions) through a field, for the analysis of data is very valuable.

We can say that the complete metadata can control all operational data life cycle, can also be based on this record all changes process data. However, you know, the data processing is very complicated, there are various conditions, as shown in FIG.

Metadata driving the entire process

Logically, we should be able to know the relationship between the two data should have, but it is difficult to describe the formal relationship, or can lead to deep levels of description but not practical. This is the so-called "data lineage." Mapping had to say this thing. Mapping once a weapon, particularly in commercial visualization tool (e.g. Informatic) in the source table is pulled out, a target table is pulled out, then the system automatically map matching quickly inside all fields. Specifically:

  1. 1 operation, the data structure does not change substantially, only the field increase flag or stamp the zipper in the table, the field service content that may be translated over; by-field mapping mechanically here, as with the metadata directly parameterized configuration;
  2. In operation 2, the relational operator and set, select, and projection connection will use, it is split and merge data merge, split, and fields. Which means a certain field because the data may be processed and turned into a two / multiple fields, or two / more fields merge into a field, if the level of data processing and more, this relationship will become very complicated;
  3. 3 operations, business detail is lost, you can basically think that can not be tracked, or is a "formula" (usually plus or minus four arithmetic sum of the aggregate function, and the like) to describe the relationship. So here's the formula for more standardized indicators, or statistical.

Many people will ask, in fact, many applications are based on detailed reports based. If you do not consider the summary level, to the integration layer can be, and whether it will be able to track traceability? Again, in theory, but the actual operation, the cost is high. To banks, for example, a bank of hundreds of systems, each of more than 100 multi-table, there are two or three little pieces, basically you have to have more than three thousand tables, each table 10 fields also have 30,000 above the field. Each partition must be mapped, in particular, the integration layer to form a table available reports, a conversion is often not enough, different indicators will be calculated using the same data repeatedly add up to calculate the approximate infinite field mapping, no matter what way to manage costs are quite high. The result is often the beginning of a very full heart qi, very ambitious plan, but it is out of touch during the mapping process and the actual configuration do not correspond. After all, to do a report by such mapping, far from direct write a stored procedure or JAVA program more quickly and easily. I do not blame the rules, make a short report, stress is high, do come out pretty good.

Business people do report, it is always just a myth.

In fact, I think, people understand technology, learn business will be a bit easier. After all, in this regard, we do not care about business processes, data structures, and only need to care about the meaning of data. The business people science technology, basically not realistic.

Another point, this traceability is basically only concern technical personnel, are generally report calculates a problem, come back to find the corresponding data, statistical standards is correct, whether the raw data is correct. So they might not pay for that service, technology is not required.

How that manages all of the mapping process it? ETL will be discussed later in the frame and characteristics.

About the associated data, mapping data is a bit like a certain extent. But the station level should be higher, you should stand level analysis of business-related data. In fact, this involves some concept of "master data". After more than a typical credit business, for example, users tend to want to query a loan contract, want to see the contents of the contract signed customer information and project loans, promissory note, loan repayment records and records, and the content will be in Further tables. By association "foreign key business", you can put these elements are connected in series. Even from the summary of a layer of customer loans in the total amount of this amount by the index dating back several loans of the contract.

Here Insert Picture Description

In the same data area, the data traffic associated with each of the entities of a network, or a directed graph. If the open area of ​​other data, it becomes a three-dimensional spatial data correlation.

Provide traceability mechanism from a business perspective, it will get more recognition and support services staff. After the example of the figure translated into the language of business is that queries to someone's loan contract / record, need to further look at the customer information system to provide an integrated view of customer information to the user, but the user would like to see further, such as the client in the credit information system, as well as information in the core of the system.

Finally, it is emphasized that the above said all the functions, should be based on the metadata to develop, not hard-coded in the program. This is also the biggest role of metadata. Plainly, the metadata is a certain standard parameters. About Metadata There are a lot of works and specifications academic, of which the most famous is the Object Management Group OMG proposed a series norm, from 1995, adopted the MOF (Meta Object Facility), in 1997, adopted the UML, in 2000, OMG and using CWM. There was a time before specialize metadata, but in the end, the feeling is almost doing academic research, after basically do not remember what the. The remaining feeling is highbrow. Obviously not complicated stuff, do not get people to understand. Up to now, the data warehouse to do is to do a set of recommended students table, and then escalated, can be rich. At least in the application of the basic, most important metadata is used up, but also to carry out every place in the system. That is the key, as those norms, really have some free time to see.

To be continued.

Previous: Data Warehousing Practice Zatan - (c) - the overall implementation framework

Next: Data Warehousing practice-talk (five) ETL

Published 20 original articles · won praise 7 · views 2496

Guess you like

Origin blog.csdn.net/cfy_fantasyxx/article/details/103004560