[Transfer] Talk about the development and classification of data warehouse architecture

Very good article, turned it over directly

Jerome 20061210

Recently, there has been more discussion about the data warehouse architecture . I will briefly organize some of the architectures here. The purpose is to set a target for everyone, and everyone can criticize and supplement after this article.

I have grouped all the architectures I have heard about and divided them into six categories . Among them, many explanations are my personal understanding and may not be correct. Please provide more guidance.

1. Independent data mart architecture

The stand-alone data mart architecture, sometimes called the stand-alone data warehouse architecture, should be the earliest and most common approach. Especially for small and medium-sized enterprises and small and medium-sized development companies, this kind of architecture will be adopted for cost and quick effect considerations. Everyone must be familiar with this structure.

The disadvantage of this architecture method is also obvious. It is not consistent data within the enterprise, resulting in information silos. Of course, my company is very small, and there is only one system, no integration, and it is fine to use this method if a data mart is sufficient. Small investment in the early stage, let the enterprise see the effect, and then consider re-establishing the data warehouse when the development is large.

2. Federated data warehouse architecture

I have written a brief introduction to this architecture method before. Of course, I am not familiar with this method, and the introduction is messy. I think its appearance should be due to the establishment of several independent data mart structures in the early stage of enterprise development. Later, it was found that this would not work, the data was not integrated, and a solution had to be found to solve the information silos. Of course, it is good to tear down and rebuild, but the investment is too large, and the previous data mart still wants to be used. What should I do? Therefore, I came up with another way to establish some comparison tables between each independent data mart, so that data can be exchanged on the basis of not overthrowing them. Later, I slowly discovered that I had already thought about the integration strategy, and it was also possible to build a data warehouse in this way. Therefore, the concepts of regional federation and functional federation were also proposed.

The shortcomings of the federated architecture are also obvious. Unless a method similar to the bus architecture is used to achieve data consistency at the beginning of the establishment, data inconsistency is prone to occur, resulting in incomplete integration. If it is considered well at the beginning, the difference with the bus architecture is not big. Of course, the federal architecture still has a certain role in temporarily solving the data exchange problem of the company's original independent data mart.

3. Centralized architecture

The emergence of centralized architecture means that the data warehouse architecture has entered a relatively mature period. His architectural method is to establish a physical EDW, that is, a central data warehouse. The data is centralized in the EDW, and the applications and analysis programs are accessed in the EDW, and the data is consistent throughout the enterprise. With the development of ROLAP, the establishment of ROLAP in this centralized architecture has become more popular. The common solution of MicroStrategy is to establish ROLAP in EDW. ROLAP creates a separate table to save metadata, only saves the relationship of the dimensional model, and does not save the data of the dimensional model. It is parsed by the application of MicroStrategy, and the application server is used as a cache, and the speed is OK.

This method also has some disadvantages, such as poor scalability and too high requirements for the RDBMS where the EDW is located. With the gradual increase of data volume and analysis, the data has to be separated. If data separation is carried out on the basis of EDW, and data marts or mining warehouses are established separately for different applications, the centralized structure will evolve into a Hub and Spoke architecture.

4. Hub and spoke architecture

In fact, I would like to call it the corporate information factory architecture (Corporate information factory architecture) directly. The hub and wheel spoke architecture sounds awkward and does not sound loud. And the enterprise information factory should be the best representative of this kind of architecture. From the name, we can roughly guess the same. The central data warehouse EDW collects data from various source systems and provides the data to various data marts and mining warehouses. The function is very similar to the hub, so it is called Hub. If you draw the picture, it may be more vivid. There is a line between EDW and each source database , data mart, and mining warehouse. It looks like a wheel. These lines are like wheel spokes, so it is called Spoke . This method of integrating data using the central data warehouse EDW, and then distributing it to various data marts to use the data is vividly called Hub and spoke architecture.

Of course, this architecture also has shortcomings. Although the data mart is established on the integrated central data warehouse EDW, data exchange between these data marts is still impossible. The methods and ETL programs established by everyone will be different. The data between cities is not necessarily consistent. And this architectural approach starts to get complicated.

5. Bus architecture

The biggest difference between the bus architecture and the hub and spoke architecture should be the atomic layer of dimensional modeling and the establishment of consistent dimensions. Because of the pre-established bus architecture and consistency dimension, this architecture can ensure the consistency of enterprise data in the process of gradually establishing the data mart. The bus architecture is a step from a complex to a simple data warehouse architecture. It combines the atomic layer of the dimensionally modeled data warehouse and the data mart into one. The data warehouse is established on the first layer, and it can also support various data marts. Analytical applications.

Of course, the bus architecture also has shortcomings. The central data warehouse is stored in a dimensional model, which has limitations for special non-dimensional analysis applications and is not well supported.

6. Composite architecture

The composite architecture is a name that I came up with, of course it may have its own name but I don't know it, any guidance is welcome. This architecture method is a comprehensive consideration of Hub and spoke architecture and Bus architecture, or an architecture method obtained by combining the two methods. This, innovate511 may have more say, CDW architecture should be the representative of this architecture.

The disadvantage of the composite architecture is also obvious. The architecture is too complicated (more complicated than CIF). If the amount of data in the enterprise is large, it will be very troublesome to move every time.

 

Hehe, the various architectures that have been simply sorted out have also briefly discussed some disadvantages. As for which architecture is good and which is not, I don’t want to argue with you. I think they all have their own scope of application. Guidance on errors and unclear descriptions in my article is welcome.

 

Bumper 20061210

innovate511 20061211

I think it depends on the specific situation.

It all matters how large the project is, business expectations (including expected life cycles), business investments, etc. Of course, as previously analyzed, various architectures have been tried in the current project, and the advantages and disadvantages are obvious.

However, the disadvantage of the composite architecture is not that it is troublesome to move, but the overall requirements are very high, whether it is EDW data integration, or CDW construction similar to HUB, and then to the data mart, there are many interfaces, and the design requirements are high. The requirements for development and testing are also very high, and any step that is not taken well may lead to a project on the verge of failure. This is also an important reason for the failure of many projects trying to imitate CDW ideas in the 1990s. But the effect is also obvious. Even if there are tens of thousands of end users, the four major BI requirements of branches in various departments around the world can all be met. Since all complex data conversion problems have been completed in the background, BI tools can be more stable and more stable. Implement BI quickly. At the same time, because it is very flexible, it can also meet the addition of more data sources and services, realize the long-term use of the system, avoid excessive capital investment caused by repeated investment, reduce stability, consistency cannot be guaranteed after repeated projects, and early project maintenance. Too many problems lead to disadvantages such as dissatisfaction of end users.

As for the problem of data moving, I don't think it is a big problem, because no step can be regarded as an independent whole. As long as the migration is carried out according to the process, there is no risk, including the use of Kimball's idea for DMR (Data Mart Restructure), data set The city can also continue to expand or split to meet the overall needs of more complex BI. I am still very confident in these flexibility.

 

Qing 20061211

The discussion about the data warehouse architecture has been very enjoyable in the past two days. After reading the introduction of jerome and innovate, some things have become clearer. Jerome proposed six architectures, Innovate proposed four requirements faced by the architecture, and four levels of architectural abstraction. Now I will play the role of the host, and then summarize the six, four, and four. Please continue to discuss.

Six architectures (proposed by Jerome):

1. Independent data mart: If you are not sure about building a large data warehouse, you can start with an independent data mart;

2. Federal type: If you already have several data marts, but do not want to build a physical centralized data warehouse, you can consider federated type;

3. Centralized: For different data sources, a physical centralized data warehouse is established, and any analysis is taken from it;

4. CIF style: If you are determined to build a large data warehouse, fully plan this information factory. Build physical centralized data warehouses and build data marts for specialized applications;

5. Bus type: If you want to see the effect sooner, but don’t want to be like an independent data mart, you may waste your investment and adopt a bus architecture; unified planning and implementation as soon as possible.

6. Composite: Comprehensive CIF, CIF for data, and bus architecture for application.

I think that most of the business analysis systems in the telecommunications industry are more "centralized", because they physically centralize data from different data sources, and do not build a separate data mart, most of which are only in the data warehouse architecture. Draw out a layer of DM layer (I don't know if it counts). Of course, it is also possible to use the bus architecture. After all, this one is quicker. At this stage, "fast" is still a priority.

Four requirements for architecture (proposed by Innovate):

1. Data integration needs;

2. Decomposition requirements for business;

3. The need for data efficiency;

4. Demand for changes in demand;

Innovate didn't say much about these four kinds of needs. Why are they divided into these four kinds? Are there other categories? It seems that the decomposition of the business and the change of requirements both refer to the business requirements (maybe the former is the functional decomposition of the existing business, and the latter is the change of business requirements). If you want to categorize requirements, you have to look at what roles are making them. Those who deal with data hope that the architecture can be convenient for themselves, and those who use data hope to be able to access it quickly and safely... I think this is the basis for the classification of requirements.

Four abstraction levels of the architecture (proposed by Innovate):

1. The overall IT architecture; such as hardware, network system, or SOA, etc.;

2. Data warehouse architecture; such as the bus type and CIF architecture described above;

3. Functional architecture; how to support ETL, OLAP, report presentation, Portal and other functions;

4. Application architecture; how to support specific services, such as analysis of customer call behavior, which is related to the telecommunications industry.

It is obvious to divide the architecture into different levels of abstraction. For example, SOA is definitely more abstract than J2EE. J2EE may not be suitable for the architecture of data warehouse system, but SOA can. At the third layer of architecture, at least they are suitable for data warehouse projects in different industries, but at the fourth layer, they only serve specialized industries.

 

[email protected] 20061212

Ask a case of e-government: how to integrate data in a vertical system that has been self-contained?

In the urban e-government, data integration should be carried out on various departments at the municipal level. Because most of these departments are top-down systems, from the national central, provincial to local urban areas. For example, public security, taxation, and the People's Bank of China are all self-contained business systems. For another example, the information systems of various professional indicators (population, industry, agriculture, etc.) of the National Bureau of Statistics are also self-contained, and are pushed from the central to local districts and towns. Income, GDP, retail sales of social consumption, etc.), it is necessary to integrate various professional indicators. They are all heterogeneous databases and data standards (called data marts?), and the data platform formed by integration is called data warehouse? Is it to form a middleware-style unified planning data standard or to make an interface table? If you want to use the data of each bureau horizontally, for example, the urban government wants to use these data, which of the above solutions should be adopted? I think although the starting points of our application goals are different, the data integration problems encountered should be the same.

 

goldenfish 20061213

The challenge of data integration in this case first comes from the management model. The city government wants to use the data, but the ownership of the data does not belong directly to the city government. In the past, the mayor's office invited the leaders of various departments to discuss how to provide data. Each department said that its detailed data is highly confidential, and it may cause social problems, and they are unwilling to give it. If so, can you provide last year's data? Or if given, only once? It is even more impossible to obtain these data dynamically and in real time. Large state-owned enterprises also have this problem. There are too many levels of legal persons. The headquarters wants the following data, but they are pushed back by three and four, and those who are soft and hard are not given.

Secondly, how to establish a comprehensive statistical index system at the urban level. It is a bit difficult to define a comprehensive statistical indicator system from the perspective of an urban area. Is it consistent with the statistical caliber of the National Bureau of Statistics? If they are exactly the same, why not take them directly from the city's statistics bureau? Of course, the Bureau of Statistics may adopt a set of its own statistical methods, such as sampling; its own frequency, such as once a quarter; it may only be responsible to the higher-level statistical agency, not to the lower-level and so on, so that the statistics of the Bureau of Statistics Difficult to use directly. The establishment of the statistical indicator system should bring together the core indicators of each major business department, eliminate overlap and inconsistency, and some need to merge data across departments. The establishment of this indicator system is also a great challenge.

Finally, it is the question of how to establish the technical architecture. There is no universal or absolutely correct technical architecture. It is problem-solving and adjusted to actual conditions. Due to the above-mentioned management and security reasons, it may be difficult to directly fetch data from the source system, and the next best thing is to use the interface table or interface file. If the data cannot be directly derived from the source system, the interface format may also be defined and provided by the manufacturer of the development system. The interface table (or interface file) is placed and transferred from the interface buffer to a centralized site (server) for data integration. It is difficult to constrain the data standards of unified planning to the business systems of each department, after all, they are all drawn down vertically; realistically, only interface standards can be constrained. In the design, it is also necessary to consider how the data is stored (data structure), the frequency of data retrieval, how long the data is retained, and whether the data needs to be returned to the source system after integration. In addition, it is also necessary to consider that it is difficult to obtain the original details of some data, and manual supplementary recording methods are required.

 

Qing 20061213

When I saw this question, the first thought that came to my mind was "the architecture of the federated data warehouse". Because for an already self-contained system, is it realistic to consolidate data into a physical data warehouse? Later, I saw goldenfish's reply, mentioning that the primary challenge is "management mode". That should be the reason. The integration of management is more difficult than the integration of data, so why not try to keep it simple?

Of course, due to the lack of understanding of the system in e-government, whether it is reasonable or not, it is enough to be able to develop ideas.

Why was the first reaction "Federal Data Warehouse", ask yourself? Perhaps it is Mr. Yu Shan who mentioned self-contained systems. These systems are completed by different developers, and there is no unified standard. The design ideas and coding of the system must be strange and involve security issues. So I think it may be more appropriate to use a virtual data warehouse, that is, the actual data are still in their own systems, and the so-called data integration is to build a system that can accommodate all these systems (in reality, it may be the intersection of these systems. Part of the framework, the framework places things like "data pointers", which are virtual. When you need to query what data, this framework actually locates the query to a specific system. For example, to inquire about a person's credit, crime, and tax status, it seems that the ID number is entered from a unified platform, and the above information is returned, but the data is still stored in the financial, public security and taxation systems.

Thinking deeply from this, I think this case is not so much "data integration" as it is "application integration". Therefore, the data warehouse architecture mentioned above is not used at all, nor is it a "federal architecture", and the answer should probably be found in Service Oriented Architecture (SOA).

To build a data warehouse, it is better to establish some standards. You can imagine each independent system as a service provider, and what services it can provide is registered. For example, the financial system can give the person's account, loan, credit and other information according to a personal ID, can give the company's information according to a company's code, or can provide annual statistical indicators and so on. Of course, the extraction of this information should be based on the data warehouse within the financial system, and cannot always be extracted from the business system. The key here is how to define the services provided by these "service providers", that is, standards. What to input, what to output, and even the standardization of input and output specifications (such as ID requirements, output data, gender coding rules), permission control, how to register for these services, and so on.

There is such a standard, and the specific implementation is handed over to the implementation vendor of the independent system. As for what use these services can be used for, it depends on the application.

 

goldenfish3 20061213

There is a difference between data integration and application integration. Usually, application integration refers to EAI, and now it has turned to implement EAI in terms of SOA. Data integration usually refers to data integration/consolidation/mediation through data integration layer or data integration platform, EDW, etc. Application integration in a broad sense includes data integration between business applications, that is, the data of A is sent to B, for example, the policy charges in the insurance business system are sent to the financial system for entry. However, application integration refers to that system A requires B to complete an operation, and B returns a confirmation after completion, and such multi-step operations form a process. Process integration is an advanced application of application integration (integration). Data integration is often done for analytical purposes. Data from different systems is sent to an integrated platform for data cleaning and integration, in order to realize analytical applications. The combination of the two can be called Enterprise Integration (EI).

As mentioned above, the two overlap. Especially after the combination of SOA and master data management, master data integration is realized through SOA, and master data forms a unified view for the business system (there is already an application of SID in telecommunications). The emergence of this concept reflects the impact of SOA on the traditional business system architecture, and is also a development direction of application integration. However, master data management cannot replace data integration, nor can it replace data cleaning and integration. Both serve different purposes and have different (though possibly overlapping) scopes. On the contrary, the emergence of EII has had an impact on the traditional data integration architecture. EII embodies a concept of real-time data integration. Data from different source systems are not stored centrally, but are integrated while being transmitted when used. But obviously, the current implementation of EII cannot isolate the access pressure to the business system database, and it is difficult to effectively process a large amount of data.

[email protected] 20061213

Thank you for your teachings. As mentioned above, management integration is difficult. But it is the boundaries of management that form the boundaries of unified planning of data, and this is how heterogeneous systems are formed. For an enterprise, the elimination of heterogeneous data of departments can also be achieved through unified planning, but it is difficult to achieve for various professional bureaus at the municipal level.

As far as my current research is concerned, the OA system at the urban level, joint approval (the so-called Sunshine Government Affairs), and the Bureau of Finance have to implement unified treasury payment for other departments (purchases of more than 2,000), these systems are horizontal. The coordination is relatively close, and it seems that unified data planning and integration can be carried out.

Liu Qing's last proposal for application integration made me very enlightened. To use an analogy, is it like the method of eliminating heterogeneity in the XML network, which is not unified integration in the storage and coding of text, but unified integration in the language of the display interface? So can it have nothing to do with the operating system, database, storage standard of character data? to achieve unity.

 

flame 20061220

Personally, I think the development prospect of the bus mechanism is better, and the scalability is better. It can build multi-level data warehouses and BI applications in large enterprises. As for the content of the shortcomings, I think it is not a big problem. Warehouses will have special applications, and components on each bus will always have their own characteristics. As long as the standards used by this application are uniform, or the scope of use is not wide, it will not have a significant impact on the architecture of the entire bus. big impact.

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326663074&siteId=291194637