Building a data warehouse system big data

The era of big data, data warehouse system upgrade to large data generation is very common, is also a good opportunity. If you want to study large data warehouse data system design, personally recommended by bit (before) the architect of video sharing lessons: http: //t.cn/EJ07vua mainly dry more, there are 15 cases of actual combat + 3 + 3 project.


First, what is a database?

1. Database (Database) is organized according to the data structure, based on computer storage devices store and manage data warehouse

2. The database is a long-term storage in the computer, organized, sharable data collection. Data in the database are defined by certain

   Organizational data model, described and stored together with the smallest possible redundancy, high extensibility and easy data independence


   And certain features within a range shared by multiple users


Data Warehouse Definition:


Subject-oriented, integrated, relatively stable, reflecting the historical changes in data collection to support management decisions.






Data warehouse and database comparison:




Business-oriented database often referred to as OLTP, data warehouse for analysis, also known as OLAP


Second, the development of data warehouse

Data warehouse concept dating back to the 1970s, desirable to provide a system architecture and business process analysis into different

Levels

In the 1980s, the establishment of TA2 (Technical Architecture2) specification, which clearly defines four components of the analysis system: Number

Data acquisition, data access, directory, customer service

In 1988, IBM first introduced the concept of data warehouse: a structured environment, to support end-users to manage their entire business,

And to support IT departments to ensure data quality; abstract basic components: data extraction, transformation, validation, load, cube

Development, defined the basic fundamental principles of data warehouse, the main principles of the framework structure, and analysis system


In 1991, Bill Inmon published "Building the Data Warehouse" put forward more specific data warehouse principles:

1. The data warehouse is a subject-oriented

2. Integrated

3. Including history

4. Do not updated

5. oriented decision support

6. for enterprise-wide

7. The most detailed data storage

8. The data acquisition data snapshot of formula


Although some theory is still controversial, but by virtue of the book won the "Father of Data Warehousing" award


Bill Inmon advocated the construction of a top-down enterprise data warehouse, data warehouse is considered part of an overall business intelligence system.

A company has only one data warehouse, data marts source of information from the data warehouse, data warehouse, the information stored in line with the first


Three paradigms, roughly architecture:




Ralph Kimball publication of "The Data Warehouse Toolkit", which advocates a bottom-up data warehouse, pushing construction


Li data marts, data warehouses is considered a collection of all data marts within the enterprise, information is always stored in a multidimensional model in which its ideas:




Two ideas and opinions are very difficult to succeed in the actual operation completed project delivery, until finally Bill Inmon proposed a new BI architecture CIF (Corporation information factory), the data mart contains the room. CIF is the number of core warehouse architecture is divided into different levels to meet the needs of different scenarios, such as common ODS, DW, DM, etc., each with different building programs based on the actual scene, the idea is to change the current data warehouse architecture building guide, but a top-down or bottom-up data warehouse construction, not unity.


Construction of large data warehouse based on the number of features


As we move into the era of DT from the IT age, data is also increasing the amount of accumulation, accompanied by the development of the Internet, a growing number of scenarios generated, the traditional data processing, storage methods can not meet the growing demand. Higher compared to the traditional Internet industry and the industry's acceptance of new things, application scenarios more complex, based on big data to build a data warehouse in the Internet industry has been the first attempt.


Although the data warehouse modeling methodology is the same, but because the industry faces, scenes, in the Internet field, based on data warehouse building large data flow can not be in accordance with the original project, the development model, more of a need to combine the new technology system, flexible business scenarios adjusted to respond quickly to demand-oriented.


A wide range of application scenarios


1) number of bins conventional long construction period, stability requirements, for DSS, CRM, BI systems, less demanding aging.


2) Based on data warehouse data requirements for the construction of large rapid response needs, while demand for flexible, changeable, with varying degrees of real-time requirements, except for the DSS, BI and other traditional applications, but also in response to user portrait, personalized recommendation, machine learning, data analysis and other complex scenarios.


Technology stack more comprehensive and complex


Traditional warehouse building more number of business based on proven data integration platforms such as Teradata, Oracle, Informatica and other relatively sophisticated technical system, but is relatively closed, the perpetrators of technical and professional requirements are relatively simple, more general application in banking, insurance, telecommunications and other "money" industry.

Is based on the non-commercial, open-source technology, is based on common hadoop ecological built on several large data warehouse construction in general, involve more extensive technical and complex, at the same time with respect to the commercial products, stability, and service support is weak, needs its own maintenance more many technical framework.


Third, the technology stack change

 




Number warehouse model design more flexible

1. The traditional number of positions have a more stable business scenario and relatively reliable data quality, but also more stable demand for the construction of several positions have a more complete project process management and control, the number of warehouse model design, there are strict and stable construction standards .

2. In the Internet industry:

1) rapid changes in the industry, business flexibility, while the Internet is a viable industry rely on speed


2) a wide variety of data sources: structured database, Nginx log, a user browsing trajectory, unstructured and semi-structured data


3) Data quality is relatively poor, uneven levels


So, in the Internet field, the number of warehouse design models are more concerned about flexible, rapid response and response to changing market conditions, in order to more quickly resolve business and operational issues-oriented, fast data access, fast service access, but there is no once and for all .


Fourth, the scope of application and prospect data warehouse

Meaning the number of existing warehouse




Fifth, large data warehouses based on data mainly used in the Internet industry


Guess you like

Origin blog.51cto.com/14485508/2426997