Look at the data warehouse

0.5 What is a database?

  • Database (DB) in accordance with the data structure to organize, store and manage based on data computer on warehouse storage device

  • Long-term storage in the database is a computer, an organized, sharable data collection. It refers to data in the database based on a certain organizational data model, described and stored together with the smallest possible redundancy, higher data independence and extensibility of the characteristics within a predetermined range and a plurality of users share

1. What is a data warehouse?

A data warehouse is subject-oriented, integrated, relatively stable, reflecting the historical changes in data collection to support management decisions.

  • Subject-oriented: data at a high level on a comprehensive enterprise information systems merge abstract conceptual analysis utilized. Each topic corresponds to a substantially corresponding analysis.

  • Integrated: enterprise data, while the data to maintain consistency, completeness, validity, accuracy.

  • Stability: From the point of view is a period of time remains the same, there is no update, delete, query and analysis to the main.

  • Change: to reflect the historical changes.

Features database database
data range Storage history, complete reaction of historical change The current status data
Data changes You can add, accidentally deleted, no change, reflecting the historical changes Support frequently add, delete, change, check operation
Scenarios Oriented Analysis, strategic decision support Transaction-oriented business processes
Design Theory Violation paradigm, appropriate redundancy Compliance paradigm (1NF, 2NF, 3NF etc. paradigm), to avoid redundancy
Handling capacity Infrequent, large-volume, high throughput, delay Frequent, small batch, high concurrency, low latency

Business-oriented databases commonly referred to OLTP, analysis-oriented data warehouse called OLAP.

2. The development of data warehouse

Data warehouse concept dating back to the 1970s, an architectural hopes the business processing systems and analytical processing is divided into different levels.

1980s, resume TA2 (Technical Architecture2) specification, which clearly defines four part analysis system: data acquisition, data access, directory, user services .

In 1988, IBM first introduced the concept of data warehouse: a structured environment, to support end-users to manage their entire business and supports IT departments to ensure data quality; abstract basic components: data extraction, transformation, effectiveness verify, load, cube development , defined the basic fundamental principles of data warehouse, the main principles of the framework structure, and analysis system.


[Knock on the blackboard, that the emphasis]:

Conversion: Sqoop simple data conversion, more conversion operations are achieved via the ETL

Validation: data quality issues involved


How to build a data warehouse it?

    1. In 1991, Bill Inmon made from top to bottom (top-down) approach to building an enterprise data warehouse (Data Warehouse, DW), think DW is part of a whole system of business intelligence (BI) is. A company has only one DW, data marts (Data Market, DM) source of information from the DW, the DW, the information stored in line with 3NF paradigm.
  • 2) Ralph Kimball advocated a bottom-up (bottom-up) way to create DW, pushing the establishment of a data mart that DW is the set of all DM within the enterprise, information is always stored in a multidimensional model.

  • 3) Bill Inmon proposed a new BI architecture CIP (Corporation information factory), the DM contains the room. CIP is the core of the DW architecture is divided into different levels to meet the needs of different scenarios, such as common ODS, DW (eg: DWD, DWS , etc.), DM, etc., each with different building programs based on the actual scene, the idea DW architecture Guide is currently building, but in the end the construction is carried out in DW top-down or bottom-up way, not unity.


[Knock on the blackboard, that the emphasis]:

Many people say that the era of Hadoop dimensional modeling is not necessary, and it is not true, in the era of the entire Hadoop modeling is still useful, but it is very meaningful. Hadoop big data applications provide for the whole of the underlying technology, including Hive, Spark, just a technical architecture. And when we talk about big data, more concerned about the value of big data, so when the organization will still be combing the data content of conceptual data modeling.

But many companies bigger data might be just the initial stage, more concerned with traffic aspects, modeling is not required, but the body such as the amount of data reaches a certain level, if there is no data modeling, so it can not guarantee data quality, data also will be destroyed, they rot.


Data warehouse layered architecture diagram:

Core Description:

  • The synchronization of the data traffic to a read only DB, DB
  • Use Sqoop read-only data in the DB Import to HDFS (ODS)
  • Hive ODS layer data is screened by the DWS HQL layer (DWS is stored in the intermediate result aggregated)
  • Hive data DWS layer in HQL screened by binding to the specific needs of DM (DM is stored in the index to be calculated)
  • The Hive layer data DM is introduced through Sqoop, Spark SQL to the RDBMS stored in MySQL (data for visual display)
  • Visualization display (pie chart, bar chart, line chart, etc.) by a visualization technique corresponding result data read RDBMS

Why stratify the data warehouse?

  • Space for time, to improve the system through a large number of pre-user experience (efficiency), and therefore there will be a large amount of data warehouses data redundancy.
  • If not hierarchical, then the source business system business rules changes will affect the whole system of the cleaning process, a huge amount of work.
  • The cleaning process can be simplified data management by the hierarchical data, since the original work extension step to execute a plurality of steps, equivalent to a complex job split into a plurality of simple work, the black box a large variation became a white box, the processing logic of each layer are relatively simple and easily understood, we will more easily determine the correctness of each step, when the data error occurs, we often need to adjust certain steps.

Having said that, the step of establishing a data warehouse it?

    1. Collect and analyze business requirements
  • 2) establish the physical design of the data model and data warehouse
  • 3) define the data source
  • 4) Select the data warehouse technology and platform
  • 5) extracted from the operational database, purification and conversion data to the data warehouse
  • 6) Select Access and Reporting Tool
  • 7) Select the database connectivity software
  • 8) Select the data analysis and data presentation software
  • 9) to update the data warehouse

3. Construction characteristics based on the number of large data warehouse

    1. Feature

Given the "either change or die" Internet industry influence decisions in the field of Internet-based data warehouse building big data is not in accordance with the original project process, development model, more is needed to incorporate new technologies system, flexible business scenarios adjusted to respond quickly to demand-oriented.

    1. A wide range of application scenarios
The number of traditional warehouse Based on the number of large data warehouse
Long construction period Require rapid response requirements
Steady demand Demand for flexible, changeable
Timeliness less demanding There are different levels of real-time requirements
For DSS, CRM, BI and other systems In addition to traditional applications-oriented DSS, BI, etc., but also in response to user portrait, personalized recommendation (for example, you read the article again, give you recommend ten articles of the same type), and learning, data analysis and other sophisticated applications Scenes

Ali: OneData (company level for all data applications are based on my data, only one data export, whether real-time or off-line are unified only exit), OneService (data services are also uniform)

Many times, we offer to several positions, they will think it is 离线, or at least T+1, but this is the traditional data warehousing, big data and based on the data warehouse should not be so understanding, real-time should be summarized into the data warehouse.

    1. Technology stack more comprehensive and complex
    • Number of traditional warehouse construction: More is based on proven commercial data integration platform, such as Oracle, Informatica and other relatively sophisticated technical system, but relatively close, relatively professional and technical implementers of a single request.
    • Based on the number of building large data warehouse: general is based on non-commercial, open source technology, and involves more extensive technical, complex, no commercial companies to provide services, needs its own maintenance more technical framework. Common ecological construction is based on Hadoop.

Total technology stack diagram:

    1. Number warehouse model design more flexible
    • Traditional Number of positions:
      • Have a more stable business scenario and relatively reliable data quality, but also more stable demand for the construction of a number of positions have a more complete project process management and control, the number of warehouse model design, there are strict and stable construction standards
    • Internet industry:
      • Rapid changes in the industry, business flexibility, while the Internet is a viable industry rely on speed
      • Wide range of data sources: structured database, Nginx log, a user browsing trajectory, unstructured and semi-structured data
      • Data quality is relatively poor, uneven levels

In summary, in the Internet field, the number of warehouse model must exist (not necessarily dimensional modeling, but needs a remodel at least), and the design of the model number of positions more concerned about flexibility, rapid response and response to changing market conditions, more to quickly solve business and operational issues -oriented, fast data access, fast service access, but there is no once and for all.

Applications and Prospects 4. Data Warehouse

  • Meaning the number of existing warehouse

    • Project management
  • In the Internet industry application based on data warehouse big data:

    • WITH A
    • Push Message
    • Thousand thousand faces
    • User portrait
    • Fraud

5. development direction

  • 1) data analysis, data mining, artificial intelligence, machine learning, risk control, unmanned

  • 2) Operation data, accurate operation

  • 3) Advertising accurate, intelligent delivery

Resource Link

Reproduced in: https: //juejin.im/post/5cfd3d94e51d4510a732808f

Guess you like

Origin blog.csdn.net/weixin_34388207/article/details/91419146