Talking about Big Data - Offline Data Warehouse Based on SparkSQL

Table of contents

knowledge supplement

Hierarchical Design of Offline Data Warehouse

Data Hierarchical ODS

Data Layered DW

Data layering APP

data modeling


knowledge supplement

Before formal reading, you first need to understand the following basic concepts to help you better understand the data warehouse:

  • Business segment : A higher-dimensional business division method than the data domain, suitable for huge business systems.
  • Dimensions : Dimensional modeling was proposed by Ralph Kimball. The dimensional model advocates building a model starting from the needs of analysis and decision-making, and serving the needs of analysis. Dimensions are the measurement environment, the angle from which we observe the business, and are used to reflect a type of attribute of the business. A collection of attributes constitutes a dimension, which can also be called an entity object. For example, when analyzing the transaction process, the environment in which the transaction occurs can be described through dimensions such as buyers, sellers, commodities, and time.
  • Attributes (dimension attributes) : The columns that represent dimensions contained in a dimension are called dimension attributes. Dimension attributes are the basic source of query constraints, grouping, and report label generation, and are the key to data usability.
  • Metrics : In dimensional modeling, metrics are called facts, and environments are described as dimensions, which are the diverse environments needed to analyze facts. Measures are usually numeric data, as facts of the fact logic table.
  • Indicators : Indicators are divided into atomic indicators and derived indicators. An atomic indicator is a measurement based on the behavior of a certain business event. It is an indicator that cannot be separated in the business definition. It is a noun with a clear business meaning and reflects a clear business statistical caliber and calculation logic, such as payment amount.
    • Atomic indicators = business process + measurement.
    • Derived indicators = time period + modifiers + atomic indicators, derived indicators can be understood as the delineation of the statistical scope of atomic indicators.
  • Business limitation : the business scope of the statistics, to filter out the records that meet the business rules (similar to the conditions after where in SQL, excluding the time interval).
  • Statistical period : the statistical time range, such as the last day, the last 30 days, etc. (similar to the time condition after where in SQL).
  • Statistical granularity : The object or perspective of statistical analysis, which defines the degree to which data needs to be summarized, and can be understood as the grouping condition during aggregation operations (similar to the group by object in SQL). Granularity is a combination of dimensions that indicates the scope of your statistics. For example, if a certain indicator is the turnover of a certain seller in a certain province, then the granularity is the combination of the two dimensions of seller and region. If you need to count the data of the whole table, the granularity is the whole table. When specifying granularity, you need to fully consider the relationship between business and dimension. Statistical granularity often exists as a modifier for derived metrics.

Hierarchical Design of Offline Data Warehouse

Traditional data warehouse:

Why should the data warehouse be layered?

  • Clear data structure: Each data layer has its scope, so that we can locate and understand it more conveniently when using tables.
  • Data lineage tracking: To put it simply, there are many sources of a business table. If there is a problem with a source table, we hope to quickly and accurately locate the problem and understand the scope of its harm.
  • Reduce repetitive development: Standardize data layering and develop some common middle-level data, which can greatly reduce repeated calculations. Decomposing a complex task into multiple steps to complete, each layer only deals with a single step, which is relatively simple and easy to understand. Moreover, it is convenient to maintain the accuracy of the data. When there is a problem with the data, it is not necessary to repair all the data, but only need to start repairing from the problematic steps.
  • Shield the abnormality of the original data: shield the impact of the business, and need to re-access the data without changing the business once.

So let's understand the layering theoretically:

 Let's make an abstraction, and the data warehouse can be divided into the following three layers, namely: data operation layer, data warehouse layer and data product layer.

Data Hierarchical ODS

The full name of ODS is Operational Data Store, which is operational data storage.

The "topic-oriented" data post source layer, also called the ODS layer, is the layer closest to the data in the data source. The data in the data source is extracted, cleaned, and transmitted, that is to say, after the legendary ETL, it is installed into this layer. The data at this layer is generally classified according to the classification method of the source business system.


 

Data Layered DW

This layer is the overall data warehouse, which contains:

  • Public summary granularity fact layer (DWS): With the subject object of analysis as the modeling driver, based on the upper-level application and product indicator requirements, build a public granular summary indicator fact table, and physicalize the model by means of a wide table. Construct statistical indicators with naming conventions and consistent caliber, provide public indicators for the upper layer, and establish summary wide tables and detailed fact tables.

    The tables of the public summary granularity fact layer are usually also called summary logical tables, which are used to store derived indicator data.

  • Detail-grained fact layer (DWD): With the business process as the modeling driver, based on the characteristics of each specific business process, build the most fine-grained detail-level fact table. Some important dimension attribute fields of the detailed fact table can be appropriately redundant in combination with the data usage characteristics of the enterprise, that is, wide table processing.

    Tables in the fine-grained fact layer are also commonly referred to as logical fact tables.

Data layering APP

Data product layer (APP), this layer is to provide the result data used for data products.

Here, the data mainly provided for data products and data analysis is generally stored in es, mysql and other systems for online systems, and may also be stored in Hive or Druid for data analysis and data mining.
 

Of course, in practice, we can also expand the hierarchical structure as needed:

data modeling

 First of all, of course, clarify the requirements:

  • Business volume (DAU), data volume (GB/TB/PB level), growth rate?
  • For offline data analysis scenarios, do you need to support real-time analysis? May involve complex queries? Need to support the upper-level reporting system, is it open to non-technical personnel?
  • Does the business department currently have clear data needs, and is there any need for data monitoring analysis and indicator statistics within half a year?
  • Is there a budget for commercial-grade products, and is open source a priority?

After that is to consider the cost: no suggestion here

Then consider the scale:

  • For the amount of data in a period of time in the future, it is necessary to have a rough assessment
  • Oracle RAC supports a small number of node clusters and scale up scenarios
  • Hadoop clusters can be scaled out horizontally
  • PG plus proxy can also be sliced ​​horizontally
  • In addition to the engine, the peripheral system also needs to consider the data size
  • In addition to data size, tenant usage also needs to be considered

After that is ease of use, operation and maintenance, etc.

Then we build the data warehouse layering and engine architecture:

 Related technology stack:

You can refer to the following technical selection:

 So what are the advantages of sparksql for such a data warehouse?

sparksql has its own architecture system in it:

 Finally, let's look at a real data warehouse architecture:

Written at the end, the data warehouse is also composed of many tables, please always remember what is a data warehouse?

Data warehouse, the English name is Data Warehouse, which can be abbreviated as DW or DWH. A data warehouse is a strategic collection that provides support for all types of data for decision-making processes at all levels of an enterprise. It is a single data store created for analytical reporting and decision support purposes. Provides guidance on business process improvement, monitoring time, cost, quality, and control for businesses that need business intelligence.

For more data warehouse introduction, please move to:

Talk about Big Data - Real-time Data Warehouse and Practical Application of Big Factory Rules are designed, and databases are designed to capture data. The data warehouse is designed subject-oriented. The data warehouse generally stores historical data. The design of the data warehouse intentionally introduces redundancy and adopts an anti-paradigm approach to design. The data warehouse is designed for analyzing data. https://blog.csdn.net/qq_52213943/article/details/124132686?spm=1001.2014.3001.5502

Guess you like

Origin blog.csdn.net/qq_52213943/article/details/124156599