Multidimensional Data Analysis

[Data Mining] Data Mining #Business Intelligence (BI) Data Analysis and Mining Concept

Data mining is currently booming in all kinds of businesses and institutions. We have therefore produced a summary of common terminology in this field.

  1. Analytical CRM/aCRM: Used to support decision-making, improve the company's interaction with customers or increase the value of the interaction. Collect, analyze, and apply knowledge about customers and how to effectively contact customers.
  2. Big Data: Big Data is both an overused buzzword and a real trend in today's society. This term refers to the ever-increasing volume of data that is captured, processed, aggregated, stored, and analyzed every day. This is how Wikipedia describes "Big Data": "The sum of data sets so large and complex that existing database management tools are intractable (…)".
  3. Business Intelligence: Applications, facilities, tools, and processes that analyze data and present information to help business executives, management, and others make more informed business decisions.
  4. Churn Analysis/Attrition Analysis: Describe which customers are likely to stop using the company's product/business, and identify which customers are most costly to churn. The results of the churn analysis are used to prepare new offers for customers who are likely to churn.
  5. Conjoint Analysis/ Trade-off Analysis: Comparing several different variants of the same product/service on the basis of actual usage by consumers. It can predict the acceptance of products/services after they are launched, and is used for product line management, pricing and other activities.
  6. Credit Scoring: Assessing the creditworthiness of an entity (company or individual). The bank (the borrower) uses this to determine whether the borrower will repay the loan.
  7. Cross / Up selling: A marketing concept. Selling complementary products (kit selling) or additional products (value-added selling) to specific consumers based on their characteristics and past behaviour.
  8. Customer Segmentation & Profiling: According to existing customer data, customers with similar characteristics and behaviors are classified into groups. Describe and compare groups.
  9. Data Mart: Data stored by a specific organization on a specific topic or department, such as sales, financial, and marketing data.
  10. Data Warehouse: A central repository of data that collects and stores data from multiple business systems of an enterprise.
  11. Data Quality: Processes and techniques related to ensuring the reliability and usefulness of data. High-quality data should faithfully reflect the transaction process behind it, and be able to meet the expected use in operations, decision-making, and planning.
  12. Extract-Transform-Load ETL (Extract-Transform-Load): A process in data warehousing. Acquire data from one source, transform the data as required for subsequent use, and then place the data in the correct target database.
  13. Fraud Detection: Identify suspected fraudulent transfers, orders, and other illegal activities against specific organizations or companies. Triggered alerts are pre-designed in the IT system, and warnings will appear when attempting or performing such activities.
  14. Hadoop: Another big data buzzword today. Apache Hadoop is an open source software architecture for distributed storage and processing of huge data sets on a computer cluster composed of existing commercial hardware. It enables large-scale data storage and faster data processing.
  15. Internet of Things (IoT): A widely distributed network consisting of electronic devices of many types (personal, domestic, industrial) and many purposes (medical, leisure, media, shopping, manufacturing, environmental regulation). These devices exchange data and coordinate activities with each other via the Internet.
  16. Customer Lifetime Value (Lifetime Value, LTV): The expected discounted profit that a customer will generate for a company during his/her lifetime.
  17. Machine Learning: A discipline that studies automatic learning from data so that computers can adjust their operations based on the feedback they receive. It is closely related to artificial intelligence, data mining, and statistical methods.
  18. Market Basket Analysis: Identify combinations of goods or services that often co-occur in transactions, such as products that are often purchased together. The results of such analysis are used to recommend additional products, provide a basis for decision-making on displaying products, etc.
  19. On-Line Analytical Processing (OLAP): A tool that allows users to easily create and browse reports that summarize relevant data and analyze it from multiple perspectives.
  20. Predictive Analytics: Extracting information from existing data sets in order to identify patterns and predict future returns and trends. In business, predictive models and analytics are used to analyze current data and historical facts to better understand customers, products, partners, and identify opportunities and risks for the company.
  21. Real Time Decisioning (RTD): Help companies make real-time (almost no delay) optimal sales/marketing decisions. For example, a real-time decision-making system (scoring system) can score and rank customers at the moment they interact with the company through various business rules or models.
  22. Retention / Customer Retention: Refers to the percentage of customer relationships that can be maintained for a long time after establishment.
  23. Social Network Analysis (SNA): Depicts and measures the relationships and flows between people, groups and groups, institutions and institutions, computers and computers, URLs and URLs, and other kinds of connected information/knowledge entities. These people or groups are nodes in the network, and the lines between them represent relationships or flows. SNA provides a method for analyzing human relationships that is both mathematical and visual.
  24. Survival Analysis: Estimating how long a customer will continue to use a business, or the likelihood of losing in subsequent periods. Such information allows businesses to judge customer retention for the desired forecast period and introduce appropriate loyalty policies.
  25. Text Mining: The analysis of data containing natural language. Statistical calculations are performed on the words and phrases in the source data to express the text structure in mathematical terms, and then the text structure is analyzed using traditional data mining techniques.
  26. Unstructured Data: Data either lacks a pre-defined data model or is not organized according to a pre-defined specification. This term usually refers to information that cannot be placed in a traditional columnar database, such as email messages, comments.
  27. Web Mining / Web Data Mining : The use of data mining techniques to automatically discover and extract information from Internet sites, documents or services.

Difference Between Database and Data Warehouse

The difference between a database and a data warehouse is actually the difference between OLTP and OLAP.

Operational processing, called OLTP (On-Line Transaction Processing,), can also be called a transaction-oriented processing system. It is a daily operation on the database for specific businesses, and usually queries and modifies a small number of records. Users are more concerned about the response time of operations, data security, integrity, and the number of concurrently supported users. As the main means of data management, the traditional database system is mainly used for operational processing.

Analytical processing, called OLAP (On-Line Analytical Processing), generally analyzes historical data of certain subjects to support management decisions.

First of all, we must understand that the emergence of data warehouse is not to replace the database.

  • Database is transaction-oriented design, data warehouse is subject-oriented design.
  • Databases generally store business data, and data warehouses generally store historical data.
  • The database design is to avoid redundancy as much as possible. Generally, it is designed for a certain business application, such as a simple User table, which can record simple data such as user name and password, which is suitable for business applications, but not suitable for analysis. The design of the data warehouse intentionally introduces redundancy, and is designed according to the analysis requirements, analysis dimensions, and analysis indicators.
  • Databases are designed for capturing data, data warehouses are designed for analyzing data.

Take banking, for example. The database is the data platform of the transaction system. Every transaction made by the customer in the bank will be written into the database and recorded. Here, it can be simply understood as using the database to keep accounts. The data warehouse is the data platform of the analysis system. It obtains data from the transaction system, summarizes and processes it, and provides decision-makers with a basis for decision-making. For example, how many transactions occur in a certain branch of a bank in a month, and what is the current deposit balance of the branch. If there are more deposits and more consumer transactions, then it is necessary to set up an ATM in the area.

Obviously, the transaction volume of banks is huge, usually calculated in millions or even tens of millions of times. The transaction system is real-time, which requires timeliness. It takes tens of seconds for customers to deposit a sum of money, which is unbearable. This requires the database to only store data for a short period of time. The analysis system is post-event, and it must provide all valid data within the time period of interest. These data are massive, and the summary calculation is slower, but as long as effective analysis data can be provided, the goal will be achieved.

The data warehouse is produced in order to further mine data resources and make decisions when there are already a large number of databases. It is by no means a so-called "large database".

Related concepts

Data Warehouse: The full English name of DW is Data Warehouse, which is a subject-oriented, integrated, relatively stable data collection that reflects historical changes and is used to support management decisions.

Cube: The main object in Online Analytical Processing (OLAP), a cube is a technology that enables fast access to data in a data warehouse. A cube is a collection of data, usually from a subset of the data warehouse Construct, organize and summarize into a multidimensional structure defined by a set of dimensions and measures.

Dimensions: Structural properties of a cube. They are organized hierarchies (levels) of categories used in fact tables to describe data. These categories and levels describe similar sets of members based on which users will analyze.

Measures: In a cube, a measure is a set of values ​​that are based on a column in the cube's fact table and is usually numeric. Also, a measure is the central value of the cube being analyzed

Fact table: refers to a table in which a large amount of business measurement data is saved. Measures in a fact table are generally referred to as facts

For other related concepts, please refer to the introduction in the blog. For details, please refer to the multidimensional analysis system
ETL based on mondrian: extraction, conversion, and loading.
The essence of ETL work is to extract data from various data sources, convert the data, and finally load and fill the data into the data warehouse Dimensionally modeled tables. The ETL work is not complete until these dimension/fact tables are populated. Next, we will explain the three links of extraction, conversion and loading respectively:

1. Extract

Data warehouses are analysis-oriented, while operational databases are application-oriented. Obviously, not all the data used to support business systems need to be analyzed. Therefore, this stage is mainly to determine the data that needs to be extracted from the application database according to the subject and subject domain of the data warehouse.

During the specific development process, developers must often find that some ETL steps do not match the table descriptions after the data warehouse modeling. At this time, it is necessary to re-check, design requirements, and re-do ETL. As mentioned in this article of the database series, any changes involving requirements need to start from scratch and update the requirements documents.

2. Transform

The conversion step mainly refers to the process of converting the extracted data structure to meet the target data warehouse model. In addition, the transformation process is also responsible for data quality work, this part is also known as data cleaning (data cleaning).

3. Load

The loading process loads the extracted and converted data with guaranteed data quality to the target data warehouse. Loading can be divided into two types: first loading (first load) and refresh loading (refresh load). Among them, the first load involves a large amount of data, while the refresh load is a micro-batch load.

One more thing, now with the rise of various distributed and cloud computing tools, ETL has actually become ELT. That is, the business system itself does not do the conversion work, but imports the data into the distributed platform after simple cleaning, so that the platform can perform cleaning and conversion work in a unified manner. Doing so can make full use of the distributed nature of the platform, while making the business system more focused on the business itself.

OLAP/BI tools
After the data warehouse is built, users can write SQL statements to access it and analyze the data in it. However, it would be too troublesome to write SQL statements for each query, and the SQL code routines for analyzing dimensional modeling data are relatively fixed. Thus, there is an OLAP tool, which is dedicated to the analysis of dimensional modeling data. The BI tool is able to display the results of OLAP in a graphical form, and it usually appears together with OLAP. (Note: The OLAP tools referred to in this article refer to both.)

The relationship between OLAP tools and data warehouses in normalized data warehouses is roughly as follows:

insert image description here

In this case, OLAP does not allow access to the central database. On the one hand, the central database adopts standardized modeling, while OLAP only supports the analysis of dimensional modeling data; on the other hand, the central database of the standardized data warehouse itself does not allow upper-level developers to access it. In the dimensional modeling data warehouse, the relationship between OLAP/BI tools and the data warehouse is as follows:

insert image description here

insert image description here

2.3 Query case

#Sample 1 维度表查询:
 
SELECT TOP (10) [DateKey] '日期Key'
      ,[FullDateAlternateKey] '日期代理key'
      ,[DayNumberOfWeek] '周所在日'
      ,[EnglishDayNameOfWeek] '所在周'
      ,[DayNumberOfMonth] '月所在日'
      ,[DayNumberOfYear] '年所在日'
      ,[WeekNumberOfYear] '年所在周'
      ,[EnglishMonthName] '英文月名'
      ,[MonthNumberOfYear] '年所在月'
      ,[CalendarQuarter] '所在季度'
      ,[CalendarYear] '日历年'
      ,[FiscalQuarter] '财季度'
      ,[FiscalYear] '财年'
  FROM [AdventureWorksDW2019].[dbo].[DimDate]
  ORDER BY DateKey DESC
#Sample 2 事实表查询
# 查看2013财年网上销售的产品名、汇率名、订单日期、用户信息、销售额、总产品成本、打折量等。
SELECT TOP 10、 B.EnglishProductName,C.CurrencyName CurrencyName,
D.FrenchPromotionName FrenchPromotionName,E.FirstName,E.LastName,
A.Salesamount,A.TaxAmt,A.TotalProductCost,A.DiscountAmount
FROM FactInternetSales A
JOIN DimProduct B
      ON A.ProductKey = B.ProductKey
JOIN DimCurrency C
    ON A.CurrencyKey = C.CurrencyKey
JOIN DimPromotion D
    ON A.PromotionKey =  D.PromotionKey
JOIN DimCustomer E
    ON A.CustomerKey = E.CustomerKey
JOIN DimDate F
    ON A.OrderDateKey =F.DateKey
WHERE F.FiscalYear=2013

—————————————————
0 Terms and constraints

  1. Extraction-Transformation-Loading is the process of extracting, transforming, and loading OLTP data (hereinafter referred to as ETL)

  2. The descriptions of the documents are all in accordance with ETL→DW→CUBE→presentation

1 ETL related

1.1 Dimension table

1.1.1 Time dimension

  1. Explanation: This dimension records the time of each day, the highest granularity is the day, and it can be divided into weeks, months, years and other granularities.

  2. Corresponding table: tbl_dimdate

  3. Corresponding process: pro_supportdw_dimdate

  4. Whether it is public: yes

  5. Note: Hierarchy (layer) can be built in this dimension, as shown in the figure below:

1.1.2 Equipment Dimensions

  1. Description: This dimension records the equipment information. It can be divided into brands, models and other granularities.

  2. Correspondence table: tbl_dimdevice

  3. Corresponding process: pro_supportdw_dimdevice

  4. Whether it is public: No

  5. Note: Hierarchy (layer) can be built in this dimension, as shown in the figure below:

1.1.3 Regional dimension

  1. Description: This dimension records the geographical information. It can be divided into countries, provinces, districts and other granularities.

  2. Correspondence table: tbl_dimgeography

  3. Corresponding process: None, manually add region data if necessary

  4. Whether it is public: No

  5. Explanation: There is no Hierarchy (layer) in this dimension, see the figure below:

1.1.4 Resolution dimension

  1. Description: This dimension records the resolution information.

  2. Correspondence table: tbl_dimresolution

  3. Corresponding process: pro_supportdw_dimresolution

  4. Whether it is public: No

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.5 Operating System Dimensions

  1. Description: This dimension records the information of the operating system.

  2. Correspondence table: tbl_dimos

  3. Corresponding process: pro_supportdw_dimos

  4. Whether it is public: No

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.6 Network Type Dimension

  1. Description: This dimension records the information of the network type.

  2. Correspondence table: tbl_dimnetworktype

  3. Corresponding process: None, manually maintain data

  4. Whether it is public: No

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.7 Carrier Dimensions

  1. Description: This dimension records the information of the operator type.

  2. Correspondence table: tbl_dimoperator

  3. Corresponding process: None, manually maintain data

  4. Whether it is public: No

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.8 System Dimensions

  1. Description: This dimension records the information of the system (similar to project market [market], desktop [LAU]) information).

  2. Correspondence table: tbl_dimsystem

  3. Corresponding process: None, manually maintain data

  4. Whether it is public: yes

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.9 Package Dimensions

  1. Description: This dimension records the package information.

  2. Correspondence table: tbl_cms_apk_package_ref

  3. Corresponding process: None, manually maintain data, sourced from tbl_cms_apk_package (requires data synchronization)

  4. Whether it is public: yes

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.10 Manufacturer Dimensions

  1. Description: This dimension records the information of the manufacturer.

  2. Corresponding table: tbl_user

  3. Corresponding process: no

  4. Whether it is public: yes

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.11 System Version Dimensions

  1. Description: This dimension records the version information of the system.

  2. Correspondence table: tbl_dimappversion

  3. Corresponding process: pro_supportdw_dimappversion

  4. Whether it is public: yes

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.12 Dimensions of advertising resources

  1. Description: This dimension records information about resources or advertisements.

  2. Corresponding table: tbl_dimresource

  3. Corresponding process: None, manually maintain data, sourced from tbl_resource (requires data synchronization)

  4. Is public: No, unique to the inventory model

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.13 Differentiating Dimensions of Advertising Resources

  1. Description: This dimension records the information of resource or advertisement distinction.

  2. Correspondence table: tbl_dimadres_type

  3. Corresponding process: None, manually maintain data

  4. Is public: No, unique to the inventory model

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.14 Dimensions of old and new advertising resources

  1. Description: This dimension records the information of resource or advertisement distinction.

  2. Correspondence table: tbl_dimnewold

  3. Corresponding process: None, manually maintain data

  4. Is public: No, unique to the inventory model

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.15 System type dimension

  1. Description: This dimension records the information of the system subtype (similar to airpush type, uubao type)

  2. Correspondence table: tbl_dimsystemtype

  3. Corresponding process: None, manually maintain data

  4. Is public: No, unique to the inventory model

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.1.16 System source dimension

  1. Description: This dimension records the information of the source type of the system (the source is similar to advertising resources, manually edited)

  2. Correspondence table: tbl_dimresourcetype

  3. Corresponding process: None, manually maintain data

  4. Is public: No, unique to the inventory model

  5. Description: There is no Hierarchy (layer) in this dimension, only Level (level)

1.2 Fact table and measure (measure)

1.2.1 Market fact table and measure (measurement)
1.2.1.1 Market fact table

  1. TBL_FACTMARKET This table is a market fact table, which contains indicators such as new additions, independent users, startup times, retention, etc. The dimension is accurate to IMEI

  2. TBL_FACTMARKET_FIN This table is accurate to APK_ID

1.2.1.2 market measure

  1. Added: Statistics on the number of new users in the Market

  2. Independent: Statistics on the number of independent users of the Market

  3. Startup: Market Startup Volume Statistics

  4. Market users are retained every other day, this is postUpdate

  5. Store users in the market every 7 days, this is postUpdate

  6. Market users are retained every 15 days, this is postUpdate

  7. Market users are retained every 21 days, this is postUpdate

  8. Market users are retained every 30 days, this is postUpdate

  9. Weekly retention rate

  10. monthly retention rate

1.2.2 Inventory fact table and measure
1.2.2.1 Inventory fact table

  1. TBL_FACTADRES This table is the fact table of advertising resources, and the indicators contained include receiving, reading, downloading, downloading completion, installation, etc. The dimension is accurate to IMEI

  2. TBL_FACTADRES_FIN This table is the advertising resource fact table, which contains indicators such as reading, clicking, downloading, downloading completion, installation, etc. The dimension is accurate to APK_ID

1.2.2.2 Ad inventory measure (metric)

  1. Receiving: statistics of receiving volume of advertising resources

  2. Viewing volume: statistics on the reading volume of advertising resources

  3. Downloads: Download statistics of inventory

  4. Downloads completed: the statistics of the completed downloads of advertising resources

  5. Installs: Install statistics for inventory

1.3 ETL

1.3.1 Market model

  1. Pro_supportdw_factmarketmarket2.0 or above fact table extraction

  2. Pro_support_oldfactmarketmarket1.2 version fact table extraction (including airpush)

  3. Pro_supportdw_loadfactmarketmarket fact table extraction summary (aggregated to apk_id dimension)

  4. pro_supportdw_preservemarket2.0 retention extraction (this is PostUpdate)

1.3.2 Inventory Model

  1. pro_supportdw_factadres inventory fact table extraction

1.3.3 Vendor Model

  1. pro_supportdw_loadaggrmarket This is the collection of market model and advertising resource model, the dimension is apk_id

1.4 ETL scheduling

1.4.1 Dimension table job

  1. Job corresponding process: pro_supportDW_Dim_jobs

  2. The process of including dimension tables is as follows:

pro_supportdw_dimdevice(sysdate);–device dimension (design brand model)

pro_supportdw_dimos(sysdate); --Operating system dimension

pro_supportdw_dimresolution(SYSDATE); – resolution dimension

pro_new_user_install(SYSDATE); --New user information, used by AdRes to compare new and old users

pro_supportdw_dimresource; ---- newly added ad dimension data update

1.4.2 Fact table job

1.4.2.1 market job

  1. Market Job corresponding process: PRO_Support_Market_JOBs

  2. The process of including the fact table is as follows:

pro_supportdw_factmarket

pro_support_oldfactmarket

pro_supportdw_loadfactmarket

1.4.2.2 Ad inventory job

  1. This job is included in the vendor job

1.4.2.3 Manufacturer job

  1. Manufacturer Job Corresponding Process: pro_support_adres_agg_jobs

  2. The process of including the fact table is as follows:

pro_supportdw_factadres

pro_supportdw_loadaggrmarket

2 Cube related

2.1 Cube介绍
2.1.1 cube说明
An OLAPcube is an array of data understood in termsof its 0 or more dimensions.

Cube is an abbreviation for multidimensional data model.

2.1.1 Cube related terms
1) Cube: Cube is the main object in online analytical processing (OLAP), and it is a technology that can quickly access data in the data warehouse. Cube is a collection of data , usually constructed from a subset of the data warehouse, organized and summarized into a multidimensional structure defined by a set of dimensions and measures.

2) Dimensions: are the structural characteristics of cubes. They are organized hierarchies (levels) of classifications used to describe data in fact tables. These classifications and levels describe some similar collections of members that users will base on these member set for analysis.

3. Measures: In a cube, a measure is a set of values ​​that are based on a column in the cube's fact table and is usually numeric. Also, a measure is the central value of the cube being analyzed .That is, the measure value is the numerical data that the end user focuses on when browsing the cube. The measure value you choose depends on the type of information that the end user requests. Some common measure values ​​are sales, cost, expenses, production count, etc. .

4) Metadata: The structural model of data and applications in different OLAP components. Metadata describes objects such as tables in OLTP databases, data warehouses and cubes in data marts, and also records which applications refer to different record block.

5) Level: A level is an element of a dimension hierarchy. Levels describe the hierarchy of data, from the highest (most summarized) level of data down to the lowest (most detailed) level.

6) Data mining: Data mining enables you to define models that include grouping and predictive rules to apply to data in relational databases or multidimensional OLAP datasets. These predictive models can then be used to automate complex data analysis to find out Trends that help identify new opportunities and select winning ones.

7) Multidimensional: OLAP (MOLAP): The MOLAP storage mode enables the aggregation of partitions and a copy of its source data to be stored on the analysis server computer in a multidimensional structure. According to the percentage and design of the partition aggregation, the MOLAP storage mode is for the fastest query response Timing offers potential. All in all, MOLAP is more suitable for partitions in cubes that are used frequently and need for fast query response.

8) Relational: OLAP (ROLAP): The ROLAP storage mode enables the aggregation of partitions to be stored in the table of the relational database (specified in the partition data source). However, the ROLAP storage mode can be used for partitioned data without creating aggregates in the relational database .

9) Hybrid: OLAP (HOLAP): The HOLAP storage mode combines the characteristics of both MOLAP and ROLAP.

10) Granularity: The level or depth of data aggregation.

11) Aggregation | Aggregation: Aggregation is a pre-calculated data summary, which can improve query response time because the answer is prepared before the question is asked.

12) Cutting block: Partition data defined by multiple members of multiple dimensions is called a cut block.

13) Slicing: Partition data defined by a member of a dimension is called a slice.

14) Data Drilling: The end user selects a single unit from a regular cube, a virtual cube, or a linked cube, and retrieves a result set from the source data of the unit to obtain more detailed information. This operation process is the data Drill.

Remarks: Mondrian is based on ROLAP
—————————————————

Data Warehouse Structure Hierarchy

insert image description here

Data Warehousing and Data Mining - Multidimensional Data Operations

Data Cube
Before introducing the specific use of OLAP tools, we must first understand this concept: Data Cube.
Many years ago, when we had to manually extract information from a pile of data, we would analyze a pile of data reports. Typically these data reports are presented in two dimensions, a two-dimensional table of rows and columns. However, in the real world, we may analyze data from multiple perspectives. A data cube can be understood as a two-dimensional table with expanded dimensions.
The figure below shows a three-dimensional data cube:
insert image description here
Although this example is three-dimensional, more often the data cube is N-dimensional. It is implemented in two ways, which will be discussed later in this article. The star schema mentioned in the previous article is one of them. This schema is actually a bridge connecting relational tables and data cubes. But for most pure OLAP users, the object of data analysis is the data cube in this logical concept, and there is no need to delve into its specific implementation. For users of these OLAP tools, the basic usage is to first configure the dimension tables and fact tables, and then tell OLAP the dimensions and fact fields and operation types to be displayed in each query.

The five most common operations in data cubes are introduced below: slicing, dicing, rotating, rolling up, and drilling down.

The operations on the data cube are: slicing, dicing, rotating, rolling up and drilling down.
The data cube looks like this:

insert image description here

Slicing and dicing (Slice and Dice)
The operation of selecting a dimension member on a certain dimension of the data cube is called slicing, and the selection of two or more dimensions is called slicing.
The figure below logically shows the slicing and dicing operations:
insert image description here
the SQL simulation statements of these two operations are as follows, mainly working on the WHERE statement.

# 切片
SELECT Locates.地区, Products.分类, SUM(数量)
FROM Sales, Dates, Products, Locates
WHERE Dates.季度 = 2
    AND Sales.Date_key = Dates.Date_key
    AND Sales.Locate_key = Locates.Locate_key
    AND Sales.Product_key = Products.Product_key
GROUP BY Locates.地区, Products.分类

# 切块
SELECT Locates.地区, Products.分类, SUM(数量)
FROM Sales, Dates, Products, Locates
WHERE (Dates.季度 = 2 OR Dates.季度 = 3) AND (Locates.地区 = '江苏' OR Locates.地区 = '上海')
    AND Sales.Date_key = Dates.Date_key
    AND Sales.Locate_key = Locates.Locate_key
    AND Sales.Product_key = Products.Product_key
GROUP BY Dates.季度, Locates.地区, Products.分类

Pivot Pivot
refers to changing the display direction of a report or page. For the user, it is a view operation, but from the perspective of the SQL simulation statement, it is just to change the order of the fields behind the SELECT. The following figure logically shows the rotation operation:
insert image description here

Roll-up and Drill-down (Rol-up and Drill-down)
Roll-up can be understood as "ignoring" certain dimensions; drill-down refers to subdividing certain dimensions. The following figure logically shows the roll-up and drill-down operations:
insert image description here

The SQL simulation statements of these two operations are as follows, mainly working on the GROUP BY statement.

# 上卷
SELECT Locates.地区, Products.分类, SUM(数量)
FROM Sales, Products, Locates
WHERE Sales.Locate_key = Locates.Locate_key
    AND Sales.Product_key = Products.Product_key
GROUP BY Locates.地区, Products.分类

# 下钻
SELECT Locates.地区, Dates.季度, Products.分类, SUM(数量)
FROM Sales, Dates, Products, Locates
WHERE Sales.Date_key = Dates.Date_key
    AND Sales.Locate_key = Locates.Locate_key
    AND Sales.Product_key = Products.Product_key
GROUP BY Dates.季度.月份, Locates.地区, Products.分类

4. Other OLAP operations
In addition to the above-mentioned basic operations, different OLAP tools also provide their own OLAP query functions, such as drill through, drill through, etc. This article will not explain them one by one. Usually a complex OLAP query is the result of the superposition of multiple such OLAP operations.

OLAP architectural model
1. MOLAP (Multidimensional Online Analytical Processing)

The MOLAP architecture generates a new cube, or an actual data cube, so to speak. Its structure is shown in the figure below:
insert image description here

In the cube, each cell corresponds to a direct address, and frequently used queries are precomputed. Therefore, each query is very fast, but because the update of the cube is relatively slow, whether to use this architecture needs to be analyzed in detail.

2. ROLAP(Relational Online Analytical Processing)

The ROLAP schema does not generate actual cubes, but simulates a data cube using a star schema with multiple relational tables. Its structure is shown in the figure below:

insert image description here

Obviously, queries under this architecture are not as fast as MOLAP. Because in ROLAP, all queries are converted into SQL statements for execution. The execution of these SQL statements will involve JOIN operations between multiple tables, which is not as fast as MOLAP.

3. HOLAP(Hybrid Online Analytical Processing)

This architecture adopts a hybrid solution with comprehensive reference to MOLAP and ROLAP, putting some queries that need special speed-up into the MOLAP engine, and other queries call the ROLAP engine.

The author found an interesting phenomenon. The development of many tools satisfies this rule: Tool A is created, and its shortcomings are found after it is put into use; then Tool B is created to make up for this shortcoming, but brings new shortcomings; then it will be used Tool C is created to call A and B according to different situations. relatively speechless...

Summary
The development of the entire data warehouse system will involve various teams: data modeling team, business analysis team, system architecture team, platform maintenance team, front-end development team and so on. For those who are determined to work in this field, there is still a lot to learn. But for those who want to be an excellent "data scientist" like the author, these data basics are enough. In the author's opinion, the core competitive advantages of data scientists lie in three aspects: data foundation, data visualization, and algorithm model. These three aspects require increasing time and cost, while the importance of knowledge is decreasing. Therefore, the database series and data warehouse series are the two most cost-effective series.

Work requirements:

  • Create a database in SQL SERVER2012, which contains four tables. The reference table design is shown in the figure below.
  • Then slice, dice, rotate, roll up and drill down based on the above database tables.
  • Four self-built tables and multi-dimensional operations (slicing, dicing, rotation, roll-up and drill-down) performed on the tables.

insert image description here

Create a table structure and insert simulated data
This data is exported from the SQL Server2012 version database, for reference and reference only

**1 销售分析表结构**
/****** Object:  Table [dbo].[analysisTable]    Script Date: 2019/3/11 15:33:52 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[analysisTable](
	[timeID] [tinyint] NOT NULL,
	[productID] [tinyint] NOT NULL,
	[areaID] [tinyint] NOT NULL,
        [number] [int] NOT NULL, 
        [money] [int] NOT NULL ) ON [PRIMARY] GO 
	
地区维表表结构
CREATE TABLE [dbo].[areaTable](
	[areaID] [tinyint] IDENTITY(1,1) NOT NULL,
	[areaCou] [varchar](200) NOT NULL,
	[areaPro] [varchar](50) NOT NULL,
	[areaCity] [varchar](50) NOT NULL,
	[areaDoor] [varchar](200) NOT NULL,
 CONSTRAINT [PK_areaTable] PRIMARY KEY CLUSTERED 
(
	[areaID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
产品维表表结构
CREATE TABLE [dbo].[productTable](
	[productID] [tinyint] IDENTITY(1,1) NOT NULL,
	[productType] [nvarchar](50) NOT NULL,
	[productName] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_productTable] PRIMARY KEY CLUSTERED 
(
	[productID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
时间维表表结构
CREATE TABLE [dbo].[timeTable](
	[timeID] [tinyint] IDENTITY(1,1) NOT NULL,
	[timeYear] [varchar](50) NOT NULL,
	[timeMonth] [varchar](50) NOT NULL,
 CONSTRAINT [PK_timeTable] PRIMARY KEY CLUSTERED 
(
	[timeID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Multidimensional data operation
Slicing operation
Operation SQL statement

select timeTable.timeMonth, productTable.productName, areaTable.areaDoor 
from analysisTable, timeTable, productTable, areaTable 
where
analysisTable.timeID = timeTable.timeID AND
analysisTable.productID = productTable.productID AND
analysisTable.areaID = areaTable.areaID AND 
analysisTable.productID = 1;

Operation result query graph
insert image description here

Go language basic array, slice, map

In the Go language, in order to facilitate the storage and management of user data, its data structure design is divided into three structures: Array, Slice, and Map.

Recently, I read the content of the basics of the Go language, and looked at the principles of the realization of these three structures:

Array Array
Array is the basic data structure of slicing and mapping;
array is a fixed-length data type and is allocated continuously in memory, and the data speed
of solid index array is very fast; when declaring an array, you need to specify the type and quantity of array storage ( The length of the array);
the type of the array variable includes the length of the array and the type of the element, and only the arrays with the same two parts can assign values ​​to each other.
Creation and Initialization
Once an array is declared, its data type and length cannot be changed.

// declare array array using array literal
:= [5]int{1, 2, 3, 4, 5}

// Automatically deduce the length declaration array
array := […]int{1, 2, 3, 4, 5, 6}
// Use ... instead of length, deduced according to the number of initialized elements

// Declare an array and specify a specific element value
array := [5]int{1:10, 2:20} The type of
pointer
type array element can be any built-in type, it can also be a certain structure type, or it can be a pointer type .

// Declare an array of pointers to strings with an element length of 3
var array1 [3]*string

// Specify elements for the pointer array
*array1[0] = "demo0"
*array1[1] = "demo1"
*array1[2] = "demo2" The
multidimensional
array itself is one-dimensional data, and the multidimensional array is composed of multiple arrays combined.

// declares a two-dimensional array
var array = [3][2]int
// declares a two-dimensional array with elements 3 and 2

// Initialize a two-dimensional array
var array = [3][2]int{ {1, 2}, {3, 4}, {5, 6}}
Pass arrays between functions: when passing variables between functions, pass is always a copy of the variable's value, so passing an array variable will copy the entire array! When defining a function, the parameter should be designed as a pointer type for a larger data type, so that when the function is called, only 8 bytes of memory for each pointer need to be allocated on the stack, but this means that the pointer pointed to will be changed. Values ​​(shared memory), in fact, slice types should be used in most cases, not arrays.

Slice
Slice slice is a reference type, which refers to part or all of the underlying array pointed to by its pointer field;
slice is built around the concept of dynamic array;
dynamic growth of slice is realized by append;
shrinking is achieved by It is implemented by slicing again. The new slice obtained by slicing again will share the underlying array with the original slice, and their pointers point to the same underlying array.
Creation and initialization
The slice type has 3 fields:

Pointer: Points to the address of the first element contained in the slice in the underlying array;
Length: The number of elements in the underlying array contained in the slice (the number of elements accessible to the slice);
Capacity: The maximum size that the slice is allowed to grow to The number of elements, which is the length of the underlying array.
make and slice literals
// use make to create a
slice := make([]int, 3)

// create a slice with length and capacity
:= make([]int, 1, 6)
// length 1, capacity 6 elements
nil and empty slice
// nil string slice
var slice []string

// Empty slice
slice := []int{}
// Empty integer slice
because the slice only refers to the underlying array, and the data of the underlying array does not belong to the slice itself, so a slice only needs 24 bytes of memory (in 64-bit On the machine): the pointer field is 8 bytes, the length field is 8 bytes, and the capacity field is 8 bytes. So it is efficient to pass slices directly between functions, and only need to allocate 24 bytes of stack memory.

The len function can return the length of the slice, and the cap function can return the capacity of the slice.

Mapping Map
Mapping map is used to store a series of unordered key-value pairs;
mapping is an unordered collection, and its implementation uses a hash table;
the mapped hash table contains a set of buckets, and each bucket stores a part of key-value pairs ;
Two arrays are used inside the map:
the first array: stores the high eight-bit value of the hash key used to select the bucket, which is used to distinguish which bucket each key-value pair should be stored in; the second
array: There is a byte array in each bucket, which first stores all the keys in the bucket in turn, and then stores all the values ​​in the bucket;
creation and initialization
// create a map to store student information
students := map[string]string { "name": "mengxiaoyu", "age": "22", "sex": "boy", "hobby": "pingpang", }




// Display all information of the mapping
for key, value := range students{ fmt.printf("key:%s, \t value:%s\n", key, value); } The order when traversing the key-value pairs of the mapping It is random. If you want to obtain the key-value pairs of the map in an orderly manner, you need to traverse the mapped keys and store them in a slice, then sort the slice, and finally traverse the slice, and get the correspondence in the map according to the order of the elements in the slice. value.


Guess you like

Origin blog.csdn.net/kalvin_y_liu/article/details/124197953
Recommended