Big Data Common Terminology

A List of Common Big Data Terms
insert image description here

The main content includes the following (collection, forwarding to friends around you)

Snowflake model, star model, and constellation model
Fact
table Dimension table
Drill-up and drill-down
Dimensional degeneration
Data lake
UV and PV
portrait
ETL
machine learning
Big data familiar
SKU and SPU
ad hoc query
Data lake Data
center
ODS, DWD, DWS, DWT With ADS
T0 and T+1

User portrait
User portrait, also known as user role, is an effective tool to outline target users, connect user demands and design direction, and user portrait has been widely used in various fields. In the process of actual operation, we often use the most simple and life-like words to connect the user's attributes, behaviors and expected data conversion. As a virtual representative of actual users, the user personas formed by user portraits are not constructed outside of the product and the market. The formed user roles need to be representative of the main audience and target groups of the product.


This is a bad concept for big data .

Different consumers have different sensitivity to prices and have different willingness to pay. Compared with unified pricing, differentiated pricing behavior can increase the profit of merchants. Therefore, there is a monopoly on Internet access, and killing familiarity will become a "natural reaction".

Big data killing itself is to use all kinds of consumption data to form consumption data into tags. This kind of killing is very bad. In fact, it is easy to identify in our transaction process, but it may be more difficult to identify in online commodity transactions, and it will destroy the fairness of transactions and social equity.

Ad hoc query
Ad hoc query (Ad Hoc) means that users can flexibly select query conditions according to their own needs, and the system can generate corresponding statistical reports according to the user's selection. The biggest difference between ad hoc query and ordinary application query is that ordinary application query is custom-developed, while ad hoc query is defined by user-defined query conditions.

Data Lake
Data Lake (Data Lake) is a large warehouse that stores various raw data of an enterprise, and the data in it can be accessed, processed, analyzed and transmitted. hudi At present, Hadoop is the most commonly used technology for deploying data lakes, so many people think that data lakes are Hadoop clusters. Data lake is a concept and Hadoop is the technology used to realize this concept.

Data lakes can handle all types of data, such as structured data, unstructured data, semi-structured data, etc. The type of data depends on the original data format of the data source system. Unstructured data (voice, picture, video, etc.) According to massive data, rules are excavated and reflected to the operation department. Possesses very strong computing power for processing data.

The difference with the data warehouse is:

The data warehouse mainly deals with historical and structured data, and these data must match the model defined in advance by the data warehouse. The indicators for data warehouse analysis are all specified in advance by the product manager. Analyze data on demand. (daily activity, new additions, retention, conversion rate, etc.).

Data middle platform
Data middle platform is the precipitation of existing/new information system business and data, and is an intermediate and supporting platform for realizing data empowerment of new business and new applications.

In data development, the change of the core data model is relatively slow, and at the same time, the workload of data maintenance is also very large; but the speed of business innovation and the change of data requirements are very fast.

The emergence of the data center is to make up for the problem that the responsiveness cannot keep up with the development speed mismatch between data development and application development.

Data Mart
Data Mart (Data Mart), also known as data market, data mart is to meet the needs of specific departments or users, and store them in a multi-dimensional way, including defining dimensions, indicators to be calculated, and levels of dimensions, etc. Generate data cubes for decision analysis needs.

Data mart is a subset of enterprise-level data warehouse, which is mainly oriented to department-level business and only oriented to a specific topic. In order to solve the contradiction between flexibility and performance, the data mart is a small departmental or workgroup-level data warehouse added to the data warehouse architecture. The data mart stores pre-calculated data for a specific user to meet the performance requirements of the user. Data marts can alleviate the bottleneck of accessing data warehouses to a certain extent.

Features:

1. Characteristics of data marts include small size.

2. There are specific applications.

3. Department-oriented.

4. Defined, designed and developed by business units.

5. Business unit management and maintenance.

6. can be achieved quickly.

7. It is cheaper to buy.

8. Quick return on investment.

9. Tight integration of toolsets.

10. Provides a more detailed, pre-existing, summary subset of the data warehouse.

11. Upgradable to a full data warehouse.

ETL
ETL stands for Extract, Transform and Load. It refers to the process of "extracting" raw data, "converting" the data into a form "suitable for use" by means of cleaning/enrichment, and "loading" it into a suitable library for system use. Even though ETL originates from data warehouses, this process is also used when acquiring data, for example, from external sources in big data systems.

Snowflake model, star model and constellation model
Star model: It is a multi-dimensional data relationship, which consists of a fact table (Fact Table) and a set of dimension tables (Dimension Table). Each dimension table has a dimension as the primary key, and the primary keys of all these dimensions are combined into the primary key of the fact table.
insert image description here

Snowflake model: When one or more dimension tables are not directly connected to the fact table, but are connected to the fact table through other dimension tables, the diagram is like multiple snowflakes connected together, so it is called the snowflake model. The snowflake schema is an extension to the star schema. It further stratifies the dimension tables of the star schema, and the original dimension tables may be expanded into small fact tables to form some local "hierarchy" areas. These decomposed tables are connected to the main dimension table instead of facts surface.
insert image description here

Constellation model: It is composed of multiple fact tables. The dimension table is public and can be shared by multiple fact tables.
insert image description here

Fact table
Each row of data in the fact table represents a business event. The term "fact" refers to a measure of a business event, for example, the amount placed in an order event.

(1) The transactional fact table takes each transaction or event as a unit, such as a sales order record, a payment record, etc., as a row of data in the fact table.

(2) Periodic snapshot fact table The periodic snapshot fact table does not retain all data, but only data at fixed time intervals, such as daily or monthly sales, or monthly account balances.

(3) Cumulative snapshot fact table The cumulative snapshot fact table is used to track changes in business facts. For example, the data warehouse may need to accumulate or store the point-in-time data of each business stage of the order from the time the order is placed to the time when the order is packaged, shipped, and signed for to track the progress of the order statement cycle. When this business process is in progress, the records in the fact table must also be updated continuously.

Dimension table
(Dimension Table) or dimension table, sometimes called lookup table (Lookup Table), is a table corresponding to the fact table; it saves the attribute value of the dimension, and can be associated with the fact table; equivalent to Extract and standardize the frequently repeated attributes on the fact table and manage them in one table. Common dimension tables include: date table (storing attributes such as week, month, quarter, etc. corresponding to the date), location table (including country, province/state, city, etc. attributes), etc. Dimensions are the foundation and soul of dimensional modeling,

There are many benefits to using dimension tables, as follows:

(1). Reduced the size of the fact table.

(2). It facilitates the management and maintenance of dimensions, adding, deleting and modifying attributes of dimensions without changing a large number of records in the fact table.

(3). Dimension tables can be reused for multiple fact tables to reduce duplication of work.

Drill Up and Drill Down Drill
Up: Bottom-up, returning from the current data to the upper-level data.

Drill-down: From top to bottom, continue to obtain lower-level data from the current data.

Drilling is one of the indispensable functions in data analysis. By changing the level of displaying data dimensions and changing the granularity of analysis, we can focus on more detailed information in the data. It includes drill up ( roll up ) and drill down ( drill down ).

Drilling up is to aggregate and summarize data along the dimension level, and drilling down is to deepen the dimension during analysis and view the data layer by layer. By drilling down layer by layer, the data is more clear at a glance, the value behind the data can be fully explored, and more correct decisions can be made in a timely manner.

Dimension degenerate
Dimension tables with degenerate dimensions can be eliminated, thereby simplifying the schema of a dimensional data warehouse. Because simple schemas are easier to understand and have better query performance than complex ones.

A dimension can be degenerated when it does not have any data that the data warehouse needs. It is necessary to migrate the related data of degenerated dimensions to the fact table, and then delete the degenerated dimensions.

Dimension attributes can also be stored in the fact table, and this kind of dimension columns stored in the fact table is called "dimensional regression". Like other dimensions stored in dimension tables, dimension degradation can also be used to filter and query fact tables, implement aggregation operations, and so on.

UV and PV
PV (visit volume): that is, Page View, specifically referring to the page views or clicks of the website;

UV (unique visitor): Unique Visitor, a computer client that visits your website is a visitor. According to the IP address to distinguish the number of visitors, repeated visits within a period of time can also be regarded as a UV;

UV value = sales / number of visitors. It means how much sales each visitor brings; the greater the UV value, the more the product meets the needs of consumers, and only a certain amount of promotion investment will bring the corresponding UV; for example, the number of views at the end of this article represents For UV, whether you opened it today or tomorrow, for you, the added value of the background record of the program is 1.

SKU and SPU
SPU = Standard Product Unit (standardized product unit)

SPU is the smallest unit of product information aggregation, and it is a set of reusable and easy-to-retrieve standardized information that describes the characteristics of a product. In layman's terms, products with the same attribute value and characteristics can be called an SPU.

SKU=stock keeping unit (stock keeping unit)

SKU is the unit of inventory in and out measurement, which can be in units of pieces, boxes, pallets, etc.

If you want an iPhone 13, the clerk will continue to ask: What iPhone 13 do you want? 64G silver? 128G white? Each iPhone 13 has a gross weight of 400.00g and is produced in mainland China. These two attributes It belongs to the spu attribute.

As for capacity and color, the attribute that will affect the price and inventory (for example, the price of 64G and 128G is different, 128G white is still in stock, and the green is sold out) is the sku attribute.

spu attribute:

1. Gross weight 420.00 g

2. Made in mainland China

sku attribute:

1. Capacity: 16G, 64G, 128G

2. Color: silver, white, rose gold

ODS, DWD, DWS, DWT and ADS
ODS layer: keep the original appearance of the data without any modification, and play the role of backup data.

DWD layer: Construct a dimensional model, generally using a star model, and the status presented is generally a constellation model.

DWS layer: service data layer, the summary behavior of all subject objects stored in the DWS layer on the day, such as the number of orders placed in each region on the day, the order amount, etc.

DWT layer: The DWT layer stores the cumulative behavior of all subject objects, such as the number of orders placed in a region recently (7 days, 15 days, 30 days, 60 days), order amount, etc.

The DWS layer is the sky table, and the DWT layer is the cumulative value.

ADS layer: application data layer, index layer.

The concepts of T+0 and T+1
first came from the stock market. T+0 and T+1 trading systems are a kind of trading system in China's stock market. T+0 trading means that stocks bought on the same day can be sold on the same day, and stocks sold on the same day can be bought on the same day.

In big data: T+0 stands for real-time processed data. T+1 represents processing yesterday's data.

Machine learning
The part of artificial intelligence that refers to the ability of machines to learn from the tasks they perform, improving themselves over time.

MapReduce
is a software framework for processing large-scale data (Map: mapping, Reduce: induction).

Real-time data
refers to data that is created, processed, stored, analyzed and displayed within milliseconds.

Guess you like

Origin blog.csdn.net/weixin_44976611/article/details/129276332