Computer three-level database review 12-data warehouse and data mining

Future Education Chapter 14 Topic Notes_Data Warehouse and Data Mining

1. Association rule mining is to discover the connection between different commodities in the transaction database; the unsupervised learning algorithm does not specify clear prerequisites for the category.
2. Data warehouse is a new technology of data storage and organization that appears in order to build a new analysis and processing environment.
Data warehouse has several characteristics, including non-updatable and time-varying.
Non-updateability: When users extract data in the warehouse for analysis, they will not update the data in the warehouse at the same time.
Data variability: data is updated and processed at regular intervals.
3. The data warehouse is an object-oriented A collection of data that is comprehensive, integrated, non-volatile, and time-varying.
4. Granularity refers to the level of refinement or integration of data stored in the data unit of the data warehouse. The higher the degree of refinement, the smaller the granularity.
5.
OLTP (micro-analysis) [global type] is the daily operation of the database online, usually the query and modification of a record, requiring rapid response to user requests, and very high requirements for data security, integrity and physical throughput. high. [Middle and lower-level business personnel for enterprises]
OLAP (macro analysis) [Instant] is the query and analysis of data, usually the query and analysis of massive historical data, the amount of data to be accessed is very large, and the query and analysis operations are very complex. [For middle and upper level and decision makers]
7. Since there are sample sets and test sets, there are also existing category label
classifications: through learning, an objective function f is obtained, and each attribute set x is mapped to a predefined Class label y; [belonging to supervised learning]
Clustering: According to the information found in the data describing the objects and their relationships, the data objects are classified. The objects in the group are similar to each other, but the objects in different groups are different . The greater the similarity within the group, the greater the difference between the groups, the better the clustering; [belonging to unsupervised learning]
Association rule mining: discover meaningful connections hidden in large data sets.
Multi-dimensional analysis: Refers to the multi-dimensional analysis and processing of complex queries on the data in the data warehouse by management decision-makers at all levels from different angles, fast and flexible.
10. Association rules are implicit expressions of the form X->Y. Strength can be measured by his support (s) and confidence ©.
Support determination rules can be used for the frequency of a given data set; confidence determines how often Y appears in transactions that include X.

11. Metadata is data about data, or data describing data. Metadata describes the structure, content, chain and index of the data.
In relational data, this description is the definition of other objects such as databases, tables, and columns.
12. Commonly used OLAP multi-dimensional analysis operations include slicing, dicing, rotating, drilling and rolling.
Rolling up: is to perform the aggregation operation in the data cube, by rising in the level or by eliminating a certain dimension or a certain dimension to observe more general data
drilling: by descending in the dimension level or by introducing a certain dimension or certain dimensions Observe the data carefully. [
Year- >Month] Slicing and dicing realize the display of local data, helping users to choose from many mixed data.
Rotation is to change the direction of the dimension.
13. In the data warehouse, metadata is mainly divided into technical metadata and business metadata data.
16. MOLAP becomes the OLAP of multidimensional databases. The core of this OLAP is multidimensional database technology.
20. OLAP is the English abbreviation for contact analysis and processing. He still uses DBMS to access data.
21. Four characteristics of data warehouse:
①Theme-oriented: data in the data warehouse is organized according to a certain subject domain
②Integrated: transaction-oriented Operational databases are usually related to certain specific applications, and the databases are independent of each other and are often heterogeneous; while the data in the data warehouse is processed, summarized and sorted systematically on the basis of the original scattered database data extraction and cleaning. owned.
③Stable
④Reflecting historical changes
22. The data in the data warehouse comes from a variety of data sources. Before the source data is loaded into the data warehouse, certain data conversion is required. The main task of conversion is to convert data granularity and inconsistent data.
27. Subject-oriented design methods
28. OLAP is mainly used to support complex analysis operations. There are three main ways to implement it: MOLAP [Relation based on relational databases], ROLAP [Multi-Dimensional based on multidimensional databases], HOLAP [Hybrid hybrid]
29, ODS (Operational Data Store ) Is an optional part of the database warehouse system structure, with some features of the data warehouse and some features of the OLTP system
①The first type of ODS data update frequency is second
②The second type of ODS data update frequency is hourly
③The third The update frequency of ODS-like data is day-level. The
fourth ODS is classified according to the direction and type of data source.
30. Knowledge discovery is mainly composed of three steps, which are data preparation, data mining, and interpretation and evaluation of results.
32. The data warehouse will not be updated in real time.
33. The smaller the granularity and the higher the level of detail, the lower the degree of integration. The larger the amount of data.
34. The structure of the data warehouse adopts a three-level data model:
①Conceptual model: business The model is the result of joint analysis by business decision makers, business domain knowledge experts and IT experts.
②Logical model: related to the top and bottom.
③Physical model: mainly includes the software and hardware configuration, resource situation and data warehouse model of the data warehouse.
41. The ETL tools used in the process of data transfer from the operational environment to the data warehouse usually need to complete the processing operations including extraction, conversion and loading.
43. Decision support systems generally refer to the implementation of important business or affairs based on data in the enterprise Information system to assist decision-making.
44. The K-means algorithm is a typical distance-based clustering algorithm. It uses distance as an evaluation index of similarity, that is, it is considered that the closer the distance between two objects, the greater the similarity.
45. The objects processed by the clustering algorithm are generally unlabeled, so clustering is generally called an unsupervised learning method.
50. In order to perform data analysis, the main reason for extracting data from the OLTP system using the [using ETL] extraction program is to solve the performance conflict between OLTP applications and analytical applications.
51. Given a sales transaction database, look for it To figure out the relationship between some items and other items in these transactions, this kind of data mining is called association mining.
56. In the classification and prediction task, the data that needs to be used generally include the training set, the test set, and the verification set
57. The snapshot is a fully usable copy of the data set, and the copy includes the image of the corresponding data at a certain point in time. A snapshot can be a copy of the data it represents. It reflects the data at a certain point in time and is immutable.
64. In the maintenance strategy, the strategy of updating only when the user finds that the data has expired is called the delayed maintenance strategy.
77. During the granular design, save the subject data of different thicknesses in the available storage space to try to Meet the requirements of multi-angle and multi-level data query for various applications, and at the same time improve the design efficiency of the query on the subject.
The smaller the granularity, the lower the degree of integration, the higher the level of detail, the more types of queries to answer, the greater the amount of data, the greater the space cost, and the greater the degree of transaction concurrency.
80. In a data warehouse, the method of maintaining data on the basis of the original data of the maintenance object according to the amount of change in the data source is called incremental maintenance.
81. Switching from a high-granularity data view to a low-granularity data view The analysis operation is called the drill operation
84. The minimum level of detail of the data involved in the user query, the average performance requirement of the user query, the available storage space of the system, and the scale of the low-granularity data are all major considerations.

mind Mapping

Insert picture description here

Guess you like

Origin blog.csdn.net/TOPic666/article/details/115263522