The content and feelings of the big data warehouse

Come and write when I have time

1. Project

2. Hadoop construction process

  1. Turn off the firewall of the virtual machine
  2. Install jdk
  3. Modify host
  4. Install ssh and configure password-free login
  5. Modify the hosts file
  6. Do time synchronization
  7. Upload hadoop package and unzip
  8. Configure environment variables
  9. 配置core-site.xml,hdfs-stite.xml,yarn-site.xml,mapred-site.xml,hadoop-env.sh,slaves(workers)
  10. format

3.mr process

1. When an mr program is started, the first to start is MRAppMaster. After MRAppMaster starts, according to the description information of this job, calculate the number of maptask instances needed, and then apply to the cluster to start the corresponding number of maptask processes.

2. After the maptask process is started, data processing is performed according to the given data slice range. The main flow is:

2.1 Use the inputformat specified by the customer to obtain the RecordReader to read the data to form an input KV pair

2.2 Pass the input KV pairs to the customer-defined map() method, do logical operations, and collect the KV pairs output by the map() method into the cache

2.3 After sorting the KV pairs in the cache according to the K partition, they continue to overflow and write to the disk file

3. After MRAppMaster monitors that all maptask process tasks are completed, it will start the corresponding number of reducetask processes according to the parameters specified by the customer, and inform the reducetask process of the data range (data partition) to be processed

4. After the Reducetask process is started, according to the location of the to-be-processed data notified by MRAppMaster, several maptask output result files are obtained from the machine where the maptask runs, and re-merged and sorted locally, and then the KV of the same key is one Group, call the reduce() method defined by the customer to perform logical operations, and collect the result KV of the operation output, and then call the outputformat specified by the customer to output the result data to external storage

4. Data warehouse modeling

Three major models

5. Three paradigms

First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Bath-Cord Normal Form (BCNF), Fourth Normal Form (4NF) and Fifth Normal Form (5NF, also known as Perfect Normal Form). The paradigm that meets the minimum requirements is the first normal form (1NF). On the basis of the first normal form, the one that further satisfies more specifications is called the second normal form (2NF), and the rest can be deduced by analogy. Generally speaking, the database only needs to satisfy the third normal form (3NF). So only the knowledge related to the three paradigms is recorded here.

1. 1NF: Fields are indivisible. Each field is at the atomic level. In the previous section, you saw that the first field is ID. It means that ID cannot be divided into two fields. I can’t say that I want to divide the person’s ID, name, The class numbers are all stuffed in one field, this is inappropriate and will have a great impact on future applications;

2. 2NF: There is a primary key, and non-primary key fields depend on the primary key. The ID field is the primary key. It can indicate that this piece of data is unique. Some readers have a good memory. "unique" means unique and does not allow duplicates. Indeed it is Often modify a field to ensure the uniqueness of the field, and then set the field as the primary key;

3. 3NF: Non-primary key fields cannot depend on each other. How do you understand this. For example, in the student table, the class number is affected by the personnel number. If you insert the class teacher, math teacher and other information into this table, do you think this is appropriate? It is definitely not appropriate, because there are multiple students, which will result in multiple classes. Then there will be multiple pieces of data for the head teacher and math teacher of each class, and our ideal effect should be that one class information corresponds to one head teacher and mathematics Teacher, it’s easier for us to understand. This forms the class table. Then which field is used to associate the student table with the class table. It must be through "classNo". This field is also called the foreign key of the two tables. I will talk about constraints later. At that time, Lao Han will focus on this, readers and friends first have a general understanding;

6. Data Mart and Data Warehouse

6.1. Concept of data warehouse and data mart

Data warehouse: It is an integrated subject-oriented data collection designed to support the function of DSS (Decision Support System). In the data warehouse, each data unit is related to a specific time. The data warehouse includes atomic level data and lightly summarized data. A data warehouse is a collection of subject-oriented, integrated, non-renewable (stability), and constantly changing over time (different times) to support the decision-making process in business management.
The data warehouse cannot be simply understood as a set of software. The data warehouse is the process of rebuilding the data flow and information flow of the enterprise. In this process, the decision support environment of the enterprise is constructed to distinguish the operational environment constructed by the original business system. The value of a data warehouse is not the amount of data you store in the warehouse, but the key lies in the quality of the information and analysis results that can be obtained from the warehouse.
Data mart: is a small department or workgroup level data warehouse. There are two types of data marts-independent and dependent. The independent data mart obtains data directly from the operational environment. The dependent data mart obtains data from the enterprise data warehouse. From a long-term perspective, subordinate data marts are more stable in architecture than independent data marts.
The existence of independent data marts will give people an illusion. It seems that data marts can be built independently first. When the data mart reaches a certain size, it can be directly converted into a data warehouse. However, this is incorrect. Multiple independent The accumulation of data marts cannot form an enterprise-level data warehouse, which is determined by the characteristics of the data warehouse and the data mart itself. If you break away from the centralized data warehouse and establish multiple data marts independently, the enterprise will only add some islands of information, and still cannot analyze the data in the view of the entire enterprise. The data mart is used by various departments or work groups. There will be inconsistencies between cities. Of course, an independent data mart is a fact, an analytical environment established to meet the needs of specific users. However, from a long-term point of view, it is an expedient measure and will inevitably be affected by enterprise-level data. Replaced by warehouses.

6.2. The difference between data warehouse and data mart

Insert picture description here

It can be seen from the figure that the data structure in the data warehouse adopts the standardized model (relational database design theory), and the data structure of the data mart adopts the star model (multidimensional database design theory). The granularity of the data in the data warehouse is finer than that of the data mart. The above figure only reflects the two characteristics of data structure and data content. Other differences are shown in the following table, and a simple example of a bank is used for illustration.

Insert picture description here
Suppose a branch-level data warehouse is constructed for a bank, and then a data mart is constructed for the international business department of the branch. The data of the data warehouse comes from the bank's business system, including: savings, cards, personal loans, foreign exchange treasures, intermediary business, etc. The subject of analysis includes customers, channels, products, etc. The data granularity of the data warehouse is determined according to the analysis requirements, generally including specific historical records (deposits, withdrawals, foreign exchange transactions, POS consumption, intermediary business payment records), and then summarize these records to days/weeks/months/quarters/ At all levels, the granularity of specific data is determined by the needs of analysis. In addition, the data warehouse also stores some business logic-some indicators calculated for analysis. For example, customer value or customer loyalty. The calculation of these indicators cannot be done through a single business system, and needs to be considered comprehensively in all businesses. This is also one of the advantages of the data warehouse system. Assuming that the entire branch has 200,000 customers, the data warehouse will contain historical data, summary data, and data warehouse index data for all businesses of 200,000 customers. The data volume will reach tens or even hundreds of gigabytes (this is only a very small scale). Data warehouse). In order to meet the query and analysis of users in all departments of the bank, the data warehouse can only adopt a paradigm design, so that no matter what needs of users have, as long as there is data, it can be satisfied. Assume that there are 20,000 customers in the international business department (using Forex treasure). If you do not build a data mart, they will directly inquire related information on the data warehouse. For example, the foreign exchange transactions of Forex treasure customers last year in various transactions Distribution on the way (counter, online, telephone banking, etc.). The query efficiency and performance are very low. If all users in various departments directly query related information on the data warehouse, the performance of the data warehouse will decrease, and it cannot meet the performance needs of users. No one is willing to make a simple The query waits minutes or even hours. Therefore, it is very necessary to build a department-level data mart, mainly based on performance considerations. The data mart of the international business department includes the foreign exchange transaction history of 20,000 customers, as well as summaries, using a star pattern (or snowflake, or a mixture of both) to facilitate the query and analysis of OLAP tools. From this simple example, it can be seen that the data in the data mart comes from the data warehouse, mainly reorganized and summarized data. Therefore, multiple data marts cannot constitute an enterprise-level data warehouse. To borrow the analogy of Inmon: it is impossible for us to pile up small fish in the sea to form a big whale. This also illustrates the essential difference between data warehouse and data mart.
Following the concept of data warehouse and data mart, data warehouse design methods are also divided into three types: top-down, bottom-up, and a mixture of the two. The so-called top-down is to first establish an enterprise-level data warehouse, and then establish various data marts. Bottom-up, in contrast to this, the hybrid method requires that the structure of the enterprise-level data warehouse is considered when the data mart is established. ,content.

7. School experience

8. Rhetorical link

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/109048596