The relationship between collection projects and data warehouse projects

1 Introduction

Collection project and data warehouse project
Collection and data warehouse are two core functional modules of the enterprise data management platform. They are relatively independent and can be developed independently.

the difference

Function

Collection: collection and transmission of data
Data warehouse: storage of data

process

Database->Data collection->Data warehouse->Visual interface

2. Data warehouse

What is a data warehouse?

1. Name

  • Database: database
    Insert image description here

  • data warehouse: data warehouse
    Insert image description here

Insert image description here

2.Data source

  • Database: core data of the enterprise’s business system
  • Data warehouse: data in the database (the data warehouse has less data, while the data warehouse has more)

3. Distinguish from data storage

Database: The main operation is based on query, and the storage is in row format, which cannot store massive data (row format affects query efficiency).
Data warehouse: In order to process and analyze data and display the data results visually, the storage is in column format, which can store massive data. (The more data, the more accurate the analysis results will be)

4. Distinguish based on data value

  • Database: supports the operation of the entire business (all businesses are run based on the database)
  • Data warehouse: Provide data support for business decisions through statistical results. The data warehouse is the transit station, and visualization is the end point.

3. Question

  1. Why doesn't the data warehouse directly use the database as the data source?
  • The database is row-based storage which is not conducive to statistical analysis.
  • Databases cannot store massive amounts of data. Some data are stored in files, and data warehouses require massive amounts of data.
  • If the database is used as the data source, the data warehouse will occupy too much database resources, which will affect business processing.
  1. What should you pay attention to when connecting a database to a data source?
  • The database continuously transfers data to the data source.
  • The amount of data source data is much larger than that of the database
  • The files of the database and data source have the same content but different sizes.
  1. How to process statistical analysis of data?
  • Multiple functions contain a lot of repeated functions and data, so intermediate results can be stored like Spark's cache.
  • The data warehouse will save the intermediate calculation results in the table (hive HDFS)
  • SparkSQL or HiveSQL can be implemented
  1. Why is it said that the database directly sends data to the data source is highly coupled?
    Data collection cannot be developed until the data source is fully developed.
  2. Why is the data source directly in tabular format???

4. Data collection

1 Introduction

The process of the database transferring data to the data source of the data warehouse is called data acquisition.

2. Process

Database->Collection->Data Source

1.HDFS

Database->HDFS->Data Source
Store the database directly in HDFS, and the data warehouse is also a clustered HDFS, so it is convenient
to add middleware so that the database does not depend on the development of the data source.

2. Database->HDFS

The database is tabular data, so DataX and Maxwell are required to convert and store the two-dimensional tables in the database format.

Guess you like

Origin blog.csdn.net/qq_42265608/article/details/132500982