The complete process of big data offline analysis

1. The complete process of big data offline analysis is usually

1. Data collection: Collect data from different sources. Components available include:
Flume: for efficiently collecting, aggregating, and moving large amounts of data.
Kafka: used to collect, store and transmit streaming data.
Sqoop: used to import data from relational databases into Hadoop.

2. Data preprocessing: Clean, deduplicate, filter, etc. the collected data. Components that can be used include:
Hadoop MapReduce: for distributed processing and transformation of data.
Pig: For data analysis and transformation of large-scale data.
Hive: Used for data warehousing and data analysis, it can convert SQL statements into MapReduce tasks.

3. Data storage: Store the preprocessed data in HDFS or other distributed storage systems. Components that can be used include:
HDFS: Hadoop Distributed File System.
HBase: Distributed column store database for reading and writing large data sets in real time.
Cassandra: Distributed NoSQL database for highly available, high-performance big data storage.

4. Data analysis: Analyze large data sets stored in HDFS. Components that can be used include:
Spark: used for large-scale data processing and analysis, supporting multiple data sources and data formats.
Mahout: For building and deploying machine learning models.
Flink: for streaming and batch data processing and analysis.

5. Data visualization: Visually display the analysis results. Components available include:
Tableau: for data visualization and interactive analysis.
Power BI: for data visualization and reporting.
D3.js: for web-based data visualization.

2. Digression: Data warehouse

The data warehouse is only logically layered, not physically layered, and can be distinguished by the database table name.
DW needs to be subject-oriented, data integrated, relatively stable, and needs to be able to reflect historical changes. We will talk about other issues such as data warehouse data quality later. Now let’s talk about some simple concepts.
Data warehouse usually has 5 layers: ODS, DWD, DWM, DIM, DWS, DM (ADS) layer

The big data component Hive can be used as the DM (ADS) layer of the data warehouse ODS, DWD, DWM, DIM, and DWS layers : The DM layer is usually played by components of the OLAP analysis domain such as MySQL, Clickhouse, and Doris.

ODS, DWD: 3NF modeling.
DIM: such as time dimension, regional dimension, quality dimension, similar to various dictionary tables.
DWM, DWS, DM: Dimensional modeling.

Data analysis models: star model (commonly used model), snowflake model, constellation model

Insert image description here