A complete explanation of big data-Elementary class Datawhle

1. What is big data
1.1 Big data features
Insert picture description here
We have cited the 4V feature of big data,
Volume. Big data has a large amount of data, and the data unit is T or P-level
Variety. There are many data types. Big data includes multiple data dimensions such as logs, videos, Image
Value has low value density and high commercial value. For example, surveillance video, where the key 1-2 seconds may have extremely high value.
Velocity requires processing speed block [Worry-free Get Customers Quanpin + 01]
1.2 4 key technologies of big data
Insert picture description here
1.3 The difference between ETL/ELT
ETL contains the abbreviation of Extract, Transform, Load,
including data extraction => transformation => load three processes.
Insert picture description here
After the data source is extracted, the conversion is performed first, and then the result of the conversion is written to the destination
ETL contains The process is the abbreviation of Extract, Load, and Transform.
The process of ELT is to write the results to the destination after extraction, and then use the aggregation analysis capabilities of the database or an external computing framework, such as Spark, to complete the conversion. The
current mainstream framework for data is ETL. Re-extracting and loading, light conversion, the built data platform belongs to the lightweight
ELT architecture. After the extraction is completed, the data loading will start immediately, which saves time. The data transformation process is performed in SQL according to subsequent use requirements, rather than in SQL
The advantage of the ELT framework in the loading phase is that it retains the original data and can show the original data to the data analyst
ETL related software:
Commercial software: Informatica PowerCenter, IBM InfoSphere DataStage, Oracle Data Integrator, Microsoft SQL Server Integration Services, etc.
Open source software: Kettle, DataX, Sqoop
1.4 Big data and database management system
DataBase Management System, database management system, can manage the
current relationship of multiple databases Databases occupy the mainstream position in DBMS. Commonly used relational databases include Oracle, MySQL, and SQL Server.
SQL is the query language of relational databases.
SQL is a language that directly interacts with data and interacts with front-end and back-end languages. The
SQL language features of the "China Taiwan" language :
great value, technology, product, and operation personnel must master SQL, and use it everywhere. There
are few changes. From the birth to the present, the grammar is
not difficult to get started. Many people will write SQL statements, but the efficiency is very different.
In addition to relational databases, there are document database MongoDB, key-value database Redis, column storage database Cassandra, etc. When you
mention big data, you have to say that Hive
Hive is a data warehouse tool based on Hadoop. To extract, transform, and load data, this is a mechanism that can store, query, and analyze large-scale data stored in Hadoop.
Compared with relational database RDBMS , Hive
Insufficiency: It
cannot respond in real time like RDBMS, and Hive query has a large delay. It
cannot do transactional query like RDBMS. Hive has no transaction mechanism.
Cannot perform row-level change operations (including insert, update, and delete) like RDBMS.
Advantages:
Hive does not have a fixed-length varchar type. Strings are all strings.
Hive is in read mode, and the data is not collated when saving table data. When reading data, set the data that does not conform to the format to NULL.
1.5 OLTP/OLAP
has two very related concepts in the data warehouse architecture, one is OLTP, the other is OLAP

OLTP (On-Line Transaction Processing)
online Transaction processing is mainly the addition, deletion and modification of data.
Recording business occurrences, such as purchase behavior. After the occurrence, it is necessary to record who did what and when, and the data will be updated in the database by adding, deleting and modifying data in
real time. High performance, strong stability, ATM, ERP, CRM, OA, etc. belong to OLTP
OLAP (On-Line Analytical Processing)
online analysis and processing, mainly for data analysis and query [Weiwuyou.com Quanpin + 01]
When data If accumulated to a certain level, summary analysis is needed. BI reports => OLAP
OLTP generates data usually in different business systems.
OLAP requires different data sources => data integration => data cleaning => data warehouse. The warehouse provides OLAP analysis uniformly.

Guess you like

Origin blog.csdn.net/benli8541/article/details/112671724