The difference between hive and traditional database in big data interview

1. Write mode and read mode

The traditional database is a write-time mode. During the load process, query performance is improved, because the columns can be indexed and compressed after pre-parsed, but this will also take more loading time.

Hive is in read mode, 1 oad data is very fast, because it does not need to read data for analysis, only copy or move files.

Read mode vs. write mode

In traditional databases, the mode of the table is mandatory when the data is loaded. If it is found that the data does not conform to the pattern during loading, it will be refused to load the data. Because the data is written to the database and checked against the pattern, this design is sometimes referred to as the "schema on write" (schema on write).

On the other hand, Hive verifies the data when not loading the data, but when querying. This is called "schema on read".

On the other hand, Hive verifies the data when not loading the data, but when querying. This is called "schema on read".

Users need to weigh between these two methods. The reading mode can make data loading very fast. This is because it does not need to read the data, "parse", and then serialize it to disk in the internal format of the database. Data loading is just the copying and movement of files. This method is also more flexible: just imagine that for different analysis tasks, the same data may have two modes. This situation can happen when Hive uses ``external tables''.

The write-time mode helps to improve query performance. Because the database can index the columns and compress the data. But as a trade-off, loading data at this time will take more time. In addition, in many cases where the loading mode is unknown, because the query has not been determined, it is impossible to decide which index to use. These situations are really where Hive "long sleeves good dance"!

2. Data format.

There is no special data format defined in Hive, which is specified by the user. Three attributes need to be specified: column separator, row separator, and the method of reading file data. In the database, the storage engine defines its own data format. All data will be stored in a certain organization

3. Data update.

The content of Hive is read more than write less, therefore, does not support the rewriting and deletion of data, the data is determined when loading. The data in the database usually needs to be modified frequently

4. Execution delay.

Hive needs to scan the entire table (or partition) when querying data, so the latency is high, and it has advantages only when processing big data. When the database is processing small data, the execution delay is low.

5. Index.

Hive is weak and not suitable for real-time query. The database has.

6. Implementation.

Hive is Mapreduce and the database is Executor

7. Scalability.

Hive is high, database is low

8. Data scale.

Hive is big, database is small

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/109167114
Recommended