Enterprise hive optimization

Hive's optimization strategy: There are mainly three aspects of optimization

     1> Architecture-level optimization: sub-table: There is a large amount of data in the log table, the main source is the nginx monitoring server, pull the log data to the specified file, and use the big data extraction tool flume to regularly monitor the file , regularly extract the data in the file to the hdfs file system, usually one night, to facilitate the analysis on the second day. These data should be stored in separate tables, and the useful data should be distributed in separate tables for future use.

                                 Make reasonable use of the tables you have seen in the middle, and be familiar with the relationship between tables. When many companies query the intermediate data, they are often lost. I think this is not good, because the data you queried this time may be in the next It will be used twice, so it is necessary to carefully store the data of the intermediate query, so that it can be used later.

                                 Reasonable setting of partition table: the purpose is to manage our data reasonably, to facilitate the query of the data we specify, to avoid the full query of the whole data, and to improve the read and write performance of the data. Generally, day and hour are used as partition fields.

                               Optimization of hql statements: Some hql statements themselves have the possibility of data skew, so to a certain extent to optimize our written hql statements,

                               Parameter-level optimization:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324963539&siteId=291194637
Recommended