Impala memory optimization

I. INTRODUCTION
NoSQL data Hadoop Ecological Analysis Three Musketeers Hive, HBase, Impala batch were analyzed in the mass, large columns of data storage, real-time interactive analysis of their own merits. In particular Impala, since joining the Hadoop community, with distinctive features of its various advantages to win the favor of the majority of large data analyst.

Impala generated by the master node performs program execution plan tree and distributed to each of the pull nodes executing in parallel data acquisition operating mode, alternative to the conventional MapReduce Hadoop in push mode data acquisition, the intermediate results are not written to disk, in a timely manner stream passing through the network, real-time interactivity and stronger; Impala does not cost extra effort management metadata, but the use of Metastore Hive metadata management, can directly access stored in Hadoop's HDFS and HBase in PB magnitude data; Impala by way of the data block is loaded into memory element calculates, Hive, HBase in terms of computing performance has been greatly improved as compared; Impala provide SQL semantics for the user to use compared HBase convenient and simple, practical, no other programming languages, just use SQL statements to perform complex data analysis tasks; Impala also inherited Hadoop flexibility, scalability and economy, has the characteristics of distributed localization process in order to avoid network bottlenecks .

Having said that the advantages of Impala, Is it really any shortcomings or not a "perfect" analysis tool?

Definitely not! Impala huge amounts of data in more than a year of analysis, web application development experience in actual combat, Impala also continued exposed some fatal problem:

  • "Eating Memory" family, it relies too serious for memory, memory overflow directly lead to the failure of technical tasks.

  • "Emasculated version" SQL-- does not support the UDF, does not support UPDATE / DELTE operation, the same does not support multiple SELECT DISTINCT.

  • Achieve security permissions roles require additional security components, configuration management more complex.
    But the fundamental, through the work mechanism of the Impala and error logs, we found the Impala defects can be solved by a good number of optimization programs and alternative measures. Impala reasonable optimization will maximize the performance of the excavation, using the open-source Hadoop ecological advantages and commercialization of NewSQL database to confront, to obtain analytical performance is not inferior to NewSQL database software at the same time saving costs.

    Due to limited space, today let's talk about Impala optimize memory overflow problems.

Second farewell "Memorylimit exceeded" - Impala memory optimization

  • SQLOperations that Spill to Disk
    efforts before the introduction Impala memory optimization techniques, we first look at the official Impala for memory overflow made. From Impala Impala 1.4 for CDH4, CDH5.1 start the new version of the "SQL Operations that Spill to Disk" function, that is transferred to disk operation when the Impala memory arithmetic overflow, although the need to take up temporary disk space when calculating the increase the disk I / O, resulting in computation time is much higher than pure memory operation, but at least to solve the deadly problem analysis tasks directly out of memory failure this "0" fault-tolerant.
    Turning on this feature is also very simple, just execute impalashell in set DISABLE_UNSAFE_SPILLS=0or set DISABLE_UNSAFE_SPILLS=FALSEcommand. When setting DISABLE_UNSAFE_SPILLSthe time parameter is 0 (FALSE), the memory manipulation operation verge into disk overflows; set (TRUE) 1, when a direct memory overflow packet memory overflow Memory limit exceedederror.
    In the open test DISABLE_UNSAFE_SPILLSparameter, disabled memory overflow written to disk functions, query task failed; but closed DISABLE_UNSAFE_SPILLSwhen the parameters, open the function out of memory written to disk, although query task with 1225 seconds, but the completion of the inquiry task.
    In Cloudera manager of Impala query control interface, we can see the resources consumed by the query twice:
    In the test, I will cluster node memory limits Impala is 3.5GB, you can see the query cumulative memory peak usage is 3.6GB, produce "Memory limit exceeded" error. And opened the "SQL Operations that Spill to Disk" after the function, peak memory usage changed to 2.3GB, the extra memory required for the calculation are converted to disk operation (each written in the temporary data 10G three cluster nodes DataNode ), to avoid the "Memory limit exceeded" error, the successful completion of the inquiry.
    Memory overflow problem only with this trick solved? actually not!
    In fact, Impala official also admitted that there are many limitations to the function, and also recommends that users avoid possible triggers "Spill to Disk" function. The function of the existing restrictions include:
  1. Not all SQL statements can be triggered, for example, union or keyword will trigger a memory overflow error;

  2. The memory of each node peak limit not too low, below the minimum required memory assigned to each operator node;

  3. Each node operation explain output memory can not be over estimated higher than the actual physical memory in each node;

  4. There are other concurrent queries when the trigger "Spill to Disk" function will trigger a memory overflow error;

  5. There are certain requirements of disk space, a disk operation data is written to a temporary directory impala each node, increasing the disk I / O, and disk usage will cause uncontrollable.

    Whether it is the official suggested, or in terms of practical experience, "Spillto Disk" is not a long-term solution. By optimizing SQL queries or should, change system parameters, hardware configuration to improve the way to solve the memory leak occurs.

SQL query optimization

SQL optimization is not only a mature analysis, development teams need to have the basic quality, better to maximize the performance Impala mining cluster, to achieve a multiplier effect in the data analysis. Here are some Impala SQL optimization of the four magic:

  • A magic: the Compute the Stats
    the COMPUTE STATS (the DDL) to complete the table, partition, the amount of data columns and data distribution information gathering, the gathered information is stored in the metadata database. Impala just get in a seemingly simple statement statistics table, but it plays an important role in the analysis of task scheduling process. The statement of statistical information will not only get to use the query in the Impala JOIN, GROUP BY, ORDER BY, UNION, DISTINCT and other high consumption of resources are optimized, and the HBase table also play a role.
    In Impala analysis tasks, whether in the original database table, the middle table, table or query results used in massive data table or key table affect the performance, it is recommended to perform COMPUTE STATS table_name command information to the table after generation table statistics, the cluster will automatically select the optimal query plan based on statistics. The more large-scale table (GB, TB and above) statistics command longer execution time, the query optimizer statistics effect after the completion of the more obvious.
    After the completion of execution statistics command, the command SHOW TABLE STATS table_name use, testing to confirm the detailed information is available statistics to the output table.
    Compute Stats Just how important? Take a look at the following example will know.
    In doing user portrait thematic analysis of the process, I need to base the user information table (300 million), the user voice information table (160 million), user data traffic information table (9 ten million) associated with three large tables summary, using the Compute Stats command takes 50 minutes before the task, each of the node memory usage peaks 140GB, the final results appear BUG, there are multiple resources duplicate records, the task consumed as shown below:
    Out of three large after the tables is performed ComputeStats command, this analysis is performed again, the task took 32 minutes, the peak of each node only memory 17GB, the normal analysis result, resources consumed by the task as shown below:
    Impala can see each node after obtaining the statistical information table, can generate more reasonable implementation plan and complete the task scheduling tasks, not only reduces the risk of errors produced "Memory limit exceeded", and avoid memory consumption is too large BUG abnormal analysis tasks.
    It is worth mentioning that in the actual application process, often there will be some larger tables (one hundred billion level recorded on the TB) suspended animation, unable to complete the statistical work in the implementation of COMPUTE STATS command of the situation. In view of this situation, there are two ways to solve: First, after added COMPUTE INCREMENTAL STATS (DDL) commands from Impala 2.1 for CDH5.3 version, the large table partition storage, can use this command increment for each partition statistics, and when the changes to the zoning change parts only scan and update the statistics; the second is not the case for upgrading to a higher version of the Impala clusters or physical-owned China Metallurgical limited incremental statistics also unable to complete their statistical work the statistics, we can manually update statistics on the table, set the partition of other ways to solve. In the example below, the first table COUNT ( ), then alter table table_name settblproperties ( 'numRows' = 'n') order (n represents COUNT ( ) acquired specific number of records) the number of records manually setting table again statistical information lookup table, found a cluster Impala has successfully obtained the statistics table.

  • Magic 2: Implementation Plan

    By adding "EXPLAIN" keyword before the SQL statement, you can return the execution plan for the statement. Execution plan from the bottom Impala How to read the display data, how to coordinate work, and the combination between nodes transmit intermediate results and the final result set is obtained in the whole process.

    According to the implementation plan, you can determine the efficiency of this statement, if we find efficiency is too low, you can adjust the statement itself, or adjust the table structure. For example, by changing the WHERE query on the JOIN query using hints, the introduction of sub-queries, change the connection order of the table, change the partition table, gathering table and column statistics or other ways to optimize the performance of the statements.
    The figure is a query call EXPLAIN view the execution plan for the output from the bottom up sequential read:
    (1) The last part shows the details of the underlying content of the total amount of data read. The amount of data read, we can determine whether or not effectively partitioning strategy, combined with the estimated size of the cluster to read the actual needs of these and other data.
    (2) the polymerization can be seen during execution, sort order and the specific implementation details, statistical functions, interactions, intermediate results can be seen in the flow between the different nodes from a higher level.
    (3) we can see if the operation was performed in parallel Impala different nodes, and each node estimates the required memory.

    (4) By configuring EXPLAIN_LEVEL parameters, you can learn more detailed output. Values ​​from 0 to 3, and the corresponding implementation plan more detailed information.

    • Magic 3: PROFILE query information

    PROFILE statement can be output using more detailed and underlying recently executed SQL query information. After query execution is completed, whether the query was successful or not, enter the command PROFILE detailed information queries can be output. It includes the implementation of the maximum amount of memory a query node number of physical bytes per read, using the information. With this information, we can determine the type of query is the IP consumption, or affect CPU consumption type, network type consumed by the poor performance of nodes, allowing you to check some recommended configuration is in effect and so on.

    Profile information content query output includes:

    Check status (1) to execute a query, the query start time, the query, the query submitted by the task Impala nodes and other basic information:

    (2) the implementation of plans and steps to check:

    (3) the statistical information on the implementation of the steps, each node comprises performing average time, maximum time, the amount of data is read, the peak information memory:

    Details (4) and the various steps of the implementation of each step of each node assigned to the detail of execution (due to the excessive part of the information, the screenshot shown).

    After you run a query, the query IO PROFILE command confirmation from the ground up, memory consumption, network bandwidth, CPU usage and other information is within our desired range, if not the optimal solution, is targeted by the SQL statement to optimize the tuning, adjustment, etc. node configuration.

  • Magic four: structural optimization
    if Impala memory overflow problem is not resolved, then it is necessary to consider measures to optimize the structure after using the above three magic weapon.
    (1) Parquet table stored
    as Impala source data, the intermediate table, select the appropriate result table storage structure. The default is to establish Impala TEXT format table storage, and for mass data storage, preferably Parquet storage format. Specified in the construction shown in table format Parquet build table, and to avoid the use INSER ... VALUES statement Parquet insert data into a table. Because each insert a row in this way will produce a single small data file on the HDFS, it will reduce the data query parallelism thus affecting search speed.
    (2) partitioning
    the data amount when the table is always very large table or query according to some specific column, by partitioning techniques can greatly enhance the Impala query speed. Partition column distinct values have a certain degree of differentiation, which is certain to include the data in the partition. The actual selection of the particle size of the partition, each partition try to ensure that the data is greater than a multiple of 1GB or 1GB. Only the partition size of the data files of the right size, in order to make full use of HDFS batch of IO performance of distributed queries and Impala.
    (3) SQL statements
    SQL statement performance has a direct impact on the efficiency of the final tasks. The following are some SQL optimization "rule of thumb":

  • Associated conditions used to avoid long string associated using digital associate as much as possible.

  • After the huge task split "divide and conquer", by way of the establishment of the progressive implementation of an alternative middle of the table nested tasks.

  • If you participate in the associated table statistics are not available, or automatically select the Impala low efficiency of the join, you can use the keyword STRAIGHT_JOIN in accordance with the "big table in the right-left, small table" principle connection manually specify the table after the SELECT keyword order.

  • Hints manual adjustment using SQL queries from the underlying work, the use Hints using [] enclosed a specific Hints in SQL statements.

  • When large table associated with the small table, using the specified Hints [BROADCAST] is the connection mode; when handling large correlation between the tables, using the specified Hints [SHUFFLE] for the connection.

Optimize cluster resources
in addition to optimize the SQL query terms, can also be optimized for Impala cluster system in the allocation of resources, so that "two-pronged approach." Impala admission control function is a lightweight, distributed resource control mechanism, which limits the size of memory segments used by the query, the number of queries executed in parallel and so on. Use this function to optimize the configuration of the system depending on the specific environment of the cluster, the idea of optimizing comprises:

  • -MEM_LIMIT develop a reasonable option for impalad process to limit the size of memory reserved for the query.

  • YARN configured for dynamic allocation, memory allocation is reasonable to develop relations with Hadoop Impala other components.

  • Application requests for different user groups, resource pooling mechanism configured to isolate Cgroup use of resources.

  • If memory overflow occurs when too many concurrent queries, rather than a small number of small memory footprint high large queries by Impala admission control functions limit concurrent queries.

  • Configure the query queue, when a new request comes in a query if the system resources has reached a peak, it was placed in a waiting queue, until after the query execution completes and releases associated resources, to wait in the queue before the query can be executed.

  • Impala use admission control to control the operation of concurrency and memory consumption of the cluster, using the Hadoop YARN control other components of resource consumption.

III. Written on the back

Impala as an open source big data analytics engine, some of the defects of its own existence is acceptable. By using some reasonable optimization program, and tap its potential, when the real "Fun", Impala will bring you an unexpected surprise.

Reproduced in: https: //www.jianshu.com/p/936339ac1a25

Guess you like

Origin blog.csdn.net/weixin_34015566/article/details/91062307