Problems encountered in recording impala

Impala initial contact with a lot of things are not very clear, step on a few pits recorded at
1. The problem: memory overflow occurs when the table join
solution: to have carried out two tables compute stats, because the impala on the operating table will not statistical analysis of the structure would immediately tables and columns (this operation because there will be some consumption), using compute stats will update this information and save it to MetaStore, will use this information to optimize queries strategies to reduce consumption when Impala linked table query.
Reference https://docs.cloudera.com/documentation/enterprise/5-9-x/topics/impala_compute_stats.html

2. Question: We have a service interface to perform impala sql, how to ensure data synchronization after executing each impalad node (if you do not set up the cluster sync dll, the cluster nodes are not synchronized data)
Solution: According to official thinking, we sql then create a blank after the execution of temporary table, set up the cluster sync dll is true, then drop off the temporary table, set up the cluster sync dll false, then the cluster data is synchronized, and the best performance.
Reference https://docs.cloudera.com/documentation/enterprise/5-15-x/topics/impala.html

3. Question: very slow table occurs join
Workaround: To consider the amount of broadcast join, if a left join will broadcast the right table (broadcast amount equivalent to the right table size multiplied by the number of table data left in the machine), to consider good sql appropriate optimization or clean out the right size of the table is reduced unnecessary data table

Guess you like

Origin blog.51cto.com/13665344/2446072