Impala optimization, concurrent performance issues, pressure testing

Scene

Impala belongs to the MPP architecture computing engine. It does not store data itself and relies heavily on memory. All calculations are in memory, so the advantages and disadvantages are clear.

  • The advantage is that the memory calculation is very fast when the cpu, disk io, and network io are the same.

  • The disadvantage is that resources are limited, and if the memory cannot fit, it will oom, and the concurrency cannot be very high.

Impala is suitable for low-concurrency and second-level analysis scenarios, such as a few seconds, tens of seconds, and tens of seconds. Filtering, aggregation, and joining under large amounts of data all perform well.

Concurrency

        Some people may ask, how low is the low concurrency, is there no number, 5? 10? 20? I can only say that there is no answer. Different clusters, different resources, and different types of SQL are all determinants of concurrency. In the case of fixed resources, there are two solutions to solve the problem of SQL exception caused by too high concurrency:

  1. The application monitors the memory usage. Before submitting SQL, check how many SQLs are running in the cluster, how many SQLs are waiting in the queue, and then explain the amount of memory required to submit SQLs. Finally decide whether to submit sql, if not, wait for a certain period of time and continue testing.

  2. Try again after the SQL operation fails. This method is more violent and will waste some resources.

resources

Therefore, it is inevitable that Impala and other applications in the hadoop ecosystem share resources.

Is it useful to isolate cluster resources? Useful, but limited. The early hadoop1.x version did not support the isolation of cpu and io. Hadoop2.x supports the isolation of cpu. I don’t know if the latest version supports io isolation. Students who know can @me. Why is it limited? Resource isolation is dynamic. Hive is generally used offline, and impala is more likely to be used for online business. The time points of use are different. Only by dynamically adjusting resources at different time points can resource utilization be maximized. It's not that difficult to get to this point.

Optimization points:

1. SQL optimization, call the execution plan before using it, execute the plan

  – Before executing the query sql, analyze the sql first, and list the detailed plan to complete this query
  – Command: explain sql, profile

7. Use profile to output the underlying information plan, and optimize the environment accordingly

Use EXPLAIN to view logical planning and analyze execution before execution

[9-24-143-25:21000] > explain select ds,count(*) from t_ed_xxxx_newuser_read_feature_n group by ds order by ds;
Connection lost, reconnecting...
Query: explain select ds,count(*) from t_ed_xxxx_newuser_read_feature_n group by ds order by ds
+----------------------------------------------------------------------------------------------+
| Explain String                                                                               |
+----------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=9.94MB                                             |
| Per-Host Resource Estimates: Memory=27.00MB                                                  | 
| PLAN-ROOT SINK                                                                               | 
| 05:MERGING-EXCHANGE [UNPARTITIONED]                                                          |
| |  order by: ds ASC                                                                          | 
| 02:SORT                                                                                      |
| |  order by: ds ASC                                                                          | 
| 04:AGGREGATE [FINALIZE]                                                                      |
| |  output: count:merge(*)                                                                    |
| |  group by: ds                                                                              | 
| 03:EXCHANGE [HASH(ds)]                                                                       | 
| 01:AGGREGATE [STREAMING]                                                                     |
| |  output: sum_init_zero(default.t_ed_xxxx_newuser_read_feature_n.parquet-stats: num_rows) |
| |  group by: ds                                                                              | 
| 00:SCAN HDFS [default.t_ed_xxxx_newuser_read_feature_n]                                    |
|    partitions=372/372 files=2562 size=15.15GB                                                |
+----------------------------------------------------------------------------------------------+

Read the output of EXPLAIN bottom-up:

  • Stage 00: Shows the underlying detailed information, such as: scanned table, number of partitions, number of files, and file size. Based on this information, you can estimate the approximate time-consuming
  • Phase 01: The aggregation operation SUM is executed in parallel on different nodes
  • Stage 03: Transfer the results of stage 01
  • Phase 04: Combine SUM results
  • Phase 02: Sort operations are performed in parallel on different nodes
  • Stage 05: sorting results are merged and output

EXPLAINAlso PROFILEoutput in the header of the result.

Performance Tuning Using the SUMMARY Report

  SUMMARYThe command can output the time consumption of each stage, which can quickly understand the performance bottleneck of the query. Like the PROFILEoutput, it is only available after the query and shows the actual time consumption. SUMMARYThe output will also be displayed in the header output of PROFILE.

 [9-24-143-25:21000] > select ds,count(*) from t_ed_xxxx_newuser_read_feature_n group by ds order by ds;
 [9-24-143-25:21000] > summary;
+---------------------+--------+----------+----------+-------+------------+----------+---------------+--------------------------------------------+
| Operator            | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail                                     |
+---------------------+--------+----------+----------+-------+------------+----------+---------------+--------------------------------------------+
| 05:MERGING-EXCHANGE | 1      | 3.20s    | 3.20s    | 372   | 372        | 0 B      | 0 B           | UNPARTITIONED                              |
| 02:SORT             | 51     | 517.22us | 2.54ms   | 372   | 372        | 6.02 MB  | 6.00 MB       |                                            |
| 04:AGGREGATE        | 51     | 1.75ms   | 7.85ms   | 372   | 372        | 2.12 MB  | 10.00 MB      | FINALIZE                                   |
| 03:EXCHANGE         | 51     | 2.91s    | 3.10s    | 2.44K | 372        | 0 B      | 0 B           | HASH(ds)                                   |
| 01:AGGREGATE        | 51     | 135.29ms | 474.62ms | 2.44K | 372        | 2.03 MB  | 10.00 MB      | STREAMING                                  |
| 00:SCAN HDFS        | 51     | 1.08s    | 2.58s    | 2.56K | 96.53M     | 1.05 MB  | 1.00 MB       | default.t_ed_xxxx_newuser_read_feature_n |
+---------------------+--------+----------+----------+-------+------------+----------+---------------+--------------------------------------------+

Profiling with PROFILE

  PROFILEstatement will produce a detailed display of the underlying report for the most recent query. Unlike EXPLAIN, this information is only generated after the query is complete, and it shows physical details on each node such as: number of bytes read, maximum memory consumption, etc.
You can use this information to determine whether the query is I/O-intensive or CPU-intensive, whether the network is causing a bottleneck, whether some nodes perform poorly while others perform well, and so on.

2. Select the appropriate file format for storage

Choose an appropriate file format for data storage (eg: Parquet), usually for large amounts of data, the Parquet file format is the best

3. Avoid generating many small files (if there are small files generated by other programs, you can use the intermediate table)

4. Use appropriate partition technology and calculate according to partition granularity

Select the appropriate partition granularity according to the actual data volume

Appropriate partition strategy can physically split the data, and useless data can be ignored when querying to improve query efficiency. It is usually recommended that the number of partitions be less than 30,000 (too many partitions will also cause performance degradation of metadata management)

Choose the smallest integer type for the partition key

  Although the string type can also be used as the partition key, because the partition key is finally used as the HDFS directory, but using the smallest integer type as the partition key can reduce memory consumption.

5. Choose the right Parquet block size

  By default, insert ... selectthe Parquet files created by Impala statements are 256M per partition (changed to 1G after 2.0), and the Parquet files written by Impala have only one block, so they can only be processed by one machine as a unit . If you have only one or few partitions in your Parquet table, or a query can only access one partition, then your performance will be very slow, because there is enough data to take advantage of Impala's concurrent distributed query

5. Use compute stats to collect table information

When pursuing performance or querying large amounts of data, you must first obtain the statistical indicators of the required tables (such as: execution compute stats)

6. Optimization of network io: reduce the amount of data transmitted to the client
    – ​​a. avoid sending the entire data to the client
    – ​​b. do conditional filtering as much as possible
    – c. use limit words
    – d. avoid using beautification when outputting files output

We can reduce the amount of data sent to the client in the following ways:

  • Aggregation (eg  count、sum、maxetc.)
  • filter (as WHERE)
  • LIMIT
  • The result set is forbidden to be displayed in a beautified format (add these optional parameters when displaying the results through impala-shell: -B、 --output_delimiter)

8. If it is to refresh the new metadata of the table, use the refresh table name to refresh, do not use impala-shell -r or invalidate metadata

 9. If the result of executing SQL has a lot of content, you can use impala-shell -B to remove some unnecessary style output 

Controlling Impala Resource Usage

Sometimes, balancing raw query performance with scalability requires limiting the amount of resources (such as memory or CPU) used by a single query or group of queries. Impala can use several mechanisms to help shed load during periods of heavy concurrent use, thereby speeding up overall query times and sharing resources across Impala queries, MapReduce jobs, and other types of workloads in the cluster:

  • The Impala admission control feature uses a fast distributed mechanism to block queries that exceed limits on the number of concurrent queries or the amount of memory used. Queries are queued and executed when other queries complete and resources become available. You can control concurrency limits and specify different limits for different user groups to divide cluster resources according to the priorities of different user classes. This feature is new in Impala 1.3. See "Admission Control and Query Queues" on page 682 for more information.

  • You can limit the amount of memory Impala reserves during query execution by specifying the -mem_limit option to the impalad daemon. See Modifying Impala Startup Options on page 33 for details. This limit applies only to memory used directly by queries; Impala reserves additional memory at startup, e.g. to hold cached metadata.

  • For production deployments, use cluster management tools to enforce resource isolation.

Setting the maximum number of connections in Impala

The maximum number of connections in Impala
Impala is used in my work recently, and it is useful to use Impala to operate the database. Since the query is a page query, there may be n people querying at the same time, that is, the number of colleagues may have many customers The end is requesting impala connections, and it gets stuck when the number of requests reaches 64. Through testing, it is found that the default number of requests (that is, the number of connections) for impala is limited to 64. When 64 impala connections are requested, if the connection If it is not released all the time, then the 65th connection can no longer be requested, which does not meet some requirements. Let's see how to modify the setting of the maximum number of connections.

Modify the maximum number of connections of impala
Through investigation and online search, it is found that the goal can be achieved by setting the parameter: –fe_service_threads=n, open the impala configuration file impala, and add –fe_service_threads=n to the value of IMPALA_SERVER_ARGS, where n is After modifying the maximum number of connections, restart impalad to take effect.

CDP Impala's Admission Control Architecture Develop Paper

Concurrent query slowness caused by Impala-3316 - Programmer Sought

1. Create a test table 

create database if not exists iot_test;
use iot_test;
create table if not exists hive_table_text (
ordercoldaily BIGINT, 
smsusedflow BIGINT, 
gprsusedflow BIGINT, 
statsdate TIMESTAMP, 
custid STRING, 
groupbelong STRING, 
provinceid STRING, 
apn STRING ) 
PARTITIONED BY ( subdir STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ;

2. Prepare test data 

The content of the gendata.sh script is as follows:

[root@cdh4 scripts]# cat gendata.sh 
function rand(){  
    min=$1  
    max=$(($2-$min+1))  
    num=$(($RANDOM+1000000000))
    echo $(($num%$max+$min))  
}  
let i=1
while [ $i -le 3 ];
do
 let n=1
 while [ $n -le $1 ];
 do
  let month=$n%12+1
  if [ $month -eq 2 ];then
    let day=$n%28+1
  else
    let day=$n%30+1
  fi  
  let hour=$n%24
  rnd=$(rand 10000 10100) 
  echo "$i$n,$i$n,$i$n,2017-$month-$day $hour:20:00,${rnd},$n,$n,$n" >> data$i.txt
  let n=n+1
 done
let i=i+1
done

Execute the ./gendata.sh 300000 command to generate 3 test files, each containing 300,000 sample data.

2. Upload test data

Run the upLoad.sh script to upload the test data to the /tmp/hive directory of HDFS

[root@cdh4 scripts]# cat upLoadData.sh 
#!/bin/sh

num=3
path='/tmp/hive'
#create directory
sudo -u hdfs hdfs dfs -mkdir -p $path
sudo -u hdfs hdfs dfs -chmod 777 $path
#upload file
let i=1
while [ $i -le $num ];
do
  hdfs dfs -put data${i}.txt $path
  let i=i+1
done
#list file
hdfs dfs -ls $path

 3. Verify that the data is correct

It can be seen that the three files contain a total of 900,000 pieces of data, which is consistent with the total number of data in the original file

 4. Load data into the test table

Execute the ./hivesql_exec.sh loadData.sql command to load data

[root@cdh4 scripts]# cat loadData.sql 
use iot_test;
LOAD DATA INPATH '/tmp/hive/data1.txt' INTO TABLE hive_table_test partition (subdir="10");
LOAD DATA INPATH '/tmp/hive/data2.txt' INTO TABLE hive_table_test partition (subdir="20");
LOAD DATA INPATH '/tmp/hive/data3.txt' INTO TABLE hive_table_test partition (subdir="30");

3. Generate a parquet table containing timestamp by Hive

1. Create a Parquet table using Hive

The statement to generate the Parquet table is as follows, where the "statsdate" field is of type TIMESTAMP:

[root@cdh4 scripts]# cat genParquet.sql 
use iot_test;
create table hive_table_parquet (
ordercoldaily BIGINT, 
smsusedflow BIGINT, 
gprsusedflow BIGINT, 
statsdate TIMESTAMP, 
custid STRING, 
groupbelong STRING, 
provinceid STRING, 
apn STRING ) 
PARTITIONED BY ( subdir STRING ) 
STORED AS PARQUET;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict; 

insert overwrite table hive_table_parquet partition (subdir) 
select * from hive_table_test

4. Prepare concurrent test scripts 

1. The concurrent test script is as follows, the Impala load balancing address is: cdh4.macro.com:25003

[root@cdh4 scripts]# cat impala-test.sh 
#!/bin/sh
#Concurrency test
let i=1
while [ $i -le $1 ];
do
 impala-shell -B -i cdh4.macro.com:25003 -u hive -f $2 -o log/${i}.out &
 let i=i+1
done
wait

2. Test the SQL statement as follows

SELECT 
 nvl(A.TOTALGPRSUSEDFLOW,0) as TOTALGPRSUSEDFLOW, nvl(A.TOTALSMSUSEDFLOW,0) as TOTALSMSUSEDFLOW, B.USEDDATE AS USEDDATE 
FROM ( SELECT SUM(GPRSUSEDFLOW) AS TOTALGPRSUSEDFLOW, SUM(SMSUSEDFLOW) AS TOTALSMSUSEDFLOW, cast(STATSDATE as timestamp) AS USEDDATE 
FROM hive_table_parquet SIMFLOW 
WHERE SIMFLOW.subdir = '10' AND SIMFLOW.CUSTID = '10099' 
 AND cast(SIMFLOW.STATSDATE as timestamp) >= to_date(date_sub(current_timestamp(),7)) 
 AND cast(SIMFLOW.STATSDATE as timestamp) < to_date(current_timestamp()) 
 GROUP BY STATSDATE ) A 
RIGHT JOIN ( 
 SELECT to_date(date_sub(current_timestamp(),7)) AS USEDDATE UNION ALL
 SELECT to_date(date_sub(current_timestamp(),1)) AS USEDDATE UNION ALL
 SELECT to_date(date_sub(current_timestamp(),2)) AS USEDDATE UNION ALL
 SELECT to_date(date_sub(current_timestamp(),3)) AS USEDDATE UNION ALL
 SELECT to_date(date_sub(current_timestamp(),4)) AS USEDDATE UNION ALL
 SELECT to_date(date_sub(current_timestamp(),5)) AS USEDDATE UNION ALL
 SELECT to_date(date_sub(current_timestamp(),6)) AS USEDDATE 
) B on to_date(A.USEDDATE) = to_date(B.USEDDATE) ORDER BY B.USEDDATE

 5. Impala concurrency test

Use the same test SQL to conduct concurrency tests in different concurrency test scenarios. In order to avoid the contingency of single test results, three tests were performed for the three concurrency test scenarios.

1. Test 1 concurrent query: return the query result in 1 second

The first test: 1.09 seconds to return the query result

The second test: return query results in 0.76 seconds

The third test: return query results in 0.78 seconds

It can be seen that one concurrent query can return results within seconds

2. Test 10 concurrent queries: all concurrent queries are completed within 6.8 seconds

First test: all concurrent queries finished in 6.4 seconds

Second test: all concurrent queries complete in 6.8 seconds

Third test: all concurrent queries complete in 6.8 seconds

It can be found that in the scenario of 10 concurrent queries, the query performance of Impala has dropped significantly.

3. Test 30 concurrent queries: the longest time spent is 12.24 seconds.

The first test: the first six queries are all completed within 5 seconds, but as the number of concurrency increases, the query takes longer to return results, and the longest time spent is 11.81 seconds.

The second test: the first four queries are completed within 5 seconds, and among the 30 concurrent queries, the longest time is 12.24 seconds.

 The third test: the first five queries were all completed within 5 seconds, and among the 30 concurrent queries, the longest time was 12.20 seconds.

in conclusion:

        According to the concurrency test results, in the test scenario of 30 concurrent queries, the query performance of Impala drops sharply, that is, as the number of concurrent queries increases, the query performance of Impala becomes worse.

If the Parquet table is generated by Hive/Spark and contains the TIMESTAMP field type, and the Impala advanced configuration includes the --convert_legacy_hive_parquet_utc_timestamps=true option, then when using Impala to do concurrent queries, the query performance will gradually decrease as the concurrency increases. The higher the concurrency, the more severe the performance degradation. According to our test results in the previous chapter, it can be seen that a single user query can return the query result in seconds, 10 users concurrent query takes about 3 seconds to return the query result, and 30 users concurrent query takes about 15 seconds .

This performance problem is caused by IMPALA-3316 (https://issues.apache.org/jira/browse/IMPALA-3316). When Impala reads the Parquet table generated by Hive or Spark, if the table contains the TIMESTAMP field type, And the Impala advanced configuration contains --convert_legacy_hive_parquet_utc_timestamps=true enable option. Impala will call the Linux local time conversion function (localtime_r) to convert Timestamp data into the local time of the system. By default, Impala does not perform any conversion, and treats Timestamp time as UTC time. However, the internal implementation of the localtime_r function will add a process global lock, so it will affect performance when there are a large number of concurrent Parquet reads. The higher the concurrency, the more serious the problem of global locks, resulting in more severe performance degradation.

4. Solution suggestion


Before the bug in Impala is fixed, we recommend the following three ways to avoid this problem:

1. If you do not require Impala to return to local time, you can remove

--convert_legacy_hive_parquet_utc_timestamps=true startup option

2. Generate related Parquet tables with Impala

3. Hive/Spark uses the STRING type to represent time when generating Parquet tables, and the time format adopts yyyy-MM-dd HH:mm:ss.SSS or yyyy-MM-dd HH:mm:ss . In this way, use Impala When the date/time function is used, Impala will automatically convert it to the TIMESTAMP type

Troubleshooting Cases of Impala Concurrency Performance Problems

1. Introduction to the problem

During the performance test of impala, it was found from the test results that the concurrency performance of impala was very poor.

1.1 Environmental information
The test environment configuration is as follows:
Server memory: 250G;
CPU: 2 CPUs, each CPU has 6 physical cores, and the number of logical cores is 24;
Bandwidth: 10 Gigabit network ports
Number of nodes: 3
Data: Generated by TPC-DS 100G data set, import the data into the hive table in parquet format.

1.2 Query SQL

select ss_quantity, ss_list_price, ss_coupon_amt, ss_sales_price, ss_wholesale_cost, ss_ext_list_price
from store_sales
where ss_sold_date_sk > 20 and (ss_item_sk between 10 and 5000)
and ((ss_cdemo_sk between 100 and 3000 or ss_store_sk between 10 and 3000))
and (ss_addr_sk > 100 or ss_promo_sk < 3000)
limit 100

1.3 Test results

Concurrency test results are as follows:
Concurrency: 4, 6, 8, 10, 15, 20
Average time-consuming (ms): 2305, 3435, 4694, 5868, 8803, 11679

From the test results, the average query time increases linearly with the increase of the concurrency of the test. A query that takes an average of 2 seconds at 4 concurrency takes 11 seconds at 20 concurrency. Such performance is not as expected.

2. Problem analysis

In order to find out the reason why the time-consuming and slow down of simple filter query in high concurrency scenarios. We list the investigation directions as follows.
1. Monitor server resources to see if it is caused by insufficient resources.
2. If the resources are sufficient, analyze the profile information of the query, analyze the time-consuming of each stage of task execution, find out the stage that takes a long time, and further analyze it in depth.

2.1 View server resources

Through the Prophet platform, monitor the resource situation when the test is running.
CPU: As the concurrency increases, the CPU usage is relatively stable, with an average usage rate of about 24%.

Memory: The initial memory is about 50GB. As the number of concurrency increases, the memory usage is relatively stable, and the usage is about 400M.

Disk and network IO consumption is also minimal. From the above monitoring results, it can be concluded that the CPU, memory, disk IO, and network IO of the system are very sufficient, so the factor of insufficient resources is excluded.

2.2 Profile analysis

Open the web page of impalad, select the queries navigation bar, and open the queries page. On this page, there are query information that is being executed and executed on impala.

         In the Last 25 Completed Queries at the bottom of the page, you can see the latest 25 completed queries. Find the query you want to analyze, select Detail

 After opening the Detail page, select the profile option, and you can see the time-consuming analysis of each stage of the query

 In this case, the key information of the intercepted profile is as follows:

         As can be seen from the figure above, the execution time of SQL is 19s016ms, and the time-consuming of Single node Plan created takes 18s05ms.

It can be seen that the step of generating a single-node plan is the culprit for the degradation of query performance, and the higher the degree of concurrency of the query, the longer it takes to generate a single-node plan. So why does this happen? What is the root cause of this phenomenon?

2.3 Analysis of Arthas

It can be found from the source code of impala that the code for generating a single-node plan is as follows:

         It can be seen that the code generated by impala's execution plan is located in the fe part and is implemented by java code. Since it is implemented in java, you can use the diagnostic artifact Arthas for further analysis.


Start arthas java -jar arthas-boot.jar in the arthas installation directory
and select the impala process attach

         First check the thread situation and execute the thread command

 It can be found that many threads in the impala process are blocked at this time.
Use thread 1331 to view the call stack of one of the blocked threads, and print the following information:

"Thread-649" Id=1331 BLOCKED on org.apache.hadoop.conf.Configuration@3a2996ef owned by "Thread-616" Id=1270
    at app//org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1424)
    -  blocked on org.apache.hadoop.conf.Configuration@3a2996ef
    at app//org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:706)
    at app//org.apache.hadoop.conf.Configuration.get(Configuration.java:1183)
    at app//org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1774)
    at app//org.apache.hadoop.hdfs.client.impl.DfsClientConf.<init>(DfsClientConf.java:248)
    at app//org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:301)
    at app//org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:285)
    at app//org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:168)
    at app//org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3237)
    at app//org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)
    at app//org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at app//org.apache.impala.planner.HdfsScanNode.computeScanRangeLocations(HdfsScanNode.java:893)
    at app//org.apache.impala.planner.HdfsScanNode.init(HdfsScanNode.java:413)
    at app//org.apache.impala.planner.SingleNodePlanner.createHdfsScanPlan(SingleNodePlanner.java:1335)
    at app//org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1395)
    at app//org.apache.impala.planner.SingleNodePlanner.createTableRefNode(SingleNodePlanner.java:1582)
    at app//org.apache.impala.planner.SingleNodePlanner.createTableRefsPlan(SingleNodePlanner.java:826)
    at app//org.apache.impala.planner.SingleNodePlanner.createSelectPlan(SingleNodePlanner.java:662)
    at app//org.apache.impala.planner.SingleNodePlanner.createQueryPlan(SingleNodePlanner.java:261)
    at app//org.apache.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:151)
    at app//org.apache.impala.planner.Planner.createPlan(Planner.java:117)
    at app//org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1169)
    at app//org.apache.impala.service.Frontend.getPlannedExecRequest(Frontend.java:1495)
    at app//org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1359)
    at app//org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1250)
    at app//org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1220)
at app//org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:154)

         It can be judged from the marked red line that the location where the thread is blocked is where the single-node plan is generated, that is, SingleNodePlanner.createScanNode. The final blocking position is Configuration.getOverlay. This is the method that calls the hadoop jar package. The code of this method is as follows:

         This is a synchronization method, which means that there will be resource competition in the case of multi-threading, which leads to the problem of thread blocking.

3. Problem Solving

So why are there multiple threads calling this method?

As can be seen from the above call stack, a FileSystem will be created during the process of creating a single-node plan. The method of creating a FileSystem is as follows:

 The above code shows that if fs.hdfs.impl.disable.cache is true, each impala query will call createFileSystem(uri, conf) to initialize the hdfs client, and finally call the getOverlay method, which leads to thread failure in multi-concurrency scenarios block. If fs.hdfs.impl.disable.cache is false, the cache can be obtained, thus avoiding the blocking problem. Therefore, as long as the fs.hdfs.impl.disable.cache configuration item in hdfs-site.xml is changed to false, the problem can be solved.

The fs.hdfs.impl.disable.cache parameter has been encountered in previous work. The default value is false , which means that the cache is used. Therefore, the first suspicion is that it is a cache problem. FileSystem.get(URI.create(path), conf), the first The master cluster is obtained once, and the cache is used for the second time

It is not recommended to modify the fs.hdfs.impl.disable.cache parameter itself , otherwise it may cause other problems

4. Effect verification

After modifying the configuration of fs.hdfs.impl.disable.cache to false, in the concurrent query scenario, using arthas monitoring did not find the problem of thread blocking. And the original 100 concurrency takes an average of 20s, now it takes an average of 4s, and the concurrency performance of the query is greatly improved.

hive exception caused by fs.hdfs.impl.disable.cache parameter rewriting partition data

Hive exception caused by rewriting partition data caused by fs.hdfs.impl.disable.cache parameter

Problem Description:

Existing (external/internal) table test, specify the data location when creating a new partition, as follows 
alter table test add partition(day='20140101')
location '20140101';

This will generate a directory in the format /{warehouse}/test/20140101/ under the table warehouse path by default. At the
same time, use the command desc formatted test partition(day='20140101') to view the corresponding location as
hdfs://..: ../{warehouse}/test/20140101/

Then use insert overwrite to insert data into the partition
insert overwrite table test partition (day='20140101') 
select xx from xx....;

Under normal circumstances, everything is normal, but when the property fs.hdfs.impl.disable.cache is set to true, the following situation will occur.
When desc formatted test partition(day='20140101'), it is found that the location has become the following format
hdfs:// ..:../{warehouse}/test/day=20140101/
At the same time, a new directory /{warehouse}/test/day=20140101/ will be generated on hdfs, and the location path before this partition will be deleted. That is, the path /{warehouse}/test/20140101/ is deleted
 

Impala Join strategy and execution plan generation

As a foreshadowing, this article first briefly introduces Broadcast Join and Partitioned Join.

Broadcast Join

As the name suggests, Broadcast Join is to join in the way of broadcasting. Take the following figure as an example, assuming that the Join operation is SELECT A JOIN B ON A.id=B.id, Broadcast Join is to broadcast the B table (rhs) data to all nodes that scan the A table, and each node that scans the A table data Impalad will use part of the data held by itself to join with the complete B table received, and then aggregate the results to obtain a complete Join result (as shown in the figure below).
 

 This Join method is suitable for associating larger tables with smaller tables. In the situation shown in the figure below, a table of more than 100 G is joined with a table of tens of KB (BROADCAST displayed in the green box is broadcast Join ), the network transmission cost of broadcasting small tables is almost negligible, and the hash calculation overhead in memory is also very small, but if Partitioned Join is used, it will increase the communication transmission cost in many networks. Give a brief introduction.
 

 Partitioned Join

         Different from Broadcast Join, Partitioned Join is more suitable for the association of two larger tables (two small tables do not matter which kind of Join is generally not very slow). The principle diagram is shown in the figure below, still assuming SELECT A JOIN B ON A.id=B.id, when the two perform Partitioned Join, the Join key of the two tables of A and B is partitioned, and the data with the same key is distributed to the same impalad , at this time, the data in each impalad is equivalent to the broken up AB tables (shuffle), and then Hash Join of small data blocks is performed in all impalads, and the results are aggregated after completion (as shown in the figure below).

 The figure below is an example of using Partitioned Join. Compared with Broadcast Join, Partitioned Join has one more EXCHANGE . This is because Partitioned Join does not send a table to the network alone, but sends all the data after shuffling to two tables. go out. It can also be seen from the figure that the volumes of the two tables are relatively close, so Impala adopts the method of partition JOIN.

In extreme cases, if a large table is broadcast by mistake, it will cause a huge burden on the network, and related queries will be very slow, and slow queries that take too long will also affect the normal execution of other queries in the queue.

How to choose Broadcast vs Partitioned

For a big data query engine, any step of operation needs to maximize the impact of performance considerations, otherwise unpredictable consequences may occur under the condition of a large amount of data. Impala has a set of CBO (Cost-based Optimizer, cost-based optimization) ) mechanism is used to guide the generation of the execution plan, not only in the selection of the JOIN strategy, but also in the generation of the entire execution plan has the shadow of CBO. Next, this article first introduces the choice of the JOIN strategy in the execution plan.

The impact of CBO on the JOIN strategy

For the judgment of the JOIN strategy, Impala has a series of calculations based on the cost, and judges the optimal strategy by calculating the sum of the network transmission cost and the memory cost of the two kinds of JOIN respectively.

Cost Calculation of Broadcast Join
The cost calculation of broadcast JOIN in Impala can be summarized as the following formula:

cost_broadcast = 2 × size_rhs × instNum_lhs
Among them, size_rhs is calculated based on the cardinality of the right table and the estimated size of each row of data after serialization, which means the size of the data transmitted in the network on the right table; instNum_lhs is the number of impalad where the left table is located , which is determined by MT_DOP (multi-thread processing parameters, number of threads) and the number of impalads scanning the left table, which means the total number of impalads participating in the scanning of the left table; the multiplication of the two means that the right table needs to be transmitted in a network transmission The amount of data; the memory cost is the same as the network cost, and is also the product of size_rhs and the number of impalads, so multiplying by 2 is the total cost. In several cases, the broadcast cost will not be calculated correctly and will be set to -1:

1) Before calculating the cost due to lack of statistical information on the right table
, Impala will directly obtain the cardinality of the right table for judgment, because the right table may be an intermediate result or data table obtained by a subquery: if it is a directly scanned data table, then directly obtain the table base, and then judge whether it is -1. If it is a subquery, no matter what kind of JOIN is written in the SQL in the subquery, as long as the statistical information of a table is missing, then Impala will set the result cardinality of the subquery to -1, and the broadcast cost will also be directly set to -1. The lack of statistical information in the left table does not affect the calculation of the broadcast cost. If the current query is also a subquery, the cardinality of the result will be -1. (Cardinality: An important indicator used by Impala to judge the size of a table or data. It is generally calculated from the total number of data rows and the data types of all columns and stored in the metastore. When used, it is obtained from the metastore library. Impala will insert/delete the data along with the data. For dynamic adjustment, the cardinality of tables with missing statistical information is -1)

2) Failed to obtain the number of SCAN nodes in the left table
. If the condition of 1) is met, if the size_rhs is not obtained correctly, the broadcast cost will also be set to -1.

Cost Calculation of Partition Join
The calculation formula for calculating the partition JOIN cost can be summarized as follows:

Cost_partition = NetworkCost_lhs + NetworkCost_rhs + SIZE_RHS
NetworkCost = Cardinality × AVGSerializeDrowsize
, where NetworkCost_lhs and NetworkCost_rhs respectively For the network cost of LHS and RHS, RHS_SIZE is still the size of RHS, that is, memory cost, and the network cost is multiplied by the line size after the base and equalization (estimated based on statistics). Before calculating the cost, some judgments need to be made first:
1) The result statistics of both tables or sub-execution plans are not missing, and then judge 2);
2) According to the partition and Join key, judge whether data needs to be transmitted through the network, if not, Then the network cost is 0;
then the cost calculation is performed, if the judgment of 2) makes the network cost of lhs and rhs 0, then the total cost is directly the memory cost; otherwise, the network cost of lhs and rhs are the estimates of the table size The value is obtained by multiplying the cardinality and the size of each row of data after the average table is serialized. Finally, the two are added together for the total cost.
Since Impala generally chooses Partitioned Join when the two sizes are relatively close to the large table Join, the size of a table is directly used as the memory cost. Combining the calculation formulas of the two costs, it can be seen that when the same two tables are joined, when the number of impalads is small, the costs of Broadcast Join and Partitioned Join may be relatively close, but when the number of impalads involved in processing is large, Broadcast Join is performed The cost may exceed Partitioned Join, so the choice of the two should not be static, but also need to be judged according to the actual situation.

strategic choice

After calculating the cost of broadcast Join and partition Join, you can choose a better strategy by comparison, but here are some special treatments: 1)
When the Join type is right outer join/right half join/right anti join/full join When connecting externally, select Partition Join directly;
2) When the Join type is NULL AWARE LEFT ANTI JOIN (no perceptual left anti-join), directly select Broadcast Join;
3) When Hint is found, use the Join method in Hint;
4 ) When the cost of the broadcast Join and the partition Join are the same or either of them is -1, the default configuration of the Join method is used, which can be modified in the configuration file or through the set; then judge by the cost of the broadcast and partition JOIN, when the
partition If the cost of JOIN is lower, partition JOIN is used; otherwise, if the cost of broadcast JOIN is lower, and at the same time:
1) The mem_limit is unlimited or the size of rhs is smaller than mem_limit, mem_limit can be modified in the configuration file or through set;
2) The broadcast_bytes_limit parameter has no The limit or rhs size is less than broadcast_bytes_limit (this parameter can be modified in the configuration file or through set to prevent the transmission of too large data to the network during the Broadcast Join process, the default is about 34GB );
choose broadcast JOIN.
Note that the JOIN key of both broadcast and partition JOIN must be equal value Join, otherwise NESTED LOOP JOIN will be performed.

Order of JOIN keys

When the FE side of Impala generates a single-node execution plan, it has a built-in cost value for each operation (such as various operators in SQL, AND/OR, [NOT] IN, etc.) (as shown in the figure below) display), whether it is the intermediate result obtained by the subquery or the data table, the current cost can be calculated through a specific algorithm, and then the execution order can be optimized by sorting . There are not only tables on both sides of JOIN, but also intermediate results of subqueries, so the calculation of the cost of all basic operations is required. It should be noted that if it is detected that there is currently only one impalad, then the single-node execution plan will be executed directly, and no distributed execution plan will be generated.


The cost calculation of each operation is calculated through the built-in value and selectivity. The selectivity can be understood as the value of the filtered data/original data after this operation, and the actual value is calculated through statistical information. The default filter rate is 0.1, if the table statistics are missing, then it will be corrected:

selectivity = exp{ln0.1 / num}
where num is the number of tables missing statistical information. When calculating the cost, iterate through all the operations of the current query, select the operation with the lowest cost at each traversal and add it to the result list, and then correct the filtering rate until all operations are traversed. The cost calculation can be described as:

cost_total = cost_currentOp + cost_otherOp × selectivity_fixed
selectivity_fixed = selectivity^(1/n)
where cost_total is the cost of the current operation, cost_currentOp is the cost of the operation itself, cost_otherOp is the cost of all other operations, selectivity_fixed is the current screening rate, n is the current The number of times to traverse. Join key, as a part of SQL, also participates in sorting optimization, but in order to achieve the best optimization effect, the table participating in the scan needs to have statistical information, otherwise the performance may be worse.
Combining the processes of 2 and 3, the selection of the JOIN method can be summarized as follows:


Summarize

Through cost-based optimization, Impala dynamically judges how the current SQL join strategy should be selected in broadcast and partition joins, and optimizes the ordering of various clauses. When the two tables participating in the join are large and one hour, broadcast join is generally considered; when two large tables are joined, partition join is used. The overhead of broadcast join depends on the size of the right table and the number of nodes, and the overhead of partition join depends on the sizes of the two tables. When the statistical information is missing, if the large table is joined by mistake, it will seriously affect the computing performance of the cluster and the network environment. In general, table statistics have an important impact on the quality of the execution plan. Therefore, before executing SQL, you should calculate the statistics of all queried tables as much as possible, and check whether the execution plan is optimal through EXPLAIN. .
 

 How Impala determines the Join strategy

How to determine the join strategy
Like mainstream database and data warehouse query engines, Impala also performs execution plan optimization (CBO) based on the cost model. Only by obtaining enough statistical information can Impala choose a better execution plan.

This section analyzes the calculation method of Join to solve the questions raised at the beginning of the article. For both join types, the total cost is calculated as the amount of data sent over the network, plus the amount of data inserted into the hash table.

1. Calculate two kinds of join costs:
the premise of calculation is that there is statistical information, otherwise it is -1.

broadcast: Send the Fragment output on the right to each node on the left, and build a hash table at the node.

Calculation: (right Fragment data size + hash table size) * number of instances on the left
——————————————————
 

Guess you like

Origin blog.csdn.net/qq_22473611/article/details/126559697