Table of contents
When optimizing offline tasks, there are generally two ideas. One is parameter optimization, which maximizes CPU and memory utilization, or reduces the spill rate; the other is SQL optimization, which reduces operations with low performance.
When comparing the two operators json_tuple and get_json_object, the advantage of get_json_obeject is that it can handle more paths, support regularization, support nesting, and take multiple layers. The disadvantage is that it can only take one value at a time; the advantage of json_tuple is that it can take one time Multiple values , the disadvantage is that only the same level of path can be processed .
Because multiple values can be taken at one time and get_json_object needs to be taken multiple times, is the performance of json_tuple higher? This is exactly what this article will explore.
1. Implementation process
The following is the query sql of get_json_object:
explain select
get_json_object(report_message, '$.msgBody.param') as param,
get_json_object(param, '$.role') as role,
get_json_object(param, '$.cpu_time') as cpu_time,
get_json_object(param, '$.session_id') as session_id,
get_json_object(param, '$.simulcast_id') as simulcast_id,
get_json_object(param, '$.server_addr') as server_ip
from
test_table
where
p_date = '20221012'
The results obtained are as follows:
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
filterExpr: (p_date = '20221012') (type: boolean)
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: get_json_object(report_message, '$.msgBody.param') (type: string), get_json_object(param, '$.role') (type: string), get_json_object(param, '$.cpu_time') (type: string), get_json_object(param, '$.session_id') (type: string), get_json_object(param, '$.simulcast_id') (type: string), get_json_object(param, '$.server_addr') (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 100000
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 100000
Processor Tree:
ListSink
The execution process is relatively simple. get_json_object is called in the select Operator operator, and the input data volume is 200 million+.
Next, look at json_tuple. There are two ways to use it, one is to select directly, and the other is to use it with lateral view.
//1.直接select
explain select
json_tuple(get_json_object(report_message, '$.msgBody.param') ,'role','cpu_time','session_id','simulcast_id','server_addr') as (`role`,cpu_time,session_id,simulcast_id,server_addr)
from
test_table
where
p_date = '20221012'
//2.与lateral view一起使用
select
get_json_object(report_message, '$.msgBody.param') as param,
from
test_table
lateral view outer json_tuple(param,'role','cpu_time','session_id','simulcast_id','server_addr') tmp
as role,cpu_time,session_id,simulcast_id,server_addr
where
p_date = '20221012'
Look at their respective execution plans, first look at the direct selection:
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
filterExpr: (p_date = '20221012') (type: boolean)
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: get_json_object(report_message, '$.msgBody.param') (type: string), 'role' (type: string), 'cpu_time' (type: string), 'session_id' (type: string), 'simulcast_id' (type: string), 'server_addr' (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
UDTF Operator
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
function name: json_tuple
Limit
Number of rows: 100000
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 100000
Processor Tree:
ListSink
json_tuple is executed in UDTF Operator, which has one more operator operation than get_json_object. Let's take a look at the use with lateral view.
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: test_table
filterExpr: (p_date = '20221012') (type: boolean)
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
Lateral View Forward
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: param (type: string)
outputColumnNames: param
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
Lateral View Join Operator
outputColumnNames: _col14, _col24, _col25, _col26, _col27, _col28
Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col14 (type: string), _col24 (type: string), _col25 (type: string), _col26 (type: string), _col27 (type: string), _col28 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 100000
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Select Operator
expressions: get_json_object(report_message, '$.msgBody.param') (type: string), 'role' (type: string), 'cpu_time' (type: string), 'session_id' (type: string), 'simulcast_id' (type: string), 'server_addr' (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
UDTF Operator
Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
function name: json_tuple
outer lateral view: true
Lateral View Join Operator
outputColumnNames: _col14, _col24, _col25, _col26, _col27, _col28
Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col14 (type: string), _col24 (type: string), _col25 (type: string), _col26 (type: string), _col27 (type: string), _col28 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 100000
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 100000
Processor Tree:
ListSink
This is a little more complicated. You can draw a picture to compare the three methods.
It can be seen that json_tuple has one more udtf Operator process than get_json_object, and it is more complicated after adding lateral view, adding the process of Lateral view join, but it should be noted that although it is called join, it is because There is no file output process before the reduce process and the join, so the Lateral view join is only a data connection without a shuffle process.
So from the execution process, the performance advantage of json_tuple cannot be seen.
2. Source code comparison
Look at the code of get_json_object:
public Text evaluate(String jsonString, String pathString) {
...
...
// Cache extractObject
Object extractObject = extractObjectCache.get(jsonString);
if (extractObject == null) {
if (unknownType) {
try {
// 解析jsonString->jsonArray
extractObject = objectMapper.readValue(jsonString, LIST_TYPE);
} catch (Exception e) {
// Ignore exception
}
if (extractObject == null) {
try {
// 解析jsonString->jsonObject
extractObject = objectMapper.readValue(jsonString, MAP_TYPE);
} catch (Exception e) {
return null;
}
}
} else {
JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;
try {
extractObject = objectMapper.readValue(jsonString, javaType);
} catch (Exception e) {
return null;
}
}
//缓存解析出来的jsonNode
extractObjectCache.put(jsonString, extractObject);
}
//
for (int i = pathExprStart; i < pathExpr.length; i++) {
if (extractObject == null) {
return null;
}
//解析到最后一层,同时缓存匹配的field object
extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);
}
Text result = new Text();
if (extractObject instanceof Map || extractObject instanceof List) {
try {
//结果
result.set(objectMapper.writeValueAsString(extractObject));
} catch (Exception e) {
return null;
}
} else if (extractObject != null) {
result.set(extractObject.toString());
} else {
return null;
}
return result;
}
It can be seen that JsonString will actually only be parsed once, and then it will be cached. The update strategy is LRU , so in theory, even if you use get_json_object to fetch multiple values, it will not cause too much performance loss.
Next, look at the source code of json_tuple:
//1.初始化,会解析传入的参数,做正确性校验
public StructObjectInspector initialize(ObjectInspector[] args)
throws UDFArgumentException {
...
...
// construct output object inspector
ArrayList<String> fieldNames = new ArrayList<String>(numCols);
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(numCols);
for (int i = 0; i < numCols; ++i) {
// column name can be anything since it will be named by UDTF as clause
fieldNames.add("c" + i);
// all returned type will be Text
fieldOIs.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
}
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
//2.在process函数中进行具体操作
public void process(Object[] o) throws HiveException {
...
...
//与get_json_object一样,jsonString->jsonObject之后会缓存起来
String jsonStr = ((StringObjectInspector) inputOIs[0]).getPrimitiveJavaObject(o[0]);
if (jsonStr == null) {
forward(nullCols);
return;
}
try {
Object jsonObj = jsonObjectCache.get(jsonStr);
if (jsonObj == null) {
try {
jsonObj = MAPPER.readValue(jsonStr, MAP_TYPE);
} catch (Exception e) {
reportInvalidJson(jsonStr);
forward(nullCols);
return;
}
jsonObjectCache.put(jsonStr, jsonObj);
}
...
...
for (int i = 0; i < numCols; ++i) {
if (retCols[i] == null) {
retCols[i] = cols[i]; // use the object pool rather than creating a new object
}
Object extractObject = ((Map<String, Object>)jsonObj).get(paths[i]);
if (extractObject instanceof Map || extractObject instanceof List) {
retCols[i].set(MAPPER.writeValueAsString(extractObject));
} else if (extractObject != null) {
retCols[i].set(extractObject.toString());
} else {
retCols[i] = null;
}
}
//收集结果
forward(retCols);
return;
} catch (Throwable e) {
LOG.error("JSON parsing/evaluation exception" + e);
forward(nullCols);
}
}
Like get_json_object, json_tuple will also cache the jsonObject generated by parsing to ensure that a json is only parsed once .
From the source code point of view, it seems that there is no obvious performance difference between get_json_object and json_tuple.
3. Experimental demonstration
Experiment to confirm. Run the sql in the first part to see the specific execution effect.
get_json_object | json_tuple | lateral view + json_tuple | |
mapCpuVcores | 1 | 1 | 1 |
mapMemMb | 2560 | 2560 | 2560 |
reduceCpuVcores | 1 | 1 | 1 |
reduceMemMb | 3072 | 3072 | 3072 |
Job GC time | 18.14 s | 18.80 s | 20.96 s |
Job CPU consumption time | 565.72 s | 539.41 s | 604.52 s |
job execution time | 42.29 s | 59.20 s | 109.59 s |
job memory usage | 74.14% | 73.88 % | 73.94 % |
Job CPU usage | 44.74% | 45.07 % | 32.05 % |
It can be seen that there is actually not much difference in performance between get_json_object and json_tuple. On the contrary, when using lateral view + json_tuple, the performance loss becomes significantly larger .
Four. Summary
The performance of get_json_object and json_tuple is basically the same , the difference is that the functions of the two are different. The lateral view is suitable for one->multiple-line scenarios. Lateral view + json_tuple is friendly to developers, but has a high performance loss. It is not recommended unless there are nested multi-field parsing and bursting operations .
This is the end of the article sharing, please point out any mistakes, and welcome everyone to follow my official account Xianyushuo Data to discuss data development related content. Thank you everyone.