Day2, Is the performance of Hive json_tuple higher than get_json_object? Why?

Table of contents

1. Implementation process

2. Source code comparison

3. Experimental demonstration

Four. Summary


        When optimizing offline tasks, there are generally two ideas. One is parameter optimization, which maximizes CPU and memory utilization, or reduces the spill rate; the other is SQL optimization, which reduces operations with low performance.

        When comparing the two operators json_tuple and get_json_object, the advantage of get_json_obeject is that it can handle more paths, support regularization, support nesting, and take multiple layers. The disadvantage is that it can only take one value at a time; the advantage of json_tuple is that it can take one time Multiple values , the disadvantage is that only the same level of path can be processed .

        Because multiple values ​​can be taken at one time and get_json_object needs to be taken multiple times, is the performance of json_tuple higher? This is exactly what this article will explore.

1. Implementation process

        The following is the query sql of get_json_object:

    explain select
      get_json_object(report_message, '$.msgBody.param') as param,
      get_json_object(param, '$.role') as role,
      get_json_object(param, '$.cpu_time') as cpu_time,
      get_json_object(param, '$.session_id') as session_id,
      get_json_object(param, '$.simulcast_id') as simulcast_id,
      get_json_object(param, '$.server_addr') as server_ip
    from
      test_table
    where
      p_date = '20221012'

        The results obtained are as follows: 

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test_table
            filterExpr: (p_date = '20221012') (type: boolean)
            Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: get_json_object(report_message, '$.msgBody.param') (type: string), get_json_object(param, '$.role') (type: string), get_json_object(param, '$.cpu_time') (type: string), get_json_object(param, '$.session_id') (type: string), get_json_object(param, '$.simulcast_id') (type: string), get_json_object(param, '$.server_addr') (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
              Limit
                Number of rows: 100000
                Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 100000
      Processor Tree:
        ListSink

        The execution process is relatively simple. get_json_object is called in the select Operator operator, and the input data volume is 200 million+.

        Next, look at json_tuple. There are two ways to use it, one is to select directly, and the other is to use it with lateral view. 

//1.直接select
    explain select
      json_tuple(get_json_object(report_message, '$.msgBody.param') ,'role','cpu_time','session_id','simulcast_id','server_addr')  as (`role`,cpu_time,session_id,simulcast_id,server_addr)
    from
      test_table
    where
      p_date = '20221012'

//2.与lateral view一起使用

    select
      get_json_object(report_message, '$.msgBody.param') as param,
    from
      test_table
      lateral view outer json_tuple(param,'role','cpu_time','session_id','simulcast_id','server_addr') tmp 
        as  role,cpu_time,session_id,simulcast_id,server_addr
    where
      p_date = '20221012'

        Look at their respective execution plans, first look at the direct selection:

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test_table
            filterExpr: (p_date = '20221012') (type: boolean)
            Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: get_json_object(report_message, '$.msgBody.param') (type: string), 'role' (type: string), 'cpu_time' (type: string), 'session_id' (type: string), 'simulcast_id' (type: string), 'server_addr' (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
              UDTF Operator
                Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
                function name: json_tuple
                Limit
                  Number of rows: 100000
                  Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 100000
      Processor Tree:
        ListSink

        json_tuple is executed in UDTF Operator, which has one more operator operation than get_json_object. Let's take a look at the use with lateral view.

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: test_table
            filterExpr: (p_date = '20221012') (type: boolean)
            Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
            Lateral View Forward
              Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: param (type: string)
                outputColumnNames: param
                Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
                Lateral View Join Operator
                  outputColumnNames: _col14, _col24, _col25, _col26, _col27, _col28
                  Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col14 (type: string), _col24 (type: string), _col25 (type: string), _col26 (type: string), _col27 (type: string), _col28 (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                    Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
                    Limit
                      Number of rows: 100000
                      Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                        table:
                            input format: org.apache.hadoop.mapred.TextInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              Select Operator
                expressions: get_json_object(report_message, '$.msgBody.param') (type: string), 'role' (type: string), 'cpu_time' (type: string), 'session_id' (type: string), 'simulcast_id' (type: string), 'server_addr' (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
                UDTF Operator
                  Statistics: Num rows: 282587784 Data size: 36093178201 Basic stats: COMPLETE Column stats: NONE
                  function name: json_tuple
                  outer lateral view: true
                  Lateral View Join Operator
                    outputColumnNames: _col14, _col24, _col25, _col26, _col27, _col28
                    Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: _col14 (type: string), _col24 (type: string), _col25 (type: string), _col26 (type: string), _col27 (type: string), _col28 (type: string)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                      Statistics: Num rows: 565175568 Data size: 72186356402 Basic stats: COMPLETE Column stats: NONE
                      Limit
                        Number of rows: 100000
                        Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 100000 Data size: 12700000 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.TextInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 100000
      Processor Tree:
        ListSink

        This is a little more complicated. You can draw a picture to compare the three methods.

         It can be seen that json_tuple has one more udtf Operator process than get_json_object, and it is more complicated after adding lateral view, adding the process of Lateral view join, but it should be noted that although it is called join, it is because There is no file output process before the reduce process and the join, so the Lateral view join is only a data connection without a shuffle process.

        So from the execution process, the performance advantage of json_tuple cannot be seen.

2. Source code comparison

        Look at the code of get_json_object:

  public Text evaluate(String jsonString, String pathString) {
    ...
    ...
    // Cache extractObject
    Object extractObject = extractObjectCache.get(jsonString);
    if (extractObject == null) {
      if (unknownType) {
        try {
          // 解析jsonString->jsonArray
          extractObject = objectMapper.readValue(jsonString, LIST_TYPE);
        } catch (Exception e) {
          // Ignore exception
        }
        if (extractObject == null) {
          try {
            // 解析jsonString->jsonObject
            extractObject = objectMapper.readValue(jsonString, MAP_TYPE);
          } catch (Exception e) {
            return null;
          }
        }
      } else {
        JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;
        try {
          extractObject = objectMapper.readValue(jsonString, javaType);
        } catch (Exception e) {
          return null;
        }
      }
      //缓存解析出来的jsonNode
      extractObjectCache.put(jsonString, extractObject);
    }

    //
    for (int i = pathExprStart; i < pathExpr.length; i++) {
      if (extractObject == null) {
          return null;
      }
      //解析到最后一层,同时缓存匹配的field object
      extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);
    }

    Text result = new Text();
    if (extractObject instanceof Map || extractObject instanceof List) {
      try {
        //结果
        result.set(objectMapper.writeValueAsString(extractObject));
      } catch (Exception e) {
        return null;
      }
    } else if (extractObject != null) {
      result.set(extractObject.toString());
    } else {
      return null;
    }
    return result;
  }

        It can be seen that JsonString will actually only be parsed once, and then it will be cached. The update strategy is LRU , so in theory, even if you use get_json_object to fetch multiple values, it will not cause too much performance loss.

        Next, look at the source code of json_tuple:

  //1.初始化,会解析传入的参数,做正确性校验
  public StructObjectInspector initialize(ObjectInspector[] args)
      throws UDFArgumentException {
    ...
    ...

    // construct output object inspector
    ArrayList<String> fieldNames = new ArrayList<String>(numCols);
    ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(numCols);
    for (int i = 0; i < numCols; ++i) {
      // column name can be anything since it will be named by UDTF as clause
      fieldNames.add("c" + i);
      // all returned type will be Text
      fieldOIs.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    }
    return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
  }

//2.在process函数中进行具体操作
public void process(Object[] o) throws HiveException {
    ...
    ...

    //与get_json_object一样,jsonString->jsonObject之后会缓存起来
    String jsonStr = ((StringObjectInspector) inputOIs[0]).getPrimitiveJavaObject(o[0]);
    if (jsonStr == null) {
      forward(nullCols);
      return;
    }
    try {
      Object jsonObj = jsonObjectCache.get(jsonStr);
      if (jsonObj == null) {
        try {
          jsonObj = MAPPER.readValue(jsonStr, MAP_TYPE);
        } catch (Exception e) {
          reportInvalidJson(jsonStr);
          forward(nullCols);
          return;
        }
        jsonObjectCache.put(jsonStr, jsonObj);
      }
    ...
    ...

      for (int i = 0; i < numCols; ++i) {
        if (retCols[i] == null) {
          retCols[i] = cols[i]; // use the object pool rather than creating a new object
        }
        Object extractObject = ((Map<String, Object>)jsonObj).get(paths[i]);
        if (extractObject instanceof Map || extractObject instanceof List) {
          retCols[i].set(MAPPER.writeValueAsString(extractObject));
        } else if (extractObject != null) {
          retCols[i].set(extractObject.toString());
        } else {
          retCols[i] = null;
        }
      }
      //收集结果
      forward(retCols);
      return;
    } catch (Throwable e) {
      LOG.error("JSON parsing/evaluation exception" + e);
      forward(nullCols);
    }
  }

         Like get_json_object, json_tuple will also cache the jsonObject generated by parsing to ensure that a json is only parsed once .

        From the source code point of view, it seems that there is no obvious performance difference between get_json_object and json_tuple.

3. Experimental demonstration

        Experiment to confirm. Run the sql in the first part to see the specific execution effect.

Comparison of 3 SQL implementations
get_json_object json_tuple lateral view + json_tuple
mapCpuVcores 1 1 1
mapMemMb 2560 2560 2560
reduceCpuVcores 1 1 1
reduceMemMb 3072 3072 3072
Job GC time 18.14 s 18.80 s 20.96 s
Job CPU consumption time 565.72 s 539.41 s 604.52 s
job execution time 42.29 s 59.20 s 109.59 s
job memory usage 74.14% 73.88 % 73.94 %
Job CPU usage 44.74% 45.07 % 32.05 %

        It can be seen that there is actually not much difference in performance between get_json_object and json_tuple. On the contrary, when using lateral view + json_tuple, the performance loss becomes significantly larger .

Four. Summary

        The performance of get_json_object and json_tuple is basically the same , the difference is that the functions of the two are different. The lateral view is suitable for one->multiple-line scenarios. Lateral view + json_tuple is friendly to developers, but has a high performance loss. It is not recommended unless there are nested multi-field parsing and bursting operations .

        This is the end of the article sharing, please point out any mistakes, and welcome everyone to follow my official account Xianyushuo Data to discuss data development related content. Thank you everyone.

Guess you like

Origin blog.csdn.net/qq_35590459/article/details/127309760