Spark code readability and performance optimization - Example IX (data transmission and parsing)

Spark code readability and performance optimization - Example IX (data transmission and parsing)

1 Introduction

  • Data transmission and analysis is usually one aspect of developers do not often concern, directly use the most convenient manner. However, in the processing of large amounts of data, whether the data in the network transmission of data or analytical methods will impact (add up) performance. Here are some examples on how to handle the data.

2. Kyro serialization

  • Kyro sequence is often talk about the topic of the data object serialization, reduce its size to facilitate rapid transmission in the network, then deserialized resolution.
  • Performance, you should consider that the serialization / de-serialization enhance the price paid is less than the network transmission
  • In general the following problems will write the code:
    • After setting SparkConf ( "spark.kryo.registrationRequired", "true"), it requires that all forces must be used to transmit data serialized Kyro
    • Serialization error, do not know how to configure this type of sequence, because the JVM error information to show that class (that is, Java form)
      • Scala and Java an array of different ways to write, you should write kryo.register use scala registration (classOf [Array [String]])
      • Class JVM printed with a $ sign, you can copy such information, directly use Java's Class.forName ( "java.util.HashMap $ EntrySet") (universal Dafa ^ _ ^)

3. csv parsing

  • Csv format of the original data is relatively common, how to quickly resolve? We make an example of a comma-delimited data
  • For example, the raw data is "name, age, address, gender ......" you need to filter out unwanted data based on age, general code that can be written as follows
    data.filter {line => 
      val fields = line.split(",")
      
      fields(1).toInt > 16
    }
    
  • Code logic is no problem, but we do not need to parse every comma separator, if the number of data fields a lot, you will waste a lot of meaningless performance. Source split method is as follows:
    public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            int off = 0;
            int next = 0;
            boolean limited = limit > 0;
            ArrayList<String> list = new ArrayList<>();
            while ((next = indexOf(ch, off)) != -1) {
                if (!limited || list.size() < limit - 1) {
                    list.add(substring(off, next));
                    off = next + 1;
                } else {    // last one
                    //assert (list.size() == limit - 1);
                    list.add(substring(off, value.length));
                    off = value.length;
                    break;
                }
            }
            // If no match was found, return this
            if (off == 0)
                return new String[]{this};
    
            // Add remaining segment
            if (!limited || list.size() < limit)
                list.add(substring(off, value.length));
    
            // Construct result
            int resultSize = list.size();
            if (limit == 0) {
                while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                    resultSize--;
                }
            }
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }
    
  • As can be seen from the source, this is very affecting efficiency
  • What we want is aged between first and second comma delimited comma delimited, so you can get this age,
    public static void main(String[] args) {
        String content = "小明,18,北京,男";
        String age = subStringByIndex(content, ',', 1);
        System.out.println("age = " + age);
    }
    
    /**
     * 获取第number到第number+1个ch之间的字符串
     */
    public static String subStringByIndex(String content, int ch, int number) {
        StringBuilder builder = new StringBuilder();
    
        int i = 0;
        while (i < content.length()) {
            char current = content.charAt(i);
    
            if (current == ch) {
                number--;
                if (number < 0) break;
            } else {
                if (number == 0) {
                    builder.append(current);
                }
            }
    
            i++;
        }
    
        return builder.toString();
    }
    
  • A reference to the data: In this example, the write efficiency is split more than four times of writing, uses less memory at the same time (temporarily not measured)
  • Further, if a CSV format data, length of each field can be a good agreement, then the index can know which field is parsed, faster

4. json parsing

  • Data source is a string json format is also very common, we often use JSON parsing framework (FastJson, Gson, Jackson, etc.) to parse the data.
  • json parsing problem is: We are used to write a JavaBean, use JSON parsing framework provides a convenient method for direct reflection data to the JavaBean, because it is simple to write. However, each of the data needed to JavaBean reflection, the efficiency is not high.
  • If you can determine the format JSON data, it is best to write your own parsing code to do here with FastJson example.
    • Raw data {"name":"Bill" , "age":"18", "address":"BeiJing"}
    • Package JavaBean, provides factory methods (for analysis)
      import com.alibaba.fastjson.JSONObject;
      
      public class Person {
      
          private String name;
      
          private int age;
      
          private String address;
      
          public Person(String name, int age, String address) {
              this.name = name;
              this.age = age;
              this.address = address;
          }
      
          //
          // 省略set、get、toString方法
          //
          
          public static Person parse(String json) {
              JSONObject jsonObject = (JSONObject) JSONObject.parse(json);
              
              String name = jsonObject.getString("name");
              int age = jsonObject.getInteger("age");
              String address =  jsonObject.getString("address");
      
              return new Person(name, age, address);
          }
          
      }
      
    • json parsing
      String json = "{\"name\":\"Bill\" , \"age\":\"18\", \"address\":\"BeiJing\"}";
      Person person = Person.parse(json);
      System.out.println(person);
      
  • When you just need to do data processing based on JSON data in a field, then do not make JSON to resolve completely, only needs to resolve the corresponding fields. For example:
    JSONObject jsonObject = (JSONObject) JSONObject.parse(json);
    int age = jsonObject.getInteger("age");
    

5. Other

  • Other formats to enhance the efficiency of analytic way, ibid. And then look at it a few examples:
    • xml format, it is recommended to use SAX parsing, parsing line by line, the latter does not need are not resolved. High efficiency, but more complex than DOM.
    • http protocol, should first resolve the request header, to see whether the conditions (if there is the service), and then decide whether to resolve later
    • Data are byte array, it should be based on analytical methods Analytical good agreement, skipping unnecessary portion
  • About protobuf
    • protobuf will be faster than kyro, or less. However, the data type of each variable used in the RDD time, it is necessary to modify protobuf protocol file to generate a new class. In Spark development, constantly changing data types RDD transmission is normal, protobuf will be difficult to use. Kyro designed for Java designed to directly follow the data type of change varies (you only need to write about the registration code can be), so easier.
    • Protobuf can access raw data. If the original data is stored in binary, for example, sent to Kafka, then SparkStream access, direct protobuf to be resolved.
发布了128 篇原创文章 · 获赞 45 · 访问量 15万+

Guess you like

Origin blog.csdn.net/alionsss/article/details/103803169