Getting started with hadoop serialization

concept

  1. Data serialization is to convert an object or data structure into a specific format, so that it can be transmitted on the network, or can be stored in memory or a file
  2. Deserialization is the opposite operation, the object is serialized data The focus of data serialization restored
  in the data is the exchange and transmission of data

Metrics

  1. Data size after serialization. Because serialized data needs to be transmitted over the network or stored in memory or files, the smaller the amount of data, the less time it takes to store or transmit.
  2. The time and CPU consumption of serialization and deserialization
  Whether it can cross languages ​​or platforms. Because in the current enterprise development, a project often uses different languages ​​for architecture and implementation. Then in a heterogeneous network system, the two sides of the network may use different languages ​​or different operating systems, for example, Java is used on one end and C ++ is used on the other; or Windows is used on one end and the other is used on the other end. It is a Linux system, then this time requires serialized data to be parsed and transmitted between different languages ​​and different platforms

Problems with Java's native serialization / deserialization mechanism

  1. Java's native serialization can't take the object structure, which results in a large amount of data when serializing multiple objects. When using
  Java's native serialization, the object is parsed and parsed according to the format specified by Java. It is a bytecode format, so when other languages ​​receive this object at this time, it cannot be parsed or it is more difficult to parse. That is, Java's native serialization mechanism is not used across languages ​​or platforms.

Common serialization framework (Avro)

  concept

    1. Avro is a remote procedure call and data serialization framework, developed within the Apache Hadoop project.
    2. It uses JSON to define data types and communication protocols, and uses compressed binary format to serialize data
    . 3. It mainly Used in Hadoop, it can provide a serialized format for persistent data, and a telegram format for communication between Hadoop nodes and from client programs to Hadoop services
    through avro, serialized each time, according to the schema (schema ) Files to serialize, can improve performance

  Features

    1. Rich data structure types, 8 basic data types and 6 complex types
    2. Fast compressible binary form
    3. Provide container files for persistent data
    4. Remote procedure call RPC framework
    simple dynamic language combination function, After combining Avro and dynamic languages, neither reading or writing data files nor using RPC protocol requires code generation, and code generation as an optional optimization is only worth implementing in statically typed languages. Code generation as an optional optimization is only worth implementing in a statically typed language

  Simple type

type of data Explanation
null No value
boolean A binary boolean
int 32-bit signed integer
long 64-bit signed integer
float 32-bit single precision floating point
double 64-bit double-precision floating-point number
bytes 8-bit unsigned byte sequence
string Character sequence

 

 

 

 

 

 

 

 

  Complex type

  1. Avro defines six complex data types, each of which has unique attributes. The following table describes each complex data type.
  2. Each complex data type contains its own attributes, among which Some attributes are required and some are optional
  3. Here, the default value of the field attribute in the Record type needs to be explained. When a field attribute in the Record Schema instance data does not provide instance data, the default value is provided. For specific values, see The following table. The default value of Union's field is determined by the first Schema in the Union definition.

Types of Attributes Explanation
Record   class
  name  a JSON string providing the name of the record (required).
  namespace  a JSON string that qualifies the name(optional).
  doc  a JSON string  providing documentation to the user of this schema (optional).
  aliases  a JSON array of strings, providing alternate names for this record (optional).
  fields  a JSON array, listing  fields (required).
  name  a JSON string.
  type  a schema/a string of  defined record.
  default  a default value for field when lack.
  order  ordering of this field.
Enums   enum
  name  a JSON string providing the name of the enum (required).
  namespace  a JSON string that qualifies the name.
  doc  a JSON string  providing documentation to the user of this schema (optional).
  aliases  a JSON array of strings, providing alternate names for this enum (optional)
  symbols  a JSON array, listing  symbols, as JSON strings (required). All symbols in an enum must be unique.
Arrays    array
  items  he schema of the  array’s items.
Maps    map
  values  the schema of the map’s values.
Fixed    fixed
  name  a string naming this fixed (required).
  namespace  a string that qualifies the name.
  aliases  a JSON array of  strings, providing alternate names for this enum (optional).
  size an integer, specifying the number of bytes per value (required).
Unions   a JSON arrays

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  for example

Avro type json type Examples
null null null
boolean boolean true
int,long integer 1
float,double number 1.1
bytes string "\u00FF"
string string "foo"
record object {"a":1}
enum string "FOO"
array array [1]
map object {"a":1}
fixed string "\u00ff"

 

 

 

 

 

 

 

 

  Examples of use

    POM file

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.blb</groupId>
    <artifactId>avro</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <dependencies>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-tools</artifactId>
            <version>1.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-maven-plugin</artifactId>
            <version>1.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-compiler</artifactId>
            <version>1.8.2</version>
        </dependency>
          <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-ipc</artifactId>
            <version>1.8.2</version>
        </ dependency> 
    </ dependencies> 
    <build>
        <plugins>
            <plugin> 
                <groupId> org.apache.maven.plugins </ groupId> 
                <artifactId> maven-compiler-plugin </ artifactId> 
                <version> 3.1 </ version> 
                <configuration> 
                    <! -In general, target and source are consistent, but sometimes, in order to make the program run in other versions of jdk (for low version target jdk, the source code can not use the syntax that is not supported in the low version jdk) , There will be cases where target is different from 
                    source- > <source> 1.8 </ source> <!-JDK version used by source code-> 
                    <target> 1.8 </ target> <!-Target class to be generated The compiled version of the file-> 
                    <encoding> UTF-8 </ encoding> <!-Character set encoding-> 
                </configuration>

            </plugin>
            <plugin>
                <groupId>org.apache.avro</groupId>
                <artifactId>avro-maven-plugin</artifactId>
                <version>1.8.2</version>
                <executions>
                    <execution>
                        <phase>generate-sources</phase>
                        <goals>
                            <goal>schema</goal>
                        </goals>
                        <configuration>
                            <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
                            <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

 

 

 

 

 

 

a JSON string providing documentation to the user of this schema (optional).

Guess you like

Origin www.cnblogs.com/zhan98/p/12709324.html