concept
1. Data serialization is to convert an object or data structure into a specific format, so that it can be transmitted on the network, or can be stored in memory or a file
2. Deserialization is the opposite operation, the object is serialized data The focus of data serialization restored
in the data is the exchange and transmission of data
Metrics
1. Data size after serialization. Because serialized data needs to be transmitted over the network or stored in memory or files, the smaller the amount of data, the less time it takes to store or transmit.
2. The time and CPU consumption of serialization and deserialization
Whether it can cross languages or platforms. Because in the current enterprise development, a project often uses different languages for architecture and implementation. Then in a heterogeneous network system, the two sides of the network may use different languages or different operating systems, for example, Java is used on one end and C ++ is used on the other; or Windows is used on one end and the other is used on the other end. It is a Linux system, then this time requires serialized data to be parsed and transmitted between different languages and different platforms
Problems with Java's native serialization / deserialization mechanism
1. Java's native serialization can't take the object structure, which results in a large amount of data when serializing multiple objects. When using
Java's native serialization, the object is parsed and parsed according to the format specified by Java. It is a bytecode format, so when other languages receive this object at this time, it cannot be parsed or it is more difficult to parse. That is, Java's native serialization mechanism is not used across languages or platforms.
Common serialization framework (Avro)
concept
1. Avro is a remote procedure call and data serialization framework, developed within the Apache Hadoop project.
2. It uses JSON to define data types and communication protocols, and uses compressed binary format to serialize data
. 3. It mainly Used in Hadoop, it can provide a serialized format for persistent data, and a telegram format for communication between Hadoop nodes and from client programs to Hadoop services
through avro, serialized each time, according to the schema (schema ) Files to serialize, can improve performance
Features
1. Rich data structure types, 8 basic data types and 6 complex types
2. Fast compressible binary form
3. Provide container files for persistent data
4. Remote procedure call RPC framework
simple dynamic language combination function, After combining Avro and dynamic languages, neither reading or writing data files nor using RPC protocol requires code generation, and code generation as an optional optimization is only worth implementing in statically typed languages. Code generation as an optional optimization is only worth implementing in a statically typed language
Simple type
type of data | Explanation |
null | No value |
boolean | A binary boolean |
int | 32-bit signed integer |
long | 64-bit signed integer |
float | 32-bit single precision floating point |
double | 64-bit double-precision floating-point number |
bytes | 8-bit unsigned byte sequence |
string | Character sequence |
Complex type
1. Avro defines six complex data types, each of which has unique attributes. The following table describes each complex data type.
2. Each complex data type contains its own attributes, among which Some attributes are required and some are optional
3. Here, the default value of the field attribute in the Record type needs to be explained. When a field attribute in the Record Schema instance data does not provide instance data, the default value is provided. For specific values, see The following table. The default value of Union's field is determined by the first Schema in the Union definition.
Types of | Attributes | Explanation |
Record | class | |
name | a JSON string providing the name of the record (required). | |
namespace | a JSON string that qualifies the name(optional). | |
doc | a JSON string providing documentation to the user of this schema (optional). | |
aliases | a JSON array of strings, providing alternate names for this record (optional). | |
fields | a JSON array, listing fields (required). | |
name | a JSON string. | |
type | a schema/a string of defined record. | |
default | a default value for field when lack. | |
order | ordering of this field. | |
Enums | enum | |
name | a JSON string providing the name of the enum (required). | |
namespace | a JSON string that qualifies the name. | |
doc | a JSON string providing documentation to the user of this schema (optional). | |
aliases | a JSON array of strings, providing alternate names for this enum (optional) | |
symbols | a JSON array, listing symbols, as JSON strings (required). All symbols in an enum must be unique. | |
Arrays | array | |
items | he schema of the array’s items. | |
Maps | map | |
values | the schema of the map’s values. | |
Fixed | fixed | |
name | a string naming this fixed (required). | |
namespace | a string that qualifies the name. | |
aliases | a JSON array of strings, providing alternate names for this enum (optional). | |
size | an integer, specifying the number of bytes per value (required). | |
Unions | a JSON arrays |
for example
Avro type | json type | Examples |
null | null | null |
boolean | boolean | true |
int,long | integer | 1 |
float,double | number | 1.1 |
bytes | string | "\u00FF" |
string | string | "foo" |
record | object | {"a":1} |
enum | string | "FOO" |
array | array | [1] |
map | object | {"a":1} |
fixed | string | "\u00ff" |
Examples of use
POM file
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.blb</groupId>
<artifactId>avro</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-tools</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-compiler</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-ipc</artifactId>
<version>1.8.2</version>
</ dependency>
</ dependencies>
<build>
<plugins>
<plugin>
<groupId> org.apache.maven.plugins </ groupId>
<artifactId> maven-compiler-plugin </ artifactId>
<version> 3.1 </ version>
<configuration>
<! -In general, target and source are consistent, but sometimes, in order to make the program run in other versions of jdk (for low version target jdk, the source code can not use the syntax that is not supported in the low version jdk) , There will be cases where target is different from
source- > <source> 1.8 </ source> <!-JDK version used by source code->
<target> 1.8 </ target> <!-Target class to be generated The compiled version of the file->
<encoding> UTF-8 </ encoding> <!-Character set encoding->
</configuration>
</plugin>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.8.2</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
a JSON string providing documentation to the user of this schema (optional). |