AVRO format learning summary

1 Introduction

Avro is a sub-project in Hadoop and an independent project in Apache. Avro is a high-performance middleware based on binary data transmission. This tool is also used in other Hadoop projects, such as HBase and Hive, for data transmission between client and server. Avro is a data serialization system that provides:

1、丰富的数据结构类型
    
    2、快速可压缩的二进制数据形式
    
    3、存储持久数据的文件容器
    
    4、远程过程调用 RPC
    
    5、简单的动态语言结合功能,Avro和动态语言结合后,读写数据文件和使用 RPC协议都不需要生成代码,而代码生成作为一种可选的优化只值得在静态类型语言中实现。

Avro supports cross programming language implementation (C, C++, C#, Java, Python, Ruby, PHP), Avro provides similar functions to systems such as Thrift and Protocol Buffers, but there are still some basic differences, mainly:

1、动态类型:Avro并不需要生成代码,模式和数据存放在一起,而模式使得整个数据的处理过程并不生成代码、静态数据类型等等。这方便了数据处理系统和语言的构造。
2、未标记的数据:由于读取数据的时候模式是已知的,那么需要和数据一起编码的类型信息就很少了,这样序列化的规模也就小了。
3、不需要用户指定字段号:即使模式改变,处理数据时新旧模式都是已知的,所以通过使用字段名称可以解决差异问题。

With Avro combined with dynamic languages, code generation is not required for reading/writing data files and using the RPC protocol, and code generation as an optional optimization only needs to be implemented in statically typed languages.

When using Avro in RPC, the server and client can exchange schemas when handshaking the connection. The server and client have complete schemas for each other, so the consistency issues that need to be addressed in the communication of information such as identically named fields, missing fields, and redundant fields can be easily resolved.

Also, Avro schemas are defined in JSON (a lightweight data interchange schema), which makes it easy to implement for languages ​​that already have a JSON library.

2.Schema

Schemas are represented by JSON objects. Schema defines simple data types and complex data types, where complex data types contain different properties. Users can customize rich data structures through various data types.

The basic types are: Type

Types of illustrate
null no value
boolean a binary value
int 32-bit signed integer
long 64-bit signed integer
float single precision (32-bit) IEEE 754 floating-point number
double double precision (64-bit) IEEE 754 floating-point number
bytes sequence of 8-bit unsigned bytes
string unicode character sequence

Avro defines six complex data types:

Record: record type, a collection of named fields of any type, represented by a JSON object. The following properties are supported:

  • Records

Records use the type name "record" and support three required properties.

type: required attribute.

name: required attribute, is a JSON string that provides the name of the record.

namespace, also a JSON

string, used to qualify and modify the name attribute.

doc: optional attribute, is a JSON

string, which provides documentation for users of this Schema.

aliases: optional attribute, is a JSON string array, providing aliases for this record.

fields: a required attribute, which is a JSON array that lists all fields. Each field is a JSON object and has the following properties:

 name: 必选属性,field的名字,是一个JSON string。
 doc: 可选属性,为使用此Schema的用户提供了描述此field的文档。
 type: 必选属性,定义Schema的一个JSON对象,或者            是命名一条记录定义的JSON string。
 default: 可选属性,即field的默认值,当读到缺少         这个field的实例时用到。默认值的允许的范围由这个      field的Schama的类型决定,如下表所示。其中union       fields的默认值对应于union中第一个Schema。Bytes      和fixed的field的默认值都是JSON          string,并且指向0-255的unicode都对应于无符号8位字节值0-255。

  • Enum: enum type, supports the following properties:

name: must have attribute, is a JSON

string, which provides the name of the enum.

namespace, also a JSON

string, used to qualify and modify the name attribute.

aliases: An optional attribute, which is an array of JSON strings that provide aliases for this enum.

doc: optional attribute, is a JSON

string, which provides documentation for users of this Schema.

symbols: must have properties, is a JSON

String array, enumerating all symbols, all symbols in the enum must be unique, no repetition is allowed.


  • Array: Array type, an unsorted collection of objects, the schema of the objects must be the same. The following properties are supported:

items: Schema of the elements in the array

  • Map: map type, unsorted object key/value pairs. The keys must be strings, the values ​​can be of any type, but must have the same schema. The following properties are supported: values: The Schema used to define the values ​​of the map. The keys of Maps are all strings. For example, a map whose key is string and value is long is defined as: Reference {"type": "map", "values": "long"}

  • Fixed: fixed type, a fixed number of 8-bit unsigned bytes. The following properties are supported:

name: must have the attribute, indicating the name of this fixed, JSON string.

namespace as above

aliases: optional attribute, same as above

size: required attribute, an integer, indicating the number of bytes of each value. For example, a 16-byte fixed can be declared as: Quote {"type": "fixed", "size": 16, "name": "md5"}

  • Union: union type, a union of patterns, which can be represented by a JSON array, each element is a pattern. Each complex data type has its own set of properties, some of which are required and some of which are optional. Unions, as mentioned above, use JSON's array representation. For example, referring to ["string", "null"] declares a union schema, and its elements can be either string or null. Unions cannot contain multiple schemas of the same type, except for named record types, named fixed types, and named enum types. For example, if the union contains two array types, or two map types, neither is allowed; but two of the same type with different names are fine. It can be seen that the union distinguishes the element schema by the name of the schema, because array and map do not have the name attribute, of course, there can only be one array or map. (The reason for using name as resolver is that it makes reading and writing unions more efficient). Unions cannot immediately contain other unions.

For example, a linked-list of 64-bit values:

{
  "type": "record", 
  "name": "LongList",
  "aliases": ["LinkedLongs"],                      // old name for this
  "fields" : [
    {"name": "value", "type": "long"},             // each element has a long
    {"name": "next", "type": ["null", "LongList"]} // optional next element
  ]
}

An enum type:

{ "type": "enum",
  "name": "Suit",
  "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
}

array type:

{"type": "array", "items": "string"}

map type:

{"type": "map", "values": "long"}

fixed type:

{"type": "fixed", "size": 16, "name": "md5"}

The default value of the field attribute in the Record type needs to be explained here. When a field attribute in the Record Schema instance data does not provide instance data, the default value is provided. The specific value is shown in the following table. The default value of the Union field is determined by the first Schema in the Union definition. avro type |json type | example ---|---|-- null | null | null boolean | boolean| true int, | long | integer 1 float, |double | number 1.1 bytes |string | “\u00FF” string |string | "foo" record |object | {"a": 1} enum |string | "FOO" array |array | [1] map |object | {"a": 1} fixed |string | "\u00ff"

3. Serialization/Deserialization

Avro specifies two data serialization encodings: binary encoding and Json encoding. Using binary encoding will efficiently serialize, and the result obtained after serialization will be relatively small; while JSON is generally used for debugging systems or WEB-based applications. TODO

4. Avro Tools

For example, convert json data to avro data:

将 json 数据转换为 avro 数据:
$ java -jar /usr/lib/avro/avro-tools.jar fromjson --schema-file twitter.avsc twitter.json > twitter.avro

设置压缩格式:

$ java -jar /usr/lib/avro/avro-tools.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro

将 avro 转换为 json:
$ java -jar /usr/lib/avro/avro-tools.jar tojson twitter.avro > twitter.json

获取 avro 文件的 schema:

$ java -jar /usr/lib/avro/avro-tools.jar getschema twitter.avro > twitter.avsc

将 Avro 数据编译为 Java:

$ java -jar /usr/lib/avro/avro-tools.jar compile schema twitter.avsc java .

references

https://unmi.cc/avro-tools-jq-view-apache-avro-file/#more-8102

https://www.cnblogs.com/fillPv/p/5009737.html

https://blog.csdn.net/strongyoung88/article/details/54293263

https://blog.csdn.net/bingduanlbd/article/details/52006520

https://blog.csdn.net/bingduanlbd/article/details/52006520

http://fangjian0423.github.io/2016/02/21/avro-intro/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325166135&siteId=291194637