hadoop 序列化入门

概念

　　1. 数据序列化就是将对象或者数据结构转化成特定的格式，使其可在网络中传输，或者可存储在内存或者文件中
　　2. 反序列化则是相反的操作，将对象从序列化数据中还原出来
　　数据序列化的重点在于数据的交换和传输

衡量标准

　　1. 序列化之后的数据大小。因为序列化的数据要通过网络进行传输或者是存储在内存或者文件中，所以数据量越小，则存储或者传输所用的时间就越少
　　2. 序列化以及反序列化的耗时及占用的CPU
　　是否能够跨语言或者平台。因为现在的企业开发中，一个项目往往会使用到不同的语言来进行架构和实现。那么在异构的网络系统中，网络双方可能使用的是不同的语言或者是不同的操作系统，例如一端使用的是Java而另一端使用的C++；或者一端使用的是Windows系统而另一端使用的是Linux系统，那么这个时候就要求序列化的数据能够在不同的语言以及不同的平台之间进行解析传输

Java原生序列化/反序列化机制的问题

　　1. Java的原生序列化不能做到对象结构的服用，就导致序列化多个对象的时候数据量较大
　　Java的原生序列化在使用的时候，是按照Java指定的格式将对象进行解析，解析为字节码格式，那么此时其他的语言在接收到这个对象的时候，是无法解析或者解析较为困难。即Java的原生序列化机制是没有做到跨语言或者跨平台传递使用

常见的序列化框架（Avro）

　　概念

　　　　1. Avro是一种远程过程调用和数据序列化框架，是在Apache的Hadoop项目之内开发的
　　　　2. 它使用JSON来定义数据类型和通讯协议，使用压缩二进制格式来序列化数据
　　　　3. 它主要用于Hadoop，它可以为持久化数据提供一种序列化格式，并为Hadoop节点间及从客户端程序到Hadoop服务的通讯提供一种电报格式
　　　　通过avro，每次进行序列化，根据模式（schema)文件来序列化，可以提高性能

　　特点

　　　　1. 丰富的数据结构类型，8种基本数据类型以及6种复杂类型
　　　　2. 快速可压缩的二进制形式
　　　　3. 提供容器文件用于持久化数据
　　　　4. 远程过程调用RPC框架
　　　　简单的动态语言结合功能，Avro 和动态语言结合后，读写数据文件和使用 RPC协议都不需要生成代码，而代码生成作为一种可选的优化只值得在静态类型语言中实现。而代码生成作为一种可选的优化只值得在静态类型语言中实现

　　简单类型

数据类型	说明
null	没有值
boolean	一个二进制布尔值
int	32位有符号整数
long	64位有符号整数
float	32位单精度浮点数
double	64位双精度浮点数
bytes	8位无符号字节序列
string	字符序列

　　复杂类型

　　1. Avro定义了六种复杂数据类型，每一种复杂数据类型都具有独特的属性，下表就每一种复杂数据类型进行说明
　　2. 每一种复杂数据类型都含有各自的一些属性，其中部分属性是必需的，部分是可选的
　　3. 这里需要说明Record类型中field属性的默认值，当Record Schema实例数据中某个field属性没有提供实例数据时，则由默认值提供，具体值见下表。Union的field默认值由Union定义中的第一个Schema决定。

类型	属性	说明
Record		class
	name	a JSON string providing the name of the record (required).
	namespace	a JSON string that qualifies the name(optional).
	doc	a JSON string providing documentation to the user of this schema (optional).
	aliases	a JSON array of strings, providing alternate names for this record (optional).
	fields	a JSON array, listing fields (required).
	name	a JSON string.
	type	a schema/a string of defined record.
	default	a default value for field when lack.
	order	ordering of this field.
Enums		enum
	name	a JSON string providing the name of the enum (required).
	namespace	a JSON string that qualifies the name.
	doc	a JSON string providing documentation to the user of this schema (optional).
	aliases	a JSON array of strings, providing alternate names for this enum (optional)
	symbols	a JSON array, listing symbols, as JSON strings (required). All symbols in an enum must be unique.
Arrays		array
	items	he schema of the array’s items.
Maps		map
	values	the schema of the map’s values.
Fixed		fixed
	name	a string naming this fixed (required).
	namespace	a string that qualifies the name.
	aliases	a JSON array of strings, providing alternate names for this enum (optional).
	size	an integer, specifying the number of bytes per value (required).
Unions		a JSON arrays

　　举例说明

Avro类型	json类型	举例
null	null	null
boolean	boolean	true
int,long	integer	1
float,double	number	1.1
bytes	string	"\u00FF"
string	string	"foo"
record	object	{"a":1}
enum	string	"FOO"
array	array	[1]
map	object	{"a":1}
fixed	string	"\u00ff"

　　使用举例

　　　　POM文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.blb</groupId>
    <artifactId>avro</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <dependencies>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-tools</artifactId>
            <version>1.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-maven-plugin</artifactId>
            <version>1.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-compiler</artifactId>
            <version>1.8.2</version>
        </dependency>
          <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-ipc</artifactId>
            <version>1.8.2</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <!-- 一般而言，target与source是保持一致的，但是，有时候为了让程序能在其他版本的jdk中运行(对于低版本目标jdk，源代码中不能使用低版本jdk中不支持的语法)，会存在target不同于source的情况 -->
                    <source>1.8</source> <!-- 源代码使用的JDK版本 -->
                    <target>1.8</target> <!-- 需要生成的目标class文件的编译版本 -->
                    <encoding>UTF-8</encoding><!-- 字符集编码 -->
                </configuration>

            </plugin>
            <plugin>
                <groupId>org.apache.avro</groupId>
                <artifactId>avro-maven-plugin</artifactId>
                <version>1.8.2</version>
                <executions>
                    <execution>
                        <phase>generate-sources</phase>
                        <goals>
                            <goal>schema</goal>
                        </goals>
                        <configuration>
                            <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
                            <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

a JSON string providing documentation to the user of this schema (optional).

猜你喜欢