Serialization TLV of Communication Protocol

Reprint address: http://www.cnblogs.com/lchb/articles/2825050.html

The communication protocol can understand the information exchange between two nodes in order to work together and negotiate certain rules and conventions, such as specifying the byte order, each field type, and what compression algorithm or encryption algorithm to use. Common protocols are tcp, udo, http, sip, etc. Protocols have process specifications and coding specifications. Processes are signaling processes such as call processes, and the coding specification specifies how all signaling and data are packaged/unpackaged.

The encoding specification is what we usually call codec, serialization. Not only for communication work, but also for storage work. If we often want to store objects in memory on disk, we need to serialize the data of the objects.

This paper adopts a step-by-step approach, first cites an example, and then constantly asks questions-solution to improve, such an iterative evolution method, introduces a protocol for gradual evolution and improvement, and finally concludes. After reading it, it is easy for everyone to formulate and choose their own coding protocol in the future.

1. Compact Mode

The example in this article is that A and B communicate to obtain or set basic information. The first step for general developers is to define a protocol structure:

struct userbase

{

unsigned short cmd;//1-get, 2-set, define a short, in order to expand more commands (ideally so plump)

unsigned char gender; //1 – man , 2-woman, 3 - ??

char name[8]; //Of course, it can be defined as string name; or a combination of len + value. For the convenience of description, simple fixed-length data is used

}

In this way, A basically does not need to encode, and directly copies it from the memory, and then performs a network byte order conversion in cmd and sends it to B. B can also parse, everything is harmonious and happy.

At this time, the encoding result can be represented as a graph (1 grid and one byte)

This encoding method, which I call compact mode , means that apart from the data itself, there is no extra redundant information, which can be regarded as Raw Data. In the DOS era, this method of use was very common. At that time, the memory and network were calculated by K, and the CPU had not reached 1G. If additional information is added, not only will it consume the stretched CPU, but also memory and bandwidth will not be hurt.

2. Scalability

One day, A adds a birthday field to the basic information, and then tells B

struct userbase

{

unsigned short cmd;

unsigned char gender;

unsigned int birthday;

char name[8];

}

This is why B is worried. After receiving the data packet from A, I don't know whether the third field is the name field in the old protocol or the birthday in the new protocol. This is after A, and B finally learned from the lesson that an important feature of the protocol - compatibility and scalability .

As a result, A and B decided to abolish the old protocol, start from scratch, and develop a protocol that is compatible with each subsequent version. The method is very simple, is to add a version field.

struct userbase

{

unsigned short version;

unsigned short cmd;

unsigned char gender;

unsigned int birthday;

char name[8];

}

In this way, A and B can breathe a sigh of relief and can easily expand in the future. It is also convenient to add fields. This method should be used by many people even now.

2. Better scalability

After a long period of time, A and B found that there were new problems, that is, changing the version number without adding a field. This is not the point. The point is that the code is very troublesome to maintain. Each version has a case branch. At best, there are dozens of branches in the case, which looks ugly and expensive to maintain.

A and B thought about it carefully, and felt that maintaining the entire protocol with just one version was not enough, so they felt that adding an extra information - tag to each field, although the memory and bandwidth were increased, but now it is not like it was in the past, it can be tolerated These redundancy, in exchange for ease of use .

struct userbase

{

1 unsigned short version;

2 unsigned short cmd;

3 unsigned char gender;

4 unsigned int birthday;

5 char name[8];

}

After formulating these agreements, A and B are very proud and feel that this agreement is good, and they can freely increase and decrease fields. Feel free to expand.

The reality is always cruel, and there will be new demands soon. The name is not enough to use 8 bytes, and the maximum length may reach 100 bytes. A and B are worried, and they can't even call "steven". Each time is packaged according to 100 bytes, although it is not bad money, it cannot be wasted like this.

So A and B searched for information from various parties and found the ANS.1 encoding specification. Good thing.. ASN.1 is an ISO/ITU-T standard. One of the encodings, BER (Basic Encoding Rules), is simple and easy to use. It uses <Tag, Length, Value> triple encoding, or TLV encoding for short.

The memory organization of each field after encoding is as follows

Fields can be structs, i.e. can be nested

After A and B use the TLV packaging protocol, the data memory organization is roughly as follows:

TLV has good scalability and is easy to learn. It also has disadvantages, because it adds 2 additional redundant information, tag and len, especially if the protocol is mostly basic data types int, short, byte. It will waste several times the storage space. In addition, the specific meaning of Value needs to be described in advance by both parties in the communication, that is, TLV does not have the characteristics of structure and self-explanation.

3. Self-explanatory

When A and B adopt the TLV protocol, it seems that the problem is solved. But I still think it's not perfect, so I decided to add the self-explaining feature, so that the packet capture can know the type of each field without looking at the protocol description document. This improved type is TT[L]V (tag, type, length, value), where L is a fixed-length basic data type such as int, short, long, byte, because its length is known, So L is not needed.

So some type values are defined as follows

Types of	Type value	Type description
bool	1	Boolean value
int8	2	a character with a sign
uint8	3	a character with a sign
int16	4	16-bit signed integer
uint16	5	16-bit unsigned integer
int32	6	32-bit signed integer
uint32	7	32-bit unsigned integer
…
string	12	string or binary sequence
struct	13	Custom structure, nested use
list	14	ordered list
map	15	unordered list

After serialization according to ttlv, the memory organization is as follows

After the change, A and B found that it did bring a lot of benefits. Not only can they add or delete fields as they like, but also modify the data type, such as changing cmd to int cmd; it can be seamlessly compatible. It's so powerful.

3. Cross-language features

One day, a new colleague C came. He wrote a new service and needed to communicate with A. However, C was written in java or PHP and had no unsigned type, so the parsing of negative numbers failed. In order to solve this problem, A re-planned the protocol type, stripped some language features, and defined some commonalities. Mandatory constraints on usage types. Although it brings constraints, it brings generality, simplicity, and cross-language . Everyone agrees, so there is a type specification.

Types of	Type value	Type description
bool	1	Boolean value
int8	2	a character with a sign
int16	3	16-bit signed integer
int32	4	32-bit signed integer
…
string	12	string or binary sequence
struct	13	Custom structure, nested use
list	14	ordered list
map	15	unordered list

Fourth, code automation - the generation of IDL language

But A and B found new troubles, that is, every time they start a new set of protocols, they have to code, decode and debug from scratch. Although TLV is very simple, writing code and decoding is a boring manual work without technical content. A very obvious one The problem is that due to a large number of copy/past, whether it is for novice or veteran, it is very easy to make mistakes. Once you make a mistake, positioning and troubleshooting is very time-consuming. So A thought of using tools to automatically generate code.

IDL (Interface Description Language), it is a description language and an intermediate language. One of the missions of IDL is to standardize and constrain. As mentioned earlier, to standardize the use of types and provide cross-language features. Analyze idl files through tools to generate various language codes

Gencpp.exe sample.idl output sample.cpp sample.h

Genphp.exe sample.idl output sample.php

Genjava.exe sample.idl output sample.java

Is it simple and efficient?

4. Summary

When you see this, do you feel familiar? Yes, at the end of the protocol, it is actually similar to facebook's thrift and google protocol buffer protocols. Including the jce protocol used by the company wireless. At first glance at the idl files of these protocols, I found that they are almost the same. Just some minor differences.

These protocols add some features in a few details:

1. Compression, compression here does not refer to general compression such as gzip, but refers to integer compression, such as int type. In many cases, the value is less than 127 (the value of 0 is particularly common), so it does not need to occupy 4 bytes. Therefore, these protocols have been refined, and the int type only uses 1/2/3/4 bytes according to the situation, which is actually a ttlv protocol.

2. The reuire/option feature: This feature has two functions. 1. It is still compressed. Sometimes there are many fields in a protocol, and some fields can be included or not. When no assignment is made, it is not necessary to pack a default value. , which is very wasteful. If the field is an option feature and there is no assignment, there is no need to pack it. 2. A bit of a logical constraint function, specifying which fields must have, and strengthening the verification.

Serialization is the basis of communication protocols, whether it is signaling channel, data channel, or RPC, it needs to be used. Extensibility and cross-language features are considered early in the design of the protocol. It will save a lot of trouble in the future.

This article mainly introduces serialization of binary communication protocol, not text protocol. In a sense, text protocols are inherently compatible and extensible. Not as many things to consider as binary. The text protocol is easy to debug (for example, packet capture is a visible character, telnet can be debugged, and data packets can be manually generated without special tools). Simple and easy to learn is its most powerful advantage.

The advantages of binary protocols are performance and security. But debugging is troublesome.

Both have their own merits, so choose according to your needs. (stevenrao)

Serialization TLV of Communication Protocol

Guess you like