Simple analysis of RLP coding principle

RLP encoding is the main method of data serialization in Ethereum. This article introduces the main rules and principle analysis of RLP encoding. RLP encoding has better data processing efficiency, especially the length and type are unified as prefixes. In fact, RLP is based on ASCII encoding. A structured extension of , which can represent both length and type, and is a very compact structured encoding scheme


RLP (Recursive Length Prefix) is an encoding algorithm used to encode binary data of arbitrary nested structure. It is the main method of data serialization/deserialization in Ethereum, blocks, transactions and other data When the structure is persisted, it will be RLP encoded and then stored in the database.


definition


The definition of RLP encoding only deals with two types of data: one is a string (such as a byte array), and one is a list. A string refers to a string of binary data, and a list is a nested recursive structure that can contain strings and lists, such as ["cat",["puppy","cow"],"horse",[[] ],"pig",[""],"sheep"] is a complex list. Other types of data need to be converted into the above two categories. The conversion rules are not defined by RLP encoding and can be converted according to their own rules. For example, struct can be converted into a list, int can be converted into binary (belongs to strings), Ethereum Medium integers are stored in big endian form.


From the name of RLP encoding, we can see its characteristics: one is recursion, the encoded data is a recursive structure, and the encoding algorithm is also processed recursively; the second is the length prefix, that is, RLP encoding has a prefix, this prefix It is related to the length of the encoded data, which can be seen from the following encoding rules.


encoding rules


Rule 1. For a single byte, if its value range is [0x00, 0x7f], its RLP encoding is itself. What needs to be paid attention to here is the boundary of 0x7f, because the maximum value of ASCII encoding is 0x7f, which means that it is completely used as ASCII encoding within 0x7f


Rule 2. If the length of a string is 0-55 bytes, its RLP encoding contains a single-byte prefix followed by the string itself. The value of this prefix is ​​0x80 plus the length of the string. Since the maximum length of the encoded string is 55=0x37, the maximum value of the single-byte prefix is ​​0x80+0x37=0xb7, that is, the value range of the first byte of the encoding is [0x80, 0xb7].


Rule 3. If the length of the string is greater than 55 bytes, its RLP encoding contains a single-byte prefix, followed by the length of the string, followed by the string itself. The value of this prefix is ​​0xb7 plus the byte length in binary form of the string length, which is a bit confusing. For example, for example, the length of a string is 1024, and its binary form is 10000000000. This binary form The length is 2 bytes, so the prefix should be 0xb7+2=0xb9, the string length 1024=0x400, so the entire RLP encoding should be \xb9\x04\x00 followed by the string itself. The value range of the first byte of the encoding, the prefix, is [0xb8, 0xbf], because the binary format of the string length is at least 1 byte, so the minimum value is 0xb7+1=0xb8, and the maximum binary string length is 8 bytes, so the maximum value is 0xb7+8=0xbf.


Rule 4. If the total length of a list (the total length of the list refers to the number of items it contains plus the sum of the lengths of the items it contains) is 0-55 bytes, its RLP encoding contains a single byte The prefix of , followed by the RLP encoding of each item in the list, the value of this prefix is ​​0xc0 plus the total length of the list. The value range of the first byte of the encoding is [0xc0, 0xf7].


Rule 5. If the total length of a list is greater than 55 bytes, its RLP code contains a single-byte prefix, followed by the length of the list, followed by the RLP code of each element in the list. The value of this prefix is ​​0xf7 Plus the length in bytes of the total length of the list. The value range of the first byte of the encoding is [0xf8, 0xff].


RLP coding example


String "dog" = [0x83, 'd', 'o', 'g' ] (rule 2)


List ["cat","dog"] = [0xc8, 0x83, 'c', 'a', 't', 0x83, 'd', 'o', 'g' ] (rule 4)


Empty string "" = 0x80 (Rule 2)


empty list[] = [0xc0] (rule 4)


Integer 15('\x0f') = 0x0f (Rule 1)


Integer 1024('\x04\00') = [0x82, 0x04, 0x00] (Rule 2)


List [ [], [[]], [ [], [[]] ] ] = [0xc7, 0xc0, 0xc1, 0xc0, 0xc3, 0xc0, 0xc1, 0xc0] (rule 4)


: 'l', 'i', 't'](规则三)


RLP analysis


以上我们可以看出RLP编码的设计思想,就是通过首字节快速判断一串编码的类型,充分利用了一个字节的存储空间,将0x7f以后的值赋予了新的含义,以往我们见到的编码方式主要是对指定长度字节进行编码,比如Unicode等,在处理这些编码时一般按照指定长度进行拆分解码,最大的弊端是传统编码无法表现一个结构,就是本文说的列表,RLP最大的优点是在充分利用字节的情况下,同时支持列表结构,也就是说可以很轻易的利用RLP存储一个树状结构。


程序处理RLP编码时也非常容易,根据首字节就可以判断出这段编码的类型,同时调用不同的方法进行解码,如果您熟悉jason这种结构,会发现RLP很类似,支持嵌套的结构,通过递归调用可以将整个RLP快速还原成一颗树,或者转译成一个jason结构,便于其他程序使用。


RLP使用首字节存储长度的位数,再用后续的字节表明整体字符串的长度,根据规则二计算,RLP可以支持的单个最大字符串长度为2的64次方,这无疑是个天文数字,再加上嵌套规则,所以理论上RLP可以编码任何数据。


转自区块网

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325412058&siteId=291194637