HTTP2 HPACK 头部压缩

“GET / HTTP/1.1\r\nHost: www.mrpre.com\r\n\r\n”。显然，如果不关心Body，HTTP是一个纯使用ASCII进行交互的协议。

缺点

1：Method 表达方式冗余：例如 HTTP 的 method 有 GET 、POST等不下数十种方法，每次请求都需要用多个不等长度的字符来表示请求method，浪费，为什么不能Client和Server两端约定数字index 0代表GET，数字index 1代表POST呢？

2：除了Method冗余，HTTP的首部也是非常冗余的，特别是Cookie，一次完整的登录，Cookie可能又臭又长，但在整个HTPT生命周期，Cookie又不变，每次发送一个请求，都带一个几百字节Cookie浪费带宽，为什么不约定这个cookie为某个动态生成的index，这样，除了第一次需要发送完整的cookie，后面几次都可以用这个index代替cookie。

3：Header中name和value（特别是value），可能长度较长，第一次发送有没有办法减少其长度？有，霍夫曼编码。

2：为了优化上述问题，HTTP2提供了头部压缩的功能。

所谓的头部压缩，就是头部进行编码，编码的方式也特别简单，查表。打个比方，表就是约定数字0为GET，数字1为POST等。对于 Header 中的 value，进行霍夫曼编码，编码后的值，可以插入到表中，这样当第二次发送的同样的name value，直接发送其在表中的index即可。

HTTP2 的 Header由”name: value”这种形式组织，那么必然存在如下这几种情况。
1：name是通用的，value也是通用的，例如content-type: xxx，其值是固定几个。
2：name是通用的，value是非通用的，例如Host，其值是不同的域名。
3：name不是通用的，value也不是通用的，例如自定义头部。

HTTP2中，为了使得所有首部信息一致，除了Header自身用”name: value”这种形式组织，Method、URL也用这种方式表示。
比如”method: GET”、”method: POST”、”path: /index.html”。这样所有的HTTP首部中的信息均可以用”name: value”表示了。

上面说了，HTTP的Header可以用查表的方法获得，表分静态和动态，我们来看看Nginx的静态表，就可以更加的直观了。



static ngx_http_v2_header_t  ngx_http_v2_static_table[] = {
    { ngx_string(":authority"), ngx_string("") },
    { ngx_string(":method"), ngx_string("GET") },
    { ngx_string(":method"), ngx_string("POST") },
    { ngx_string(":path"), ngx_string("/") },
    { ngx_string(":path"), ngx_string("/index.html") },
    { ngx_string(":scheme"), ngx_string("http") },
    { ngx_string(":scheme"), ngx_string("https") },
    { ngx_string(":status"), ngx_string("200") },
    { ngx_string(":status"), ngx_string("204") },
    { ngx_string(":status"), ngx_string("206") },
    { ngx_string(":status"), ngx_string("304") },
    { ngx_string(":status"), ngx_string("400") },
    { ngx_string(":status"), ngx_string("404") },
    { ngx_string(":status"), ngx_string("500") },
    { ngx_string("accept-charset"), ngx_string("") },
    { ngx_string("accept-encoding"), ngx_string("gzip, deflate") },
    { ngx_string("accept-language"), ngx_string("") },
    { ngx_string("accept-ranges"), ngx_string("") },
    { ngx_string("accept"), ngx_string("") },
    { ngx_string("access-control-allow-origin"), ngx_string("") },
    { ngx_string("age"), ngx_string("") },
    { ngx_string("allow"), ngx_string("") },
    { ngx_string("authorization"), ngx_string("") },
    { ngx_string("cache-control"), ngx_string("") },
    { ngx_string("content-disposition"), ngx_string("") },
    { ngx_string("content-encoding"), ngx_string("") },
    { ngx_string("content-language"), ngx_string("") },
    { ngx_string("content-length"), ngx_string("") },
    { ngx_string("content-location"), ngx_string("") },
    { ngx_string("content-range"), ngx_string("") },
    { ngx_string("content-type"), ngx_string("") },
    { ngx_string("cookie"), ngx_string("") },
    { ngx_string("date"), ngx_string("") },
    { ngx_string("etag"), ngx_string("") },
    { ngx_string("expect"), ngx_string("") },
    { ngx_string("expires"), ngx_string("") },
    { ngx_string("from"), ngx_string("") },
    { ngx_string("host"), ngx_string("") },
    { ngx_string("if-match"), ngx_string("") },
    { ngx_string("if-modified-since"), ngx_string("") },
    { ngx_string("if-none-match"), ngx_string("") },
    { ngx_string("if-range"), ngx_string("") },
    { ngx_string("if-unmodified-since"), ngx_string("") },
    { ngx_string("last-modified"), ngx_string("") },
    { ngx_string("link"), ngx_string("") },
    { ngx_string("location"), ngx_string("") },
    { ngx_string("max-forwards"), ngx_string("") },
    { ngx_string("proxy-authenticate"), ngx_string("") },
    { ngx_string("proxy-authorization"), ngx_string("") },
    { ngx_string("range"), ngx_string("") },
    { ngx_string("referer"), ngx_string("") },
    { ngx_string("refresh"), ngx_string("") },
    { ngx_string("retry-after"), ngx_string("") },
    { ngx_string("server"), ngx_string("") },
    { ngx_string("set-cookie"), ngx_string("") },
    { ngx_string("strict-transport-security"), ngx_string("") },
    { ngx_string("transfer-encoding"), ngx_string("") },
    { ngx_string("user-agent"), ngx_string("") },
    { ngx_string("vary"), ngx_string("") },
    { ngx_string("via"), ngx_string("") },
    { ngx_string("www-authenticate"), ngx_string("") },
};

ok，至此，上面说了，Header中name和value可能有多种组合，现在就来分析下不同组合下，HPACK是如何进行编码的。

1：Indexed Header Field Representation

name和value都是存在表中的（无论动态表还是静态表）
格式如下：

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 1 |        Index (7+)         |
   +---+---------------------------+

这是最简单的编码。一个字节就能表示了，例如0x83就表示POST，因为0x83的msb为1，剩余bit为3，RFC规定了3代表了POST。
换句话说，name 依靠 index 来获得；

扫描二维码关注公众号，回复： 2174411 查看本文章

不过并不是取这个字节的低7bit 就是index，Hpack定义了数字的编码，index还需要进行Hpack的数字解码才能获得实际的index，当然，本例中，index为3时，解码后的index还是3。具体数字解码后文会说。

2：Literal Header Field with Incremental Indexing – Indexed

name在表中，但是value不在表中，且允许被新加在表中。
比如Host: www.mrpre.com

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 1 |      Index (6+)       |
   +---+---+-----------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

和Indexed Header Field Representation一样，name依靠 Index来获得；而为了获得value信息，就复杂了，关于value如何编码，后面再说。

3：Literal Header Field with Incremental Indexing – New Name

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 1 |           0           |
   +---+---+-----------------------+
   | H |     Name Length (7+)      |
   +---+---------------------------+
   |  Name String (Length octets)  |
   +---+---------------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

这个值得是Header的Name既不在表中，value也不在表中。比如自定义头部。允许被新加在表中。
比如创建一个新的”x-mrpre-header: handsome”，当Client第一次发送这个头时，就会使用这种格式。
第一个字节必然是0x40。

4：Literal Header Field without Indexing – Indexed Name

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 0 | 0 |  Index (4+)   |
   +---+---+-----------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

其实看格式就知道，列出了value值，但是没有列出name值，说明name在表中。这个上面的
Literal Header Field with Incremental Indexing -- Indexed区别在于这个格式的数据不会被加入
到动态表中。

5：Literal Header Field without Indexing – New Name

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 0 | 0 |       0       |
   +---+---+-----------------------+
   | H |     Name Length (7+)      |
   +---+---------------------------+
   |  Name String (Length octets)  |
   +---+---------------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

和3一样，只是这样的数据不会被保存在动态表中。

6：Literal Header Field Never Indexed – Indexed Name

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 0 | 1 |  Index (4+)   |
   +---+---+-----------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

7：Literal Header Field Never Indexed – New Name

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 0 | 1 |       0       |
   +---+---+-----------------------+
   | H |     Name Length (7+)      |
   +---+---------------------------+
   |  Name String (Length octets)  |
   +---+---------------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

6、7分别和4、5非常类似，确实是这样，除了头4个bit由0000变成了0001，其他的都没有变化。
但是两类的区别在于，当存在proxy的情况下，6、7这样的数据，必须原模原样的发送给上游服务器（upstream）。

8：Maximum Dynamic Table Size Change

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 1 |   Max size (5+)   |
   +---+---------------------------+

上面都是传送HTTP首部信息，这个只是用来告知对方修改动态表大小的。

数字编码

1：数字编码用于index。
2：数字编码用于value的长度。
本节以value长度编码为例来说明数字编码。

上一节Header编码中，有一些我们没讲全，例如当name在表中，但是value不在表中时，第一次怎么传输value？
value是个字符串，那么除了字符串value本身，value必然需要一个长度标识来表示自身value多长。
这节将详解HPACK怎么描述value长度。

其实最简单的方法就是，固定使用4字节来描述，其后面紧跟着长度对应的value，

     0   1   2   .........   30   31 Value
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
   +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

但是对于那些短的value，用4字节表示其长度就浪费了，其实只要1字节就够了。

所以HPACK这么规定

1、对于[0, 2^N-2]长度，使用1字节表示

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | ? | ? | ? |       Value       |
   +---+---+---+-------------------+

上图中，N是5，具体N是几，完全根据当前字节有多少bit可用。
例如对于Indexed Header Field Representation，index 只有 7 字节可以用；对于Literal Header Field without Indexing -- Indexed Name
index 只有 4 字节可以用。

话句话说，对于N个可用bit，可以最大表示2^N - 2（理论上N比特可以表示2^N - 1，但是2^N - 1另作他用，下面就会讲到）。

2、对于大于等于2^N-1长度，需要使用多个字节表示，例如我们想表示数字number，number大于2^N - 2。
首先，需要将第一个字节可用bit的位全部至上1，则第一个字节能够表示2^N - 1。（这就是为什么单个字节表示的长度最大是2^N - 2而不是2^N - 1的原因）

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | ? | ? | ? | 1   1   1   1   1 |
   +---+---+---+-------------------+

其次第二字节为 remain = number - (2^N - 1)

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 0 |       remain      |
   +---+---+---+-------------------+

这么说不够直观，我们举个具体点的例子：
首先我们假设可用bit 是5bit（什么是 可用bit 上面说了）。
如果我们想要表示 0-30 的数字，很简单，一个字节就搞定，比如我们想表达30这个数字：

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | ? | ? | ? | 1   1   1   1   0 |
   +---+---+---+-------------------+

但是我们想要表达数字100，怎么办？那我们就需要用2个字节表示，其中第一个字节的可用bit全部至为1，然后第二字节的后7bit
表示为 number-(2^N - 1) = 100 - （2^5 -1）= 69

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | ? | ? | ? | 1   1   1   1   1 |
   +---+---+---+-------------------+
   | 0 | 1   0   0   0   1   0   1 |
   +---+---------------------------+

多个字节表示长度，那么必然需要知道到具体底有几个字节表示长度。HPACK规定，对于最后一个字节，msb至0。
我们考虑表示 82961 这么一个数：

第一步：82961 - 31 = 82930 = 10100 0011 1111 0010(b)
第二步：把 0100 0011 1111 0010(b) 从lsb开始，按7bit均分。

0000101 0000 111 111 0010

然后，lsb填在高byte处，msb填在低byte处。

        0   1   2   3   4   5   6   7
        +---+---+---+---+---+---+---+---+
 byte1  | ? | ? | ? | 1   1   1   1   1 | 1F
        +---+---+---+-------------------+
 byte2  | 1 | 1   1   1   0   0   1   0 | F2
        +---+---------------------------+
 byte3  | 1 | 0   0   0   0   1   1   1 | 87
        +---+---------------------------+
 byte4  | 0 | 0   0   0   0   0   0   5 | 5
        +---+---------------------------+

每个byte的msb，根据其位置，置为0或者1。1表示不是最后一个字节，0表示最后一个字节。

从Nginx抠出来的解析数字的代码如下

#include <stdio.h>

int getlen(unsigned char *buf)
{
    int shift, value = 0;
    unsigned char *p = buf;
    unsigned char octet;
    value = *p++;
    for (shift = 0; ; shift += 7) {
        octet = *p++;
        value += (octet & 0x7f) << shift;

        if (octet < 128) {
            return value;
        }
    }
}

void main()
{
    unsigned char buf[]={0x1f,0xf2, 0x87, 0x5};
    printf("len:%d\n", getlen(buf));
}

字符编码

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | 0 | 0 | 0 | 1 |  Index (4+)   |
   +---+---+-----------------------+
   | H |     Value Length (7+)     |
   +---+---------------------------+
   | Value String (Length octets)  |
   +-------------------------------+

有了能够表示长度的编码，name 我们现在只剩下解决对字符串编码的问题。
上图中，要表示一个字符串，首先，表示长度的第一个字节的msb，也即H，是1，表示字符串是否继续Huffman编码。
在HTTP2中使用的是静态Huffman编码。Huffman编码按理说应该不属于计算机，而是信息论的东西，这里不细说。
对于HTTP2，查表即可。表格地址：https://tools.ietf.org/html/rfc7541#appendix-B

HTTP2 之 HPACK 头部压缩