使用 Async-hooks 集成 Envoy 和我们的 RPC 库（二）

前文再续，书接上一回。上回讲到，我们怎样利用 async_hook 将请求（Request）的上下文（Context）在我们的业务程序中传递，以及如何使用 Monkey Patch 去设置上下文。Monkey Patch 在某种程度上是一种好办法，但是并不适合所有情况。比如在 Tubi 的应用过程中，我们发现 node-grpc 并没有一个特别合适的地方可以打补丁。那现在，我们一起来看一下一种不用 Monkey Patch 的解决方案。

基于调用栈上的技巧

首先，让我们先看看调用栈的结构：

调用栈的依赖树

上图中的每一个节点代表一次函数调用。函数都是逐个依次调用的，从上往下，从左到右（中序遍历）。图中你可以看到在根节点上有两棵子树（两块蓝色阴影）。中间件（Middleware）总是在主体逻辑前执行的，所以你会看到它们在左下方。

如果我们写一个中间件放置于 HTTP 服务中间件列表的最前面，它的调用通常跟图中所指位置是类似的：放置于左下方，早于右边的其他调用。

然后，我们可以采取以下策略来管理上下文：

建立一棵树

首先，添加一个 async_hook 回调。在回调的实现中，构建一棵我们程序调用依赖的树：一个 asyncId 触发另外一个 asyncId，就跟上面的图一样。

设置上下文

实现一个新的中间件，同时保证它是所有中间件中的第一个。它的实现逻辑是：

对于中间件执行的 asyncId 设置需要的上下文信息。
对于中间件那个 asyncId 的所有祖先节点，依次回溯，如果它没有设置任何上下文信息，则复制中间件设置的那个上下文信息。

扫描二维码关注公众号，回复： 16076546 查看本文章

看个图比较好理解：

设置上下文

图中涂黄色的部分就是被标记上下文信息的节点。正如你所看到的，第一个请求（左边），它会把一些本不属于请求本身的一些节点都给标记了。这是可以接受的，因为一旦有一个请求把这些节点标记了，后续的其他请求就会“尊重”它的“决定”，并且借此看到彼此之间的边界。很容易证明，只有第一个请求会这么做。

查询上下文

令 currentId 等于 asyncId （当前的 asyncId）。
查找 currentId 的父节点有没有上下文信息：
1. 有：复制父节点的上下文信息到 currentId，并返回上下文信息。
2. 无：令 currentId 等于父节点的 asyncId，继续执行第 2 步，直到找到上下文信息并返回。

下图中的青色部分就是查询的工作原理：

查询工作原理

每次查询都是从查询的调用开始，然后依次到它的父节点以及所有祖先节点。一旦查询结束，所有途经的节点都会被标记正确的上下文信息。很容易发现，这种算法的平均时间复杂度时 O(1)（递归查询的次数不会多于程序的调用深度）。

优势

无需关心业务程序用的是什么库，以及应该怎样打 Monkey Patch。
计算复杂度与 Monkey Patch 一致。

劣势

这显然是一个更复杂的实现。
如果某个库的实现，对于两个独立的请求会在同一个 async resource 里面执行，那么这种方法是不奏效的。在我们的用例中，目前还没有发现这样的情况。
需要精心管理上下文的存储以避免内存泄漏。
如果需要避免上下文的混乱，要严格做到：
- 上下文的设置必须先于查询。
- 对于一个请求上下文，只可以被设置一次（否则本来属于同一个请求的信息会分裂成多个）。

生活并不尽如人意

我们说过这是一个更复杂的实现，对吧？尽管我们已经有了一个可以使用的实现，但是还存在一些问题需要解决。

问题一：async_hook 的生命周期

当你第一次看到 async_hook 的 API 的时候，你可能会有以下一些先入为主的假定：

init / before / after / destory 会且仅会执行一次。
祖先的 after / destory 一定会在后代的相应方法之后执行。

然而事实并非如此。跟直觉不大一致的是，是有可能不执行 before / after的。同时，祖先的 after / destory 可能先于后代的发生。

为了证明这一点，请看这个代码：

/**
* This file is to demonstrate async_hooks life cycle is a little bit out of normal feelings:
* 1. ancestor's destroy may earlier than successor's before / after / destroy
* 2. there can have no before / after, but only have init / destroy
*
* BUT, the things we can believe are:
* 1. for the same stack:
*    init / before / execEnd / after / destroy are happening in strict order
* 2. ancestor's init / before / execStart is strictly earlier than successor's one
*/

const asyncHooks = require('async_hooks');
const treeify = require('treeify');
const fs = require('fs');


let timeId = 0;
let initAndNotDestroyed = 0;
const timeInfo = {};
const tree = {};
const nodeMap = {};


function travel(tree) {
  const ret = {};
  Object.keys(tree).forEach(key => {
      const subTree = tree[key];
      const info = timeInfo[key];
      let newName = key;
      if (info) {
          const order = Object.keys(info)
              .sort((a, b) => info[a] - info[b])
              // .filter(key => key === "init")
              .map(key => `${key}(${info[key]})`)
              .join(', ')
          newName = `${key}: ${order}`;
      }
      ret[newName] = travel(subTree);
  });
  return ret;
}

function init(asyncId, type, triggerAsyncId, resource) {
  timeInfo[asyncId] = {
      init: timeId++
  };
  const current = {};
  nodeMap[asyncId] = current;
  if (!nodeMap[triggerAsyncId]) {
      const tmp = {};
      nodeMap[triggerAsyncId] = tmp;
      tree[triggerAsyncId] = tmp;
  }
  nodeMap[triggerAsyncId][asyncId] = current;
  initAndNotDestroyed++;
}
function before (asyncId) {
  if (!timeInfo[asyncId]) {
      fs.writeSync(1, `before, not inited: ${asyncId}\n`);
      return;
  }
  timeInfo[asyncId].before = timeId++;
}
function after(asyncId) {
  if (!timeInfo[asyncId]) {
      fs.writeSync(1, `after, not inited: ${asyncId}\n`);
      return;
  }
  timeInfo[asyncId].after = timeId++;
}
function destroy(asyncId) {
  timeInfo[asyncId].destroy = timeId++;
  initAndNotDestroyed--;
}
function setExeStartTime() {
  timeInfo[asyncHooks.executionAsyncId()].execStart = timeId++;
}

function setExecEndTime() {
  timeInfo[asyncHooks.executionAsyncId()].execEnd = timeId++;
}


const asyncHook = asyncHooks.createHook({ init, before, after, destroy });
asyncHook.enable();


const ASYNC = 0;
const SET_TIMEOUT = 1;
const PROMISE = 2;

const TYPES = [ASYNC, SET_TIMEOUT, PROMISE];

const sleep = (time) => new Promise(resolve => {
  setTimeout(() => {
      resolve();
      setExecEndTime();
  }, time);
  setExecEndTime();
});

function chain(ctx, types, first, idx = 0) {
  if (idx > types.length) {
      return;
  }
  const time = (first ? 50 : 0) + 50 * idx;

  switch(types[idx]) {
      case ASYNC:
          (async function foo () {
              await sleep(time);
              setExeStartTime();
              chain(ctx, types, first, idx + 1);
              setExecEndTime();
          })();
          break;
      case SET_TIMEOUT:
          setTimeout(function() {
              setExeStartTime();
              chain(ctx, types, first, idx + 1);
              setExecEndTime();
          }, time);
          break;
      case PROMISE:
          sleep(time)
              .then(() => {
                  setExeStartTime();
                  chain(ctx, types, first, idx + 1);
                  setExecEndTime();
              })
          break;
  }
  setExecEndTime();
}

const typeOne = [];
const typeTwo = [];

for (let i = 0; i < TYPES.length; i++) {
  for (let j = 0; j < TYPES.length; j++) {
      typeOne.push(TYPES[i]);
      typeTwo.push(TYPES[j]);
  }
}

clearTimeout(setTimeout(() => {
  console.log("i never log");
}, 100))

setTimeout(function() {
  const ctx = '>>>'
  // set(ctx);
  chain(ctx, typeOne, true);
}, 0);

setTimeout(function() {
  const ctx = '<<<'
  // set(ctx);
  chain(ctx, typeTwo, false);
}, 0);

setTimeout(function() {
  asyncHook.disable();
  console.log(treeify.asTree(travel(tree)));
}, 10000);

它的输出：

$ node async_hook_life_cycle_test.js
└─ 1
 ├─ 5: init(0), destroy(7)
 ├─ 6: init(1), destroy(8)
 ├─ 7: init(2), before(10), execEnd(18), after(19), destroy(32)
 │  ├─ 12: init(11)
 │  │  └─ 16: init(16)
 │  │     └─ 17: init(17), before(66), execStart(67), execEnd(75), after(76)
 │  │        ├─ 30: init(68)
 │  │        │  └─ 33: init(72)
 │  │        │     └─ 34: init(73), before(104), execStart(105), execEnd(113), after(114)
 │  │        │        ├─ 43: init(106)
 │  │        │        │  └─ 46: init(110)
 │  │        │        │     └─ 47: init(111), before(138), execStart(139), execEnd(142), after(143)
 │  │        │        │        └─ 51: init(140), before(158), execStart(159), execEnd(162), after(163), destroy(166)
 │  │        │        │           └─ 56: init(160), before(172), execStart(173), execEnd(177), after(178), destroy(193)
 │  │        │        │              ├─ 57: init(174), before(196), execStart(197), execEnd(204), after(205), destroy(218)
 │  │        │        │              │  ├─ 65: init(198)
 │  │        │        │              │  │  └─ 68: init(202), before(236), execStart(237), execEnd(243), after(244)
 │  │        │        │              │  │     ├─ 74: init(238)
 │  │        │        │              │  │     │  └─ 76: init(241), before(261), execStart(262), execEnd(269), after(270)
 │  │        │        │              │  │     │     ├─ 77: init(263)
 │  │        │        │              │  │     │     │  └─ 80: init(267), before(279), execStart(280), execEnd(282), after(283)
 │  │        │        │              │  │     │     ├─ 78: init(264), before(275), execEnd(276), after(277), destroy(284)
 │  │        │        │              │  │     │     └─ 79: init(265), before(274), after(278), destroy(285)
 │  │        │        │              │  │     └─ 75: init(239), before(252), execEnd(253), after(254), destroy(272)
 │  │        │        │              │  ├─ 66: init(199), before(222), execEnd(223), after(224), destroy(245)
 │  │        │        │              │  └─ 67: init(200), before(221), after(235), destroy(247)
 │  │        │        │              └─ 58: init(175), before(195), after(209), destroy(220)
 │  │        │        ├─ 44: init(107)
 │  │        │        │  └─ 48: init(115), before(129), after(130)
 │  │        │        └─ 45: init(108), before(123), execEnd(124), after(125), destroy(145)
 │  │        ├─ 31: init(69)
 │  │        │  └─ 35: init(77), before(101), after(102)
 │  │        └─ 32: init(70), before(85), execEnd(86), after(87), destroy(117)
 │  ├─ 13: init(12)
 │  │  └─ 23: init(30), before(64), after(65)
 │  ├─ 14: init(13), before(50), execEnd(51), after(52), destroy(78)
 │  └─ 15: init(14), before(49), after(63), destroy(80)
 ├─ 8: init(3), before(34), after(38), destroy(48)
 ├─ 9: init(4), before(20), execEnd(27), after(28), destroy(33)
 │  ├─ 18: init(21)
 │  │  └─ 21: init(25)
 │  │     └─ 22: init(26), before(41), execStart(42), execEnd(45), after(46)
 │  │        └─ 25: init(43), before(53), execStart(54), execEnd(61), after(62), destroy(79)
 │  │           ├─ 26: init(55)
 │  │           │  └─ 29: init(59), before(89), execStart(90), execEnd(99), after(100)
 │  │           │     ├─ 36: init(91)
 │  │           │     │  └─ 40: init(96)
 │  │           │     │     └─ 41: init(97), before(131), execStart(132), execEnd(136), after(137)
 │  │           │     │        ├─ 49: init(133), before(148), execStart(149), execEnd(156), after(157), destroy(165)
 │  │           │     │        │  ├─ 52: init(150)
 │  │           │     │        │  │  └─ 55: init(154), before(180), execStart(181), execEnd(189), after(190)
 │  │           │     │        │  │     ├─ 59: init(182)
 │  │           │     │        │  │     │  └─ 62: init(186)
 │  │           │     │        │  │     │     └─ 63: init(187), before(212), execStart(213), execEnd(216), after(217)
 │  │           │     │        │  │     │        └─ 69: init(214), before(225), execStart(226), execEnd(233), after(234), destroy(246)
 │  │           │     │        │  │     │           ├─ 70: init(227)
 │  │           │     │        │  │     │           │  └─ 73: init(231), before(256), execStart(257), execEnd(259), after(260)
 │  │           │     │        │  │     │           ├─ 71: init(228), before(249), execEnd(250), after(251), destroy(271)
 │  │           │     │        │  │     │           └─ 72: init(229), before(248), after(255), destroy(273)
 │  │           │     │        │  │     ├─ 60: init(183)
 │  │           │     │        │  │     │  └─ 64: init(191), before(210), after(211)
 │  │           │     │        │  │     └─ 61: init(184), before(206), execEnd(207), after(208), destroy(219)
 │  │           │     │        │  ├─ 53: init(151), before(169), execEnd(170), after(171), destroy(192)
 │  │           │     │        │  └─ 54: init(152), before(168), after(179), destroy(194)
 │  │           │     │        └─ 50: init(134), before(147), after(164), destroy(167)
 │  │           │     ├─ 37: init(92)
 │  │           │     │  └─ 42: init(103), before(127), after(128)
 │  │           │     ├─ 38: init(93), before(120), execEnd(121), after(122), destroy(144)
 │  │           │     └─ 39: init(94), before(119), after(126), destroy(146)
 │  │           ├─ 27: init(56), before(82), execEnd(83), after(84), destroy(116)
 │  │           └─ 28: init(57), before(81), after(88), destroy(118)
 │  ├─ 19: init(22)
 │  │  └─ 24: init(31), before(39), after(40)
 │  └─ 20: init(23), before(35), execEnd(36), after(37), destroy(47)
 ├─ 10: init(5), before(287)
 └─ 11: init(6), before(286)

仔细观察，你会发现：

before/ after 是可能没有的（见 5、6）。
祖先的 destory 会比后代早（见 7 vs 51，destroy(32) 比destroy(166) 早）。

所以，如果我们在 init 的时候初始化上下文，在 destory 的时候清理掉，就可能出现栈顶部（当前执行的代码，对于上面图中的树，就是叶子节点）的调用没办法找到对应的上下文信息。

但是，如果你不清理掉那些上下文信息，你就会有内存泄漏。

不过我们也发现，有两件事情是确定的：

对于同一个调用 init / before / execEnd / after / destory 是有顺序的。
祖先的 init / before / execStart 是早于后代的。

所以，这里我们介绍使用引用计数的方式解决这个问题：

init

当前的上下文信息是被当前执行引用的，同时对于父节点的引用计数加一。

destory

清除对自己上下文的引用。
如果当前的上下文信息没有其他子孙节点引用，则可以删除；同时删除对父节点的引用。
看看父节点是不是也需要这样的处理。

问题二：上下文错乱

正如前面提到的，上下文错乱只会发生在：

查询上下文发生在设置之前。
同一个请求设置了多次上下文。

更严重的是，这种错乱没办法检测到。不过如果第一个请求做对了，并且打上标记，后面的请求可以通过以下方式检测到上述问题：

对于查询（Query）：如果它发现自己碰到了第一个请求所设置的上下文（根据打上的标记），那么它就可以推断出当前请求的上下文没有设置。
对于设置（Set）：如果它发现自己碰到了第一个请求所设置的上下文，推断出没有在同一个请求中设置两次上下文。

所以，要避免错乱，我们只能设置上下文信息一次，并且要先于查询发生。这个对于我们的应用场景是没有问题的。换句话说，只要正确使用，我们就不会有错乱的问题。

问题三：Destory 并没有调用

这个问题当前（2018年12月）应该已经解决了，详细详情参阅Github/node#19859 [1]。

当我们把代码部署到生产环境，我们发现一些内存泄漏的问题：

作为比较，这个是一星期前的样子：

为了调研问题，我们弄了一个计数器：

在 init 的时候 +1
在 destroy 的时候 -1

下面是一天半时间的图：

未 destory 的异步资源随时间增长（暴跌是由于 pm2 自动重启）

我们可以看发现， destroy 并没有跟 init 一样等量地调用。当我们深挖这个问题的时候，我们发现，是我们使用了 Keep-alive 配置的 HTTP Client 导致 destroy 没有调用。它只会在连接断开的时候调用 destroy，但是生产环境有持续的请求，所以连接通常都不会断开的。

为了解决这个问题，我们使用了一个简单的分代垃圾回收算法：

两个 Map：old vs new。
对于 set，总是写到 new 那个 Map。
对于 get，在 new 里面看看有没有，如果没有，继续去 old 找，如果找到了，复制到 new。
在一定周期内，例如 5 分钟，删掉 old，old 指向 new，new 指向一个空的 Map。

因为我们绝大多数请求都是在几秒内结束的，所以每 5 分钟删掉没用过的数据是一个安全的选择。同时，这种实现并没有造成性能的下降。

结论

通过使用 async_hook ，使得我们统计和追踪都容易了很多。想要了解更多细节的话，你可以去看下我们基于这个算法实现的开源项目 envoy-node [2]。欢迎给我们留言或者提供反馈！

[1] Github Node Issue #19859 Reusing HTTP connection lead to no destroy triggered: https://github.com/nodejs/node/issues/19859
[2] envoy-node: https://github.com/Tubitv/envoy-node

关于作者

程颖宇，Tubi 中国团队高级后端工程师。毕业后曾在微软工作三年，负责必应 Ad Insights API 和 Middle Tier 平台的研发工作。16 年加入刚刚组建的 Tubi 中国团队，先后负责 Adbreak Finder、Clip Pipeline 等多个重点项目，现在主要从事微服务基础设施以及 Ad Server 的研发工作。

第八期 Scala Meetup 火热报名中，欢迎参与：抢位｜Scala Meetup 线上回归！预约报名-Tubi TV活动-活动行