添加 10 行代码，让 Node.js 处理 5 倍的请求

本文正在参与技术专题征文Node.js进阶之路，点击查看详情」

Node.js 以单线程和非阻塞 IO，在 CPU 中作为单进程的方式工作。采用这种单线程和单进程的方式运行，即使服务器的的性能很强且各个资源都被有效利用，能做的也是很有限的。Node.js 旨在构建具有多个节点的分布式应用程序，因此得名 Node.js。

工作负载是我们开始扩展应用程序的主要原因之一，包括可用性和容错性等。扩展可以通过多种方式实现，最简单的方式是克隆多个 Node.js 的实例，我们可以使用 Node.js 提供的 Cluster Moudle 的方式进行克隆。

在我们开始使用资源利用型Node.Js 服务器处理请求之前，让我们先了解一下 Cluster 模块的工作原理。

Cluster 是如何工作的？

采用 Cluster 模式运行的进程有两种类型：Master 进程和 Worker 进程，Master 进程负责接收所有请求并决定由哪个 Worker 进程处理请求。Worker 进程可以被认为是普通的 Node.js 单实例处理请求的模式。

Master 进程是如何分配请求的呢？

第一种方式是轮询调度（round-robin）：Master 进程监听端口，将新接收到的请求通过循环分配的方式分发给 Worker 进程。当然 Master 进程会有一些内置的处理以避免 Worker 进程过载。这种方式是除了 Windows 之外，大多数平台的处理方式。
第二种方式是采用 socket：Master 进程创建 socket，将请求传递给对应的 Worker 进程处理。

理论上，第二种方法应该提供最佳性能。然而由于操作系统本身的分配调度的不确定性，常常会导致请求被分配的极不均匀。例如同时有 8 个 Node.js 实例，但 70%的请求都被分配到其中的 2 个进程实例上。

创建一个简单的 Node.js 服务

让我们来创建一个简单的 Node.js 服务来处理请求：

/*** server.js ***/
const http = require(“http”);
// 获取进程id
const processId = process.pid;
// 创建http服务并处理请求
const server = http.createServer((req, res) => {
    // 模拟CPU工作
    for (let index = 0; index < 1e7; index++);

    res.end(`Process handled by pid: ${processId}`);
});
// 监听8080端口
server.listen(8080, () => {
    console.log(`Server Started in process ${processId}`);
});
复制代码

此时响应如下：

对这个 Node.js 服务进行负载测试

我们将使用ApacheBench工作进行测试，你也可以根据自己的喜好选择 benchmark 测试工具，例如autocannon。

我们将在 10s 内对我们的 Node.js 服务进行 500 并发请求的负载测试。

➜ test_app ab -c 500 -t 10 http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Finished 3502 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /
Document Length: 29 bytes
Concurrency Level: 500
Time taken for tests: 11.342 seconds
Complete requests: 3502
Failed requests: 0
Total transferred: 416104 bytes
HTML transferred: 116029 bytes
Requests per second: 308.76 [#/sec] (mean)
Time per request: 1619.385 [ms] (mean)
Time per request: 3.239 [ms] (mean, across all concurrent requests)
Transfer rate: 35.83 [Kbytes/sec] received
Connection Times (ms)
 min mean[+/-sd] median max
Connect: 0 6 3.7 5 17
Processing: 21 1411 193.9 1412 2750
Waiting: 4 742 395.9 746 1424
Total: 21 1417 192.9 1420 2750
Percentage of the requests served within a certain time (ms)
 50% 1420
 66% 1422
 75% 1438
 80% 1438
 90% 1624
 95% 1624
 98% 1624
 99% 1625
 100% 2750 (longest request)
复制代码

从上面的测试中，总共处理了3502个请求，吞吐率（Requests per second）为308req/s，用户平均请求等待时间（Time per request）为1619ms。

这个负载测试的结果很好，应该足以支撑其大部分中小规模的站点应用。但是我们并没有充分的运用 CPU 资源，大部分可用的 CPU 资源都被被闲置了。

使用 Cluster 模式

现在我们来使用 Cluster 模式对我们的 Node.js 服务进行升级

/** cluster.js **/
const os = require(“os”);
const cluster = require(“cluster”);
if (cluster.isMaster) {
   const number_of_cpus = os.cpus().length;

   console.log(`Master ${process.pid} is running`);
   console.log(`Forking Server for ${number_of_cpus} CPUs\n`);
   // 根据CPU内核格式创建woker进程
   for (let index = 0; index < number_of_cpus; index++) {
       cluster.fork();
   }
   // 单worker进程退出，进行日志打印
   cluster.on(“exit”, (worker, code, signal) => {
       console.log(`\nWorker ${worker.process.pid} died\n`);
   });
} else {
   require(“./server”);
}
复制代码

当今可大多数 CPU 都至少具有双核处理器。我的个人电脑的处理器是第 8 代 i7，有 8 个内核，其余 7 个内核的资源都是处于空闲状态。

当我们运行cluster.js后，服务器响应如下：

如果你使用 Cluster 模式运行多个 Node.js 实例，这将充分利用 CPU/服务器的能力。请求由 Master 进程分配给 8 个 Worker 进程中的一个进行处理。

对 Cluster 模式的 Node.js 服务进行负载测试

➜  test_app ab -c 500 -t 10  http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
Completed 20000 requests
Finished 20374 requests
Server Software:
Server Hostname:        localhost
Server Port:            8080
Document Path:          /
Document Length:        29 bytes
Concurrency Level:      500
Time taken for tests:   10.000 seconds
Complete requests:      20374
Failed requests:        0
Total transferred:      2118896 bytes
HTML transferred:       590846 bytes
Requests per second:    2037.39 [#/sec] (mean)
Time per request:       245.412 [ms] (mean)
Time per request:       0.491 [ms] (mean, across all concurrent requests)
Transfer rate:          206.92 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.3      0      12
Processing:     6  242  15.6    241     369
Waiting:        6  242  15.5    241     368
Total:         18  242  15.5    241     371
Percentage of the requests served within a certain time (ms)
  50%    241
  66%    244
  75%    246
  80%    247
  90%    251
  95%    259
  98%    283
  99%    290
 100%    371 (longest request)
复制代码

我们的吞吐率（Requests per second）从刚才的308per/s增加到了2037per/s，有了接近6倍的提高。而且用户平均请求等待时间（Time per request）从1619ms下降到了245ms。之前我们总共收到了3502个请求，现在提高到了20374个请求（5.8倍的提高）。

可以看到我们并没有更改sever.js的代码，只是增加了cluster.js10 来行的代码就带来了性能的巨大提升。

可用性和零停机时间

当我们值有一个服务器实例并且该实例崩溃时，必须重新启动实例，这会产生停机时间。即使该过程采用 PM2 等工具进行自动化实现，也会有延迟，并且在这段时间内无法处理任何一个请求。

模拟服务崩溃

/*** server.js ***/
const http = require(“http”);
// 获取Node.js进程的进程id
const processId = process.pid;
// 创建http服务
const server = http.createServer((req, res) => {
    // 模拟CPU运行
    for (let index = 0; index < 1e7; index++);

    res.end(`Process handled by pid: ${processId}`);
});
// 监听80端口
server.listen(8080, () => {
    console.log(`Server Started in process ${processId}`);
});
// ⚠️注意：下面的代码只是用于进程崩溃测试，请勿用于生产环境
setTimeout(() => {
    process.exit(1);
}, Math.random() * 10000);
复制代码

如果我们将sever.js新增最后的 3 行代码，那么重启服务器后可以看到所有的进程都在崩溃。由于最后没有可用的 Worker 进程，主进程依然存在，整个服务可能会崩溃。

➜  test_app node cluster.js
Master 63104 is running
Forking Server for 8 CPUs
Server Started in process 63111
Server Started in process 63118
Server Started in process 63112
Server Started in process 63130
Server Started in process 63119
Server Started in process 63137
Server Started in process 63142
Server Started in process 63146
Worker 63142 died
Worker 63112 died
Worker 63111 died
Worker 63146 died
Worker 63119 died
Worker 63130 died
Worker 63118 died
Worker 63137 died
➜  test_app
复制代码

处理零停机时间

当我们有多个服务器实例时，可以轻松提高服务器的可用性。

让我们打开我们的 cluster.js 文件并新增一些代码：

/** cluster.js **/
const os = require(“os”);
const cluster = require(“cluster”);
if (cluster.isMaster) {
   const number_of_cpus = os.cpus().length;

   console.log(`Master ${process.pid} is running`);
   console.log(`Forking Server for ${number_of_cpus} CPUs\n`);
   // 根据CPU内核格式创建woker进程
   for (let index = 0; index < number_of_cpus; index++) {
       cluster.fork();
   }

    cluster.on(“exit”, (worker, code, signal) => {
      /**
      * 检查worker进程退出，并且不是由Master进程杀死的情况。fork进程
      */
      if (code !== 0 && !worker.exitedAfterDisconnect) {
          console.log(`Worker ${worker.process.pid} died`);
          cluster.fork();
      }
    });
} else {
   require(“./server”);
}
复制代码

对实施了重启的Node.js服务进行负载测试

让我们重新测试我们的cluster.js进行负载测试，请注意上面我们对server.js已经改造成了会不定期的崩溃。测试结果如下：

➜  test_app ab -c 500 -t 10 -r http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
Completed 20000 requests
Finished 20200 requests
Server Software:        
Server Hostname:        localhost
Server Port:            8080
Document Path:          /
Document Length:        29 bytes
Concurrency Level:      500
Time taken for tests:   10.000 seconds
Complete requests:      20200
Failed requests:        12
   (Connect: 0, Receive: 4, Length: 4, Exceptions: 4)
Total transferred:      2100488 bytes
HTML transferred:       585713 bytes
Requests per second:    2019.91 [#/sec] (mean)
Time per request:       247.536 [ms] (mean)
Time per request:       0.495 [ms] (mean, across all concurrent requests)
Transfer rate:          205.12 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.5      0      13
Processing:    13  243  15.7    241     364
Waiting:        0  243  16.0    241     363
Total:         22  243  15.5    241     370
Percentage of the requests served within a certain time (ms)
  50%    241
  66%    245
  75%    248
  80%    250
  90%    258
  95%    265
  98%    273
  99%    287
 100%    370 (longest request)
➜  test_app
复制代码

在本轮负载测试中，可以看到我们的吞吐率为2019per/s。在全部的20200个请求中，只有12个请求失败了，服务器的正常运行时间达到了99.941%。

译文原文链接：《Make NodeJs handle 5x request with 99.9% uptime adding 10 lines of code》 (作者：Biplap Bhattarai)