Node + TypeScript 实战肺炎疫情实时动态数据爬虫

爬虫

来自维基百科：

爬虫，网络爬虫（英语：Web Crawler），也叫网络蜘蛛（Spider）。

通俗一点讲，就是一段自动化的代码，它会模拟人的行为，去浏览一些网站，然后把需要的、有价值的信息拿回来。

我们这篇文章的目标是把全国新冠肺炎疫情实时动态：https://ncov.dxy.cn/ncovh5/view/pneumonia 这个网页的信息爬取下来。

爬虫背后也有巨大的商业价值，比如：

百度：利用爬虫，每天放出无数爬虫到各个网站，把他们的信息抓回来，然后等着你搜索需要的信息；
抢票平台：我们授权自己的 12306 账号密码给抢票平台，它不断刷新 12306 网站的火车余票。一旦发现有票，就马上帮我们拍下来。

爬取的目标绝大多数情况下“要么是网页，要么是 App”，所以本文只介绍网页。

流行的网页架构

本文将其划分为了两种类别，即服务端渲染和客户端渲染，下面将介绍对于两种网页我们该怎么爬取。

服务端渲染

浏览的页面，是由服务器渲染后直接返回的，有效信息包含在请求的 HTML 页面里面，比如豆瓣电影 Top 250 这个站点。

我们打开网页，右键显示源代码，可以看到，里面有我们需要的电影名、评分、主演、导演等信息。

这就简单了，对于服务端渲染的页面，HTML 里面已经有我们需要的信息，我们只需要解析 HTML 即可。代码如下：

用 node 内置的 HTTPS 模块发出一个 GET 请求（因为我们在浏览器里面输入豆瓣的 URL 进行访问，也是 GET 请求），在数据响应结束事件，即 end 事件里，借助第三方的 cheerio 模块解析 HTML，之后我们就像使用 jQuery 一样，选中信息所在的 HTML 节点，组织数据即可。

function spiderMovie(index) {
  // 开始请求 豆瓣地址
  https.get('https://movie.douban.com/top250?start=' + index, function (res) {
    var pageSize = 25;
    var html = '';
    var movies = [];
    res.setEncoding('utf-8');
    res.on('data', function (chunk) {
      html += chunk;
    });
    res.on('end', function () {
      var $ = cheerio.load(html);
      $('.item').each(function () {
        // 限制 当前 item 下面的 .pic img
        var picUrl = $('.pic img', this).attr('src');
        var movie = {
          title: $('.title', this).text(),
          star: $('.info .star .rating\_num', this).text(),
          link: $('a', this).attr('href'),
          picUrl: picUrl
        };
        if (movie) {
          movies.push(movie);
        }
        downloadImg('./img/', movie.picUrl);
      });
      saveData('./data/movie-data' + (index / pageSize) + '.json', movies);
    });
  }).on('error', function (err) {
    console.log(err);
  });
}

在上面的代码中，我们已经获取数据，现在我们爬取的电影数据保存在本地为 JSON 文件，电影封面将其下载到本地。

/\*\* \* 下载图片 @param { string } imgDir 存放图片的文件夹 @param { string } url 图片的URL地址 \*/
function downloadImg(imgDir, url) {
  // 先得到 流
  https.get(url, function (res) {
    res.pipe(fs.createWriteStream(imgDir + path.basename(url)))
  }).on('error', function (err) {
    console.log(err);
  });
}
/\*\* \* \* @param {\*} path path 保存数据的文件夹 \* @param {\*} movies 电影信息数组 \*/
function saveData(path, movies) {
  console.log(movies);
  fs.writeFile(path, JSON.stringify(movies, null, ' '), function (err) {
    if (err) {
      return console.log(err);
    }
    console.log('Data saved');
  });
}
spiderMovie(0);

大功告成。

客户端渲染

页面的主要内容由 JavaScript 渲染而成，和客户端渲染不同的就是，我们需要的信息不再是服务端返回在 HTML 里面，而是 JavaScript 生成。

比如现在我们有一个需求：

爬取本平台，即 gitbook-chat 所有 Chat 的前三页数据。

在使用中我们也有感受，打开 https://gitbook.cn/ 服务端返回的 Chat 数据，只有第一页的，后面两页的数据需要等我们滚动到底部才会加载的，那么后面两页的数据就属于需要 JavaScript 发出请求构造页面的，后面两页就属于客户端渲染。

对于客户端渲染，我们可以模拟浏览器执行，即模拟浏览器滚动到底部，可以借助 Headless Chrome puppeteer，无界面的浏览器。

const puppeteer = require('puppeteer');
(async () =\> {
  const browser = await puppeteer.launch({
    headless: false,
    timeout: 5000
  })

  const page = await browser.newPage()

  await page.goto('https://gitbook.cn/', {
    waitUntil: 'load'
  })

  // 初始服务端返回的 内容
  const totalChatLists = await page.evaluate(async () => {
    console.log('evaluate里面的代码会在浏览器执行');
    const sleep = (time) =\> new Promise((resolve, reject) =\> {
      setTimeout(() =\> {
        resolve();
      }, time);
    })
    const scroll = () =\> {
      const list = document.querySelectorAll('.chat\_list .chat\_item');
      console.log(list.length, document.body.clientHeight);
      // 平滑滚动到底部一次
      window.scrollTo({
        top: document.getElementById('chatItemContainer').clientHeight + 72,
        behavior: 'smooth'
      })
    }
    // 滚动第一次
    scroll();
    await sleep(2000);
    // 滚动第二次
    scroll();
    await sleep(2000);
    const chatlists = document.querySelectorAll('.chat\_list .chat\_item')
    return Promise.resolve([...chatlists].map(chatEl =\> {
      const url = chatEl.getAttribute('href');
      const title = chatEl.querySelector('h2').innerText;
      return {
        url, title
      }
    }));
  });
  console.log(totalChatLists.length, totalChatLists);
})()

Puppeteer 的 API 很见名知意，打开一个页面，然后调用 await page.evaluate，这里面的代码会在页面里面执行，我们在里面用 JS 模拟滚动底部，每次会间隔 2s，滚动完毕，我们需要的内容也加载完毕，选取即可。

结果如图：共 60 条数据。

目标网站爬取

打开 https://ncov.dxy.cn/ncovh5/view/pneumonia，右键显示源代码，虽然 HTML 里面没有我们想要的数据，但是 script 里面却有我们想要的数据。

类似于：

\<body\>
    \<div\>
    \</div\>
    \<script\> try { window.getListByCountryTypeService2 = [{}, {}] } catch() {} \</script\>
\</body\>

把数据返回在前端，同时还放置于 window 对象下面，这是经典的 React 服务端渲染，处理前后端状态管理的方式。如上代码所示相当于定义了一个 getListByCountryTypeService2 的全局变量。

我们只要在控制台里面输入在 script 下面的变量名既可以拿到数据。

如下图：

截止作者写作之日（2020-02-09），共整理出来如下字段：

网页内容变量名疫情数据概览 getStatisticsService 全国数据 getListByCountryTypeService1 全球数据 getListByCountryTypeService2 疾病知识 getWikiList 实时播报 getTimelineService （只有前 5 个数据）辟谣与防护 getIndexRumorList （只有前 10 个数据）防护知识合辑 getIndexRecommendList （只有前 5 个数据）

其中有些数据不完整，需要跳到新的页面，才能看到完整的数据，表格中已经说明。

下一步，继续分析，完整的数据怎么爬取。

以“辟谣与防护”为例，我们继续分析，来到谣言排行榜：https://ncov.dxy.cn/ncovh5/view/pneumonia_rumors?from=dxy&source=undefined

很明显了，所有的谣言，都是请求了一个 JSON 接口，其余两个同理。整理可得如下：

接口     辟谣与防护 https://file1.dxycdn.com/2020/0130/454/3393874921745912507-115.json?t=26354242   防护知识合辑 https://file1.dxycdn.com/2020/0130/542/3393874921746319236-115.json?t=26342815   实时播报 https://file1.dxycdn.com/2020/0130/492/3393874921745912795-115.json?t=26342813

综上所述：对于已经把全局数据放到全局变量的数据，我们直接获取即可。对于全局变量里面数据不完整的，我们可以请求 JSON 接口。

确定需求

对于爬虫，我们想控制一下频率，每隔半个小时，爬取一次，爬完数据放到 MongoDB（NoSQL 数据库）里面。

每当前端需要查询数据，我们都从 MongoDB 里面查询返回。

技术栈整理：

后端服务：Koa、Koa-router
代码编写：TypeScript
数据持久化：MongoDB

项目需要安装 MongoDB，以及 npm 安装 TypeScript。

npm install -g typescript

数据爬取

项目初始化：

// step1
npm init -y
// step2: 新建：tsconfig.json 内容如下：
{
  "compilerOptions": {
    /\* Basic Options \*/
    "target": "ES2017",                          /\* Specify ECMAScript target version: 'ES3' (default), 'ES5', 'ES2015', 'ES2016', 'ES2017', 'ES2018', 'ES2019' or 'ESNEXT'. \*/
    "module": "commonjs",                     /\* Specify module code generation: 'none', 'commonjs', 'amd', 'system', 'umd', 'es2015', or 'ESNext'. \*/
    "outDir": "./dist",                        /\* Redirect output structure to the directory. \*/
    "rootDir": "./src",                       /\* Specify the root directory of input files. Use to control the output directory structure with --outDir. \*/

    /\* Strict Type-Checking Options \*/
    "strict": true,                           /\* Enable all strict type-checking options. \*/

    /\* Additional Checks \*/

    /\* Module Resolution Options \*/
    "esModuleInterop": true,                  /\* Enables emit interoperability between CommonJS and ES Modules via creation of namespace objects for all imports. Implies 'allowSyntheticDefaultImports'. \*/

    /\* Source Map Options \*/

    /\* Experimental Options \*/
    "experimentalDecorators": true,        /\* Enables experimental support for ES7 decorators. \*/
    "emitDecoratorMetadata": true,         /\* Enables experimental support for emitting type metadata for decorators. \*/

    /\* Advanced Options \*/
    "forceConsistentCasingInFileNames": true  /\* Disallow inconsistently-cased references to the same file. \*/
  }
}

tsconfig.json 指明打包到 dist 目录，源码在 src 目录。

建立如下目录结构：

.
├── README.md
├── article.md
├── config.json
├── dist
├── package.json
├── src
│   ├── index.ts
│   ├── model
│   │   └── model.ts
│   ├── module
│   │   ├── getListByCountryTypeService1.ts
│   │   └── getStatisticsService.ts
│   ├── types
│   │   └── types.ts
│   └── util
│       ├── db.ts
│       ├── query.ts
│       ├── request.ts
│       └── schedule.ts
└── tsconfig.json

我们先开始 util 目录下的代码 db.ts，负责数据库相关的：

// 需要 npm i @types/mongoose mongoose 
import mongoose from 'mongoose';

const mongodbUrl: string = 'mongodb://127.0.0.1:27017/2019-nCoV'

const connection = mongoose.createConnection(mongodbUrl);

export default connection;

大家在接下来的步骤里面，导入什么包，那么就安装什么包，上面得代码需要安装两个 @types/mongoose mongoose，一个是我们依赖的，还有一个 @types，这是因为有些第三方模块不是 ts 写得，所以有另外的一个模块（@types）对他们定义类型。如果运行的时候发现报了如下错误，不用惊慌，缺少依赖包，用 npm 安装即可。

src/index.ts:1:17 - error TS2307: Cannot find module 'koa'.
1 import Koa from 'koa';

回到正题，上面的 db.ts，根据 MongoDB 启动的 URL，建立与 MongoDB 的连接，有了连接，我们可以用连接建立模型（Model），模型可以定以数据库里面存储的字段。

在 model.ts 里面：

/\*\* \* 数据总览 \*/
import mongoose from 'mongoose';
import connection from '../util/db';
import { VariableKey, ModelCacheType } from '../types/types';

const Schema = mongoose.Schema;
// 
const modelCache: ModelCacheType = {}
// 定义模型
const schema = new Schema({
  inDate: Date,
  data: {}
});

export default function (modelName: VariableKey): mongoose.Model\<any\> | undefined {
  if (modelCache[modelName]) {
    return modelCache[modelName];
  } else {
      // 创建模型
    const model = connection.model(modelName, schema);
    modelCache[modelName] = model
    return model;
  }
}
export {
  modelCache
}

用 schema 定义模型，我们存储得比较简单一个日期字段，一个数据字段，日期字段可以在疫情结束后对每天产生的数据回溯。

const modelCache: ModelCacheType = {} 定义了一个模型的缓存，类型是ModelCacheType，可以避免重复创建。

那么我们现在需要 ModelCacheType 这一个类型，在 types.ts 里面：

import mongoose from 'mongoose';
// 爬取数据的字段，我们在上面整理出表格了
enum VariableKey {
  getStatisticsService =  'getStatisticsService',
  getListByCountryTypeService1 = 'getListByCountryTypeService1',
  getListByCountryTypeService2 = 'getListByCountryTypeService2',
  getTimelineService = 'getTimelineService',
  getIndexRumorList = 'getIndexRumorList',
  getIndexRecommendList = 'getIndexRecommendList',
  getWikiList = 'getWikiList'
}
// 模型的缓存 这一个对象里面可以存储哪些字段，每一个字段的类型都是 mongoose.Model
interface ModelCacheType {
  getStatisticsService?: mongoose.Model<any>,
  getListByCountryTypeService1?: mongoose.Model<any>,
  getListByCountryTypeService2?: mongoose.Model<any>,
  getWikiList?: mongoose.Model<any>,
  getTimelineService?: mongoose.Model<any>,
  getIndexRumorList?: mongoose.Model<any>,
  getIndexRecommendList?: mongoose.Model<any>
}

interface JsonListObj {
  modelName: string,
  url: VariableKey
}
interface jsonListType {

}
// 需要爬取的数据，我们把它定义为一个数组
const dataList: VariableKey[] = [
  VariableKey.getIndexRecommendList,
  VariableKey.getListByCountryTypeService1,
  VariableKey.getListByCountryTypeService2,
  VariableKey.getIndexRumorList,
  VariableKey.getStatisticsService,
  VariableKey.getTimelineService,
  VariableKey.getWikiList
] 
export {
  VariableKey,
  ModelCacheType,
  JsonListObj,
  dataList
}

有了数据库连接，定义了数据的模型，那么我们就可以开始爬取数据，我们还需要一个请求方法，爬去完放到数据库里面了。

在 request.ts：

import request from 'request';
// 把 request 封装为 Promise
function req(url: string, option: request.CoreOptions) {
  return new Promise((resolve: (res: any) =\> void, reject) => {
    request(url, option, (err, res, body: any) => {
      if (err) {
        reject(err);
      }
      else {
        resolve(body)
      }
    })
  })
}
export default req;

在 schedule.ts:

import jsdom from "jsdom";
import dayJS from 'dayjs';
import request from './request';
import modelCreate from '../model/model';
import { VariableKey, JsonListObj } from '../types/types';

const { JSDOM } = jsdom;

// 四个基本的数据 无分页
const dataLists: VariableKey[] = [
  VariableKey.getStatisticsService,
  VariableKey.getListByCountryTypeService1,
  VariableKey.getListByCountryTypeService2,
  VariableKey.getWikiList
]
/\*\* \* 长列表 数据不完整 \* 0: 辟谣与防护 \* 1: 防护知识合辑 \* 3: 实时播报 \*/

const jsonList = [
  {
    modelName: VariableKey.getIndexRumorList,
    url: 'https://file1.dxycdn.com/2020/0130/454/3393874921745912507-115.json?t=26343751'
  },
  {
    modelName: VariableKey.getIndexRecommendList,
    url: 'https://file1.dxycdn.com/2020/0130/542/3393874921746319236-115.json?t=26342815'
  },
  {
    modelName: VariableKey.getTimelineService,
    url: 'https://file1.dxycdn.com/2020/0130/492/3393874921745912795-115.json?t=26342813'
  }
]
function modelCreateByData(modelName: VariableKey, data: object): void {
  // 创建模型
  const model = modelCreate(modelName);
  const inDate = dayJS().format('YYYY-MM-DD HH:mm:ss')
  if (model) {
   // 插入数据
    model.create({ inDate: inDate, data: data })
  }
}
async function crawler(): Promise\<void\> {
  try {
    console.log('正在访问 dxy');
    const html: string = await request('https://ncov.dxy.cn/ncovh5/view/pneumonia', {})
    拿到 html
    const dom = new JSDOM(html, { runScripts: 'dangerously' });
    const window = (dom.window as any)
    dataLists.map((modelName) =\> {
      modelCreateByData(modelName, window[modelName]);
    })
    for (let urlObj of jsonList) {
      const jsonListObj = urlObj;
      const data = await request(jsonListObj.url, { json: true })
      const modelName: VariableKey = jsonListObj.modelName
      modelCreateByData(modelName, data.data);
    }
  } catch (error) {

  }
}
function run(): void {
  crawler();
  setInterval(() =\> {
    crawler();
  }, 1000 * 60 * 30)
}
export {
  run
}

schedule.ts，负责获取数据的工作，这里借助 jsdom，帮我们在 node 端构建 dom 环境，浏览器端 window 下有一些全局变量，node 端通过 jsdom，我们也可以在 window 下获取变量，默认启动会执行一次，每隔 30 分钟又会爬取一次。

而对于 window 下数据不完整的，需要（代码已包含在schedule.ts，这里特地挑出来讲解一下）：

for (let urlObj of jsonList) {
      const jsonListObj = urlObj;
      const data = await request(jsonListObj.url, { json: true })
      const modelName: VariableKey = jsonListObj.modelName
      modelCreateByData(modelName, data.data);
    }

在上文 schedule.ts 中已经定义 jsonList，存入了每一个需要请求的 URL，只要循环一一请求即可，存入数据的方法都是统一的。万事俱备，在 index.ts 中：

import { run } from './util/schedule';
run();

启动：

// step1 第一个窗口 启动 typescript 监听文件跟新
tsc -w (tsc 命令需要安装 typescript)
// step2 第二个窗口 启动编译完的js文件
nodemon dist/index.js

可以看到 MongoDB 里面已经存入数据。

API 制作

我们规定 API 访问的路径和数据库模型一一对应：

kit-font-smoothing:antialiased;box-sizing:border-box;text-align:left;line-height:20px;padding:8px;vertical-align:top;font-weight:700;border-top:0px;outline:0px;">模型 /getStatisticsService getStatisticsService /getListByCountryTypeService1 getListByCountryTypeService1 /getListByCountryTypeService2 getListByCountryTypeService2 /getWikiList getWikiList /getTimelineService getTimelineService /getIndexRumorList getIndexRumorList /getIndexRecommendList getIndexRecommendList

这样我们根据 URL 就可以知道访问哪个模型，去模型里面吧对应的数据查询出来即可。

定义查询的 util 函数，在 query.ts：

import mongoose from 'mongoose';
// 根据模型，查询该模型的数据
export default async function query(model: mongoose.Model\<any\> | undefined) {
  return new Promise((resolve, reject) =\> {
    if (!model) return Promise.resolve({ msg: '暂无数据' });
    // 查询最后一条数据，因为一个模型里面有多条数据，选择最新入库的一条
    model
    .find({})
    .sort({ \_id:-1 })
    .limit(1)
    .exec((err: any, docs: mongoose.Document) =\> {
      if (err) {
        console.log('查询出错了', err)
        reject(err);
      }
      console.log('查询结果', docs);
      resolve(docs);
    })
  })
}

在 types.ts 的 dataList 里面我们定义了所有接口或者所有模型的数组：

const dataList: VariableKey[] = [
  VariableKey.getIndexRecommendList,
  VariableKey.getListByCountryTypeService1,
  VariableKey.getListByCountryTypeService2,
  VariableKey.getIndexRumorList,
  VariableKey.getStatisticsService,
  VariableKey.getTimelineService,
  VariableKey.getWikiList
]

那么我们只要根据这个数组生成 API 即可，而且循环的时候我们也知道模型，根据模型查询的 query.ts 也准备好了，index.ts 全文如下：

import Koa from 'koa';
import Router from 'koa-router';
import fs from 'fs';
import path from 'path';
import { run } from './util/schedule';
import { dataList } from './types/types'
import query from './util/query';
import modelCreate from './model/model';

const packageJSON = require('../package.json')

const app: Koa = new Koa();
const router: Router = new Router();

app.use(async (ctx: Koa.BaseContext, next: Koa.Next) => {
  ctx.set({
    'Access-Control-Allow-Credentials': 'true',
    'Access-Control-Allow-Origin': ctx.headers.origin || '\*',
    'Access-Control-Allow-Headers': 'X-Requested-With,Content-Type',
    'Access-Control-Allow-Methods': 'PUT,POST,GET,DELETE,OPTIONS',
    'Content-Type': 'application/json; charset=utf-8'
  });
  await next();
})

run();

dataList.forEach(key =\> {
  const { modelCache }  = require('./model/model');
  router.get(`/${key}`, async (ctx) => {
    const model = modelCreate(key)
    const body = await query(model);
    ctx.body = body;
  })
})

app
  .use(router.routes())
  .use(router.allowedMethods());

const port = process.env.PORT || 3001;
app.listen(port, () => {
  console.log(`${packageJSON.name} is running ${port} port ....`);
})

尝试访问：

成功查出数据。

欢迎关注我的公众号，回复关键字“大礼包” ，将会有大礼相送！！！ 祝各位面试成功！！！