网络爬虫开发(四)-爬虫基础——环境准备、定义options接口、抽取公共部分代码、定义抽象方法、实现TeacherPhotos类、实现NewsList类及总结

网络爬虫开发(四)-爬虫基础——环境准备、定义options接口、抽取公共部分代码、定义抽象方法、实现TeacherPhotos类、实现NewsList类及总结

封装爬虫基础库

以上代码重复的地方非常多,可以考虑以面向对象的思想进行封装,进一步的提高代码复用率,为了方便开发,保证代码规范,建议使用TypeScript进行封装

以下知识点为扩展内容,需要对面向对象和TypeScript有一定了解!

执行tsc --init初始化项目,生成ts配置文件

TS配置:

{
    
    
  "compilerOptions": {
    
    
    /* Basic Options */
    "target": "es2015", 
    "module": "commonjs", 
    "outDir": "./bin", 
    "rootDir": "./src", 
    "strict": true,
    "esModuleInterop": true 
  },
  "include": [
    "src/**/*"
  ],
  "exclude": [
    "node_modules",
    "**/*.spec.ts"
  ]
}

Spider抽象类:定义options接口、抽取公共部分代码

// 引入http模块
const http = require('http')
import SpiderOptions from './interfaces/SpiderOptions'

export default abstract class Spider {
    
    
  options: SpiderOptions;
  constructor(options: SpiderOptions = {
    
     url: '', method: 'get' }) {
    
    
    this.options = options
    this.start()
  }
  start(): void {
    
    
    // 创建请求对象 (此时未发送http请求)
    let req = http.request(this.options.url, {
    
    
      headers: this.options.headers,
      method: this.options.method
    }, (res: any) => {
    
    
      // 异步的响应
      // console.log(res)
      let chunks: any[] = []
      // 监听data事件,获取传递过来的数据片段
      // 拼接数据片段
      res.on('data', (c: any) => chunks.push(c))

      // 监听end事件,获取数据完毕时触发
      res.on('end', () => {
    
    
        // 拼接所有的chunk,并转换成字符串 ==> html字符串
        let htmlStr = Buffer.concat(chunks).toString('utf-8')
        this.onCatchHTML(htmlStr)
      })
    })

    // 将请求发出去
    req.end()

  }
  abstract onCatchHTML(result: string): any
}

export default Spider

SpiderOptions接口:

export default interface SpiderOptions {
    
    
  url: string,
  method?: string,
  headers?: object
}

PhotoListSpider类:

import Spider from './Spider'
const cheerio = require('cheerio')
const download = require('download')
export default class PhotoListSpider extends Spider {
    
    
  onCatchHTML(result: string) {
    
    
    // console.log(result)
    let $ = cheerio.load(result)
    let imgs = Array.prototype.map.call($('.tea_main .tea_con .li_img > img'), item => 'http://web.itheima.com/' + encodeURI($(item).attr('src')))
    Promise.all(imgs.map(x => download(x, 'dist'))).then(() => {
    
    
      console.log('files downloaded!');
    });
  }
}

NewsListSpider类:

import Spider from "./Spider";

export default class NewsListSpider extends Spider {
    
    
  onCatchHTML(result: string) {
    
    
    console.log(JSON.parse(result))
  }
}

测试类:

import Spider from './Spider'
import PhotoListSpider from './PhotoListSpider'
import NewsListSpider from './NewsListSpider'

let spider1: Spider = new PhotoListSpider({
    
    
  url: 'http://web.itheima.com/teacher.html'
})

let spider2: Spider = new NewsListSpider({
    
    
  url: 'http://www.itcast.cn/news/json/f1f5ccee-1158-49a6-b7c4-f0bf40d5161a.json',
  method: 'post',
  headers: {
    
    
    "Host": "www.itcast.cn",
    "Connection": "keep-alive",
    "Content-Length": "0",
    "Accept": "*/*",
    "Origin": "http://www.itcast.cn",
    "X-Requested-With": "XMLHttpRequest",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
    "DNT": "1",
    "Referer": "http://www.itcast.cn/newsvideo/newslist.html",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Cookie": "UM_distinctid=16b8a0c1ea534c-0c311b256ffee7-e343166-240000-16b8a0c1ea689c; bad_idb2f10070-624e-11e8-917f-9fb8db4dc43c=8e1dcca1-9692-11e9-97fb-e5908bcaecf8; parent_qimo_sid_b2f10070-624e-11e8-917f-9fb8db4dc43c=921b3900-9692-11e9-9a47-855e632e21e7; CNZZDATA1277769855=1043056636-1562825067-null%7C1562825067; cid_litiancheng_itcast.cn=TUd3emFUWjBNV2syWVRCdU5XTTRhREZs; PHPSESSID=j3ppafq1dgh2jfg6roc8eeljg2; CNZZDATA4617777=cnzz_eid%3D926291424-1561388898-http%253A%252F%252Fmail.itcast.cn%252F%26ntime%3D1563262791; Hm_lvt_0cb375a2e834821b74efffa6c71ee607=1561389179,1563266246; qimo_seosource_22bdcd10-6250-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_22bdcd10-6250-11e8-917f-9fb8db4dc43c=; href=http%3A%2F%2Fwww.itcast.cn%2F; bad_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b71-a7a4-11e9-93cc-9b702389a8cb; nice_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b72-a7a4-11e9-93cc-9b702389a8cb; openChat22bdcd10-6250-11e8-917f-9fb8db4dc43c=true; parent_qimo_sid_22bdcd10-6250-11e8-917f-9fb8db4dc43c=fc61e520-a7a4-11e9-94a8-01dabdc2ed41; qimo_seosource_b2f10070-624e-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_b2f10070-624e-11e8-917f-9fb8db4dc43c=; accessId=b2f10070-624e-11e8-917f-9fb8db4dc43c; pageViewNum=2; nice_idb2f10070-624e-11e8-917f-9fb8db4dc43c=20d2a1d1-a7a8-11e9-bc20-e71d1b8e4bb6; openChatb2f10070-624e-11e8-917f-9fb8db4dc43c=true; Hm_lpvt_0cb375a2e834821b74efffa6c71ee607=1563267937"
  }
})

封装后,如果需要写新的爬虫,则可以直接继承Spider类后,在测试类中进行测试即可,仅需实现具体的爬虫类onCatchHTML方法,测试时传入url和headers即可。

而且全部爬虫的父类均为Spider,后期管理起来也非常方便!

实例1

目录

在这里插入图片描述

第一步:执行tsc --init初始化项目,生成ts配置文件

tsconfig.json

{
    
    
  "compilerOptions": {
    
    
    /* Basic Options */
    // "incremental": true,                   /* Enable incremental compilation */
    "target": "ES2015", /* Specify ECMAScript target version: 'ES3' (default), 'ES5', 'ES2015', 'ES2016', 'ES2017', 'ES2018', 'ES2019' or 'ESNEXT'. */
    "module": "commonjs", /* Specify module code generation: 'none', 'commonjs', 'amd', 'system', 'umd', 'es2015', or 'ESNext'. */
    // "lib": [],                             /* Specify library files to be included in the compilation. */
    // "allowJs": true,                       /* Allow javascript files to be compiled. */
    // "checkJs": true,                       /* Report errors in .js files. */
    // "jsx": "preserve",                     /* Specify JSX code generation: 'preserve', 'react-native', or 'react'. */
    // "declaration": true,                   /* Generates corresponding '.d.ts' file. */
    // "declarationMap": true,                /* Generates a sourcemap for each corresponding '.d.ts' file. */
    // "sourceMap": true,                     /* Generates corresponding '.map' file. */
    // "outFile": "./",                       /* Concatenate and emit output to single file. */
    "outDir": "./bin", /* Redirect output structure to the directory. */
    "rootDir": "./src", /* Specify the root directory of input files. Use to control the output directory structure with --outDir. */
    // "composite": true,                     /* Enable project compilation */
    // "tsBuildInfoFile": "./",               /* Specify file to store incremental compilation information */
    // "removeComments": true,                /* Do not emit comments to output. */
    // "noEmit": true,                        /* Do not emit outputs. */
    // "importHelpers": true,                 /* Import emit helpers from 'tslib'. */
    // "downlevelIteration": true,            /* Provide full support for iterables in 'for-of', spread, and destructuring when targeting 'ES5' or 'ES3'. */
    // "isolatedModules": true,               /* Transpile each file as a separate module (similar to 'ts.transpileModule'). */
    /* Strict Type-Checking Options */
    "strict": true, /* Enable all strict type-checking options. */
    // "noImplicitAny": true,                 /* Raise error on expressions and declarations with an implied 'any' type. */
    // "strictNullChecks": true,              /* Enable strict null checks. */
    // "strictFunctionTypes": true,           /* Enable strict checking of function types. */
    // "strictBindCallApply": true,           /* Enable strict 'bind', 'call', and 'apply' methods on functions. */
    // "strictPropertyInitialization": true,  /* Enable strict checking of property initialization in classes. */
    // "noImplicitThis": true,                /* Raise error on 'this' expressions with an implied 'any' type. */
    // "alwaysStrict": true,                  /* Parse in strict mode and emit "use strict" for each source file. */
    /* Additional Checks */
    // "noUnusedLocals": true,                /* Report errors on unused locals. */
    // "noUnusedParameters": true,            /* Report errors on unused parameters. */
    // "noImplicitReturns": true,             /* Report error when not all code paths in function return a value. */
    // "noFallthroughCasesInSwitch": true,    /* Report errors for fallthrough cases in switch statement. */
    /* Module Resolution Options */
    // "moduleResolution": "node",            /* Specify module resolution strategy: 'node' (Node.js) or 'classic' (TypeScript pre-1.6). */
    // "baseUrl": "./",                       /* Base directory to resolve non-absolute module names. */
    // "paths": {},                           /* A series of entries which re-map imports to lookup locations relative to the 'baseUrl'. */
    // "rootDirs": [],                        /* List of root folders whose combined content represents the structure of the project at runtime. */
    // "typeRoots": [],                       /* List of folders to include type definitions from. */
    // "types": [],                           /* Type declaration files to be included in compilation. */
    // "allowSyntheticDefaultImports": true,  /* Allow default imports from modules with no default export. This does not affect code emit, just typechecking. */
    "esModuleInterop": true /* Enables emit interoperability between CommonJS and ES Modules via creation of namespace objects for all imports. Implies 'allowSyntheticDefaultImports'. */
    // "preserveSymlinks": true,              /* Do not resolve the real path of symlinks. */
    // "allowUmdGlobalAccess": true,          /* Allow accessing UMD globals from modules. */
    /* Source Map Options */
    // "sourceRoot": "",                      /* Specify the location where debugger should locate TypeScript files instead of source locations. */
    // "mapRoot": "",                         /* Specify the location where debugger should locate map files instead of generated locations. */
    // "inlineSourceMap": true,               /* Emit a single file with source maps instead of having a separate file. */
    // "inlineSources": true,                 /* Emit the source alongside the sourcemaps within a single file; requires '--inlineSourceMap' or '--sourceMap' to be set. */
    /* Experimental Options */
    // "experimentalDecorators": true,        /* Enables experimental support for ES7 decorators. */
    // "emitDecoratorMetadata": true,         /* Enables experimental support for emitting type metadata for decorators. */
  },
  "include": [
    "src/**/*"
  ],
  "exclude": [
    "node_modules",
    "**/*.spec.ts"
  ]
}

第二步:在src文件夹下新建一个Spider类,定义抽象方法

src/Spider.ts

// 目标: 希望将来写爬虫的时候, 来一个类继承祖宗类
// 然后, 在子类中处理得到的结果即可

// 爬虫用法: 创建爬虫对象, 传入URL自动开爬
const http = require('http')
import SpiderOptions from './interfaces/SpiderOptions'

export default abstract class Spider {
    
    
  // 定义成员
  options: SpiderOptions
  // 使用接口定义options的成员
  constructor(options: SpiderOptions = {
    
     url: '', method: 'get' }) {
    
    
    // 初始化
    this.options = options
    this.start()
  }
  start() {
    
    
    // 创建请求对象
    let req = http.request(this.options.url, {
    
    
      headers: this.options.headers,
      method: this.options.method
    }, (res: any) => {
    
    
      let chunks: any[] = []
      res.on('data', (c: any) => chunks.push(c))

      res.on('end', () => {
    
    
        let result = Buffer.concat(chunks).toString('utf-8')
        // console.log(result)
        // 抽象方法调用  子子孙孙干的事儿 他都不知道 他只管调用一下抽象方法
        // 具体的实现由子子孙孙继承时实现即可
        this.onCatchHTML(result)
      })
    })

    // 发送请求
    req.end()
  }

  abstract onCatchHTML(result: string): any
}

第三步:新建文件夹interfaces,并在其中新建接口文件SpiderOptions.ts,对外定义Options接口

src/interfaces/SpiderOptions.ts

export default interface SpiderOptions {
    
    
  url: string,
  method?: string,
  headers?: object
}

第四步:在src文件夹下新建爬虫文件TeacherPhotos.ts

src/TeacherPhotos.ts

// 封装完毕后,如果需要做爬虫,只需要以下几步:
// 1. 写一个爬虫类, 继承Spider
// 2. 实现onCatchHTML方法(爬虫获取资源后需要做的事情)
// 3. 使用: 创建该爬虫对象,传入URL即可
const cheerio = require('cheerio')
const download = require('download')
import Spider from './Spider'
export default class TeacherPhotos extends Spider {
    
    
  onCatchHTML(result: string) {
    
    
    // 获取到html之后的操作  由子类具体实现
    // console.log(result)

    // 根据html的img标签src属性来下载图片
    let $ = cheerio.load(result)
    let imgs = Array.prototype.map.call($('.tea_main .tea_con .li_img > img'), (item: any) => 'http://web.itheima.com/' + encodeURI($(item).attr('src')))
    Promise.all(imgs.map(x => download(x, 'dist'))).then(() => {
    
    
      console.log('files downloaded!');
    });
  }
}

第五步:在src文件夹下新建测试接口文件test.ts

src/test.ts

import TeacherPhotos from './TeacherPhotos'
new TeacherPhotos({
    
    
  url: 'http://web.itheima.com/teacher.html'
})

编译成tes.js文件后,运行文件

node .\bin\test.js

此时,就爬取了http://web.itheima.com/teacher.html的html文件信息

实例2

前四步同上,

第五步:在src文件夹下新建测试接口文件test.ts

src/test.ts

import NewsList from './NewsList'
new NewsList({
    
    
  url: 'http://www.itcast.cn/news/json/f1f5ccee-1158-49a6-b7c4-f0bf40d5161a.json'
})

也可以配置请求头

src/test.ts

// import Spider from './Spider'

// new Spider({
    
    
//   url: 'http://www.itcast.cn/newsvideo/newslist.html'
// })

// new Spider({
    
    
//   url: 'http://web.itheima.com/teacher.html'
// })

// import TeacherPhotos from './TeacherPhotos'
// new TeacherPhotos({
    
    
//   url: 'http://web.itheima.com/teacher.html'
// })

import NewsList from './NewsList'
new NewsList({
    
    
  url: 'http://www.itcast.cn/news/json/f1f5ccee-1158-49a6-b7c4-f0bf40d5161a.json',
  method: 'post',
  headers: {
    
    
    "Host": "www.itcast.cn",
    "Connection": "keep-alive",
    "Content-Length": "0",
    "Accept": "*/*",
    "Origin": "http://www.itcast.cn",
    "X-Requested-With": "XMLHttpRequest",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
    "DNT": "1",
    "Referer": "http://www.itcast.cn/newsvideo/newslist.html",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "Cookie": "UM_distinctid=16b8a0c1ea534c-0c311b256ffee7-e343166-240000-16b8a0c1ea689c; bad_idb2f10070-624e-11e8-917f-9fb8db4dc43c=8e1dcca1-9692-11e9-97fb-e5908bcaecf8; parent_qimo_sid_b2f10070-624e-11e8-917f-9fb8db4dc43c=921b3900-9692-11e9-9a47-855e632e21e7; CNZZDATA1277769855=1043056636-1562825067-null%7C1562825067; cid_litiancheng_itcast.cn=TUd3emFUWjBNV2syWVRCdU5XTTRhREZs; PHPSESSID=j3ppafq1dgh2jfg6roc8eeljg2; CNZZDATA4617777=cnzz_eid%3D926291424-1561388898-http%253A%252F%252Fmail.itcast.cn%252F%26ntime%3D1563262791; Hm_lvt_0cb375a2e834821b74efffa6c71ee607=1561389179,1563266246; qimo_seosource_22bdcd10-6250-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_22bdcd10-6250-11e8-917f-9fb8db4dc43c=; href=http%3A%2F%2Fwww.itcast.cn%2F; bad_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b71-a7a4-11e9-93cc-9b702389a8cb; nice_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b72-a7a4-11e9-93cc-9b702389a8cb; openChat22bdcd10-6250-11e8-917f-9fb8db4dc43c=true; parent_qimo_sid_22bdcd10-6250-11e8-917f-9fb8db4dc43c=fc61e520-a7a4-11e9-94a8-01dabdc2ed41; qimo_seosource_b2f10070-624e-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_b2f10070-624e-11e8-917f-9fb8db4dc43c=; accessId=b2f10070-624e-11e8-917f-9fb8db4dc43c; pageViewNum=2; nice_idb2f10070-624e-11e8-917f-9fb8db4dc43c=20d2a1d1-a7a8-11e9-bc20-e71d1b8e4bb6; openChatb2f10070-624e-11e8-917f-9fb8db4dc43c=true; Hm_lpvt_0cb375a2e834821b74efffa6c71ee607=1563267937"
  }
})

编译成tes.js文件后,运行文件

node .\bin\test.js

此时,就爬取了接口的文件信息

猜你喜欢

转载自blog.csdn.net/weixin_44867717/article/details/134366987