Download the comics written by koa2.x reptiles

Verbatim https://www.dazhuanlan.com/2019/08/25/5d6235bb190fd/


Download reptile write comics with koa2.x:
Use koa2.x of async, await an asynchronous resolve the problem, write a comic download reptile, code, there are surprises and benefits Oh!

Project to build

  1. Installation nodejs> 7.6, koa-generator installation
  2. Direct koa2 spider, build the project
  3. 安装request,request-promise,cheerio,mkdirp
  4. npm install install dependencies

Thinking

Pictures or cartoons reptile idea is very simple, the first observation of the laws url, the url according to the law added to the download task, in fact, to request html content, and then parse html, locate the downloaded images url (usually img src tag property values), save the url into an array, using async await control all tasks until all the images downloaded.

difficulty

But nodejs itself on asynchronous, if you go in for cycling downloaded directly, certainly not enough to be the key to the good execution of asynchronous control.
Reptiles easy to handle asynchronous difficult. Here I am using async es7, await an asynchronous with the promise to solve the problem, you can also use the async module, eventproxy, etc. asynchronous control module to solve.

Core code, spider.js

       
       
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
       
       
const fs = require( 'fs');
const request = require( "request-promise");
const cheerio = require( "cheerio");
const mkdirp = require( 'mkdirp');
const config = require( '../config');
exports.download = async function(ctx, next) {
const dir = 'images';
let links = [];
mkdirp (dir);
was URLs = [];
let tasks = [];
let downloadTask = [];
let url = config.url;
for ( var i = 1; i <= config.size; i++) {
let link = url + '_' + i + '.html';
if (i == 1) {
link = url + '.html';
}
tasks.push(getResLink(i, link))
}
links = await Promise.all(tasks)
console.log( 'links==========', links.length);
for ( var i = 0; i < links.length; i++) {
let item = links[i];
let index = item.split( '___')[ 0];
let src = item.split( '___')[ 1];
downloadTask.push(downloadImg(src, dir, index + links[i].substr( -4, 4)));
}
await Promise.all(downloadTask);
}
async function (url, dir, filename) {
console.log( 'download begin---', url);
request.get(url).pipe(fs.createWriteStream(dir + "/" + filename)).on( 'close', function() {
console.log( 'download success', url);
});
}
async function getResLink(index, url) {
const body = await request(url);
let urls = [];
var $ = cheerio.load(body);
$(config.rule).each( function() {
var src = $( this).attr( 'src');
urls.push(src);
});
return index + '___' + urls[ 0];
}

基础配置

由于爬虫的复杂性基于不同的网站,不同的任务很不一样,这里只是把几个常用的变量抽取到了config.js。

       
       
1
2
3
4
5
6
7
       
       
module.exports = {
//初始url
url: 'http://www.xieet.com/meinv/230',
size: 10,
// 选中图片img标签的选择器
rule: '.imgbox a img'
};

运行代码

  1. 下载我上传的代码koa-spider
  2. npm install,npm start即可运行

总结

In fact, both written reptile or some other process, using a large part of nodejs are to be processed asynchronously, must learn to learn nodejs asynchronous processing.

Guess you like

Origin www.cnblogs.com/petewell/p/11408081.html