Node.js crawling sister map - the use of crawler crawler

In October last year, I wrote "Node.js crawls 8W+ pieces of travel data of donkey mother in 2 hours" . What I did before was to use request for network requests and cheerio for DOM parsing.
Later, I saw crawler on the Internet , and I played with it yesterday crawler. It feels ok to use, and its function is the same as that of the two bags above.
write picture description here
This time I crawled a picture of a girl .
Before crawling, it is recommended to see how crawler is used. In fact, it is still very simple. I won't go into details here, please check the documentation by yourself.
The other thing is to take a good look at the rules of the web page address of the sister map , whether there is an anti-brush mechanism (the same IP, multiple visits in a short period of time, the concept of customization). The rule I found here is the home page of the sister map: http://www.meizitu.com and http://www.meizitu.com/a/more_1.html open the same effect. The second link address directly shows how many pages are currently being accessed.
Practical steps:
1. Calculate the total number of pages

var c = new Crawler({
        maxConnections : 10,  // 最大链接数 10
        retries: 5,  // 失败重连5次
        // This will be called for each crawled page
        callback : function (error, res, done) { // 执行下面的回调,这个回调不会执行
            if(error){
                console.log(error);
            }else{
                var $ = res.$;
                console.log($("title").text());
            }
            done();
        }
    });

    c.queue([{
        uri: 'http://www.meizitu.com/a/more_1.html',
        jQuery: true,
        callback: function (error, res, done) {
            if(error){
                console.log(error);
            }else{
                var $ = res.$;   // 这就可以想使用jQuery一个解析DOM了
                var total_pag = 0;
                $('#wp_page_numbers li a').each(function(index,item){
                        if ($(item).text() == '末页') {
                            total_pag = $(item).attr('href');
                            var regexp = /[0-9]+/g;
                            total_pag = total_pag.match(regexp)[0]; // 总页数
                        }
                })
            }
            done();
        }
    }]);

The operation of this step is to first see how many pages there are and calculate the total number of pages.
write picture description here
Each page here shows a theme, and each theme clicks into it for details. What I do is to download the theme first.
2. Download the theme

function downloadContent(i,c){
    var uri = 'http://www.meizitu.com/a/more_' + i + '.html';
    c.queue([{
        uri: uri,
        jQuery: true,
        callback: function (error, res, done) {
            if(error){
                console.log(error);
            }else{
                var $ = res.$;
                var meiziSql = '';
                $('.wp-item .pic a').each(function(index,item){
                        var href = $(item).attr('href'); // 获取路径uri
                    var regexp = /[0-9]+/g;
                    var artice_id = href.match(regexp)[0]; // 获取文章ID
                    var title = $(item).children('img').attr('alt');
                    title = title.replace(/<[^>]+>/g,""); // 去掉 <b></b> 标签
                    var src = $(item).children('img').attr('src');
                    var create_time = new Date().getTime();
                    if (href == 'http://www.meizitu.com/a/3900.html') {
                        title = title.replace(/'/g,'');  // 这个的标题多了一个 单引号, mysql在插入的时候存在问题,所以这样处理了一下
                    } 
                    var values = "'" + artice_id + "'" + ',' 
                            + "'" + title + "'" + ',' 
                            +  "'" + href + "',"  
                            + "'" + src + "'" + ',' 
                            + "'" + create_time + "'";
                    meiziSql = meiziSql + 'insert ignore into meizitu_all(artice_id,title,href,src,create_time) VALUES(' + values + ');';
                })
                pool.getConnection(function(err, connection) {
                    if(err){
                        console.log('数据库连接失败',i);
                    }
                    connection.query(meiziSql,function (err, results) {
                        connection.release();
                        if (err){
                            console.log(err,i);
                        }else{
                            console.log('插入成功',i);
                        }
                    })
                })
            }
            done();
        }
    }]);
}

write picture description here.
This will download each topic on each page.
write picture description here
A total of 2806, haha, enough to watch for a year, (if you don't have aesthetic fatigue).
Seeing this function downloadContent(i,c), I wonder if you will have any questions? Why are there two parameters, the parameters are cgood books, then what does one of them ido? I won't say what he does here for the time being, everyone will know in the next article.
Only the crawler is mentioned here . As you can see from the code, crawler can either initiate a network request or convert the DOM into a jQueryform for parsing.
The crawler mentioned here, plus what I did before, actually initiates a network request, obtains the DOM of the web page, and then parses the DOM node. The next article is the use of async to improve crawling efficiency .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325504413&siteId=291194637