In October last year, I wrote "Node.js crawls 8W+ pieces of travel data of donkey mother in 2 hours" . What I did before was to use request for network requests and cheerio for DOM parsing.
Later, I saw crawler on the Internet , and I played with it yesterday crawler
. It feels ok to use, and its function is the same as that of the two bags above.
This time I crawled a picture of a girl .
Before crawling, it is recommended to see how crawler is used. In fact, it is still very simple. I won't go into details here, please check the documentation by yourself.
The other thing is to take a good look at the rules of the web page address of the sister map , whether there is an anti-brush mechanism (the same IP, multiple visits in a short period of time, the concept of customization). The rule I found here is the home page of the sister map: http://www.meizitu.com and http://www.meizitu.com/a/more_1.html open the same effect. The second link address directly shows how many pages are currently being accessed.
Practical steps:
1. Calculate the total number of pages
var c = new Crawler({
maxConnections : 10, // 最大链接数 10
retries: 5, // 失败重连5次
// This will be called for each crawled page
callback : function (error, res, done) { // 执行下面的回调,这个回调不会执行
if(error){
console.log(error);
}else{
var $ = res.$;
console.log($("title").text());
}
done();
}
});
c.queue([{
uri: 'http://www.meizitu.com/a/more_1.html',
jQuery: true,
callback: function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$; // 这就可以想使用jQuery一个解析DOM了
var total_pag = 0;
$('#wp_page_numbers li a').each(function(index,item){
if ($(item).text() == '末页') {
total_pag = $(item).attr('href');
var regexp = /[0-9]+/g;
total_pag = total_pag.match(regexp)[0]; // 总页数
}
})
}
done();
}
}]);
The operation of this step is to first see how many pages there are and calculate the total number of pages.
Each page here shows a theme, and each theme clicks into it for details. What I do is to download the theme first.
2. Download the theme
function downloadContent(i,c){
var uri = 'http://www.meizitu.com/a/more_' + i + '.html';
c.queue([{
uri: uri,
jQuery: true,
callback: function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$;
var meiziSql = '';
$('.wp-item .pic a').each(function(index,item){
var href = $(item).attr('href'); // 获取路径uri
var regexp = /[0-9]+/g;
var artice_id = href.match(regexp)[0]; // 获取文章ID
var title = $(item).children('img').attr('alt');
title = title.replace(/<[^>]+>/g,""); // 去掉 <b></b> 标签
var src = $(item).children('img').attr('src');
var create_time = new Date().getTime();
if (href == 'http://www.meizitu.com/a/3900.html') {
title = title.replace(/'/g,''); // 这个的标题多了一个 单引号, mysql在插入的时候存在问题,所以这样处理了一下
}
var values = "'" + artice_id + "'" + ','
+ "'" + title + "'" + ','
+ "'" + href + "',"
+ "'" + src + "'" + ','
+ "'" + create_time + "'";
meiziSql = meiziSql + 'insert ignore into meizitu_all(artice_id,title,href,src,create_time) VALUES(' + values + ');';
})
pool.getConnection(function(err, connection) {
if(err){
console.log('数据库连接失败',i);
}
connection.query(meiziSql,function (err, results) {
connection.release();
if (err){
console.log(err,i);
}else{
console.log('插入成功',i);
}
})
})
}
done();
}
}]);
}
.
This will download each topic on each page.
A total of 2806, haha, enough to watch for a year, (if you don't have aesthetic fatigue).
Seeing this function downloadContent(i,c)
, I wonder if you will have any questions? Why are there two parameters, the parameters are c
good books, then what does one of them i
do? I won't say what he does here for the time being, everyone will know in the next article.
Only the crawler is mentioned here . As you can see from the code, crawler can either initiate a network request or convert the DOM into a jQuery
form for parsing.
The crawler mentioned here, plus what I did before, actually initiates a network request, obtains the DOM of the web page, and then parses the DOM node. The next article is the use of async to improve crawling efficiency .