node crawls Netease cloud songs

Reason: Dad asked me to download thousands of songs for him to play in the car. I feel like downloading it manually, even if it takes time to download in batches, simply write a crawler to download it automatically. .

For this small crawler project, select node+koa2, initialize the project koa2 projectName(need to be installed globally first koa-generator), and then enter the project file, npm install && npm start, in which the dependencies are usedsuperagent, cheerio, async, fs, path

Open the NetEase Cloud web version, click on the playlist page, I select the Chinese category, right-click to view the framework source code, get the real url, and find m-pl-containerthe html structure with the id, this is the list of playlists that need to be crawled this time, directly use the superagentrequest url, Only the data on the first page can only be crawled, and it needs asyncto be crawled concurrently

static getPlayList(){
	const pageUrlList = this.getPageUrl();

	return new Promise((resolve, reject) => {
		asy.mapLimit(pageUrlList, 1, (url, callback) => {
			this.requestPlayList(url, callback);
		}, (err, result) => {
			if(err){
				reject(err);
			}

			resolve(result);
		})
	})
}

Among them const asy = require('async'), because it is used async/await, it is distinguished from requestPlayListthe request initiated by superagent

static requestPlayList(url, callback){
	superagent.get(url).set({
		'Connection': 'keep-alive'
	}).end((err, res) => {
		if(err){
			console.info(err);
			callback(null, null);
			return;
		}

		const $ = cheerio.load(res.text);
		let curList = this.getCurPalyList($);
		callback(null, curList);  
	})
}

getCurPalyListis to get the information on the page and pass it $in for dom operation

static getCurPalyList($){
	let list = [];

	$('#m-pl-container li').each(function(i, elem){
		let _this = $(elem);
		list.push({
			name: _this.find('.dec a').text(),
			href: _this.find('.dec a').attr('href'),
			number: _this.find('.nb').text()
		});
	});

	return list;
}

So far, the crawling of the playlist list is completed, and the next step is to crawl the song list

static async getSongList(){
	const urlCollection = await playList.getPlayList();

	let urlList = [];
	for(let item of urlCollection){
		for(let subItem of item){
			urlList.push(baseUrl + subItem.href);
		}
	}

	return new Promise((resolve, reject) => {
		asy.mapLimit(urlList, 1, (url, callback) => {
			this.requestSongList(url, callback);
		}, (err, result) => {
			if(err){
				reject(err);
			}

			resolve(result);
		})
	})
}

requestSongListThe usage is similar to that of the playList above, so it will not be repeated. After the above code gets the song list, it needs to be downloaded locally

static async downloadSongList(){
	const songList = await this.getSongList();

	let songUrlList = [];
	for(let item of songList){
		for(let subItem of item){
			let id = subItem.url.split('=')[1];
			songUrlList.push({
				name: subItem.name,
				downloadUrl: downloadUrl + '?id=' + id + '.mp3'
			});
		}
	}

	if(!fs.existsSync(dirname)){
		fs.mkdirSync(dirname);
	}
	
	return new Promise((resolve, reject) => {
		asy.mapSeries(songUrlList, (item, callback) => {
			setTimeout(() => {
				this.requestDownload(item, callback);
				callback(null, item);
			}, 5e3);
		}, (err, result) => {
			if(err){
				reject(err);
			}

			resolve(result);
		})
	})
}

Among them requestDownloadis to request downloadUrl and download and save to local

static requestDownload(item, callback){
	let stream = fs.createWriteStream(path.join(dirname, item.name + '.mp3'));

	superagent.get(item.downloadUrl).set({
		'Connection': 'keep-alive'
	}).pipe(stream).on('error', (err) => {
		console.info(err);   // error处理,爬取错误时,打印错误并继续向下执行
	})
}

At this point, the crawler applet is complete. The project crawls the playlist list --> song list --> download to the local, of course, you can also directly find a singer's homepage, modify the url passed into the songList, and directly download the singer's popular songs.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325377843&siteId=291194637