Crawling webpages with node is not as difficult as everyone imagined. Next, I will tell you how to crawl:
prelude:
First of all, everyone needs to understand cheerio, which is the key to editing and crawling content in node
Cheerio is a fast and flexible server-side HTML parsing tool based on the core jQuery library. It lets you parse and manipulate HTML or XML documents using a jQuery-like syntax. Cheerio is mainly used for data crawling, data extraction and DOM manipulation in the Node.js environment
const cheerio = require('cheerio');
const html = '<div><h1>Hello, World!</h1><p>This is a paragraph.</p></div>';
const $ = cheerio.load(html);
const headingText = $('h1').text();
const paragraphText = $('p').text();
console.log(headingText); // 输出:Hello, World!
console.log(paragraphText); // 输出:This is a paragraph.
Through this operation, you can use the syntax of jQuery to operate and filter the obtained html content
jQuery syntax enhancement
- jQuery double traversal
$(".list1").each(function() {
var outerElement = $(this);
outerElement.find(".ist2").each(function() {
var innerElement = $(this);
// 对内部元素进行操作
console.log(innerElement.text());
});
});
Use first to .each()
iterate over .list1
the elements of the class name. In each iteration, we take the outer element outerElement
and use .find()
the method to find the inner .list2
element. Then, use another .each()
loop to iterate over each inner element innerElement
and operate on it
- jQuery is more flexible to get the desired dom element
$(`ul li:eq(${index})`) Get the index element in the list (index starts from 0)
Code display:
Complete crawl code
var myRequest = require('request')
var myCheerio = require('cheerio')
const fs = require('fs')
// 要爬取的目标网页地址
var myURL = 'https://www.chinanews.com.cn/'
function request(options, callback) {
var options = {
url: options.url, headers: options.headers
}
myRequest(options, callback)
}
// 配置请求
var options = {
url: myURL,
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
},
encoding: null // 添加此行,以便返回二进制数据
}
request(options, function (err, res, body) {
var html = body;
// 将 HTML 内容加载进 cheerio,并指定 { decodeEntities: false } 参数以保留原始的实体编码,而不进行解码
myCheerio.load(html, { decodeEntities: false })
var $ = myCheerio.load(html);
$('.rdph-list2').each(function () {
var child = $(this);
let textLi = ''
child.find("li").each(function () {
var li = $(this);
textLi += li.find("a").text() + '\n'
});
fs.writeFile(`${textLi.slice(0,5)}.txt`, textLi, (err) => {
if (err) throw err;
console.log('文本文件已创建并写入数据');
});
textLi = ''
})
})
Crawling effect:
expand:
Crawl the webpage, remember to check the encoding format of the webpage, as shown below:
The above code crawling is in the UTF-8 format of crawling. If it is used to crawl other formats such as gb2312, the returned text will be garbled and needs to be adapted