[node] Use node to crawl the text content of the web page, and create a text to store the content in the text, with source code

Crawling webpages with node is not as difficult as everyone imagined. Next, I will tell you how to crawl:

prelude:

First of all, everyone needs to understand cheerio, which is the key to editing and crawling content in node

Cheerio is a fast and flexible server-side HTML parsing tool based on the core jQuery library. It lets you parse and manipulate HTML or XML documents using a jQuery-like syntax. Cheerio is mainly used for data crawling, data extraction and DOM manipulation in the Node.js environment

const cheerio = require('cheerio');

const html = '<div><h1>Hello, World!</h1><p>This is a paragraph.</p></div>';
const $ = cheerio.load(html);

const headingText = $('h1').text();
const paragraphText = $('p').text();

console.log(headingText);     // 输出:Hello, World!
console.log(paragraphText);   // 输出:This is a paragraph.

Through this operation, you can use the syntax of jQuery to operate and filter the obtained html content

jQuery syntax enhancement

  • jQuery double traversal
$(".list1").each(function() {
  var outerElement = $(this);

  outerElement.find(".ist2").each(function() {
    var innerElement = $(this);
    // 对内部元素进行操作
    console.log(innerElement.text());
  });
});

Use first to .each()iterate over .list1the elements of the class name. In each iteration, we take the outer element outerElementand use .find()the method to find the inner .list2element. Then, use another .each()loop to iterate over each inner element innerElementand operate on it

  • jQuery is more flexible to get the desired dom element

 $(`ul li:eq(${index})`) Get the index element in the list (index starts from 0)

jQuery Reference - Selectors

Code display:

Complete crawl code

var myRequest = require('request')
var myCheerio = require('cheerio')
const fs = require('fs')
// 要爬取的目标网页地址
var myURL = 'https://www.chinanews.com.cn/'
function request(options, callback) {
    var options = {
        url: options.url, headers: options.headers
    }
    myRequest(options, callback)
}
// 配置请求
var options = {
    url: myURL,
    headers: {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'
    },
    encoding: null // 添加此行,以便返回二进制数据
}
request(options, function (err, res, body) {
    var html = body;
    // 将 HTML 内容加载进 cheerio,并指定 { decodeEntities: false } 参数以保留原始的实体编码,而不进行解码
    myCheerio.load(html, { decodeEntities: false }) 
    var $ = myCheerio.load(html);
    $('.rdph-list2').each(function () {
        var child = $(this);
        let textLi = ''
        child.find("li").each(function () {
            var li = $(this);
            textLi += li.find("a").text() + '\n'
        });
        fs.writeFile(`${textLi.slice(0,5)}.txt`, textLi, (err) => {
            if (err) throw err;
            console.log('文本文件已创建并写入数据');
        });
        textLi = ''
    })

})  

Crawling effect:

expand:

Crawl the webpage, remember to check the encoding format of the webpage, as shown below:

The above code crawling is in the UTF-8 format of crawling. If it is used to crawl other formats such as gb2312, the returned text will be garbled and needs to be adapted

Guess you like

Origin blog.csdn.net/weixin_52479803/article/details/132100521