Teach you how to use Node.js crawler to crawl website data

Before you start, please make sure that you have installed the Node.js environment. If you haven't installed the children's shoes, please install the tutorial by yourself...

just get started

1. Install two necessary dependencies in the project folder

npm install superagent --save-dev

SuperAgent (this is how the official website explains it)

-----SuperAgent is light-weight progressive ajax API crafted for flexibility, readability, and a low learning curve after being frustrated with many of the existing request APIs. It also works with Node.js!

----- superagent is a lightweight and progressive ajax api, with good readability and low learning curve. It relies on the native request api of nodejs internally. It is suitable for nodejs environment.

npm install cheerio --save-dev

Cheerio

----- cheerio is a crawling page module of nodejs, specially customized for the server, a fast, flexible and implemented jQuery core implementation. Suitable for various web crawler programs. Equivalent to jQuery in node.js

2. Create a new crawler.js file

//Import dependencies
const http       = require("http");
const path       = require("path");
const url        = require("url");
const fs         = require("fs");

const superagent = require("superagent");
const cheerio    = require("cheerio");
3. Look at the comments (here is the data of the boss's direct recruitment website)
superagent
    .get("https://www.zhipin.com/job_detail/?city=100010000&source=10&query=%E5%89%8D%E7%AB%AF")
    .end((error,response)=>{
        //Get the page document data
        var content = response.text;
        //cheerio, which is jQuery under nodejs, wraps the entire document into a collection and defines a variable $receive
        var $ = cheerio.load(content);
        //Define an empty array to receive data
        var result = [];
        //Analyze the document structure first get each li and then traverse the content inside (at this time, each li stores the data we want to get)
        $(".job-list li .job-primary").each((index,value)=>{
            //The address and type are displayed in one line, and string interception is required
            //address
            let address=$(value).find(".info-primary").children().eq(1).html();
            //Types of
            let type=$(value).find(".info-company p").html();
            //decoding
            address=unescape(address.replace(/&#x/g,'%u').replace(/;/g,''));
            type=unescape(type.replace(/&#x/g,'%u').replace(/;/g,''))
            // String interception
            let addressArr=address.split('<em class="vline"></em>');
            let typeArr=type.split('<em class="vline"></em>');
            //Add the obtained data to the array in the form of an object
            result.push({
                title:$(value).find(".name .job-title").text(),
                money:$(value).find(".name .red").text(),
                address:addressArr,
                company:$(value).find(".info-company a").text(),
                type:typeArr,
                position:$(value).find(".info-publis .name").text(),
                txImg:$(value).find(".info-publis img").attr("src"),
                time:$(value).find(".info-publis p").text()
            });
            // console.log(typeof $(value).find(".info-primary").children().eq(1).html());
        });
        //Convert array to string
        result=JSON.stringify(result);
        //Output the array to the json file and refresh the directory to see that there is an additional boss.json file in the current folder (open the boss.json file, ctrl+A to select all, ctrl+K, and then Ctrl+F to convert the json automatic typesetting of documents)
        fs.writeFile("boss.json",result,"utf-8",(error)=>{
            //Listen for errors, if normal output, print null
            if(error==null){
                console.log("Congratulations, the data crawling is successful! Please open the json file, first Ctrl+A, then Ctrl+K, and finally Ctrl+F format and view the json file (Visual Studio Code editor only)");
            }
        });
    });




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325819195&siteId=291194637