Before you start, please make sure that you have installed the Node.js environment. If you haven't installed the children's shoes, please install the tutorial by yourself...
just get started
1. Install two necessary dependencies in the project folder
npm install superagent --save-dev
SuperAgent (this is how the official website explains it)
-----SuperAgent is light-weight progressive ajax API crafted for flexibility, readability, and a low learning curve after being frustrated with many of the existing request APIs. It also works with Node.js!
----- superagent is a lightweight and progressive ajax api, with good readability and low learning curve. It relies on the native request api of nodejs internally. It is suitable for nodejs environment.
npm install cheerio --save-dev
Cheerio
----- cheerio is a crawling page module of nodejs, specially customized for the server, a fast, flexible and implemented jQuery core implementation. Suitable for various web crawler programs. Equivalent to jQuery in node.js
2. Create a new crawler.js file
//Import dependencies const http = require("http"); const path = require("path"); const url = require("url"); const fs = require("fs"); const superagent = require("superagent"); const cheerio = require("cheerio");3. Look at the comments (here is the data of the boss's direct recruitment website)
superagent .get("https://www.zhipin.com/job_detail/?city=100010000&source=10&query=%E5%89%8D%E7%AB%AF") .end((error,response)=>{ //Get the page document data var content = response.text; //cheerio, which is jQuery under nodejs, wraps the entire document into a collection and defines a variable $receive var $ = cheerio.load(content); //Define an empty array to receive data var result = []; //Analyze the document structure first get each li and then traverse the content inside (at this time, each li stores the data we want to get) $(".job-list li .job-primary").each((index,value)=>{ //The address and type are displayed in one line, and string interception is required //address let address=$(value).find(".info-primary").children().eq(1).html(); //Types of let type=$(value).find(".info-company p").html(); //decoding address=unescape(address.replace(/&#x/g,'%u').replace(/;/g,'')); type=unescape(type.replace(/&#x/g,'%u').replace(/;/g,'')) // String interception let addressArr=address.split('<em class="vline"></em>'); let typeArr=type.split('<em class="vline"></em>'); //Add the obtained data to the array in the form of an object result.push({ title:$(value).find(".name .job-title").text(), money:$(value).find(".name .red").text(), address:addressArr, company:$(value).find(".info-company a").text(), type:typeArr, position:$(value).find(".info-publis .name").text(), txImg:$(value).find(".info-publis img").attr("src"), time:$(value).find(".info-publis p").text() }); // console.log(typeof $(value).find(".info-primary").children().eq(1).html()); }); //Convert array to string result=JSON.stringify(result); //Output the array to the json file and refresh the directory to see that there is an additional boss.json file in the current folder (open the boss.json file, ctrl+A to select all, ctrl+K, and then Ctrl+F to convert the json automatic typesetting of documents) fs.writeFile("boss.json",result,"utf-8",(error)=>{ //Listen for errors, if normal output, print null if(error==null){ console.log("Congratulations, the data crawling is successful! Please open the json file, first Ctrl+A, then Ctrl+K, and finally Ctrl+F format and view the json file (Visual Studio Code editor only)"); } }); });