Simple crawler with nodejs

foreword
 
I need to use node in my recent work, and I also want to be promoted to a full stack engineer, so I started the node learning journey. During the learning process,
I will summarize some practical examples, make blog posts and video tutorials, and use examples to understand and experience the usage of node, so let’s learn node from shallow to deep with Kitten!
It will be some basic articles in the near future, mainly used to understand the various functions of node, which is very suitable for front-end engineers who have some understanding of node but have not developed the foundation of node.
After the basics are mastered, advanced exploration and summary will be carried out in the follow-up.
 
 
This article will take the relevant search of keywords in Baidu search results as an example to teach you to make the simplest crawler with nodejs:
  
 
An introduction to the node modules and properties that will be used:
 
request:
 
     Used to send page requests, grab page code
     GET request
     

 

cheerio:
        
   cheerio is a subset of jQuery Core that implements the browser-agnostic DOM manipulation API in jQuery Core:
The load method    will be used in this example , here is a simple example:
     
 
express:
 
     Based on the Node.js platform, a fast, open and minimalist web development framework, which is mainly used for simple routing functions, and will not be introduced in detail. It mainly uses get. For details, please refer to the official website.
 
 
Implementation:
 
1. First, we need to use express to build a simple node service
 
 
 
Use the command line to run node demo.js and access localhost:3000/key in the browser. The result is
 
 
 
2. Use request to implement page scraping function
 

 

Use the command line to run node demo.js and access localhost:3000/key in the browser. The result is
 

 

 
3. Use cheerio to parse the page code into jquery format, and use jQuery syntax to find the location of the crawled content, so this crawler is realized!
 
If you want to know the specific solution, please pay attention to my public account~ Reply to "node crawler" to get the original text
 
Public number: Meow Whisper

 

Use the command line to run node demo.js, and visit localhost:3000/index in the browser. The result is
tips:
Some websites are not in utf-8 encoding mode, then you can use iconv-lite to solve the garbled problem of gb2312
Of course, each website has an anti-crawler function, and some problems can be avoided by studying how to simulate a normal user's invitation (Baidu's Chinese search will also be blocked)
This article is just an introduction, and I will discuss the advanced version with you in detail later.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325220189&siteId=291194637
Recommended