Use crawlers to crawl web content

logo

For crawlers, the first thing we think of is python, but for the front end, we usually use node to write crawlers to crawl website data.

The basic process of crawling

Insert picture description here

1. Initiate a request
Use the http library to initiate a request to the target site, that is, send a Request, third-party request libraries such as request, axios, etc.
Request contains: request header, request body, etc.

2. Get the response content
If the server can respond normally, you will get a Response
Response containing: html, json, pictures, videos, etc.

3. Parse content
Parse html data: regular expressions, third-party parsing libraries such as cheerio, PhantomJS, JDom. etc.
Parse json data: json module
Parse binary data: write files in buffer mode.

#4, save data
database

Next, take the crawling of Tencent.com article data as an example. First, we must know that the request address of the website is https://www.qq.com/. According to this address, we send a request to get the source code of the website:

const request = require('request');
const url = 'https://www.qq.com/'
const fs = require('fs')
const cheerio = require('cheerio')
const iconv = require('iconv-lite')

request({
    
    url, encoding: null}, (err, response, body) => {
    
    
  let result = iconv.decode(body, 'gb2312');
  console.log(result)
})

In the process of getting the website source code, we found that the website is not in utf8 encoding format, but in gb2312 format, so we use the iconv-lite module to analyze.

After getting the response content, we need to extract the content in the html. This time I want to grab the news headline part of the website.

const request = require('request')
const url = 'https://www.qq.com/'
const fs = require('fs')
const cheerio = require('cheerio')
const iconv = require('iconv-lite')

request({
    
    url, encoding: null}, (err, response, body) => {
    
    
  let result = iconv.decode(body, 'gb2312');
  let list = []
  let $ = cheerio.load(result)
  $('.yw-list li').each((i, ele) => {
    
    
    let text = $(ele).text().replace(/\s/g, '')
    list.push(text)
  })
  console.log(list)
  fs.writeFileSync('qq.json', JSON.stringify(list))
})

After extracting part of the useful content, it is usually saved in the database or written to the file system.

Guess you like

Origin blog.csdn.net/wu_xianqiang/article/details/108481625