node-spider: node practice simple crawler

1. Understanding

1.1. Crawlers: Web crawlers are also called network robots, which can automatically collect and organize data information on the Internet instead of people.

1.2. Cheerio: Cheerio is a page capture module of nodejs. It is a fast, flexible and implemented jQuery core implementation specially customized for the server. Suitable for various Web crawler programs.

2. Analysis of pages to be crawled

2.1, url page analysis

// 第一页地址
https://money.163.com/special/businessnews/
// 第二页地址
https://money.163.com/special/businessnews_02/
// 第三页地址
https://money.163.com/special/businessnews_03/

Through the analysis of the page address, the pagination parameter of the url is marked with _02 in the last field, the first page does not have it, the 2-9 pages are 02-09, and the normal display is more than 10

2.2. List analysis

2.3, pager analysis

All the page numbers are shown here, we can get the page number of the last page, so as to grab each page

2.4. Analyze the article details page, find the fields, titles, and texts to be captured

3. Practice node crawler

3.1. Create a directory spider

Create directory spider, pnpm init creates package.json

3.2. Installation dependencies

npm i axios
npm i cheerio

3.3, code

const cheerio = require('cheerio')
const axios = require('axios')
const url = require('url')
const fs = require('fs')

const request = axios.create({
  baseURL: 'https://money.163.com/' // 网易财经商讯
})

// 获取最后一页的页码数
const getLastPage = async () => {
  const { data } = await request({
    method: 'GET',
    url: '/special/businessnews/', // 新闻列表首页
  })

  const $ = cheerio.load(data)
  const paginations = $('.index_pages a') // 分页区域
  const lastPageHref = paginations.eq(paginations.length - 2).attr('href')
  return Number(lastPageHref.split("/")[4].split("_")[1])
}

// 需求：获取 网易财经商讯网站 所有的文章列表（文章标题、文章内容）并且将该数据存储到数据库中

// 获取所有文章列表
const getArticles = async () => {
  const lastPage = await getLastPage()
  console.log('28', lastPage)
  const links = []
  for (let page = 1; page <= lastPage; page++) {
    let url = "special/businessnews/"
    if(page > 1 && page <= 9){
      url = `special/businessnews_0${page}/`
    } else if(page > 9){
      url = `special/businessnews_${page}/`
    }
    const { data } = await request({
      method: 'GET',
      url: url,
    })
    const $ = cheerio.load(data)
    $('.index_list a').each((index, element) => {
      const item = $(element) // 转换为 $ 元素
      links.push(item.attr('href'))
    })
    // 每次抓取完一页的数据就等待一段时间，太快容易被发现
    await new Promise(resolve => {
      setTimeout(resolve, 1000)
    })
    console.log("links.length", links.length)
  }
  return links
}

// 获取文章内容
const getArticleContent = async (url) => {
  const { data } = await request({
    method: 'GET',
    url
  })
  const $ = cheerio.load(data)
  const title = $('.post_title').text().trim()
  const content = $('.post_body').html()
  return {
    title,
    content
  }
}

const main = async () => {
  // 1. 获取所有文章列表链接
  const articles = await getArticles()
  // 2. 遍历文章列表
  for (let i = 0; i < articles.length; i++) {
    const link = articles[i]
    const article = await getArticleContent(link)
      // 生产环境可以存到数据库里边了
      fs.appendFileSync('./db.txt', `
      标题：${article.title}
      文章内容：${article.content}
      \r\n\r\n\r\n\r\n
    `)
    console.log(`${link} 抓取完成`)
    await wait(500)
  }
}

main()

function wait (time) {
  return new Promise(resolve => {
    setTimeout(() => {
      resolve()
    }, time)
  })
}

3.4. Execute the crawler

Execute node spider.js, and the data is captured to the db.txt file, successfully.

Fourth, the crawler configuration file

This article only implements the crawling of a website list, and all the codes are implemented in one file. If you want to grab multiple files, you can write the structure of each website in a json file, and the crawler reads the configuration file of each website, thereby improving efficiency. At the same time, the work of developing crawlers and writing crawler configurations can be separated and handed over to different personnel to improve the overall efficiency of the crawler development process.

// 配置文件
{
    url: "",
    list: [],
    pagination: {
        max: 20,
    },
    article: {
        title: "",
        author: "",
        time: "",
        content: ""
    }
    ...
}

5. Welcome to exchange and correct, follow me, and learn together.

Reference link:

Node.js Implementation of Simple Crawler Explanation

node-spider: node practice simple crawler

Guess you like