Node implements CSDN blog export (follow-up)

foreword

In 2021, I implemented a Node export blog function: crawl the interface and blog pages and export them to md file format. There are many iterations and optimizations in the middle, and some key problems have been solved. Write an article to make a record and review

Blog update function

The blog update function is added to the original export function, avoiding the need to export everything every time, whether it is time-consuming. Add the -update command to the command to perform the upgrade operation. For example, when using node server -type:csdn -id:time_____ -update to
update, it will compare whether the blog name exists, and if not, this article will be exported separately. The core code is

  startLoadBlogItem: async () => {
    const newData = getBlogConfig().blogList;
    let temp = newData;
    console.log(`获取列表成功,共${newData.length}篇文章`);
    if (global.update) {
      const oldData = (await readFile(global.type, "./temp/")).toString(
        "utf-8"
      );
      // temp表示待导出的博客列表
      temp = getArrayAddItems(stringToJson(oldData) ?? [], newData);
      console.log(`本次更新${temp.length}篇文章`);
    }
    writeFile(global.type, JSON.stringify(newData), "./temp/");
    return messageCenter.emit("getBlogInfo", temp);
  },

As well as the following update data operation, my method is to add a temp directory for article cache in the root directory. When loading for the first time, the article list will be stored in the file. In subsequent updates, you only need to compare the length of the list and intercept the newly added article list.

// 获取数组更新项
function getArrayAddItems(oldList = [], newList = [], key = "title") {
  return newList.filter((it) => !!!oldList.find((i) => i[key] === it[key]));
}

For specific changes, see: Blog Update

At the same time, we can modify the pipeline in Jenkins, and use the build trigger to trigger automatically by timing, so that the blog can be automatically updated every day

Picture anti-leech

Since the crawling blog uses the original picture, the general picture will use the picture anti-leech link to prevent abnormal requests. If an illegal Referer is used to access the other party's resources, an error will be thrown.

If you use a browser to open it, it will be normal. At this time, you will come to the second optimization point: reverse generation, use replace to replace the original img address to my local nginx server, the following is the modification point in the code

// 替换图片地址，拿nginx做个代理
function replaceImgUrl(content) {
  const { imgUrl, imgProxyUrl } = getBlogConfig();
  const rule = new RegExp(`(${imgUrl})`, "g");
  return content.replace(rule, imgProxyUrl);
}

and commit

Refer to the previous article: Nginx common instructions, basic configuration, reverse proxy_DieHunter1024's Blog-CSDN Blog

We create a new route in our nginx server, configure reverse generation, so that all requests to access /csdnImg/ are proxied to the path of the image

        location /csdnImg/ {
            proxy_pass https://img-blog.csdnimg.cn/;
        }

However, after using it, it is still 403

Due to the verification of the HTTP Referer anti-leech, we need to add Referer disguise to bypass the verification in nginx. In addition, we can add User-Agent disguise to hide the real identity, bypass some restrictions, and pretend to be a browser to prevent banning

        location /csdnImg/ {
            proxy_pass https://img-blog.csdnimg.cn/;
            proxy_set_header User-Agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36";
            proxy_set_header Referer "http://blog.csdn.net/";
        }

The above is the whole content of the article, I hope it can help you

Source code address: blog_website: Crawler script + hexo deployment based on CSDN blog written by node