SEO: server-side dynamic rendering

principle

The server judges the requested user-agent, and the browser directly sends it to the SPA page. If it is a crawler, it sends it to the dynamically rendered html page.

Program

There are two options:

Scenario 1: node judges the requested user-agent
Scenario 2: nginx judges the requested user-agent

Use nginx to make judgments. The node service is only for crawlers. Even if the node is down, it will not affect the normal use of users, so scheme 2 is adopted here.

practice

Technology stack

Server: node, egg
cache: redis
request forwarding: nginx

Concrete practice

1. nginx configuration

If nginx judges that it is a crawler, it will forward it to node, and if it is not a crawler, it will return the landing page to the user.

# seo2:服务端动态渲染方案-nginx判断爬虫
server {
    
    
    listen  80;
    server_name www.cmvalue.seo2;

    add_header Access-Control-Allow-Origin *;

    location /homepage {
    
    
        if ($http_user_agent ~* "googlebot|google-structured-data-testing-tool|Mediapartners-Google|bingbot|linkedinbot|baiduspider|360Spider|Sogou Spider|Yahoo! Slurp China|Yahoo! Slurp|twitterbot|facebookexternalhit|rogerbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator") {
    
    
          set $agent $http_user_agent;
          proxy_pass  http://www.cmvalue.seo2:7001;
          break;
        }

        alias /var/www/f2e/yl-homepage/;
        index  index.html index.htm;
        try_files $uri $uri/ /homepage/index.html;
    }

    location ~ /api/ {
    
    
        proxy_pass  https://www.cmvalue.com;
    }
}

Modify the local host

127.0.0.1   www.cmvalue.seo2

2. Node code

The main thing to do: crawl the complete html through the headless browser and return it to the browser crawler.

router

router.get(/^\w*[^a]*[^p]*[^i]*\w*$/, controller.seo.index)

controller

logger.info('开始爬取页面...')
const {
    
     html, spaRenderTime } = await service.seo.getCrawlerHtml(ctx.href)
ctx.set('Server-Timing', `Prerender;dur=${
      
      spaRenderTime};desc="Headless render time (ms)"`)
logger.info(`爬取完成,耗时${
      
      spaRenderTime}ms`)
ctx.body = html

server

async getCrawlerHtml(url) {
    
    
  // 判断缓存
  const CACHE_HTML = await this.getRedis(url, 'crawler')
  if (CACHE_HTML) {
    
    
    return {
    
    
      html: CACHE_HTML,
      spaRenderTime: 0,
    }
  }
  // 开始爬取
  const startTime = Date.now()
  const browser = await puppeteer.launch({
    
    
    headless: true,
  })
  try {
    
    
    const page = await browser.newPage()
    await page.goto(url, {
    
    
      // 满足什么条件认为页面跳转完成
      // load - 页面的load事件触发时
      // domcontentloaded - 页面的 DOMContentLoaded 事件触发时
      // networkidle0 - 不再有网络连接时触发(至少500毫秒后)
      // networkidle2 - 只有2个网络连接时触发(至少500毫秒后)
      waitUntil: 'networkidle0',
    })
    // await page.waitForSelector('#app')
    const html = await page.content()
    await browser.close()
    const spaRenderTime = Date.now() - startTime
    this.setRedis(url, html, 'crawler')
    return {
    
     html, spaRenderTime }
  } catch (err) {
    
    
    console.error(err)
    throw new Error('page.goto/waitForSelector timed out.')
  }
}
// set
async setRedis(url, html, type) {
    
    
  const REDIS_KEY = `seo:yl-homepage:${
      
      type}:${
      
      url}`
  // 600秒(10分钟)缓存
  const CACHE_TIME = 10 * 60
  await this.app.redis.setex(REDIS_KEY, CACHE_TIME, html)
}
// get
async getRedis(url, type) {
    
    
  const REDIS_KEY = `seo:yl-homepage:${
      
      type}:${
      
      url}`
  const res = await this.app.redis.get(REDIS_KEY)
  return res
}

Guess you like

Origin blog.csdn.net/weixin_43972437/article/details/113448824