principle
The server judges the requested user-agent, and the browser directly sends it to the SPA page. If it is a crawler, it sends it to the dynamically rendered html page.
Program
There are two options:
Scenario 1: node judges the requested user-agent
Scenario 2: nginx judges the requested user-agent
Use nginx to make judgments. The node service is only for crawlers. Even if the node is down, it will not affect the normal use of users, so scheme 2 is adopted here.
practice
Technology stack
Server: node, egg
cache: redis
request forwarding: nginx
Concrete practice
1. nginx configuration
If nginx judges that it is a crawler, it will forward it to node, and if it is not a crawler, it will return the landing page to the user.
# seo2:服务端动态渲染方案-nginx判断爬虫
server {
listen 80;
server_name www.cmvalue.seo2;
add_header Access-Control-Allow-Origin *;
location /homepage {
if ($http_user_agent ~* "googlebot|google-structured-data-testing-tool|Mediapartners-Google|bingbot|linkedinbot|baiduspider|360Spider|Sogou Spider|Yahoo! Slurp China|Yahoo! Slurp|twitterbot|facebookexternalhit|rogerbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator") {
set $agent $http_user_agent;
proxy_pass http://www.cmvalue.seo2:7001;
break;
}
alias /var/www/f2e/yl-homepage/;
index index.html index.htm;
try_files $uri $uri/ /homepage/index.html;
}
location ~ /api/ {
proxy_pass https://www.cmvalue.com;
}
}
Modify the local host
127.0.0.1 www.cmvalue.seo2
2. Node code
The main thing to do: crawl the complete html through the headless browser and return it to the browser crawler.
router
router.get(/^\w*[^a]*[^p]*[^i]*\w*$/, controller.seo.index)
controller
logger.info('开始爬取页面...')
const {
html, spaRenderTime } = await service.seo.getCrawlerHtml(ctx.href)
ctx.set('Server-Timing', `Prerender;dur=${
spaRenderTime};desc="Headless render time (ms)"`)
logger.info(`爬取完成,耗时${
spaRenderTime}ms`)
ctx.body = html
server
async getCrawlerHtml(url) {
// 判断缓存
const CACHE_HTML = await this.getRedis(url, 'crawler')
if (CACHE_HTML) {
return {
html: CACHE_HTML,
spaRenderTime: 0,
}
}
// 开始爬取
const startTime = Date.now()
const browser = await puppeteer.launch({
headless: true,
})
try {
const page = await browser.newPage()
await page.goto(url, {
// 满足什么条件认为页面跳转完成
// load - 页面的load事件触发时
// domcontentloaded - 页面的 DOMContentLoaded 事件触发时
// networkidle0 - 不再有网络连接时触发(至少500毫秒后)
// networkidle2 - 只有2个网络连接时触发(至少500毫秒后)
waitUntil: 'networkidle0',
})
// await page.waitForSelector('#app')
const html = await page.content()
await browser.close()
const spaRenderTime = Date.now() - startTime
this.setRedis(url, html, 'crawler')
return {
html, spaRenderTime }
} catch (err) {
console.error(err)
throw new Error('page.goto/waitForSelector timed out.')
}
}
// set
async setRedis(url, html, type) {
const REDIS_KEY = `seo:yl-homepage:${
type}:${
url}`
// 600秒(10分钟)缓存
const CACHE_TIME = 10 * 60
await this.app.redis.setex(REDIS_KEY, CACHE_TIME, html)
}
// get
async getRedis(url, type) {
const REDIS_KEY = `seo:yl-homepage:${
type}:${
url}`
const res = await this.app.redis.get(REDIS_KEY)
return res
}