[Nodejs] Puppeteer\ crawler practice

insert image description here

puppeteer

Documentation: puppeteer.js Chinese documentation|puppeteerjs Chinese website|puppeteer crawler tutorial

Puppeteer itself relies on Node 6.4 or above, but for async/await which is super easy to use, it is recommended to use Node 7.6 or above. In addition, headless Chrome itself has relatively high requirements on the version of the library that the server depends on. The centos server dependency is relatively stable, and it is difficult to use headless Chrome with v6. Upgrading the dependent version may cause various server problems (including but not limited to the inability to use ssh), it is best to use High version server.

Because Puppeteer is an npm package, the installation is very simple:

pnpm i puppeteer-core

puppeteer会自动安装一个谷歌浏览器的安装包,所以选择core版,但是得指定启动路径

usage and examples

Puppeteer is similar to other frameworks, and operates the browser to respond accordingly by manipulating the Browser instance.

const puppeteer = require('puppeteer');

(async () => {
    
    
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://rennaiqian.com');
  await page.screenshot({
    
    path: 'example.png'});
  await page.pdf({
    
    path: 'example.pdf', format: 'A4'});
  await browser.close();
})();

The above code generates a browser instance through the launch method of puppeteer. Corresponding to the browser, the launch method can pass in configuration items. It is more useful to pass in {headless: false} during local debugging to turn off the headless mode.

const browser = await puppeteer.launch({
    
    headless:false})

browser.newPageThe method can open a new tab and return the instance page of the tab, and common operations can be performed on the page through various methods on the page. The above code performs screenshot and print pdf operations.

A very powerful method is page.evaluate(pageFunction, ...args)that we can inject our functions into the page, so there are infinite possibilities

const puppeteer = require('puppeteer');

(async () => {
    
    
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://rennaiqian.com');

  // Get the "viewport" of the page, as reported by the page.
  const dimensions = await page.evaluate(() => {
    
    
    return {
    
    
      width: document.documentElement.clientWidth,
      height: document.documentElement.clientHeight,
      deviceScaleFactor: window.devicePixelRatio
    };
  });

  console.log('Dimensions:', dimensions);
  await browser.close();
})();

It should be noted that external variables cannot be used directly in the evaluate method, they need to be passed in as parameters, and return is also required to obtain the execution result. Because it is an open source project for more than a month, the project is very active now, so when using it, you can find the API yourself to ensure that the parameters and usage methods are correct.

debugging skills

① Turn off the no-interface mode, sometimes it is useful to view the content displayed by the browser. The full browser can be launched with the following command:

const browser = await puppeteer.launch({
    
    headless: false})

② Slow down, the slowMo option slows down the operation of Puppeteer by the specified milliseconds. Here's another way to see what's going on:

const browser = await puppeteer.launch({
    
    
  headless:false,
  slowMo:250
});

③ Capture the output of the console by listening to the console event. This is also handy when debugging code in page.evaluate:

page.on('console', msg => console.log('PAGE LOG:', ...msg.args));
await page.evaluate(() => console.log(`url is ${
      
      location.href}`));

reptile practice

Many web pages judge the device through user-agent, which can be simulated through page.emulate(options). options has two configuration items, one is userAgent, the other is viewport, you can set width (width), height (height), screen scaling (deviceScaleFactor), whether it is mobile (isMobile), whether there is a touch event (hasTouch).

const puppeteer = require('puppeteer');
const devices = require('puppeteer/DeviceDescriptors');
const iPhone = devices['iPhone 6'];

puppeteer.launch().then(async browser => {
    
    
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.example.com');
  // other actions...
  await browser.close();
});

The above code simulates an iPhone6 ​​visiting a certain website, where devices are the simulation parameters of some common devices built into puppeteer.

Many web pages require login, there are two solutions:

让puppeteer去输入账号密码 常用方法:点击可以使用page.click(selector[, options])方法,也可以选择聚焦page.focus(selector)。 输入可以使用page.type(selector, text[, options])输入指定的字符串,还可以在options中设置delay缓慢输入更像真人一些。也可以使用keyboard.down(key[, options])来一个字符一个字符的输入。 如果是通过cookie判断登录状态的可以通过page.setCookie(...cookies),想要维持cookie可以定时访问。
Tip: Some websites need to scan the code, but other webpages with the same domain name are logged in, you can try to log in to the webpage where you can log in, and use the cookie to access and skip the code scanning.

simple example

const puppeteer = require('puppeteer');

(async () => {
    
    
  const browser = await puppeteer.launch({
    
    headless: false});
  const page = await browser.newPage();
  await page.goto('https://baidu.com');
  await page.type('#kw', 'puppeteer', {
    
    delay: 100});
  page.click('#su')
  await page.waitFor(1000);
  const targetLink = await page.evaluate(() => {
    
    
    return [...document.querySelectorAll('.result a')].filter(item => {
    
    
      return item.innerText && item.innerText.includes('Puppeteer的入门和实践')
    }).toString()
  });
  await page.goto(targetLink);
  await page.waitFor(1000);
  browser.close();
})()

insert image description here

multi-element processing

const puppeteer = require('puppeteer-core');

(async function () {
    
    
  //puppeteer.launch实例开启浏览器,
  //可以传入一个options对象,可以配置为无界面浏览器,也可以配置为有界面浏览器
  //无界面浏览器性能更高更快,有界面一般用于调试开式

  let options = {
    
    
    //设置视窗的宽高
    defaultViewport: {
    
    
      width: 1400,
      height: 800,
    },
    //设置为有界面,如果为true,即为无界面
    // headless: false,
    args: ['--window-size=1400,700'],
    //指定浏览器路径
    executablePath: 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe',
  };
  let browser = await puppeteer.launch(options);
  //打开新页面
  let page = await browser.newPage();
  //访问页面
  await page.goto('https://www.jd.com/');
  //截屏
  //   await page.screenshot({ path: 'example.png', fullPage: true });
  //获取页面内容
  // page.$eval相当于querySelector,然后在对这个元素进行dom操作
  // page.$$eval相当于querySelectorAll,然后在对这个元素进行dom操作
  let input = await page.$('#key');
  await input.type('手机');
  await page.keyboard.press('Enter');
  await page.waitForSelector('.gl-warp>.gl-item:last-child');
  const lis = await page.$$eval('.gl-warp>.gl-item', els =>
    //这个el就是获取到的对象
    //这里可以使用dom操作
    // console.log(el);
    els.map(item => item.innerText)
  );
  //这个lis就是上面回调函数的返回值
  console.log(lis);

  //关闭浏览器
  await browser.close();
})();

Enter text with element click

const puppeteer = require('puppeteer-core');
(async function () {
    
    
  let options = {
    
    
    defaultViewport: {
    
    
      width: 1400,
      height: 800,
    },
    headless: false,
    args: ['--window-size=1400,700'],
    executablePath: 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe',
  };
  let browser = await puppeteer.launch(options);
  let page = await browser.newPage();
  await page.goto('https://www.ygdy8.com/index.html');
  //获取页面内容
  //  page.$相当于querySelector
  //  page.$$相当于querySelectorAll
  //这些返回的是一个elementHandle对象
  const input = await page.$('input[name="keyword"]'); // 定位输入框
  /*  1
  input.focus()
  page.keyboard.type("电影") */
  //2
  await input.type('电影');

  /* 1  
 elementHandle.click()
  const search = await page.$('input[name="Submit"]'); // 定位搜索按钮
  await search.click();  // 点击 */
  //2
  await page.click('input[name="Submit"]');
})();

Get the text value of an element

const puppeteer = require('puppeteer-core');
(async function () {
    
    
  let options = {
    
    
    defaultViewport: {
    
    
      width: 1400,
      height: 700,
    },
    args: ['--window-size=1400,700'],
    headless: false,
    executablePath: 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe',
  };
  let browser = await puppeteer.launch(options);
  let page = await browser.newPage();
  await page.goto('https://www.baidu.com/');
  let input = await page.waitForSelector('#kw');
  await input.type('hello world');
  let btn = await page.$('#su');
  btn.click();
  /* 等待指定的选择器匹配的元素出现在页面中,如果调用此方法时已经有匹配的元素,
  那么此方法立即返回。如果指定的选择器在超时时间后扔不出现,此方法会报错。 
  返回: <Promise<ElementHandle>>*/
  await page.waitForSelector('div#content_left > div.result-op.c-container.xpath-log');
  let text = await page.$eval(
    'div#content_left > div.result-op.c-container.xpath-log',
    el => el.innerText
  );
  console.log(text);
})();

Handle the js method

const puppeteer = require('puppeteer-core');
(async function () {
    
    
  let options = {
    
    
    defaultViewport: {
    
    
      width: 1400,
      height: 700,
    },
    args: ['--window-size=1400,700'],
    // headless: false,
    executablePath: 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe',
  };
  let browser = await puppeteer.launch(options);
  let page = await browser.newPage();
  await page.goto('https://www.baidu.com/');
  let input = await page.waitForSelector('#kw');
  await input.type('hello world');
  let btn = await page.$('#su');
  btn.click();
  await page.waitForSelector('div#content_left > div.result-op.c-container.xpath-log');
  //里面可以直接写js代码
  let text = await page.evaluate(() => {
    
    
    let div = document.querySelector('div#content_left > div.result-op.c-container.xpath-log');
    return div.innerText;
  });
  console.log(text);
})();

Guess you like

Origin blog.csdn.net/weixin_43094619/article/details/131919923