Module requests-html (lower)

render method

Let's look at the relationship between management requestsand the author is the same person, pyppeteeris nodejsthe puppeteerunofficial achieve

requests-htmlCalls pyppeteerto interact with the browser,

puppeteerThe Chinese document Click here to convey

pyppeteerDocuments Bowen Reference

Call the render method to startpyppeteer

To download first before using chromium Download

You know, heavenly network environment is complex, if you use pyppeteeryour own binding chromium, long time have not been downloaded, so we have to manually install and specify the program insideexecutablePath

For the requests-htmlsource code is added to the line 714

executablePath=’path/to/the/chromium‘ 
from requests_html import HTMLSession

url  = 'https://httpbin.org/get'

session = HTMLSession()
res = session.get(url = url)
res.html.render()
print(res.html.html)

I can see the above figure with a red circle out of place, marked a headless browser HeadlessChrome, this is obviously not normal human users, it will be recognized by pocketing website

url  = 'https://httpbin.org/get'

session = HTMLSession(
            browser.args = [
                '--no-sand', 
                '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"'
            ])
res = session.get(url = url)
res.html.render()
print(res.html.html)

Type in the browser-based console normal use navigator.userAgent, you can see the browser's request header to copy him --user-agentafter, do not pay attention to spaces, --nosandis the highest authority run

Startup Parameters

kwargs = {
        'headless': False,
         'devtools': False, // 打开开发者工具
         'ignoreDefaultArgs':  // 忽略默认配置
         'userDataDir' :'./userdata', //设置用户目录,保存cookie
            'args': [
                '--disable-extensions',
                '--window-size={width},{height}',
                '--hide-scrollbars',
                '--disable-bundled-ppapi-flash',
                '--mute-audio', //页面静音
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-gpu',
                '--enable-automation',
               
            ],
        'dumpio': True,
    }

Web site can request to see our first set of entry into force of the UA

Render method parameter

  • retries retries, the default is 8,
  • script, JS script, an optional parameter defaults to None, strtype, if there is value, returning JS script execution return value
  • wait wait for the page to load in seconds before, to prevent a timeout, default 0.2 seconds, optional parameters, float
  • scrolldown, scroll the page number, integer, defaults to 0,
  • sleep, pause the number of seconds after the initial rendering, receiving integer, optional type, default is 0
  • reload default True, If False, if it is False, it will load content from memory
  • keep_page, by default False, if it is True, you can r.html.pageinteract with the page
"""Reloads the response in Chromium, and replaces HTML content
        with an updated version, with JavaScript executed.

        :param retries: The number of times to retry loading the page in Chromium.
        :param script: JavaScript to execute upon page load (optional).
        :param wait: The number of seconds to wait before loading the page, preventing timeouts (optional).
        :param scrolldown: Integer, if provided, of how many times to page down.
        :param sleep: Integer, if provided, of how many long to sleep after initial render.
        :param reload: If ``False``, content will not be loaded from the browser, but will be provided from memory.
        :param keep_page: If ``True`` will allow you to interact with the browser page through ``r.html.page``.

If sleepand scrolldownused together, represent turning all night, stopping for a few seconds

JS injection Example 1

script = """
                () => {
                    return {
                        width: document.documentElement.clientWidth,
                        height: document.documentElement.clientHeight,
                        deviceScaleFactor: window.devicePixelRatio,
                    }
                }
                """
from requests_html import HTMLSession

url  = 'https://httpbin.org/get'

session = HTMLSession(
    browser_args=[
                '--no-sand',
                '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"'
            ]
)
res = session.get(url = url)

r = res.html.render(script=script)
print(r)

The output is

{'width': 800, 'height': 600, 'deviceScaleFactor': 1}

JS injection of Example 2 change navigator.webdriver

'''() =>{
    
           Object.defineProperties(navigator,{
             webdriver:{
               get: () => undefined
             }
           })
        }'''

scrolldown

Let's look at the source of this change

Interact with the browser

page.screenshot([options])

- options `<object>` 可选配置 
    - path `<string>` 截图保存路径。截图图片类型将从文件扩展名推断出来。如果是相对路径,则从当前路径解析。如果没有指定路径,图片将不会保存到硬盘。
    - type `<string>` 指定截图类型, 可以是 jpeg 或者 png。默认 'png'.
    - quality `<number>` 图片质量, 可选值 0-100. png 类型不适用。
    - fullPage <boolean> 如果设置为true,则对完整的页面(需要滚动的部分也包含在内)。默认是false
    - clip `<object>` 指定裁剪区域。需要配置:
        - x `<number>` 裁剪区域相对于左上角(0, 0)的x坐标
        - y `<number>` 裁剪区域相对于左上角(0, 0)的y坐标
        - width `<number>` 裁剪的宽度
        - height `<number>` 裁剪的高度
    - omitBackground <boolean> 隐藏默认的白色背景,背景透明。默认不透明
    - encoding `<string>` 图像的编码可以是 base64 或 binary。 默认为“二进制”。

Screenshot example

import asyncio

from requests_html import HTMLSession

url  = 'https://httpbin.org/get'

session = HTMLSession(
    browser_args=[
                '--no-sand',
                '--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"'
            ]
)
res = session.get(url = url)
script = """
                () => {
                    return {
                        width: document.documentElement.clientWidth,
                        height: document.documentElement.clientHeight,
                        deviceScaleFactor: window.devicePixelRatio,
                    }
                }
               """
try:
    res.html.render(script=script,sleep = 1,keep_page = True)
    async def main():
        await res.html.page.screenshot({'path':'1.png'}) # 传入参数用字典path 代表路径 值为你要存放的路径

    asyncio.get_event_loop().run_until_complete(main())
finally:
    session.close()
#  指定截图位置,截图从哪个坐标开始
screenshot({'path':'1.png','clip':'{'x':200,'y':'300','weith':400,'height':'600'}'})

page.evaluate(pageFunction[, ...args])

  • pageFunction <function | string> To example performed in the context of the page
js1 = '''() =>{
    
           Object.defineProperties(navigator,{
             webdriver:{
               get: () => undefined
             }
           })
        }'''
        

js4 = '''() =>{Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
        }'''
await page.evaluate(js1) ## 更改webdriver
await page.evaluate(js4) ##更改语言

page.setViewport()

Set the page size

page.setViewport({'width': 1366, 'height': 768})

page.cookies()

If you do not specify any url, this method returns the cookie domain name of the current page. If you specify the url, cookie returned only at the specified url.

page.type(selector, text[, options])

- selector `<string>` 要输入内容的元素选择器。如果有多个匹配的元素,输入到第一个匹配的元素。
- text `<string>` 要输入的内容
- options `<object>`
    - delay `<number>` 每个字符输入的延迟,单位是毫秒。默认是 0。

page.click(selector[, options])

- selector `<string>` 要点击的元素的选择器。 如果有多个匹配的元素, 点击第一个。
- options `<object>`
    - button `<string>` left, right, 或者 middle, 默认是 left。
    - clickCount `<number>` 默认是 1. 查看 UIEvent.detail。
    - delay `<number>` mousedown 和 mouseup 之间停留的时间,单位是毫秒。默认是0

page.focus(selector)

  • selector <string>selector selector elements give focus. If there are multiple matching elements, a focal point to the first element.

page.hover(selector)

  • selector <string>selector to hover elements. If there are multiple matching elements, hover first

page.waitFor(selectorOrFunctionOrTimeout[, options[, ...args]])

- selectorOrFunctionOrTimeout <string|number|function> 选择器, 方法 或者 超时时间
- options `<object>` 可选的等待参数
    ...args <...Serializable|JSHandle> 传给 pageFunction 的参数
  • If you selectorOrFunctionOrTimeoutare string, then regarded as css selector or a xpath, it is not based on '//' at the beginning, at this time this method is page.waitForSelector or page.waitForXPath shorthand
  • If you selectorOrFunctionOrTimeoutare function, then regarded as a predicate, this time this method is page.waitForFunction () shorthand
  • If you selectorOrFunctionOrTimeoutare number, then regarded as the timeout, in milliseconds, it returns Promise objects, resolve after a specified time
  • Otherwise it will error

page.emulate

Analog phones

await page.emulate(iPhone);

Keyboard Events

For more keyboard keys grammar

grammar:

res.html.page.keyboard.XXX

keyboard.down(key[, options])

  • key <string>press the key names, such as ArrowLeft. contains a list of all the key names, see USKeyboardLayout.-
  • Options <object>- text <string>, if specified, the text input event is generated.

keyboard.up(key)

  • key <string>to release the key name key, for example ArrowLeft

keyboard.press(key[, options])

  • key <string>press the key names, such as ArrowLeft.
  • Options <object>- text <string>, if specified, the text input event is generated. - delay <number>time and keyup keydown interval, in milliseconds default to 0.

keyboard.type(text, options)

  • text <string>to be input to the focus of the text elements.
  • Options <object>- Delay <number>. time interval key, in milliseconds default to 0.

          page.keyboardtype('喜欢你啊',{‘delay’:100})

    Mouse Events

r.html.page.mouse.XXX

mouse.click(x, y, [options])

  • x <number>
  • Y <number>
  • options <object>
  • the Button <string>left, right or middle, the default is left.
  • clickCount <number>default is 1. See UIEvent.detail.
  • delay <number>in the millisecond and between mousedown mouseup and waiting time. The default is 0.

mouse.down([options])

  • options <object>
  • the Button <string>left, right or middle, the default is left.
  • clickCount <number>default is 1.

mouse.up([options])

  • options <object>
  • the Button <string>left, right, or middle, the default is left.
  • clickCount <number>default is 1.

Item code

puppeteerProject Reference Portal

Simulated landing Gmail

import asyncio
import time
from pyppeteer import launch


async def gmailLogin(username, password, url):
    #'headless': False如果想要浏览器隐藏更改False为True
    # 127.0.0.1:1080为代理ip和端口,这个根据自己的本地代理进行更改,如果是vps里或者全局模式可以删除掉'--proxy-server=127.0.0.1:1080'
    browser = await launch({'headless': False, 'args': ['--no-sandbox', '--proxy-server=127.0.0.1:1080']})
    page = await browser.newPage()
    await page.setUserAgent(
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36')

    await page.goto(url)

    # 输入Gmail
    await page.type('#identifierId', username)
    # 点击下一步
    await page.click('#identifierNext > content')
    page.mouse  # 模拟真实点击
    time.sleep(10)
    # 输入password
    await page.type('#password input', password)
    # 点击下一步
    await page.click('#passwordNext > content > span')
    page.mouse  # 模拟真实点击
    time.sleep(10)
    # 点击安全检测页面的DONE
    # await page.click('div > content > span')#如果本机之前登录过,并且page.setUserAgent设置为之前登录成功的浏览器user-agent了,
    # 就不会出现安全检测页面,这里如果有需要的自己根据需求进行更改,但是还是推荐先用常用浏览器登录成功后再用python程序进行登录。

    # 登录成功截图
    await page.screenshot({'path': './gmail-login.png', 'quality': 100, 'fullPage': True})
    #打开谷歌全家桶跳转,以Youtube为例
    await page.goto('https://www.youtube.com')
    time.sleep(10)


if __name__ == '__main__':
    username = '你的gmail包含@gmail.com'
    password = r'你的gmail密码'
    url = 'https://gmail.com'
    loop = asyncio.get_event_loop()
    loop.run_until_complete(gmailLogin(username, password, url))
# 代码由三分醉编写,网址www.sanfenzui.com,参考如下文章:
# https://blog.csdn.net/Chen_chong__/article/details/82950968

Realize Taobao landing

puppeteer analog phones

scrpay use pyppetter

Guess you like

Origin www.cnblogs.com/ruhai/p/11318133.html