request-html module (on)

Module requests-html

Official website

Github URL

Request data

from requests_html import HTMLSession

session = HTMLSession()

requests-htmlRequest issued by sessionissued to

Send Getrequest

url  = 'https://baidu.com'
res= session.get(url = url)

Send postrequest

url  = 'https://baidu.com'

res= session.post(url = url)

You may also be used requesta method, designated GET, or a POSTparameter to specify, on the use and the use requestsin sessiona method of the class encapsulates

url = 'http://ww.baidu.com'
res = session.request(method = 'GET',url = url)
print(res.html.html)

getMethods and postthere requestand methods requestsconsistent method of module, as to why, as the two modules are written by one person

Custom HTML object

from requests_html import HTML
doc = """<a href='https://httpbin.org'>"""

html = HTML(html=doc)
print(html.links)
{'https://httpbin.org'}

HTML object properties

url = 'https://www.zhihu.com/signin?next=%2F'
res = session.get(url = url)

In ipythonthe inspection restype

In [15]: res
Out[15]: <Response [200]>
In [16]: type(res)
Out[16]: requests_html.HTMLResponse
In [17]: res.html
Out[17]: <HTML url='https://www.zhihu.com/signin?next=%2F'>

In [18]: type(res.html)
Out[18]: requests_html.HTML

We can see requests_html.HTMLand requests_html.HTMLResponseclass modules realize their own

In [19]: dir(res.html)
Out[20]:
['absolute_links', 'add_next_symbol', 'arender', 'base_url', 'default_encoding', 'element', 'encoding', 'find', 'full_text', 'html', 'links', 'lxml', 'next', 'next_symbol', 'page', 'pq', 'raw_html', 'render', 'search', 'search_all', 'session', 'skip_anchors', 'text', 'url', 'xpath]

In addition to attributes and methods to remove the module inside the package there are so many methods and properties, the following step by step we introduced class.

html object properties

Input page absolute path, if the connection is a relative path within the page will be automatically converted to absolute path

url = 'https://www.zhihu.com/signin?next=%2F'
res = session.get(url = url)

Here we request 知乎homepage, you can check the address at the bottom,

In:res.html.absolute_links
Out:
{
 'https://www.zhihu.com/app',
 'https://www.zhihu.com/contact',
 'https://www.zhihu.com/explore',
 'https://www.zhihu.com/jubao',
 'https://www.zhihu.com/org/signup',
 'https://www.zhihu.com/question/waiting',
}

It is converted to a relative path absolute path

As it is connected, the output of the page is the absolute path is the absolute path, a relative path is a relative path output

In : res.html.links
Out:
{
 '/app',
 '/contact',
 '/explore',
 '/jubao',
 'https://www.zhihu.com/org/signup',
 'https://www.zhihu.com/term/privacy',
 'https://www.zhihu.com/terms',
'}

base_url

The underlying connection

html

Return html code response page

raw_html

Back to the binary stream

text

All text in response to input, as a result,

In [21]: res.html.text
Out[21]: '知乎 - 有问题,上知乎\n.u-safeAreaInset-top { height: constant(safe-area-inset-top) !important; height: env(safe-
area-inset-top) !important; } .u-safeAreaInset-bottom { height: constant(safe-area-inset-bottom) !important; height: env(safe-area-inset-bottom) !important; }\nif (window.requestAnimationFrame) { window.requestAnimationFrame(function() { window.FIRST_ANIMATION_FRAME = Date.now(); }); }\n首页\n发现\n等你来答\n登录加入知乎\n有问题,上知乎\n免密码登录\n密码登录\n获取短
信验证码\n接收语音验证码\n注册/登录\n未注册手机验证后自动登录\n注册即代表同意《知乎协议》《隐私保护指引》\n注册机构号\n社交
帐号登录\n微信\nQQ\nQQ\n微博\n下载知乎 App\n知乎专栏圆桌发现移动应用联系我们来知乎工作注册机构号\n© 2019 知乎京 ICP 证 1107
45 号京公网安备 11010802010035 号出版物经营许可证\n侵权举报网上有害信息举报专区儿童色情信息举报专区违法和不良信息举报:010-
82716601\n

encoding

Character Encoding

Character encoding may be provided by the following method

res.html.encoding = 'gbk'

html object methods

find

parameter:

:param selector: css 选择器
:param clean: 是否去除页面中的<scpript>和<style>标签,默认False
:param containing:如果指定有值,只返回包含所给文本的Element对象,默认False
:param first: 是否返回第一个对象,默认False
:param _encoding: 字符编码

Back to Results

[Element,Element……] 
当First为True的时候,只返回第一个Element

xpath

:param selector: xpath 选择器
其他和find方法一致
res.html.search(xxx{}yyy)[0] // 只搜索一次

res.html.search(xxx{name}yyy{pwd}zzz)[name] // 只搜索一次

search_all

Find all meet the templatetarget, the result is resultan object composed oflist

Element object

'absolute_links', 'attrs', 'base_url', 'default_encoding', 'element', 'encoding', 'find', 'full_text', 'html', 'lineno', 'links', 'lxml', 'pq', 'raw_html', 'search', 'search_all', 'session', 'skip_anchors', 'tag', 'text', 'url', 'xpath'

text

Remove \r\ntext after

full_text

Not removed \r\ntext value after

attrs

Returned as a dictionary of Element object attributes and attribute names,

Guess you like

Origin www.cnblogs.com/ruhai/p/11318082.html