Module requests-html
Request data
from requests_html import HTMLSession
session = HTMLSession()
requests-html
Request issued by session
issued to
Send Get
request
url = 'https://baidu.com'
res= session.get(url = url)
Send post
request
url = 'https://baidu.com'
res= session.post(url = url)
You may also be used request
a method, designated GET
, or a POST
parameter to specify, on the use and the use requests
in session
a method of the class encapsulates
url = 'http://ww.baidu.com'
res = session.request(method = 'GET',url = url)
print(res.html.html)
get
Methods and post
there request
and methods requests
consistent method of module, as to why, as the two modules are written by one person
Custom HTML object
from requests_html import HTML
doc = """<a href='https://httpbin.org'>"""
html = HTML(html=doc)
print(html.links)
{'https://httpbin.org'}
HTML object properties
url = 'https://www.zhihu.com/signin?next=%2F'
res = session.get(url = url)
In ipython
the inspection res
type
In [15]: res
Out[15]: <Response [200]>
In [16]: type(res)
Out[16]: requests_html.HTMLResponse
In [17]: res.html
Out[17]: <HTML url='https://www.zhihu.com/signin?next=%2F'>
In [18]: type(res.html)
Out[18]: requests_html.HTML
We can see requests_html.HTML
and requests_html.HTMLResponse
class modules realize their own
In [19]: dir(res.html)
Out[20]:
['absolute_links', 'add_next_symbol', 'arender', 'base_url', 'default_encoding', 'element', 'encoding', 'find', 'full_text', 'html', 'links', 'lxml', 'next', 'next_symbol', 'page', 'pq', 'raw_html', 'render', 'search', 'search_all', 'session', 'skip_anchors', 'text', 'url', 'xpath]
In addition to attributes and methods to remove the module inside the package there are so many methods and properties, the following step by step we introduced class.
html object properties
absolute_links
Input page absolute path, if the connection is a relative path within the page will be automatically converted to absolute path
url = 'https://www.zhihu.com/signin?next=%2F'
res = session.get(url = url)
Here we request 知乎
homepage, you can check the address at the bottom,
In:res.html.absolute_links
Out:
{
'https://www.zhihu.com/app',
'https://www.zhihu.com/contact',
'https://www.zhihu.com/explore',
'https://www.zhihu.com/jubao',
'https://www.zhihu.com/org/signup',
'https://www.zhihu.com/question/waiting',
}
It is converted to a relative path absolute path
links
As it is connected, the output of the page is the absolute path is the absolute path, a relative path is a relative path output
In : res.html.links
Out:
{
'/app',
'/contact',
'/explore',
'/jubao',
'https://www.zhihu.com/org/signup',
'https://www.zhihu.com/term/privacy',
'https://www.zhihu.com/terms',
'}
base_url
The underlying connection
html
Return html code response page
raw_html
Back to the binary stream
text
All text in response to input, as a result,
In [21]: res.html.text
Out[21]: '知乎 - 有问题,上知乎\n.u-safeAreaInset-top { height: constant(safe-area-inset-top) !important; height: env(safe-
area-inset-top) !important; } .u-safeAreaInset-bottom { height: constant(safe-area-inset-bottom) !important; height: env(safe-area-inset-bottom) !important; }\nif (window.requestAnimationFrame) { window.requestAnimationFrame(function() { window.FIRST_ANIMATION_FRAME = Date.now(); }); }\n首页\n发现\n等你来答\n登录加入知乎\n有问题,上知乎\n免密码登录\n密码登录\n获取短
信验证码\n接收语音验证码\n注册/登录\n未注册手机验证后自动登录\n注册即代表同意《知乎协议》《隐私保护指引》\n注册机构号\n社交
帐号登录\n微信\nQQ\nQQ\n微博\n下载知乎 App\n知乎专栏圆桌发现移动应用联系我们来知乎工作注册机构号\n© 2019 知乎京 ICP 证 1107
45 号京公网安备 11010802010035 号出版物经营许可证\n侵权举报网上有害信息举报专区儿童色情信息举报专区违法和不良信息举报:010-
82716601\n
encoding
Character Encoding
Character encoding may be provided by the following method
res.html.encoding = 'gbk'
html object methods
find
parameter:
:param selector: css 选择器
:param clean: 是否去除页面中的<scpript>和<style>标签,默认False
:param containing:如果指定有值,只返回包含所给文本的Element对象,默认False
:param first: 是否返回第一个对象,默认False
:param _encoding: 字符编码
Back to Results
[Element,Element……]
当First为True的时候,只返回第一个Element
xpath
:param selector: xpath 选择器
其他和find方法一致
search
res.html.search(xxx{}yyy)[0] // 只搜索一次
res.html.search(xxx{name}yyy{pwd}zzz)[name] // 只搜索一次
search_all
Find all meet the template
target, the result is result
an object composed oflist
Element object
'absolute_links', 'attrs', 'base_url', 'default_encoding', 'element', 'encoding', 'find', 'full_text', 'html', 'lineno', 'links', 'lxml', 'pq', 'raw_html', 'search', 'search_all', 'session', 'skip_anchors', 'tag', 'text', 'url', 'xpath'
text
Remove \r\n
text after
full_text
Not removed \r\n
text value after
attrs
Returned as a dictionary of Element object attributes and attribute names,