目录
一 介绍
Python上有一个非常著名的HTTP库——requests,相信大家都听说过,用过的人都说非常爽!现在requests库的作者又发布了一个新库,叫做requests-html,看名字也能猜出来,这是一个解析HTML的库,具备requests的功能以外,还新增了一些更加强大的功能,用起来比requests更爽!接下来我们来介绍一下它吧。
# 官网解释 ''' This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. If you’re interested in financially supporting Kenneth Reitz open source, consider visiting this link. Your support helps tremendously with sustainability of motivation, as Open Source is no longer part of my day job. When using this library you automatically get: - Full JavaScript support! - CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). - XPath Selectors, for the faint at heart. - Mocked user-agent (like a real web browser). - Automatic following of redirects. - Connection–pooling and cookie persistence. - The Requests experience you know and love, with magical parsing abilities. - Async Support '''
官网告诉我们,它比原来的requests模块更加强大,并且为我们提供了一些新的功能!
- 支持JavaScript
- 支持CSS选择器(又名jQuery风格, 感谢PyQuery)
- 支持Xpath选择器
- 可自定义模拟User-Agent(模拟得更像真正的web浏览器)
- 自动追踪重定向
- 连接池与cookie持久化
- 支持异步请求
二 安装
安装requests-html非常简单,一行命令即可做到。需要注意一点就是,requests-html只支持Python 3.6或以上的版本,所以使用老版本的Python的同学需要更新一下Python版本了。
三 如何使用?
四 介绍
扫描二维码关注公众号,回复:
6254347 查看本文章
五 介绍
教程和用法
使用请求向'python.org'发出GET请求:
>>> from requests_html import HTMLSession >>> session = HTMLSession () >>> r = 会话。得到('https://python.org/' )
尝试异步并同时获取一些网站:
>>> from requests_html import AsyncHTMLSession >>> asession = AsyncHTMLSession () >>> async def get_pythonorg (): ... r = await asession 。得到('https://python.org/' ) >>> async def get_reddit (): ... r = await asession 。得到('https://reddit.com/' ) >>> async def get_google (): ... r = await asession 。得到('https://google.com/' ) >>> 结果 = 会话。run (get_pythonorg , get_reddit , get_google )
按原样获取页面上所有链接的列表(不包括锚点):
>>> r 。HTML 。链接 {'//docs.python.org/3/tutorial/','/ about / apps /','https://github.com/python/pythondotorg/issues','/ accounts / login /','/ dev / peps /','/ about / legal /','// docs.python.org/3/tutorial/introduction.html#lists','/ download / alternatives','http://feedproxy.google。 com / ~r / PythonInsider /~3 / kihd2DW98YY / python-370a4-is-available-for-testing.html','/ download / other /','/ downloads / windows /','https:// mail。 python.org/mailman/listinfo/python-dev','/ doc / av','https://devguide.python.org/','/ about / success /#engineering','https:// wiki。 python.org/moin/PythonEventsCalendar#Submitting_an_Event','https://www.openstack.org','/ about / gettingstarted /','http://feedproxy.google.com/~r/PythonInsider/~3/ AMoBel8b8Mc /蟒-3。html','/ success-stories / industrial-light-magic-runs-python /','http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator' ,'/','http://pyfound.blogspot.com/','/ events / python-events / past /','/ downloads / release / python-2714 /','https://wiki.python .org / moin / PythonBooks','http://plus.google.com/+Python','https://wiki.python.org/moin/','https://status.python.org/' ,'/ community / workshops /','/ community / lists /','http://buildbot.net/','/ community / awards','http://twitter.com/ThePSF','https: //docs.python.org/3/license.html','/ psf / donations /','http://wiki.python.org/moin/Languages','/ dev /','/ events / python -user-group /','https://wiki.qt.io/PySide','/ community / sigs /','https://wiki.gnome.org/Projects/PyGObject','http://www.ansible.com','http://www.saltstack.com','http: //planetpython.org/','/ events / python-events','/ about / help /','/ events / python-user-group / past /','/ about / success /','/ psf -landing /','/ about / apps','/ about /','http://www.wxpython.org/','/ events / python-user-group / 665 /','https:// www.python.org/psf/codeofconduct/','/ dev / peps / peps.rss','/ downloads / source /','/ psf / sponsorship / sponsors /','http://bottlepy.org' ,'http://roundup.sourceforge.net/','http://pandas.pydata.org/','http://brochure.getpython.info/','https://bugs.python.org /','/ community / merchandise /','http:// tornadoweb。org','/ events / python-user-group / 650 /','http://flask.pocoo.org/','/ downloads / release / python-364 /','/ events / python-user- group / 660 /','/ events / python-user-group / 638 /','/ psf /','/ doc /','http://blog.python.org','/ events / python- events / 604 /','/ about / success /#government','http://python.org/dev/peps/','https://docs.python.org','http:// feedproxy。 google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html','/ users / membership /','/ about / success /#arts','https:// wiki.python.org/moin/Python2orPython3','/ downloads /','/ jobs /','http://trac.edgewall.org/','http://feedproxy.google.com/~r/ PythonInsider / ~3 / wh73_1A-N7Q / python-355rc1-and-python-348rc1-are-now.html','/ privacy /','https://pypi.python.org/','http://www.riverbankcomputing.co.uk/software/pyqt/intro','http://www.scipy.org', '/ community / forums /','/ about / success / #scientific','/ about / success / #software-development','/ shell /','/ accounts / signup /','http:// www .facebook.com / pythonlang?fref = ts','/ community /','https://kivy.org/','/ about / quotes /','http://www.web2py.com/', '/ community / logos /','/ community / diversity /','/ events / calendars /','https://wiki.python.org/moin/BeginnersGuide','/ success-stories /','/ doc / essays /','/ dev / core-mentorship /','http://ipython.org','/ events /','// dococs.python.org / 3 / tutorial / controlflow.html', '/约/成功/#教育','/ blogs /','/ community / irc /','http://pycon.blogspot.com/','// jobs.python.org','http://www.pylonsproject.org/', 'http://www.djangoproject.com/','/ downloads / mac-osx /','/ about / success / #business','http://feedproxy.google.com/~r/PythonInsider/~ 3 / x_c9D0S-4C4 / python-370b1-is-now-available-for.html','http://wiki.python.org/moin/TkInter','https://docs.python.org/faq/ ','//docs.python.org/3/tutorial/controlflow.html#defining-functions'}com / ~r / PythonInsider / ~3 / x_c9D0S-4C4 / python-370b1-is-now-available-for.html','http://wiki.python.org/moin/TkInter','https:// docs.python.org/faq/','// docs.python.org/3/tutorial/controlflow.html#defining-functions'}com / ~r / PythonInsider / ~3 / x_c9D0S-4C4 / python-370b1-is-now-available-for.html','http://wiki.python.org/moin/TkInter','https:// docs.python.org/faq/','// docs.python.org/3/tutorial/controlflow.html#defining-functions'}
以绝对形式获取页面上所有链接的列表(不包括锚点):
>>> r 。HTML 。absolute_links {'https://github.com/python/pythondotorg/issues','https://docs.python.org/3/tutorial/','https://www.python.org/about/success/' ,'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html','https://www.python.org/dev/ peps /','https://mail.python.org/mailman/listinfo/python-dev','https://www.python.org/doc/','https://www.python.org/ ','https://www.python.org/about/','https://www.python.org/events/python-events/past/','https://devguide.python.org/' ,'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event','https://www.openstack.org','http://feedproxy.google.com/~r/PythonInsider/~3/ AMoBel8b8Mc / python-3.html','https://docs.python.org/3/tutorial/introduction。html#lists','http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator','http://pyfound.blogspot.com/','https ://wiki.python.org/moin/PythonBooks','http://plus.google.com/+Python','https://wiki.python.org/moin/','https:// www .python.org / events / python-events','https://status.python.org/','https://www.python.org/about/apps','https://www.python。 org / downloads / release / python-2714 /','https://www.python.org/psf/donations/','http://buildbot.net/','http://twitter.com/ThePSF ','https://docs.python.org/3/license.html','http://wiki.python.org/moin/Languages','https://docs.python.org/faq/' ,'https://jobs.python.org','https://www.python.org/about/success/#software-development','https://www.python.org/about/success/#education','https://www.python.org/community/logos/','https://www.python.org/doc/av',' https://wiki.qt.io/PySide','https://www.python.org/events/python-user-group/660/','https://wiki.gnome.org/Projects/PyGObject ','http://www.sansstack.com','http://www.python.org/dev/peps/peps.rss','http:/ /planetpython.org/','https://www.python.org/events/python-user-group/past/','https://docs.python.org/3/tutorial/controlflow.html#defining -functions','https://www.python.org/community/diversity/','https://docs.python.org/3/tutorial/controlflow.html','https://www.python。 org / community / awards','https://www.python.org/events/python-user-group/638/','https://www.python。org / about / legal /','https://www.python.org/dev/','https://www.python.org/download/alternatives','https://www.python.org/ downloads /','https://www.python.org/community/lists/','http://www.wxpython.org/','https://www.python.org/about/success/#政府','https://www.python.org/psf/','https://www.python.org/psf/codeofconduct/','http://bottlepy.org','http:// roundup.sourceforge.net/','http://pandas.pydata.org/','http://brochure.getpython.info/','https://www.python.org/downloads/source/' ,'https://bugs.python.org/','https://www.python.org/downloads/mac-osx/','https://www.python.org/about/help/', 'http://tornadoweb.org','http://flask.pocoo.org/','https://www.python。org / users / membership /','http://blog.python.org','https://www.python.org/privacy/','https://www.python.org/about/gettingstarted/ ','http://python.org/dev/peps/','https://www.python.org/about/apps/','https://docs.python.org','https:/ /www.python.org/success-stories/','https://www.python.org/community/forums/','http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00 /python-364-is-now-available.html','https://www.python.org/community/merchandise/','https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3','http://trac.edgewall.org/','http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A- N7Q / python-355rc1-and-python-348rc1-are-now.html','https://pypi.python.org/','https://www.python.org/events/python-user-group/650/','http://www.riverbankcomputing.co.uk/software/pyqt/intro','https://www.python.org / about / quotes /','https://www.python.org/downloads/windows/','https://www.python.org/events/calendars/','http://www.scipy。 org','https://www.python.org/community/workshops/','https://www.python.org/blogs/','https://www.python.org/accounts/signup/ ','https://www.python.org/events/','https://kivy.org/','http://www.facebook.com/pythonlang?fref = ts','http:/ /www.web2py.com/','https://www.python.org/psf/sponsorship/sponsors/','https://www.python.org/community/','https:// www。 python.org/download/other/','https://www.python.org/psf-landing/','https://www.python。org / events / python-user-group / 665 /','https://wiki.python.org/moin/BeginnersGuide','https://www.python.org/accounts/login/','https: //www.python.org/downloads/release/python-364/','https://www.python.org/dev/core-mentorship/','https://www.python.org/about/ success / #business','https://www.python.org/community/sigs/','https://www.python.org/events/python-user-group/','http:// ipython .org','https://www.python.org/shell/','https://www.python.org/community/irc/','https://www.python.org/about/success /#engineering','http://www.pylonsproject.org/','http://pycon.blogspot.com/','https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/','http://www.djangoproject.com/','https:// www。python.org/success-stories/industrial-light-magic-runs-python/','http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now -available-for.html','http://wiki.python.org/moin/TkInter','https://www.python.org/jobs/','https://www.python.org/事件/蟒事件/ 604 /'}
选择带有CSS Selector的元素:
>>> about = r 。HTML 。find ('#about' , first = True )
抓取元素的文本内容:
>>> 打印(约。文本) 关于 应用 行情 入门 救命 Python手册
反思Element的属性:
>>> 关于。attrs {'id':'about','class':('tier-1','element-1'),'aria-haspopup':'true'}
渲染元素的HTML:
>>> 关于。html '<li aria-haspopup =“true”class =“tier-1 element-1”id =“about”> \ n <a class="" href="/about/" title="">关于</ a> \ n <ul aria-hidden =“true”class =“subnav menu”role =“menu”> \ n <li class =“tier-2 element-1”role =“treeitem”> <a href =“ / about / apps /“title =”“>应用程序</a> </ li> \ n <li class =”tier-2 element-2“role =”treeitem“> <a href =”/ about / quotes / “title =”“>引用</a> </ li> \ n <li class =”tier-2 element-3“role =”treeitem“> <a href =”/ about / gettingstarted /“title =”“ >入门</a> </ li>
选择元素中的元素:
>>> 关于。find ('a' ) [<Element'a'href ='/ about /'title =''class =''>,<Element'a'href ='/ about / apps /'title =''>,<元素'a'href ='/ about / quotes /'title =''>,<Element'a'href ='/ about / gettingstarted /'title =''>,<Element'a'href ='/ about / help /'title =''>,<Element'a'href ='http://brochure.getpython.info/'title =''>]
搜索元素中的链接:
>>> 关于。absolute_links {'http://brochure.getpython.info/','https://www.python.org/about/gettingstarted/','https://www.python.org/about/','https: //www.python.org/about/quotes/','https://www.python.org/about/help/','https://www.python.org/about/apps/'}
在页面上搜索文字:
>>> r 。HTML 。搜索('Python是一种{}语言' )[ 0 ] 编程
更复杂的CSS Selector示例(从Chrome开发工具复制):
>>> r = 会话。get ('https://github.com/' ) >>> sel = 'body> div.application-main> div.jumbotron.jumbotron-codelines> div> div> div.col-md-7.text-center .text-md-left> p' >>> 打印([R ,HTML ,找到(SEL , 第一= 真)。文本) 的GitHub是一个开发平台,通过你的工作方式的启发。从开源到业务,您可以与数百万其他开发人员一起托管和审查代码,管理项目以及构建软件。
还支持XPath:
>>> r 。HTML 。xpath ('/ html / body / div [1] / a' ) [<Element'a'class =('px-2','py-4','show-on-focus','js-skip- to-content')href ='#start-of-content'tabindex ='1'>]