Reptile God, another new move

It takes about 2 minutes to read this article

fan monologue

Almost everyone who plays crawlers will definitely use the requests library. The author of this library is the famous Kenneth Reitz. It's a mess. I recently browsed its website and found that he has a new trick. A library that combines a crawler downloader and a parser is another great boon to the crawler world. Let's learn together.


640?wx_fmt=png&wxfrom=5&wx_lazy=1



01

Requests-Html


This library is a companion to the requests library. Generally speaking, we crawlers. After downloading the webpage, I will install some parsing libraries to parse the webpage. There are many kinds of parsing libraries, which increases our learning cost.


Is there a library that integrates the two and provides it to us conveniently. But this library directly has built-in parsing of html web pages, which is equivalent to bringing your own drinks, which is very convenient. It is known as a web page parsing library for humans.

0?wx_fmt=jpeg

At present, this library has received 7500 likes and 323 forks, which is quite awesome!



02

What's in this curry?


We just need to install directly with pip. pip install requests-html , this library has built-in requests library, pyquery library, bs library , and some coding libraries. The most awesome thing is that it even integrates the random agent library fake-useragent!

640?wx_fmt=png

What awesome features are built in:

  • Full JavaScript support!

  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).

  • XPath Selectors, for the faint at heart.

  • Mocked user-agent (like a real web browser).

  • Automatic following of redirects.

  • Connection–pooling and cookie persistence.

  • The Requests experience you know and love, with magical parsing abilities.



03

How to use this library


1). For example, we crawl a Python official website

640?wx_fmt=png

>>

/about/quotes/

/about/success/#software-development

https://mail.python.org/mailman/listinfo/python-dev

/downloads/release/python-365/

/community/logos/

/community/sigs/

//jobs.python.org

http://tornadoweb.org

https://github.com/python/pythondotorg/issues

/about/gettingstarted/

...

It's simple, we don't need to care about the http request header, nor the cookie, nor the proxy agent. We can simply parse the content of the web page by directly initializing an object of the HTMLSession() class. While having a cup of tea, you can directly call the methods in the r object, such as extracting hyperlinks in all web pages.


2). Take a look at the good methods in the HTMLSession object:

print ([e for e in dir(r.html) if not e.startswith('_')])

>>

['absolute_links', 'add_next_symbol', 'base_url', 'default_encoding', 

'element', 'encoding', 'find', 'full_text', 'html', 'links', 'lxml', 'next_symbol', 

'page', 'pq', 'raw_html', 'render', 'search', 'search_all','session', 'skip_anchors', 'text', 'url', 'xpath']


里面有很多有用的功能函数,比如find,search,search_all功能,非常方便!上边我们解析了Python官网,接着我们解析官网里面的about :

640?wx_fmt=png


想要找到about元素里面的文本内容,我们只用find一行代码就可以搞定搞定

about = r.html.find('#about', first=True)

print (about.text)

>>

About Applications Quotes Getting Started Help Python Brochure

#about 是表示网页审查里面id为about (css方式提取),first置为true表示,如果取的元素是一个list,我们只返回第一个元素。


想读取about里面的attr:

print (about.attrs)

>>

{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}


想读取about里面的链接:

about.find('a')

>>

640?wx_fmt=png


最牛逼的是这About对象已经把各种解析库的对象句柄都完成了初始化,比如大名鼎鼎的pyquery库的解析(css解析器),lxml库的解析。


直接用doc=about.pq,这里的doc其实就是把css解析的内容解析出来,我们可以非常方便的处理. 



整个requests_html库相当于一个中间层,把复杂的解析网页的这些繁琐的步骤,再次的封装了,里面还有牛逼的功能,比如支持js页面的动态解析,内置了强大的chromium引擎和异步的解析session(AsyncHTMLSession),这个里面用的是Python非常牛逼的Asyncio库。


总之有了这个requests_html,妈妈再也不用担心我学不会爬虫了更多使用方法可以参考:https://github.com/kennethreitz/requests-html



∞∞∞



640?wx_fmt=jpeg&wx_lazy=1

IT派 - {技术青年圈} 持续关注互联网、区块链、人工智能领域 640?wx_fmt=jpeg&wx_lazy=1



公众号回复“Python”

邀你加入{ IT派Python技术群 }


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325762907&siteId=291194637