Sesame HTTP: The usage of PyQuery, a powerful tool for Python crawler

foreword

Do you find the usage of XPath somewhat obscure?

Do you find the syntax of BeautifulSoup somewhat stingy?

Are you even wrestling with regular expressions and getting mad about missing a dot?

Do you already have some front-end basics to understand selectors but get confused with some other weird selector syntax?

Well, then, the good news for the front-end people is coming, PyQuery is here, you must think of jQuery when you hear the name. If you are familiar with jQuery, then PyQuery is the best choice for parsing documents! Including me!

PyQuery is Python's strict implementation modeled after jQuery. The syntax is almost identical to jQuery, so don't bother memorizing weird methods anymore.

Is there such a good thing in the world? I can't wait!

Install

If there are such artifacts, don't hurry to install it! Come!

​pip install pyquery

 Still the original recipe, still the familiar taste.

Introduction

pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.

This is not (or at least not yet) a library to produce or interact with javascript code. I just liked the jquery API and I missed it in python so I told myself “Hey let’s make jquery in python”. This is the result.

It can be used for many purposes, one idea that I might try in the future is to use it for templating with pure http templates that you modify using pyquery. I can also be used for web scrapping or for theming applications with Deliverance.

pyquery allows you to manipulate xml using jQuery syntax. This is very similar to jQuery. pyquery's processing of xml and html will be faster if lxml is utilized.

This library is not (at least not yet) a codebase to interact with JavaScript, it's just very much like the jQuery API.

initialization

Four initialization methods are introduced here.

(1) Direct string

​from pyquery import PyQuery as pq
doc = pq("<html></html>")

 The pq parameter can be directly passed in HTML code, doc is now equivalent to the $ symbol in jQuery.

(2)lxml.etree

​from lxml import etree
doc = pq(etree.fromstring("<html></html>"))

 You can first use lxml's etree to process the code, so that if your HTML code has some incompleteness or omission, it will be automatically converted into HTML code with complete and clear structure.

(3) Directly pass the URL

from pyquery import PyQuery as pq
doc = pq('http://www.baidu.com')

 This is like directly requesting a web page, similar to using urllib2 to directly request this link and get the HTML code.

(4) file transfer

​from pyquery import PyQuery as pq
doc = pq(filename='hello.html')

 You can directly pass the file name of a certain path.

Quick experience

Now let's take a local file as an example, pass in a file named hello.html, and the content of the file is

​<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

 Write the following program

​from pyquery import PyQuery as pq
doc = pq(filename='hello.html')
print doc.html()
print type(doc)
li = doc('li')
print type(li)
print li.text()

 operation result



    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 
<class 'pyquery.pyquery.PyQuery'>
<class 'pyquery.pyquery.PyQuery'>
first item second item third item fourth item fifth item

 Look, recall the syntax of jQuery, does it run the same thing?

Here we noticed that after the initialization of PyQuery, the return type is PyQuery. After using the selector to filter once, the type of the returned result is still PyQuery, which is exactly the same as jQuery, and it can’t be better! But think about what BeautifulSoup and XPath return? list! An object that can no longer be filtered twice (in this case still using BeautifulSoup or XPath syntax)!

But Bibi PyQuery, oh I just love it!

property manipulation

You can operate PyQuery exactly in jQuery's syntax.

​from pyquery import PyQuery as pq
 
p = pq('<p id="hello" class="hello"></p>')('p')
print p.attr("id")
print p.attr("id", "plop")
print p.attr("id", "hello")

 operation result

​hello
<p id="plop" class="hello"/>
<p id="hello" class="hello"/>

 one more shot

from pyquery import PyQuery as pq
 
p = pq('<p id="hello" class="hello"></p>')('p')
print p.addClass('beauty')
print p.removeClass('hello')
print p.css('font-size', '16px')
print p.css({'background-color': 'yellow'})

 operation result

​<p id="hello" class="hello beauty"/>
<p id="hello" class="beauty"/>
<p id="hello" class="beauty" style="font-size: 16px"/>
<p id="hello" class="beauty" style="font-size: 16px; background-color: yellow"/>

 Still so elegant and confident!

Here we find that this is a series of operations, and p is always changing on the original result.

So after doing the above, p itself also changes.

DOM manipulation

Same authentic jQuery syntax



from pyquery import PyQuery as pq
 
p = pq('<p id="hello" class="hello"></p>')('p')
print p.append(' check out <a href="http://reddit.com/r/python"><span>reddit</span></a>')
print p.prepend('Oh yes!')
d = pq('<div class="wrap"><div id="test"><a href="http://cuiqingcai.com">Germy</a></div></div>')
p.prependTo(d('#test'))
print p
print d
d.empty()
print d

 operation result



<p id="hello" class="hello"> check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>
<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>
<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>
<div class="wrap"><div id="test"><p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p><a href="http://cuiqingcai.com">Germy</a></div></div>
<div class="wrap"/>

 This needs no explanation.

DOM manipulation is the same as jQuery.

traverse

Traverse using the items method to return a list of objects, or use a lambda

from pyquery import PyQuery as pq
doc = pq(filename='hello.html')
lis = doc('li')
for li in lis.items():
    print li.html ()
 
print lis.each(lambda e: e)

 operation result

​first item
<a href="link2.html">second item</a>
<a href="link3.html"><span class="bold">third item</span></a>
<a href="link4.html">fourth item</a>
<a href="link5.html">fifth item</a>
<li class="item-0">first item</li>
 <li class="item-1"><a href="link2.html">second item</a></li>
 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
 <li class="item-1 active"><a href="link4.html">fourth item</a></li>
 <li class="item-0"><a href="link5.html">fifth item</a></li>

 However, the most commonly used method is the items method.

web page request

PyQuery itself also has a web page request function, and it will convert the requested web page code into a PyQuery object.

from pyquery import PyQuery as pq
print pq('http://cuiqingcai.com/', headers={'user-agent': 'pyquery'})
print pq('http://httpbin.org/post', {'foo': 'bar'}, method='post', verify=True)

 Feel it, GET, POST, everything works.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326323103&siteId=291194637