Six, CSS selectors: BeautifulSoup4

And lxml Like, Beautiful Soup is also an HTML / XML parser, the main function is how to parse and extract HTML / XML data.

lxml only partial traversal, and Beautiful Soup is based on the HTML DOM, it will load the entire document, parse the whole DOM tree, so the time and memory overhead will be much larger, so the performance is lower than lxml.

BeautifulSoup for parsing HTML is relatively simple, API is very user-friendly, supports CSS selectors , Python standard library of HTML parsers also support XML parser lxml.

Beautiful Soup 3 has stopped development, we recommend projects now use Beautiful Soup 4. Can be installed using pip:pip install beautifulsoup4

The official document: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

Crawlers speed Use of difficulty The difficulty of installation
Regular The fastest difficult No (Built-in)
BeautifulSoup slow the easiest simple
lxml fast simple general

Example:

First, the library must be imported bs4

# beautifulsoup4_test.py

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建 Beautiful Soup 对象
soup = BeautifulSoup(html)

# Open local HTML file to create the object 
#soup the BeautifulSoup = (Open ( 'index.html')) 

# soup content formatted output object 
print soup.prettify ()

 operation result:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

 If we execute, we will see in this passage IPython2 warning:

  • Mean, if we do not explicitly specify the parser, so the default use of this system the best available HTML parser ( "lxml"). If you run this code in another system, or in different virtual environments, using different parser result in different behavior.

  • But we can soup = BeautifulSoup(html,“lxml”)specify lxml parser way.

Four types of objects

Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

1. Tag

Tag Popular speak is one of the HTML tags, such as:

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

 Above  title head a p, etc. HTML tags plus the contents inside Tag is included, then try using Beautiful Soup to get Tags:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建 Beautiful Soup 对象
soup = BeautifulSoup(html)


print soup.title
# <title>The Dormouse's story</title>

print soup.head
# <head><title>The Dormouse's story</title></head>

print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print soup.p
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print type(soup.p)
# <class 'bs4.element.Tag'>

We can use soup tagged name easily access the contents of these tags, the type of these objects are bs4.element.Tag. Note, however, that it is the first look at everything in it to meet the requirements of the label. If you want to query all of the labels, it will be introduced later.

For Tag, it has two important attributes are name and attrs
soup.name Print 
# [Document] #soup special object itself, it is the name [Document] 

Print soup.head.name 
# # head for other internal tag, the tag will output the value of its own name 

print soup.p .attrs 
# { 'class': [ 'title'], 'name': 'dromouse'} 
# here, we have all the attributes p printout tag out type is obtained in a dictionary. 

soup.p Print [ 'class'] # soup.p.get (' class') 
# [ 'title'] # get method may also be utilized, passing the name of the property, the two are equivalent 

soup.p [ ' class'] = "newClass" 
Print # soup.p these properties can be modified and content, etc. 
# <p class = "newClass" name = "dromouse"> <b> the Dormouse's story </ b> </ p> 

del soup.p [ 'class'] # can also delete this property 
Print soup.p 
# <P name = "dromouse"> <B> of the Dormouse apos Story </ B>

2. NavigableString

Now that we've got the contents of the tag, then the question is, we want to get the text inside the label how to do it? Very simple, with .string can, for example,

print soup.p.string
# The Dormouse's story

print type(soup.p.string)
# In [13]: <class 'bs4.element.NavigableString'>

3. BeautifulSoup

BeautifulSoup object represents the contents of a document. Most of the time, you can treat it as an object Tag is a special Tag, we can get its type, name, and property are to feel

type Print (soup.name) 
# <type 'Unicode'> 

Print soup.name 
# [Document] 

Print # soup.attrs document attribute itself is empty 
# {}

4. Comment

Comment object is a special type of NavigableString object does not include the contents of the output symbol annotation.

print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print soup.a.string
# Elsie 

print type(soup.a.string)
# <class 'bs4.element.Comment'>

a 标签里的内容实际上是注释,但是如果我们利用 .string 来输出它的内容时,注释符号已经去掉了。

遍历文档树

1. 直接子节点 :.contents .children 属性

.content

tag 的 .content 属性可以将tag的子节点以列表的方式输出

print soup.head.contents 
#[<title>The Dormouse's story</title>]

 输出方式为列表,我们可以用列表索引来获取它的某一个元素

print soup.head.contents[0]
#<title>The Dormouse's story</title>

.children

它返回的不是一个 list,不过我们可以通过遍历获取所有子节点。

我们打印输出 .children 看一下,可以发现它是一个 list 生成器对象

print soup.head.children
#<listiterator object at 0x7f71457f5710>

for child in  soup.body.children:
    print child

 result:

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

2. All descendants of nodes:  .descendants Attribute

.contents and .children property contains the only direct child tag of, .descendants properties can be recursive loop tag for all the children of the node, and similar children, we also need to traverse acquire the contents.

for child in soup.descendants:
    print child

 operation result:

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

3. node content:  .string property

If the tag is only one type NavigableString child node, then the tag can be used .string get child nodes. If a tag is only one child node, then the tag may be used .string method, the same results .string output current only child nodes.

Popular point is this: If a label there is no label, then .string will return the contents of the tag inside. If the label inside there is only a label, then .string will return the contents of the innermost. E.g:

print soup.head.string
#The Dormouse's story
print soup.title.string
#The Dormouse's story

Search document tree

1.find_all(name, attrs, recursive, text, **kwargs)

1) name parameters

The name parameter can find all the names for the tag name, the string object will be automatically ignored

A. pass a string

The most simple filter string is passed in a string parameter in the search process, content Beautiful Soup looks for a complete match of the string, the following example is used to find all documents. <b>Tags:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
B. pass regular expressions

If the incoming regular expression as an argument, Beautiful Soup will be matched by the contents of the regular expression match (). The following example to find all tags that start with b, which represents <body>and <b>labels should be found

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b
C. transfer list

If you pass a list of parameters, Beautiful Soup will return to the table of contents and any element that matches the code below to find the document at all. <a>Labels and <b>tags:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 2) keyword parameters

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3) text parameters

Search parameters can be found by text string contents of the document, like the value of the name parameter is optional, parameter accepts text strings, regular expressions, lists

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

CSS选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

  • 写 CSS 时,标签名不加任何修饰,类名前加.,id名前加#

  • 在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list

(1)通过标签名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]

print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')
#[<b>The Dormouse's story</b>]

(2)通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(3) Find by id name

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(4) combination to find

Find and that is when writing class file, name tags with the class name, id name to the combination principle is the same combination, such as finding p tags, id equal to the contents link1, the two separated by a space needs

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

 Finding direct child tag, use  > separate

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

(5) Find property

Find properties can have added elements, attributes need to be enclosed in brackets, pay attention to the label property and belong to the same node, so the middle can not add a space, otherwise they will be unable to match.

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

 Similarly, the combination of the above properties can still find the way, space is not the same node separated, no spaces on the same node

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(6) access to content

The results above are select method returns a list form, the form can traverse device, and then get_text () method to obtain its contents.

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()

 

Case: Use BeautifuSoup4 reptiles

We do demo page Tencent agency recruit: http://hr.tencent.com/position.php?&start=10#a

 

Use BeautifuSoup4 parser will recruit job title on the page, job category, number of recruits, place of work, release time, and the details of each position click on the link out of storage.

# bs4_tencent.py


from bs4 import BeautifulSoup
import urllib2
import urllib
import json    # 使用了json格式存储

def tencent():
    url = 'http://hr.tencent.com/'
    request = urllib2.Request(url + 'position.php?&start=10#a')
    response =urllib2.urlopen(request)
    resHtml = response.read()

    output =open('tencent.json','w')

    html = BeautifulSoup(resHtml,'lxml')

# 创建CSS选择器
    result = html.select('tr[class="even"]')
    result2 = html.select('tr[class="odd"]')
    result += result2

    items = []
    for site in result:
        item = {}

        name = site.select('td a')[0].get_text()
        detailLink = site.select('td a')[0].attrs['href']
        catalog = site.select('td')[1].get_text()
        recruitNumber = site.select('td')[2].get_text()
        workLocation = site.select('td')[3].get_text()
        publishTime = site.select('td')[4].get_text()

        item['name'] = name
        item['detailLink'] = url + detailLink
        item['catalog'] = catalog
        item['recruitNumber'] = recruitNumber
        item['publishTime'] = publishTime

        items.append(item)

    # 禁用ascii编码,按utf-8编码
    line = json.dumps(items,ensure_ascii=False)

    output.write(line.encode('utf-8'))
    output.close()

if __name__ == "__main__":
   tencent()

 

Guess you like

Origin www.cnblogs.com/steven9898/p/11425165.html
Recommended