BeautifulSoup Overview

Copyright Notice: Copyright: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/qq_42658739/article/details/89812692

Introduced BeautifulSoup:

from bs4 import BeautifulSoup
#意思就是从bs4这个包里面引入BeautifulSoup这个类

Then, use the following syntax in response to the request comes after the parser:

soup = BeautifulSoup(response.text, features='lxml')

features = 'lxml' is a statement analytically, as well as the appropriate way of explanation:
html.parse
html5lib
xml [This is] the only support XML parser
detailed differences, please go to the venue and Baidu Baidu is the appropriate way to explain what it means . Because the general lxml meet the basic use.

1. Visit single tag:

print(soup.h1)#(连标签也一起返回了)

2. Find all tags:

all_a = soup.find_all('a')
print(all_a)
#查找所有的 a 标签,返回的是一个所有a标签组成的列表

all_href = [a['href'] for a in all_a]  
#循环所有a标签组成的列表 里面的 所有 a标签,
#并访问其'href'属性,这里的访问规则就像是访问字典一样。
print('\n', all_href) #返回所有的href属性的值组成的列表 

In fact BeautifulSoup can do may be plenty
for example:

soup = BeautifulSoup(response.text, features='lxml')
month = soup.find_all('li',{"class" : "month"})#访问所有  含有"month"的class属性的li标签,返回满足条件的所有li标签组成的列表。
for m in month:
    print(m.get_text()) #get_text()获取标签的文字内容【字符串内容、值【不是属性的值】】

#或者
month = soup.find_all(class_= "month"})

BeautifulSoup can be used multiple times in a tree, the meaning is the equivalent of a circulation.

jan = soup.find('ul', {"class": 'jan'})
 #查找class属性值为jan的标签内的所有内容组成的XML树
d_jan = jan.find_all('li')              #使用前面返回的结果列表作为父亲(作为这一次的Tag对象),返回所有'li'标签组成的列表
for d in d_jan:#遍历这个li标签组成的列表,寻找所需要信息
    print(d.get_text())

Find property BeautifulSoup

soup = BeautifulSoup(response.text,'lxml')
print(soup.find_all(attrs={'id':'list-1'})) #查找id为list-1的标签包含的所有标签,id也可以换成其它属性

#此外,soup.find_all()直接支持id以及class的快捷查找
soup.find_all(id = 'idvalues')
soup.find_all(class_ = 'classvalues')

When BeautifulSoup encounter regular expression

soup = BeautifulSoup(htmlre, features='lxml')

#find_all函数的第一个参数填入标签,第二个参数填入具体找的东西【很多时候填入属性和值,也有时候需要用正则表达式】
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links: #遍历获取到的图片链接组成的列表
    print(link['src'])

Experienced find () and find_all () method, following the contact of BeautifulSoup [CSS selectors select () method]
An example:

html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Test</title>
</head>
<body>
<div class="panel">
    <div class="panel-heading">
        <h4>Good happy</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Lay</li>
        </ul>
        <ul class="list list-smail" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>

</body>
</html>
'''

1. For the grammar class: value .class attribute returns a list (this is a serious point bs4.element.Tag)

 print(soup.select('.panel .panel-heading'))
#根据CSS来选择,选择class='panel'下的class='panel-heading'的标签里包含的所有标签,包含他本身,组成的列表

 print(soup.select('ul li')) #选择并返回所有的ul、li标签组成的列表

2. For id Syntax: value #id property Back to list

 print(soup.select('#list-1')) #选择id=list-1 里面的所有标签以及他本身
 print(soup.select('#list-1 .element')) #选择id=list-1下的class属性值为element 里面的所有标签以及他本身

3. Get the value of the property

for ul in soup.select('ul'):
    print(ul['id'])#获取ul的id属性的值
    print(ul.attrs['id'])  #或者attrs获取属性的值

4. The contents of the tag acquired

for ul in soup.select('li'):
    print(ul.get_text()) #获取<tag>string</tag>里面的string

Guess you like

Origin blog.csdn.net/qq_42658739/article/details/89812692