Python Beautiful Soup 4 module

BeautifulSoup is a Python library can extract data from HTML or XML file

By beautifulsoup4 prevent XSS attacks

User input means beautifulsoup4 filter content
need Singleton practical use
steps:

  1. Instantiated object, page parsing
  2. Find a destination tag
  3. The illegal tag empty
  4. After the acquisition process strings
Direct operating label

Example:

content = '''
<div id="i1">
    <img src="" id="img">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
'''
soup = BeautifulSoup(content, 'html.parser')    # <class 'bs4.BeautifulSoup'>
script_tag = soup.find('script')   # <class 'bs4.element.Tag'>
script_tag.clear()
script_tag.hidden = True
content = soup.decode()  # 将对象转换为一个字符串
print(content)

Output:

<div id="i1">
    <img src="" id="img">
</div>
<div id="i2"></div>
Action Properties

By .attrs, operating in the dictionary acquires attribute dictionary
Example:

content = '''
<div id="i1">
    <img src="" id="img">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
'''
soup = BeautifulSoup(content, 'html.parser')
img_tag = soup.find('img')
del img_tag.attrs['id']
content = soup.decode()
print(content)

Output:

<div id="i1">
    <img src="">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
Whitelist

Example:

from bs4 import BeautifulSoup

content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2" class="c1"></div>
<script>alert('Hi!')</script>
'''
tag_p = {
    # 允许使用的标签和允许的属性
    'div': ['class', ],
    'img': ['src', ],
}
soup = BeautifulSoup(content, 'html.parser')    # <class 'bs4.BeautifulSoup'>
# 开始过滤
for tag in soup.find_all():
    if tag.name in tag_p:
        pass
    else:   # 不在白名单中的标签进行清除
        tag.hidden = True
        tag.clear()
        continue

    for k in list(tag.attrs.keys()):    # 注意要先将dict.keys转换成列表
        if k in tag_p[tag.name]:
            pass
        else:
            del tag.attrs[k]

content = soup.decode()
print(content)

Output:

<div>
<img src=""/>
</div>
<div class="c1"></div>
method

= = find_all the findAll findChildren
findChild = Find find_all = [0]
tag.clear emptying the contents of the selected label (label still)
tag.hidden = True remove the tag (also content)
tag.attrs obtaining a dictionary, key: value

Guess you like

Origin www.cnblogs.com/dbf-/p/10991848.html