BeautifulSoup is a Python library can extract data from HTML or XML file
By beautifulsoup4 prevent XSS attacks
User input means beautifulsoup4 filter content
need Singleton practical use
steps:
- Instantiated object, page parsing
- Find a destination tag
- The illegal tag empty
- After the acquisition process strings
Direct operating label
Example:
content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
'''
soup = BeautifulSoup(content, 'html.parser') # <class 'bs4.BeautifulSoup'>
script_tag = soup.find('script') # <class 'bs4.element.Tag'>
script_tag.clear()
script_tag.hidden = True
content = soup.decode() # 将对象转换为一个字符串
print(content)
Output:
<div id="i1">
<img src="" id="img">
</div>
<div id="i2"></div>
Action Properties
By .attrs
, operating in the dictionary acquires attribute dictionary
Example:
content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
'''
soup = BeautifulSoup(content, 'html.parser')
img_tag = soup.find('img')
del img_tag.attrs['id']
content = soup.decode()
print(content)
Output:
<div id="i1">
<img src="">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
Whitelist
Example:
from bs4 import BeautifulSoup
content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2" class="c1"></div>
<script>alert('Hi!')</script>
'''
tag_p = {
# 允许使用的标签和允许的属性
'div': ['class', ],
'img': ['src', ],
}
soup = BeautifulSoup(content, 'html.parser') # <class 'bs4.BeautifulSoup'>
# 开始过滤
for tag in soup.find_all():
if tag.name in tag_p:
pass
else: # 不在白名单中的标签进行清除
tag.hidden = True
tag.clear()
continue
for k in list(tag.attrs.keys()): # 注意要先将dict.keys转换成列表
if k in tag_p[tag.name]:
pass
else:
del tag.attrs[k]
content = soup.decode()
print(content)
Output:
<div>
<img src=""/>
</div>
<div class="c1"></div>
method
= = find_all the findAll findChildren
findChild = Find find_all = [0]
tag.clear emptying the contents of the selected label (label still)
tag.hidden = True remove the tag (also content)
tag.attrs obtaining a dictionary, key: value