(5) HTML content retrieval based on bs4 library
(1) Marking of information
-
The marked information can form an information organization structure, which increases the information dimension
-
The structure of tags is as important as information
-
The marked information can be used for communication, storage or display
-
The marked information is more conducive to program understanding and application
-
Example: HTML is WWW (World Wide Web) information organization method
-
HTML organizes different types of information through predefined <>… </> tags
(2) Three forms of information marking
1.XML
<!-- XML(eXtensible Markup Language) -->
<!-- 标签tag -->
<!-- 名称img 后接属性 -->
<img src="china.jpg" size="10">...</img>
<!-- 空元素的缩写形式 -->
<img src="china.jpg" size="10" />
<name>...</name>
<name />
<!-- -->
2.JSON
- According to the JSON specification, comments are not supported
- Mainly for the purpose of preventing excessive comments and affecting the data carrier of the file itself
JSON(JavaScript Object Notation)
//有类型的键值对 key:value
//"name"是类型
"name" : "靓仔"
//多值用[,]组织
"name" : ["靓仔", "美眉"]
//键值对嵌套用{,}
"name" : {
"newName" : "钢铁侠二代" ,
"oldName" : "钢铁侠一代"
}
//三种书写形式
"key" : "value"
"key" : ["value1", "value2"]
"key" : {"subkey" : "subvalue"}
3.YAML
YAML(YAML Ain't Markup Language)
# 无类型的键值对 key:value
# name仅是字符串
name : 靓仔
# 缩进表达所属关系
name :
newName : 钢铁侠二代
oldName : 钢铁侠一代
# - 表达并列关系
name :
-钢铁侠二代
-钢铁侠一代
# | 表达整块数据
text: | #示例乱码
sdadadwafaqagerghehreqtggfqegqeg
regrqegrqegeqgreqgregqegqergqert
# 三种书写形式
key : value
key : #Comment
-value1
-value2
key :
subkey : subvalue
(3) Comparison of three forms of information marking
-
XML
-
The earliest general information markup language, good scalability, but cumbersome
-
Information exchange and transmission on the Internet
-
JSON
-
There are types of information, suitable for program processing (js), more concise than XML
-
Mobile application cloud and node information communication, no notes
-
Yamla
-
No type of information, the highest proportion of text information, and good readability
-
Configuration files of various systems, easy to read with notes
(4) Information extraction and method
- Information extraction: extract the content of interest from the marked information
1. General Method One
- Completely analyze the markup form of the information, and then extract the key information
- XML 、 JSON 、 YAML)
- Need tag parser, for example: tag tree traversal of bs4 library
- Advantages: accurate information analysis
- Disadvantages: the extraction process is cumbersome and slow
2. General Method 2
- Ignore the mark form, directly search for key information
- Search: find the function of the text of the information
- Advantages: The extraction process is simple and fast
- The accuracy of the extraction results is related to the information content
3. Fusion method
- Combine formal analysis and search methods to extract key information
- (XML, JSON, YAML, search)
- Need tag parser and text search function
4. Examples
(5) HTML content search method based on bs4 library
# 返回一个列表类型,存储查找的结果
<>.find_all(name, attrs, recursive, string, **kwargs)
# 检索目标标签名为a,b
soup.find_all('a')
soup.find_all(['a','b'])
# 检索目标标签名为p,含属性course
soup.find_all('p','course')
# 检索目标含指定属性,引入re库(正则表达式库)
soup.find_all(id='link1')
soup.find_all(id=re.compile('link'))
#由于find_all方法很常用,为方便使用
<tag>(..) 等价于 <tag>.find_all(..)
soup(..) 等价于 soup.find_all(..)
parameter name | Explanation |
---|---|
name | Search string for tag name |
attrs | Search string for tag attribute value |
recursive | Whether to search all descendants, default True |
string | The search string in the string area in <>… </> |
method | Explanation |
---|---|
<>.find() | Search and return only one result, string type |
<>.find_parents() | Search in the ancestor node and return the list type |
<>.find_parent() | Return a result in the ancestor node |
<>.find_next_siblings() | Search in subsequent parallel nodes and return the list type |
<>.find_next_sibling() | Return a result in subsequent parallel nodes |
<>.find_previous_siblings() | Search in the preorder parallel node and return the list type |
<>.find_previous_sibling() | Return a result in the preorder parallel node |
method | Explanation |
---|---|
<>.find() | Search and return only one result, string type |
<>.find_parents() | Search in the ancestor node and return the list type |
<>.find_parent() | Return a result in the ancestor node |
<>.find_next_siblings() | Search in subsequent parallel nodes and return the list type |
<>.find_next_sibling() | Return a result in subsequent parallel nodes |
<>.find_previous_siblings() | Search in the preorder parallel node and return the list type |
<>.find_previous_sibling() | Return a result in the preorder parallel node |