Python crawler learning (5) HTML content retrieval based on bs4 library

(5) HTML content retrieval based on bs4 library

(1) Marking of information

  • The marked information can form an information organization structure, which increases the information dimension

  • The structure of tags is as important as information

  • The marked information can be used for communication, storage or display

  • The marked information is more conducive to program understanding and application

  • Example: HTML is WWW (World Wide Web) information organization method

  • HTML organizes different types of information through predefined <>… </> tags

(2) Three forms of information marking

1.XML

<!-- XML(eXtensible Markup Language) -->

<!-- 标签tag -->
<!-- 名称img 后接属性 -->
<img src="china.jpg" size="10">...</img>
<!-- 空元素的缩写形式 -->
<img src="china.jpg" size="10" />
<name>...</name>
<name />
<!-- -->

2.JSON

  • According to the JSON specification, comments are not supported
  • Mainly for the purpose of preventing excessive comments and affecting the data carrier of the file itself
JSON(JavaScript Object Notation)

//有类型的键值对 key:value
//"name"是类型
"name" : "靓仔"
//多值用[,]组织
"name" : ["靓仔", "美眉"]
//键值对嵌套用{,}
"name" : {
    "newName" : "钢铁侠二代" ,
    "oldName" : "钢铁侠一代"
}

//三种书写形式
"key" : "value"
"key" : ["value1", "value2"]
"key" : {"subkey" : "subvalue"}

3.YAML

YAML(YAML Ain't Markup Language)

# 无类型的键值对 key:value
# name仅是字符串
name : 靓仔
# 缩进表达所属关系
name :
	newName : 钢铁侠二代
	oldName : 钢铁侠一代
# - 表达并列关系
name :
-钢铁侠二代
-钢铁侠一代
# | 表达整块数据
text: |		#示例乱码
sdadadwafaqagerghehreqtggfqegqeg
regrqegrqegeqgreqgregqegqergqert

# 三种书写形式
key : value
key : #Comment
-value1
-value2
key :
	subkey : subvalue

(3) Comparison of three forms of information marking

  • XML

  • The earliest general information markup language, good scalability, but cumbersome

  • Information exchange and transmission on the Internet

  • JSON

  • There are types of information, suitable for program processing (js), more concise than XML

  • Mobile application cloud and node information communication, no notes

  • Yamla

  • No type of information, the highest proportion of text information, and good readability

  • Configuration files of various systems, easy to read with notes

(4) Information extraction and method

  • Information extraction: extract the content of interest from the marked information

1. General Method One

  • Completely analyze the markup form of the information, and then extract the key information
  • XML 、 JSON 、 YAML)
  1. Need tag parser, for example: tag tree traversal of bs4 library
  2. Advantages: accurate information analysis
  3. Disadvantages: the extraction process is cumbersome and slow

2. General Method 2

  • Ignore the mark form, directly search for key information
  1. Search: find the function of the text of the information
  2. Advantages: The extraction process is simple and fast
  3. The accuracy of the extraction results is related to the information content

3. Fusion method

  • Combine formal analysis and search methods to extract key information
  • (XML, JSON, YAML, search)
  • Need tag parser and text search function

4. Examples

Insert picture description here

(5) HTML content search method based on bs4 library

# 返回一个列表类型,存储查找的结果
<>.find_all(name, attrs, recursive, string, **kwargs)

# 检索目标标签名为a,b
soup.find_all('a')
soup.find_all(['a','b'])
# 检索目标标签名为p,含属性course
soup.find_all('p','course')
# 检索目标含指定属性,引入re库(正则表达式库)
soup.find_all(id='link1')
soup.find_all(id=re.compile('link'))

#由于find_all方法很常用,为方便使用
<tag>(..) 等价于 <tag>.find_all(..)
soup(..)  等价于 soup.find_all(..)
parameter name Explanation
name Search string for tag name
attrs Search string for tag attribute value
recursive Whether to search all descendants, default True
string The search string in the string area in <>… </>
method Explanation
<>.find() Search and return only one result, string type
<>.find_parents() Search in the ancestor node and return the list type
<>.find_parent() Return a result in the ancestor node
<>.find_next_siblings() Search in subsequent parallel nodes and return the list type
<>.find_next_sibling() Return a result in subsequent parallel nodes
<>.find_previous_siblings() Search in the preorder parallel node and return the list type
<>.find_previous_sibling() Return a result in the preorder parallel node
method Explanation
<>.find() Search and return only one result, string type
<>.find_parents() Search in the ancestor node and return the list type
<>.find_parent() Return a result in the ancestor node
<>.find_next_siblings() Search in subsequent parallel nodes and return the list type
<>.find_next_sibling() Return a result in subsequent parallel nodes
<>.find_previous_siblings() Search in the preorder parallel node and return the list type
<>.find_previous_sibling() Return a result in the preorder parallel node
Published 10 original articles · Like1 · Visits 137

Guess you like

Origin blog.csdn.net/qq_39419113/article/details/105635138