Information marked in three ways, compare and extraction method

  1, flag information

    Flag information may be formed after the tissue structure information, dimension information increases

    The tag information is used for communication, storage or display

    Structure and information as the mark of great value

    Information after the mark is more conducive to the understanding and application procedures

  2, HTML tag information

    HTML is the message of the organization www.

    HTML is organized by different types of information <> .... </> tag predefined form

  3, information flag three ways

    1)、XML

    

    If the tag has no content, then we can use angle brackets, he said:

    

    Notes can also be embedded in content:

    

    2)、JSON

      There are types of keys on key: value

      note:

        Whether it is the key or value, if it is a string, we need to use double quotation marks to indicate, if the value is written directly on the line.

        If the value has multiple values, we need to use [,] to represent.

        Key-value pairs nested when we need {,} be represented as:

         

    3) yaml

       No type of key-key: value, such as:

       

       While using indentation way to represent your relationship:

       

       By - expression parallel relationship:

       

      With | the expression of entire blocks of data # denotes a comment

      

  Example 4, three kinds of information marks

    1), XML instance

      

    2), JSON examples

      

    3), YAML examples

      

  Comparison 5, three kinds of information labeled form

    The earliest general-purpose XML markup language information, scalability is good, but cumbersome

    JSON typed information for program processing, compact than XML

    YAML information untyped, the highest proportion of text messages and readable.

      The information exchange and transmission on the Internet is in XML format (HTML also fall into this category)

      JSON is used in the information communication nodes and mobile applications Drive, no comment,

  · YAML is readable annotation system used in all types of configuration files, there is

  6, general information extraction method

    Method One: full marks in the form of analytical information, and then extract key information

    即使用标记解析器去解析三种信息标记格式,然后将所需要的信息提取出来。如:bs4库中的标签树遍历

      优点:信息解析准确

      缺点:提取过程繁琐,速度慢

    方法二:无视标记形式,直接搜索关键信息。

    搜索: 

      对信息的文本查找函数即可

      优点:提取过程简洁,速度较快

      缺点:提取结果准确性与信息内容相关

    方法三:融合方法

      结合形式解析与搜索方法,提取关键信息

        需要标记解析器及文本查找函数。

      实例:

        提取HTML中所有URL链接

      思路:

        搜索到所有<a>标签

        解析<a>标签格式,提取href后的链接内容

import requests
from bs4 import BeautifulSoup
#BeautifulSoup是一个类
r = requests.get('http://python123.io/ws/demo.html')

# print(r.text)
demo = r.text
#解析demo的解释器
soup = BeautifulSoup(demo,'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

 

 

 

 

 

        

 

Guess you like

Origin www.cnblogs.com/fb1704011013/p/11111465.html