First, the information tag
1, marking the significance of the information
(1) information marks can be formed of information organization, an increase of the dimension information
(2) The tag information is used for communication, storage or display
(3) labeled as structure and information of great value
(4) information after the mark is more conducive to the understanding and application procedures
HTML (Hyper Text Markup Language) HTML is the WWW (Word Wide Web) information organization, hypertext information sound, images and video can be embedded in the text. Different types of information through the tissue predefined <> ... </> tag form.
2, three forms of information flag
(1)XML
XML (eXtensible Markup Language), Extensible Markup Language
The general form <name> ... </ name>
1 <img src = “china.jpg” size = “10”>中国</img>
Abbreviated form empty elements <name />
1 <img src = “china.jpg” size = “10” />
Notes written in the form <! - ->
1 <!-- This is a comment, very useful -->
(2)JSON
JSON (JavaScript Object Notation), JavaScript Object Notation
There are types of keys on key: value
Single value "key": "value"
1 "name": "Beijing Institute of Technology"
A plurality of values [,] "key": [ "value1", "value2"
1 "name": [ "Beijing Institute of Technology," "Yan'an Academy of Natural Sciences"]
Nested with {,} "key": { "subkey": "subvalue"}
1 "name": { 2 "newName": "Beijing Institute of Technology", 3 "oldName": "Yan'an Academy of Natural Sciences' 4 }
(3) yaml
Yamla (Yamla Is not Markup Language)
No type of key to key: value
Single value key: value
1 name: Beijing Institute of Technology
Multiple values - expressed a parallel relationship
1 Key: 2 - value1 3 - value2 4 name: 5 - Beijing Institute of Technology 6 - About Yan'an Academy of Natural Sciences
Nesting
key:
subkey:subvalue
1 name: 2 newName: Beijing Institute of Technology 3 oldName: Yan'an Academy of Natural Sciences
| Entire expression data # denotes a comment
content:| #comment
value
1 text: | # School Profile 2 Beijing Institute of Technology (Beijing Institute of Technology) is part of People's Republic of China Ministry of Industry and Information Technology, is a national key university, the first to enter the national "211 Project", "985", the first to enter A class of College ranks of world-class universities, selected degree authorized independent auditing unit, high school discipline innovation talent recruitment program, Higher innovation capability program, outstanding engineer education training program, the state public school building high-level university graduate program, students innovative country test plan, a national university student innovation and entrepreneurship training programs, research and practice new engineering project, the Chinese government scholarship students receiving institutions, deepening the country's innovation and entrepreneurship education reform demonstration universities, scientific and technological achievements of the first institutions of higher learning and technology transfer bases, industrial Union Ministry of information and universities, Chinese education co-AI members.
Comparison 3, three kinds of information labeled form
Markup Language | Feature | application |
XML | The earliest general information Markup Language, scalability is good, but cumbersome | Information exchange and transmission on the Internet |
JSON | There are types of information for program processing (js), compared with XML simple | Cloud applications and mobile information communication nodes, no comment |
Yamla | No information types, the highest proportion of text information, good readability | Various types of system configuration files, notes legible |
(1) XML instances
1 <person> 2 <firstName>Tian</firstName> 3 <lastName>Song</lastName> 4 <address> 5 <streetAddr>中关村南大街5号</streetAddr> 6 <city>北京市</city> 7 <zipcode>100081</zipcode> 8 </address> 9 <prof>Computer System</prof><prof>Security</prof> 10 </person>
(2) JSON Example
. 1 { 2 'firstName ":" Tian " . 3 " lastName ":" Song " . 4 " address ": { . 5 " strettAddr ":" Zhongguancun South Street No. 5 " . 6 " City ":" Beijing ", 7 "the zipcode": "100081, China " . 8 }, . 9 "Prof": [ "the System Computer", "Security"] 10 }
(3) YAML Example
. 1 firstName: Tian 2 lastName: Song . 3 address: . 4 streetAddr: Zhongguancun South Street. 5 . 5 City: Beijing . 6 the zipcode: 100081, China . 7 Prof: . 8 -Computer the System . 9 -Security
Second, information extraction
method | means | condition | advantage | Shortcoming |
Fully resolved mark in the form of information, and then extract key information |
XML, JSON, YAML |
Parser needs to be marked, for example: bs4 library tag tree traversal |
Information analysis accurate | Extraction process is cumbersome, slow |
Ignoring the mark in the form of direct search key information |
search for |
Text information can lookup function |
The extraction process is simple, fast | The extraction accuracy of the results related to the information content |
Bound form of analytical and search methods to extract key information |
XML, JSON YAML, search | You need to mark the parser and text lookup functions |
× | × |
Find method is based on HTML content bs4 library
To "http://python123.io/ws/demo.html" document as an example
1 import requests 2 from bs4 import BeautifulSoup 3 r = requests.get("http://python123.io/ws/demo.html") 4 demo = r.text 5 soup = BeautifulSoup(demo,"html.parser") 6 print(soup.prettify()) 7 <html> 8 <head> 9 <title> 10 This is a python demo page 11 </title> 12 </head> 13 <body> 14 <p class="title"> 15 <b> 16 The demo python introduces several python courses. 17 </b> 18 </p> 19 <p class="course"> 20 Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: 21 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> 22 Basic Python 23 </a> 24 and 25 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> 26 Advanced Python 27 </a> 28 . 29 </p> 30 </body> 31 </html>
<>.find_all(name,attrs,recursive,string,**kwargs)
Returns a list type, memory lookup results
name: string to retrieve the tag name
attrs: search character string tag attribute values, attribute search can be marked
recursive: Whether to retrieve all descendants, default True
Retrieving character string <> ... </> string region: string
Parameter name:
(1) retrieving a tag
1 soup.find_all("a") 2 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
(2) retrieving a, b tag
1 soup.find_all(["a","b"]) 2 [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
(3) If parameter to True, the label name to retrieve all
1 for tag in soup.find_all(True): 2 print(tag.name) 3 html 4 head 5 title 6 body 7 p 8 b 9 p 10 a 11 a
(4) retrieving tag name that begins with 'b' labels all
. 1 Import Re # introduced regular expression library 2 for Tag in soup.find_all (the re.compile ( ' B ' )): . 3 Print (tag.name) . 4 body . 5 B
attrs parameter:
(1) p with a tag attribute values "course"
1 soup.find_all('p',"course") 2 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: 3 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
(2) id attribute value "link1" label
1 soup.find_all(id = "link1") 2 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
(3) id attribute value of "link" tag
1 soup.find_all(id = "link") 2 []
(4) id attribute value included in the "link" tag
1 import re #引入正则表达式库 2 soup.find_all(id = re.compile("link")) 3 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
recursive parameters:
1 soup.find_all("a") #recursive默认为True 2 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] 3 soup.find_all("a", recursive = False) # Order of recursive True . 4 [] # Son tab without soup in a label
string parameters:
(1) retrieving the character string region is "Basic Python" string
1 soup.find_all(string = "Basic Python") 2 ['Basic Python']
(2) retrieving the character string region appears "python" string
1 soup.find_all(string = re.compile("python")) 2 ['This is a python demo page', 'The demo python introduces several python courses.']
<Tag> (...) is equivalent to a <tag> .find_all (...)
Soup (...) is equivalent to soup.find_all (...)
<>. Find_all () Extension Method
method | Explanation |
<>.find() | Search and only returns a result of type string, with .find_all () parameters |
<>.find_parents() | Search ancestor node, returns a list of types, with .find_all () parameters |
<>.find_parent() | Ancestor node returns a result of type string, with .find () parameters |
<>.find_next_siblings() | In a subsequent search for a parallel node, returns a list of types, the same .find_all () parameters |
<>.find_next_sibling() | In a subsequent parallel node returns a result of type string, with .find () parameters |
<>.find_previous_siblings() | Sequence search nodes in parallel in the front, returns a list of types, the same .find_all () parameters |
<>.find_previous_sibling() | Sequence parallel in the front node returns a result of type string, with .find () parameters |