Marker information extracted in Python

First, the information tag

1, marking the significance of the information

(1) information marks can be formed of information organization, an increase of the dimension information

(2) The tag information is used for communication, storage or display

(3) labeled as structure and information of great value

(4) information after the mark is more conducive to the understanding and application procedures

HTML (Hyper Text Markup Language) HTML is the WWW (Word Wide Web) information organization, hypertext information sound, images and video can be embedded in the text. Different types of information through the tissue predefined <> ... </> tag form.

2, three forms of information flag

(1)XML

XML (eXtensible Markup Language), Extensible Markup Language

The general form  <name> ... </ name>

1 <img src = “china.jpg” size = “10”>中国</img>

Abbreviated form empty elements  <name />

1 <img src = “china.jpg” size = “10” />

Notes written in the form <! - ->

1 <!-- This is a comment, very useful -->

(2)JSON

JSON (JavaScript Object Notation), JavaScript Object Notation

There are types of keys on key: value

Single value  "key": "value"

1 "name": "Beijing Institute of Technology"

A plurality of values [,] "key": [ "value1", "value2"

1 "name": [ "Beijing Institute of Technology," "Yan'an Academy of Natural Sciences"]

Nested with {,} "key": { "subkey": "subvalue"}

1  "name": {
 2          "newName": "Beijing Institute of Technology",
 3          "oldName": "Yan'an Academy of Natural Sciences'
 4          }

(3) yaml

Yamla (Yamla Is not Markup Language)

No type of key to key: value

Single value key: value

1 name: Beijing Institute of Technology

Multiple values ​​- expressed a parallel relationship

1  Key:
 2      - value1
 3      - value2
 4  name:
 5      - Beijing Institute of Technology
 6      - About Yan'an Academy of Natural Sciences

Nesting

key:

       subkey:subvalue

1  name:
 2      newName: Beijing Institute of Technology
 3      oldName: Yan'an Academy of Natural Sciences

| Entire expression data # denotes a comment

content:| #comment

  value

1 text: | # School Profile 
2      Beijing Institute of Technology (Beijing Institute of Technology) is part of People's Republic of China Ministry of Industry and Information Technology, is a national key university, the first to enter the national "211 Project", "985", the first to enter A class of College ranks of world-class universities, selected degree authorized independent auditing unit, high school discipline innovation talent recruitment program, Higher innovation capability program, outstanding engineer education training program, the state public school building high-level university graduate program, students innovative country test plan, a national university student innovation and entrepreneurship training programs, research and practice new engineering project, the Chinese government scholarship students receiving institutions, deepening the country's innovation and entrepreneurship education reform demonstration universities, scientific and technological achievements of the first institutions of higher learning and technology transfer bases, industrial Union Ministry of information and universities, Chinese education co-AI members.

 Comparison 3, three kinds of information labeled form

Markup Language Feature application
XML The earliest general information Markup Language, scalability is good, but cumbersome Information exchange and transmission on the Internet
JSON There are types of information for program processing (js), compared with XML simple Cloud applications and mobile information communication nodes, no comment
Yamla No information types, the highest proportion of text information, good readability Various types of system configuration files, notes legible

(1) XML instances

 1 <person>
 2     <firstName>Tian</firstName>
 3     <lastName>Song</lastName>
 4     <address>
 5         <streetAddr>中关村南大街5号</streetAddr>
 6         <city>北京市</city>
 7         <zipcode>100081</zipcode>
 8     </address>
 9     <prof>Computer System</prof><prof>Security</prof>
10 </person>

(2) JSON Example

. 1  {
 2      'firstName ":" Tian "
 . 3      " lastName ":" Song "
 . 4      " address ": {
 . 5                  " strettAddr ":" Zhongguancun South Street No. 5 "
 . 6                  " City ":" Beijing ",
 7                  "the zipcode": "100081, China "
 . 8              },
 . 9      "Prof": [ "the System Computer", "Security"]
 10 }

(3) YAML Example

. 1  firstName: Tian
 2  lastName: Song
 . 3  address:
 . 4      streetAddr: Zhongguancun South Street. 5
 . 5      City: Beijing
 . 6      the zipcode: 100081, China
 . 7  Prof:
 . 8      -Computer the System
 . 9      -Security

Second, information extraction

method means condition advantage Shortcoming

Fully resolved mark in the form of information, and then extract key information

XML, JSON, YAML

Parser needs to be marked, for example: bs4 library tag tree traversal

Information analysis accurate Extraction process is cumbersome, slow

Ignoring the mark in the form of direct search key information

search for

Text information can lookup function

The extraction process is simple, fast The extraction accuracy of the results related to the information content

Bound form of analytical and search methods to extract key information

XML, JSON YAML, search

You need to mark the parser and text lookup functions

 ×  ×

Find method is based on HTML content bs4 library

To "http://python123.io/ws/demo.html" document as an example

 1 import requests
 2 from bs4 import BeautifulSoup
 3 r = requests.get("http://python123.io/ws/demo.html")
 4 demo = r.text
 5 soup = BeautifulSoup(demo,"html.parser")
 6 print(soup.prettify())
 7 <html>
 8  <head>
 9   <title>
10    This is a python demo page
11   </title>
12  </head>
13  <body>
14   <p class="title">
15    <b>
16     The demo python introduces several python courses.
17    </b>
18   </p>
19   <p class="course">
20    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
21    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
22     Basic Python
23    </a>
24    and
25    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
26     Advanced Python
27    </a>
28    .
29   </p>
30  </body>
31 </html>

<>.find_all(name,attrs,recursive,string,**kwargs)

Returns a list type, memory lookup results

name: string to retrieve the tag name

attrs: search character string tag attribute values, attribute search can be marked

recursive: Whether to retrieve all descendants, default True

Retrieving character string <> ... </> string region: string

Parameter name:

(1) retrieving a tag

1 soup.find_all("a")
2 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

(2) retrieving a, b tag

1 soup.find_all(["a","b"])
2 [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

(3) If parameter to True, the label name to retrieve all

 1 for tag in soup.find_all(True):
 2     print(tag.name)    
 3 html
 4 head
 5 title
 6 body
 7 p
 8 b
 9 p
10 a
11 a

(4) retrieving tag name that begins with 'b' labels all

. 1  Import Re # introduced regular expression library 
2  for Tag in soup.find_all (the re.compile ( ' B ' )):
 . 3      Print (tag.name) 
 . 4  body
 . 5 B

attrs parameter:

(1) p with a tag attribute values ​​"course"

1 soup.find_all('p',"course")
2 [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
3 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

(2) id attribute value "link1" label

1 soup.find_all(id = "link1")
2 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

(3) id attribute value of "link" tag

1 soup.find_all(id = "link")
2 []

(4) id attribute value included in the "link" tag

1 import re #引入正则表达式库
2 soup.find_all(id = re.compile("link"))
3 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive parameters:

1 soup.find_all("a") #recursive默认为True
2 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
3 soup.find_all("a", recursive = False) # Order of recursive True 
. 4 [] # Son tab without soup in a label

string parameters:

(1) retrieving the character string region is "Basic Python" string

1 soup.find_all(string = "Basic Python")
2 ['Basic Python']

(2) retrieving the character string region appears "python" string

1 soup.find_all(string = re.compile("python"))
2 ['This is a python demo page', 'The demo python introduces several python courses.']

<Tag> (...) is equivalent to a <tag> .find_all (...)

Soup (...) is equivalent to soup.find_all (...)

<>. Find_all () Extension Method

method Explanation
<>.find() Search and only returns a result of type string, with .find_all () parameters
<>.find_parents() Search ancestor node, returns a list of types, with .find_all () parameters
<>.find_parent() Ancestor node returns a result of type string, with .find () parameters
<>.find_next_siblings() In a subsequent search for a parallel node, returns a list of types, the same .find_all () parameters
<>.find_next_sibling() In a subsequent parallel node returns a result of type string, with .find () parameters
<>.find_previous_siblings() Sequence search nodes in parallel in the front, returns a list of types, the same .find_all () parameters
<>.find_previous_sibling() Sequence parallel in the front node returns a result of type string, with .find () parameters

Source: Beijing Institute of Technology, Song days, MOOC

Guess you like

Origin www.cnblogs.com/huskysir/p/12454619.html