Python Reptile study notes (BeautifulSoup4 library: on the label of the tree, down, parallel traversal)

BeautifulSoup4: beautifulsoup library is resolved, traverse, Maintenance "tag tree" function library. Installation reference requests library

usage:

from bs4 import BeautifulSoup

soup = BeautifulSoup(‘<p>data</p>’,’html.parser’)

 

#test

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser") #对demo进行HTML的解析

Soup2 = BeautifulSoup (open ( "D : //demo.html"), "html.parser") # written document
print (soup.prettify ()) # tree format after the document is encoded in Unicode Beautiful Soup output, each XML / HTML tags have a separate line

 

  Basic parser:

  BS4 HTML parser: BeautifulSoup (mk, 'html.parser') (Installation BS4)

  The HTML parsing lxml library: BeautifulSoup (mk, 'lxml') (installed lxml)

  lxml XML parsing library: BeautifulSoup (mk, 'html.xml') (installed lxml)

html5lib parsing library: BeautifulSoup (mk, 'html5lib') (installed html5lib)

fundamental element:

Tag:<></>

Name: the contents of the tag name in <>, <tag> .name

Attributes: Properties, <tag> .attrs

NavigableString: the content between the tags, <tag> .string

Comment: Comment section comment.replace_with tag string (CDATA)

Requests Import
from the BeautifulSoup BS4 Import

R & lt requests.get = ( "http://python123.io/ws/demo.html")
r.text
Demo = r.text
Soup = the BeautifulSoup (Demo, "html.parser")
#Print (soup.title)
tag # = soup.a extract a label for a code segment, but only to obtain a modified first label content
aptag = soup.a.parent.name # obtaining a first parent tag label
ta = tag.attrs # properties obtained tag, which is present in the form of a dictionary
tac = tag.attrs [ 'class'] # class tag attributes obtained content
tah = tag.attrs [ 'href'] # linked content tag obtained
tat = type (tag.attrs) # Get the type of the label attribute
tta = type (tag) # tag type obtained
tcont = content between the tags tag.string #a, i.e., character string information
newsoup = BeautifulSoup ( "<b> < ! - This is a comment -> </ b> <p> This is not a comment </ p> ") #comment a comment type, of which this is a comment content

 

Downlink tag tree traversal:

.contents: a list of child nodes of the <tag> list of all the sons define deposit

.children: Iteration type child node, and the like .content for circulating (for) traverse the son node

.descendants: Iterative descendant node type, comprising all descendant nodes for circulating (for) traverse

Tag tree traversal uplink:

.parent: father node label

.parents: ancestor node iterator type label for loop iterates ancestor node

Tag tree traversal parallel

Occurs between the various nodes traversed in parallel with a parent node: Note

.next_sibling: Returns the tag of the next node in parallel according to the procedure of HTML text

.previous_sibing: Returns the HTML text label in accordance with the order of a parallel node

.next_siblings: Iterative type (for), label returned in parallel to all subsequent nodes HTML text sequence

.previous_sibings: Iterative type (for), label Continued returned in parallel all nodes of HTML text sequence

Requests Import
from the BeautifulSoup BS4 Import

R & lt requests.get = ( "http://python123.io/ws/demo.html")
r.text
Demo = r.text
Soup = the BeautifulSoup (Demo, "html.parser")
# downlink traversing
sh = soup.head # head acquired label segment
shc = soup.head.contents # son head acquired tag label segment
sbc = soup.body.contents # acquires body label segment
sn = len (sbc) # son node acquires body number, list is present in the form of body segments
# downlink body traversed son nodes
for Child in soup.body.children:
    Print (Child)

# up traversal
stp = soup.title.parent # obtain the title tag fathers
shp = soup. html.parent #html as the highest label, the label is his own father
sop = soup.parent #soup parent label is empty
uplink # tag tree traversal
for parent in soup.a.parents:
    IF parent iS None:
        Print (parent)
    the else:
        Print (Parent.Name)

# flat traverse
sans = soup.a.next_sibling # Get the next parallel to a tag label
sanbs = soup.a.next_sibling.next_sibling
SAPS = soup.a.previous_sibling # Get a label before the label parallel
sapspa = soup.a.previous_sibling.previous_sibling # empty
before traversal subsequent node #
for sibling in soup.a.previous_siblings:
    Print (sibling)
for sibling in soup.a.next_siblings:
    Print (sibling)

Published 17 original articles · won praise 11 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_33360009/article/details/104045047