BeautifulSoup4: beautifulsoup library is resolved, traverse, Maintenance "tag tree" function library. Installation reference requests library
usage:
from bs4 import BeautifulSoup
soup = BeautifulSoup(‘<p>data</p>’,’html.parser’)
#test
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
r.text
demo = r.text
soup = BeautifulSoup(demo,"html.parser") #对demo进行HTML的解析Soup2 = BeautifulSoup (open ( "D : //demo.html"), "html.parser") # written document
print (soup.prettify ()) # tree format after the document is encoded in Unicode Beautiful Soup output, each XML / HTML tags have a separate line
Basic parser:
BS4 HTML parser: BeautifulSoup (mk, 'html.parser') (Installation BS4)
The HTML parsing lxml library: BeautifulSoup (mk, 'lxml') (installed lxml)
lxml XML parsing library: BeautifulSoup (mk, 'html.xml') (installed lxml)
html5lib parsing library: BeautifulSoup (mk, 'html5lib') (installed html5lib)
fundamental element:
Tag:<></>
Name: the contents of the tag name in <>, <tag> .name
Attributes: Properties, <tag> .attrs
NavigableString: the content between the tags, <tag> .string
Comment: Comment section comment.replace_with tag string (CDATA)
Requests Import
from the BeautifulSoup BS4 Import
R & lt requests.get = ( "http://python123.io/ws/demo.html")
r.text
Demo = r.text
Soup = the BeautifulSoup (Demo, "html.parser")
#Print (soup.title)
tag # = soup.a extract a label for a code segment, but only to obtain a modified first label content
aptag = soup.a.parent.name # obtaining a first parent tag label
ta = tag.attrs # properties obtained tag, which is present in the form of a dictionary
tac = tag.attrs [ 'class'] # class tag attributes obtained content
tah = tag.attrs [ 'href'] # linked content tag obtained
tat = type (tag.attrs) # Get the type of the label attribute
tta = type (tag) # tag type obtained
tcont = content between the tags tag.string #a, i.e., character string information
newsoup = BeautifulSoup ( "<b> < ! - This is a comment -> </ b> <p> This is not a comment </ p> ") #comment a comment type, of which this is a comment content
Downlink tag tree traversal:
.contents: a list of child nodes of the <tag> list of all the sons define deposit
.children: Iteration type child node, and the like .content for circulating (for) traverse the son node
.descendants: Iterative descendant node type, comprising all descendant nodes for circulating (for) traverse
Tag tree traversal uplink:
.parent: father node label
.parents: ancestor node iterator type label for loop iterates ancestor node
Tag tree traversal parallel
Occurs between the various nodes traversed in parallel with a parent node: Note
.next_sibling: Returns the tag of the next node in parallel according to the procedure of HTML text
.previous_sibing: Returns the HTML text label in accordance with the order of a parallel node
.next_siblings: Iterative type (for), label returned in parallel to all subsequent nodes HTML text sequence
.previous_sibings: Iterative type (for), label Continued returned in parallel all nodes of HTML text sequence
Requests Import
from the BeautifulSoup BS4 Import
R & lt requests.get = ( "http://python123.io/ws/demo.html")
r.text
Demo = r.text
Soup = the BeautifulSoup (Demo, "html.parser")
# downlink traversing
sh = soup.head # head acquired label segment
shc = soup.head.contents # son head acquired tag label segment
sbc = soup.body.contents # acquires body label segment
sn = len (sbc) # son node acquires body number, list is present in the form of body segments
# downlink body traversed son nodes
for Child in soup.body.children:
Print (Child)
# up traversal
stp = soup.title.parent # obtain the title tag fathers
shp = soup. html.parent #html as the highest label, the label is his own father
sop = soup.parent #soup parent label is empty
uplink # tag tree traversal
for parent in soup.a.parents:
IF parent iS None:
Print (parent)
the else:
Print (Parent.Name)
# flat traverse
sans = soup.a.next_sibling # Get the next parallel to a tag label
sanbs = soup.a.next_sibling.next_sibling
SAPS = soup.a.previous_sibling # Get a label before the label parallel
sapspa = soup.a.previous_sibling.previous_sibling # empty
before traversal subsequent node #
for sibling in soup.a.previous_siblings:
Print (sibling)
for sibling in soup.a.next_siblings:
Print (sibling)