I remember how to get started in BeautifulSoup python modules

Beautiful Soup module

Features

In short, parse and extract HTML / XML data.

As for why to learn this module? After all, I was just learning python path out the importance of this module, personally find it interesting (because it is crawling little sister pictures only way ha ha ha ha), I also followed a teacher the following b station (we do not know sorry) notes do, feel useful to learn while doing the code knock wanted to write down notes

Evernote Export

Five basic elements:

fundamental element
Explanation
Tag
Tag information is the most basic unit of organization, respectively <> </> indicate the beginning and end
Name
Name tag, <p> </ p> name is p, the format: <Tag> .name
attributes
Tag attributes, organized dictionary (keys and values), the format: <Tag> .attrs
Navigablestring
Non attribute string in the tag, the format: <Tag> .string
comment
Note the portion of the inner tag string

Downlink tag tree traversal:


Attributes
Explanation
.contens
List of child nodes of the <tag> list of all son nodes into
.children
Iterator type of child nodes, and .contens similar for loop iterates son node
.descendants
Iterative descendant node type, comprising all descendant nodes, a loop through

Tag tree traversal uplink:


Attributes
Explanation
.parents
Father node label
.parents
Iterative ancestor node type label ancestor node for looping through

Parallel traversal tag tree:


Attributes
Explanation
.next_sibling
Back tab under a parallel node according to the order of HTML text
.previous_sibling
Return a text node parallel sequence in accordance with HTML tags
.next_siblings
Iterative type, according to the label returned HTML text sequence all subsequent nodes in parallel
.previous_siblings
Iterative type, all return label Continued parallel nodes in accordance with the procedure of HTML text
Conditions: parallel traversal between nodes occur at the same parent node sub-ah
ps: Navigablestring tag tree will have a node configuration, can not be considered parallel to traverse the next node is the type of label


Examples

I will give the teacher code integrated together, each sentence will have basic notes, no learning python little friends do not want to see the video, then the code will code one yard hey! Note that, basically output statement here, be sure to slowly realize, print a statement is run, the comparator output output content analysis step by step, after completion of a learning knowledge remember commented block, followed by the next school knowledge block, so I learned to feel Bang Bang da ~ ~ ~ ~ do not comment out the following oh

Here Insert Picture Description

The basic elements of code blocks exercises

Here Insert Picture Description

Exercise downlink block traversal

Here Insert Picture Description

Uplink block traversal Practice

Here Insert Picture Description

Exercise parallel traversal block

Here Insert Picture Description

Here is the source put it, not to copy not just focus on watching, yard one yard


import requests
from bs4 import BeautifulSoup
url="http://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo , 'html.parser')  #html解释器


print(demo)
print(soup.prettify)#对比输出的不同,html解析的功能


#以下是五种基本元素的使用
tag = soup.a

print(tag)#输出a标签
print(tag.name) #输出标签的名字
print(tag.parent.name) #输出a的父类标签的名字

print(tag.attrs)#输出标签属性(输出方式为字典)
print(tag.attrs['class'])#输出['py1'],也就是属性class的值
print(tag.attrs['href'])#输出herf属性的值

print(type(tag.attrs)) #输出标签属性类型,这里是字典类型
print(type(tag))#输出标签的类型

print(tag.string)#输出a标签中的非属性字符串信息
print(soup.p.string)#输出p标签中的string
print(type(soup.p.string))#输出标签中的string的类型,是Navigablestring,有跨标签的性质所以p标签中的b标签并没有显示出来

newsoup = BeautifulSoup("<b><!--this is a comment--></b><p>this is not a moment</p>","html.parser")
#注释以<!--注释内容-->
print(newsoup)#并分析b标签和p标签的类型观察有什么不同


#标签数的下行遍历
tag = soup.body
print(tag)
print(tag.contents)#输出body标签的儿子节点,.contents返回的类型是列表
print(len(tag.contents))#返回儿子节点的数量,因为返回类型是列表类型所以可以用列表来检索标签内容
print(tag.contents[1])#输出列表第一个子节点
for child in tag.children:
    print(child)#遍历所有儿子节点
for child in tag.descendants:
    print(child)#遍历所有子孙节点


#标签树的上行遍历
tag = soup.title
print(tag.parent)#输出title标签的父亲
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(print.name)#这里是遍历出a标签的所有父标签



#标签树的平行遍历
tag = soup.a
print(tag.next_sibling)#发现输出不是标签
print(tag.next_sibling.next_sibling)#a标签的下下个平行节点
print(tag.previous_sibling)#输出a标签的上一个平行节点
for sibling in tag.next_siblings:
    print(sibling)#遍历后续节点
for sibling in tag.previous_siblings:
    print(sibling)#遍历前续节点

Record of the road at the same time I hope that this study also notes to help you ~~~

Published 18 original articles · won praise 100 · Views 6361

Guess you like

Origin blog.csdn.net/qq_43571759/article/details/105029238