python3 crawler (6) - analysis data using Beautiful Soup

1, the basic concept

Preface:
Beautiful Soup is a Python HTML or XML parsing library, you can easily extract data from web pages with it.
Beautiful Soup has become and lxml, html6lib Python interpreter as good as dead provided by different analytical strategies or strong rate flexibility.
Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output UTF-8 encoding.
Beautiful Soup HTML and XML parser is dependent on 1xml library, so it is necessary to ensure that lxml installation: pip install lxml, common error: BS4 FeatureNotFound: Could not the Find A Tree Builder with you at The requested Features:. Lxml the Do you need to ? install a parser library uninstall lxml: pip uninstall lxml, and then reinstall: pip install lxml, not on the restart pycharm.
need to ensure that the installation before use Beautiful Soup module, the latest version is the version 4.x: PIP install beautifulsoup4
Beautiful Soup parsing typically use data BS4, is introduced: Import BS4 from the BeautifulSoup
Beautiful Soup dependent parser when parsing fact, in addition to its support HTL Python standard library parser, the parser also supports third party (such as lxml).

Parser:
Beautiful Soup parser when parsing actually depends, in addition to its support HTL parser Python standard library, it also supports a number of third-party parsers (such as lxml). Parser selection and use: can be seen from the above comparison, there lxml parsing HTML and XML parser function, and fast, fault-tolerant capacity, is generally used lxml. In use, when initializing the Beautiful Soup second parameter to 1xml to: from the BeautifulSoup BS4 Import Soup the BeautifulSoup = ( '<P> the Hello </ P>,' 1xml ') Print (soup.p.string)
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw ==






Key to use:

2, the node selector

'''***********************************************节点选择器*********************************************************'''
from bs4 import BeautifulSoup
html = '''
<div>
    <td class="nobr player desktop">
        <a href="bucks" class="ng-binding" target="_parent" 
        href1="/teams/#!/bucks"><!-- ngIf: row.clinched -->密尔沃基&nbsp;雄鹿<b>nba</b></a>
    </td>
    <tr data-ng-repeat="(i, row) in page" index="0" class="ng-scope">
        <td class="nobr center bold ng-binding" href="href01">6</td>
        <td class="nobr center bold desktop ng-binding">18&nbsp;-&nbsp;4</td>
        <td class="nobr center bold desktop ng-binding">胜 6</td>
        <td class="nobr center bold desktop ng-binding">119.5</td>
    </tr>
</div>
<p>
<li class="nobr player top">nba</li>
</p>
'''
soup=BeautifulSoup(html,'lxml') #初始化为BeautifulSoup的解析形式

'''*********************************节点选择器21个知识点*************************'''
#标准化
r1=soup.prettify()#把要解析的字符串以标准的缩进格式输出,同时有节点缺失或错误也可以自行更正
print("r1:",r1)

#节点简单定位
r2=soup.div.td.a #定位到指定节点
print(type(r2)) #输出:<class 'bs4.element.Tag'>
print("r2",r2) #输出:<a class="ng-binding" href="bucks" href1="/teams/#!/bucks" target="_parent"><!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b></a>

#获取文本或者节点名称
r3=soup.tr.td.string #调用string属性,获取指定标签里面的文本
print("r3",r3) #输出:6。注意:有多个td节点,默认只选择第一个,后面的被忽略。
r4=soup.tr.td.get_text() #用.get_text(),获取指定标签里面的文本
print("r4:",r4) #输出:6
r5=soup.tr.td.name #用.name获取节点的名称
print('r5',r5) #输出:td

#获取属性值
r6=soup.td.a['href'] #获取某个属性的值,简写式
print("r6",r6) #输出:bucks
r7=soup.td.a.attrs  #以字典形式输出某个节点的所有属性-值
print('r7',r7)#{'href': 'bucks', 'class': ['ng-binding'], 'target': '_parent', 'href1': '/teams/#!/bucks'}
r8=soup.td.a.attrs['href'] #获取某个属性的值,字典索引式
print('r8',r8) # bucks

#获取子节点或子孙节点
r9=soup.div.td.contents  #获取td的直接子节点的列表,调用contents
print('r9',r9) #['\n', <a class="ng-binding" href="bucks" href1="/teams/#!/bucks" target="_parent"><!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b></a>, '\n']
r10=soup.div.td.children  #获取td的直接子节点的列表,调用children
print('r10',r10) #返回生成器类型:<list_iterator object at 0x0000019741443470>
for i, child in enumerate(soup.div.td.children):
    print(i, child)#输出:
    '''
    0 
    1 <a class="ng-binding" href="bucks" href1="/teams/#!/bucks" target="_parent"><!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b></a>
    2 
    '''
r11=soup.div.td.descendants  #要得到所有的子孙节点的话,可以调用descendants属性:
print("r11",r11) #返回生成器类型:<list_iterator object at 0x0000019C618F1518>
for i01, child01 in enumerate(soup.div.td.descendants):
    print(i01, child01)#输出:
    '''
    0 
    1 <a class="ng-binding" href="bucks" href1="/teams/#!/bucks" target="_parent"><!-- ngIf: row.clinched -->密尔沃基 雄鹿<b>nba</b></a>
    2  ngIf: row.clinched 
    3 密尔沃基 雄鹿
    4 <b>nba</b>
    5 nba
    6
    '''
#获取父节点和祖先节点
r12=soup.a.parent #获取某个节点元素的父节点,可以调用parent属性:
print("r12",r12) #输出:<td class="nobr player desktop">....</td>
r13=soup.a.parents #获取所有的祖先节点,可以调用parents属性:
print("r13",r13)#返回生成器类型:<generator object PageElement.parents at 0x0000016669855480>
for i, parent in enumerate(soup.a.parents):
    print(i, parent) #输出:略

#获取兄弟节点
r14=soup.div.tr.td.next_sibling.next_sibling #获取下面的第2个兄弟节点,很多时候把换行也算为一个节点
print('r14:',r14)#r14: <td class="nobr center bold desktop ng-binding">18 - 4</td>

r15=soup.p.previous_sibling.previous_sibling  #获取上面的第2个兄弟节点,很多时候把换行也算为一个节点
print('r15:',r15) #r15: <div>......

r16=soup.div.tr.td.next_siblings   #获取下面的所有兄弟节点
print(type(r16),r16) #<class 'generator'> <generator object PageElement.next_siblings at 0x000001E6F16F4390>
print(list(enumerate(soup.div.tr.td.next_siblings)))#转化为列表元组输出
# 输出:[(0, '\n'), (1, <td class="nobr center bold desktop ng-binding">18 - 4</td>), (2, '\n'), (3, <td class="nobr center bold desktop ng-binding">胜 6</td>), (4, '\n'), (5, <td class="nobr center bold desktop ng-binding">119.5</td>), (6, '\n')]

r17=soup.p.previous_siblings  #获取上面的所有兄弟节点
print(type(r17),r17)#<class 'generator'> <generator object PageElement.previous_siblings at 0x0000026F649C3390>
print(list(enumerate(soup.p.previous_siblings)))
# 输出:[(0, '\n'), (1, <div>......)]

r18=soup.div.tr.td.next_sibling.next_sibling.string #获取文本信息,previous同理
print("r18:",r18) #r18: 18 - 4

r19=list(soup.div.tr.td.next_siblings)[1].string #获取文本信息,previous同理
print('r19:',r19)#r19: 18 - 4

r20=soup.div.tr.td.next_sibling.next_sibling.attrs["class"]#获取属性值,previous同理
print(r20) #['nobr', 'center', 'bold', 'desktop', 'ng-binding']

r21=list(soup.div.tr.td.next_siblings)[1].attrs["class"] #获取获取属性值,previous同理
print('r21:',r21)#r18:['nobr', 'center', 'bold', 'desktop', 'ng-binding']

3, method selector

'''*****************************************************方法选择器****************************************************'''
from bs4 import BeautifulSoup
import re
html = '''
<div>
    <td class="nobr player desktop">
        <a href="bucks" class="ng-binding" target="_parent" 
        href1="/teams/#!/bucks"><!-- ngIf: row.clinched -->密尔沃基&nbsp;雄鹿<b>nba</b></a>
    </td>
    <tr data-ng-repeat="(i, row) in page" index="0" class="ng-scope">
        <td class="nobr center bold ng-binding" href="href01">6</td>
        <td class="nobr center bold desktop ng-binding">18&nbsp;-&nbsp;4</td>
        <td class="nobr center bold desktop ng-binding">胜 6</td>
        <td class="nobr center bold desktop ng-binding">119.5</td>
    </tr>
</div>
<p>
<li class="nobr player top">nba</li>
</p>
'''
soup=BeautifulSoup(html,'lxml') #初始化为BeautifulSoup的解析形式

'''****************方法选择器:find_al1()和find()的使用**********************************************'''
#find_al1(name,attrs,recursive,text,limit,**kwargs):查询所有符合条件的元素
    # name:通常为节点名;
    # attrs:属性约束;
    # text:text参数可用来匹配节点的文本,传入的形式可以是字符串,可以是正则表达式对象;
    # limit:该参数可以限制得到的结果的数目;
    # recursive=False/True:递归,False只会找到该对象的最近后代,True默认值找到全部后代;
    # 关键词参数 keyword,自己选择那些具有指定属性的标签
    # 注意:多数情况下find_all()和findAll()等价,
    # 其中:bsObj.findAll(class='text')#会报错 bsObj.findAll(class_='text')#解决方案
#案例:
ls=[]
for di in soup.find_all(name='td'):
    ls.append(di.get_text()) #di.get_text()和di.string差不多,但get_text()更强大
print(ls) #['\n密尔沃基\xa0雄鹿nba\n', '6', '18\xa0-\xa04', '胜 6', '119.5']

ats1=soup.find_all('td',{'class':"nobr center bold ng-binding"})[0].get_text()
ats2=soup.find_all(name='td',attrs={'class':"nobr center bold ng-binding",'href':"href01"})[0].string
print(ats1,ats2) # 6 6

t1=soup.find_all(text=re.compile('(.*)雄鹿(.*)'))
print(t1) #['密尔沃基\xa0雄鹿']

ks1=soup.find_all(href="href01")[0].get_text()
print(ks1) # 6

#find(name,attrs,recursive,text,limit,**kwargs):查询第一个符合条件的元素,方法和find_all()完全一样,只是返回值的范围不一样
#案例
f1=soup.find(name='a',attrs={"href":"bucks","class":"ng-binding"})
print(f1.get_text()) #密尔沃基 雄鹿nba

# 其他:
# find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点。
# find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。
# find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。
# find_al1_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
# find_all_previous()和find_previous():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。

Beautiful Soup parses also supports CSS selectors, but he did not use pyquery resolve strong, so we will not go into here.

Published 109 original articles · won praise 108 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_41685388/article/details/104070283