BeautifulSoup4 of find_all () and select (), simple to learn reptiles

Regular expressions + BeautifulSoup crawling the web can do more with less.

Take Baidu Post Bar try your hand at URL: https://tieba.baidu.com/index.html

 

1.find_all (): search all child nodes of the current node, node grandchildren.

The following example is find_all () Match paste it classification module, the link with the link href 'entertainment' of the word.

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

f = urlopen('https://tieba.baidu.com/index.html').read()
soup = BeautifulSoup(f,'html.parser')

for link in soup.find_all('a',href=re.compile('娱乐')):
    print(link.get('title')+':'+link.get('href'))
Results: 
Entertainment Stars: ? / F / index / forumpark PCN = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
Hong Kong and Southeast Asia Star: ? / F / index / forumpark CN = Hong Kong and Southeast Asia Star & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
Mainland star: / f / index / forumpark CN = Mainland star & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1? 
Korean: ? / f / index / forumpark CN = Korea stars & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
Japanese star: ? / f / index / forumpark CN = Japanese & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
fashion figure: / f ? / index / forumpark cn = fashion figure & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
European and American stars: ? / f / index / forumpark CN = European and American stars & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
Moderator: / f / index / CN = host & forumpark?ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 
Other entertainment stars: ? / f / index / forumpark CN = other entertainers & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1
soup.find_all ( 'a', href = re.compile ( ' Entertainment' )) is equivalent to: soup (( 'Entertainment') 'a', href = re.compile) 
above example can also be used instead of soup.

2. Use the select () loop content you need:

** Search html page under a tab to "/ f / index" at the beginning of the href:

for link2 in soup.select('a[href^="/f/index"]'):
    print(link2.get('title')+':'+link2.get('href'))

** html page at a search in the tag "& pn = 1" ending href:
for link2 in soup.select('a[href$="&pn=1"]'):
    print(link2.get('title')+':'+link2.get('href'))

** Search html page with "entertainment" at a href Tags:
for link3 in soup.select('a[href*="娱乐"]'):
    print(link3.get('title')+':'+link3.get('href'))

 

Guess you like

Origin www.cnblogs.com/suancaipaofan/p/11786046.html