Regular expressions + BeautifulSoup crawling the web can do more with less.
Take Baidu Post Bar try your hand at URL: https://tieba.baidu.com/index.html
1.find_all (): search all child nodes of the current node, node grandchildren.
The following example is find_all () Match paste it classification module, the link with the link href 'entertainment' of the word.
from bs4 import BeautifulSoup from urllib.request import urlopen import re f = urlopen('https://tieba.baidu.com/index.html').read() soup = BeautifulSoup(f,'html.parser') for link in soup.find_all('a',href=re.compile('娱乐')): print(link.get('title')+':'+link.get('href'))
Results: Entertainment Stars: ? / F / index / forumpark PCN = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 Hong Kong and Southeast Asia Star: ? / F / index / forumpark CN = Hong Kong and Southeast Asia Star & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 Mainland star: / f / index / forumpark CN = Mainland star & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1? Korean: ? / f / index / forumpark CN = Korea stars & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 Japanese star: ? / f / index / forumpark CN = Japanese & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 fashion figure: / f ? / index / forumpark cn = fashion figure & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 European and American stars: ? / f / index / forumpark CN = European and American stars & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 Moderator: / f / index / CN = host & forumpark?ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1 Other entertainment stars: ? / f / index / forumpark CN = other entertainers & ci = 0 & pcn = entertainers & pci = 0 & ct = 1 & rn = 20 & pn = 1
soup.find_all ( 'a', href = re.compile ( ' Entertainment' )) is equivalent to: soup (( 'Entertainment') 'a', href = re.compile)
above example can also be used instead of soup.
2. Use the select () loop content you need:
** Search html page under a tab to "/ f / index" at the beginning of the href:
for link2 in soup.select('a[href^="/f/index"]'): print(link2.get('title')+':'+link2.get('href'))
** html page at a search in the tag "& pn = 1" ending href:
for link2 in soup.select('a[href$="&pn=1"]'): print(link2.get('title')+':'+link2.get('href'))
** Search html page with "entertainment" at a href Tags:
for link3 in soup.select('a[href*="娱乐"]'): print(link3.get('title')+':'+link3.get('href'))