Open-book exam or design patterns, I want to prepare a little more information, then wrote a reptile crawling codes and images, there is cleverly format for further processing, eventually become a markdown format
import requests
from bs4 import BeautifulSoup
First, get the rookie tutorial - html page of this factory model, the object into soup
r = requests.get("https://www.runoob.com/design-pattern/factory-pattern.html")
#获取反馈信息 200为正常
r.status_code
r.encoding = "utf-8"
soup=BeautifulSoup(r.text,'lxml')
print(soup.prettify())
Observe that by crawling links are required at the beginning of '/ design', so using startsWith () filter, to obtain a list of url
html_list=[]
for a in soup.find_all('a'):
if(a['href'].startswith('/design')):
print(a['href'])
html_list.append(a['href'])
Write a crawling function of each page, the first markdown language in a comment, the comment with three slashes, convenient format.
+def fonepage(add):
baseurl="https://www.runoob.com"
url=baseurl+add
r = requests.get(url)
#获取反馈信息 200为正常
r.status_code
r.encoding = "utf-8"
soup=BeautifulSoup(r.text,'lxml')
lis=soup.find_all(attrs={'class':'example'})
print('///## '+add)
img=soup.find_all('img')
print('///![]('+baseurl+img[0]['src']+')')
print('///```')
for son in lis:
for a in son.find_all('span'):
print(a.string,end=' ')
print('\n')
print('///```')
Then you can crawl by page
for i in range(2,len(html_list)):
fonepage(html_list[i])
Finally, processing, utilization IDEA format, followed by notepad delete all '///' string, it is converted to a markdown format.
The results are as follows: https://www.cnblogs.com/Tony100K/p/11741212.html