Use requests and beautifulsoup4 two libraries
Web crawler generally includes two steps:
1) Obtain web page content (string form) through network links - requests library.
2) Process the obtained web content - beautifulsoup4 library.
Use the pip3 command to install these two libraries:
Under Linux:
sudo pip3 install requests
sudo pip3 install beautifulsoup4
1) Use of the requests library The following code is recommended for the function
to obtain the content of a web page:
import requests
def getHTMLText(url):
try:
r=requests.get(url,timeout=30)
r.raise_for_status() #如果状态不是200,引发异常
r.encoding='utf-8' #无论原来采用什么编码,都改成 utf-8
return r.text
except:
return ""
url='http://www.baidu.com'
print(getHTMLText(url))
2) Use of beautifulsoup4 (or bs4) library After
using the requests library to obtain the HTML page and convert it into a string, it is necessary to further parse the HTMO page format and extract useful information, which requires a function library for processing HTML and XML-beautifulsoup4.
beautifulsoup4 - builds parse trees based on HTML and XML grammars, and parses the content efficiently.
from bs4 import BeautifulSoup