Reptile task is two things: to extract information requests and parsing the page
Reptile three libraries Requests Lxml BeautifulSoup
Requests Library: Request web site to obtain data
import requests #from bs4 import BeautifulSoup headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"} res=requests.get("http://bj.xiaozhu.com/",headers=headers) #soup = BeautifulSoup(res.text, 'html.parser') try: #price=soup.select("#page_list > ul > li > div.result_btm_con.lodgeunitname > div > span > i") print(res) print(res.text) # Print (soup.prettify ()) # Print (. Price) the except the ConnectionError: Print ( " Deny connection " )
Wherein <Response [200]> indicates successful requests a web page
User-Agent can be viewed by http://www.user-agent.cn/
Request header
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
Request header is added to get () method
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
Website post () method used to submit the form to crawl to login to get data
BeautifulSoup library: easily resolved Requests library requested page, and the page source code is parsed Soup documents, in order to extract data filtering
import requests from bs4 import BeautifulSoup headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"} res=requests.get("http://bj.xiaozhu.com/",headers=headers) soup = BeautifulSoup(res.text, 'html.parser') print(soup.prettify())
Advantages and disadvantages of the main parser library BeautifulSoup
Soup documents can use find () element find_all () selector () positioning required
example
import requests from bs4 import BeautifulSoup headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"} res=requests.get("http://bj.xiaozhu.com/",headers=headers) soup = BeautifulSoup(res.text, 'html.parser') try: price=soup.select("#page_list > ul > li > div.result_btm_con.lodgeunitname > div > span > i") #print(res) #print(res.text) #print(soup.prettify()) print(price) except ConnectionError: print("拒绝连接")
Wherein li: nth-child (1) in the Python runtime error needs to be changed li: nth-of-type (1).
May also be used get_text () method of obtaining an intermediate text message.