The first reptile reptiles of Python

Reptile task is two things: to extract information requests and parsing the page

Reptile three libraries Requests Lxml BeautifulSoup

Requests Library: Request web site to obtain data

import requests
#from bs4 import BeautifulSoup
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
#soup = BeautifulSoup(res.text, 'html.parser')
try:
    #price=soup.select("#page_list > ul > li > div.result_btm_con.lodgeunitname > div > span > i")
    print(res)
    print(res.text)
    # Print (soup.prettify ()) 
    # Print (. Price) 
the except the ConnectionError:
     Print ( " Deny connection " )

Wherein <Response [200]> indicates successful requests a web page

User-Agent can be viewed by http://www.user-agent.cn/

Request header
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
Request header is added to get () method
res=requests.get("http://bj.xiaozhu.com/",headers=headers)

Website post () method used to submit the form to crawl to login to get data

BeautifulSoup library: easily resolved Requests library requested page, and the page source code is parsed Soup documents, in order to extract data filtering

import requests
from bs4 import BeautifulSoup
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
print(soup.prettify())

Advantages and disadvantages of the main parser library BeautifulSoup

Soup documents can use find () element find_all () selector () positioning required

Soup documents

example

import requests
from bs4 import BeautifulSoup
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400"}
res=requests.get("http://bj.xiaozhu.com/",headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
try:
    price=soup.select("#page_list > ul > li > div.result_btm_con.lodgeunitname > div > span > i")
    #print(res)
    #print(res.text)
    #print(soup.prettify())
    print(price)
except ConnectionError:
    print("拒绝连接")

Wherein li: nth-child (1) in the Python runtime error needs to be changed li: nth-of-type (1).

May also be used get_text () method of obtaining an intermediate text message.

 

Guess you like

Origin www.cnblogs.com/gaochunhui/p/11277133.html