Focus crawler
Crawl the specified content in the page.
Encoding process:
Specify URL-initiate a request-obtain response data- data analysis -persistent storage
Data analysis classification
- Regular match
- bs4
- xpath
Principles of Data Analysis
xpath analysis:
- The most commonly used and most convenient and efficient way to analyze
- Instantiate an etree object, and load the page source data into the object
- Label positioning and data capture by calling related attribute methods in etree objects combined with xpath expressions
首先
pip install lxml
from lxml import etree
- xpath('xpath expression')
- tree = etree.parse(‘html’)
- tree.xpath(’/html/head/title’)
- / Is to start positioning from the root node, one level, // is multiple levels, which can mean to start positioning from any position
- r = tree.xpath('//div[@class=“sh”]/p[3]') attribute positioning tag[@attrName=“attrValue”]
- Get the text, /text() //text()
- Take attribute value——/@attribute name< img src = "baidu.com"> —— /img/@scr
from lxml import etree
import requests
if __name__ == "__main__":
url = 'https://sh.58.com/ershoufang/'
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
headers = {
"User-Agent":ua,
}
response1 = requests.get(url=url,headers=headers).text
tree = etree.HTML(response1)
li_list = tree.xpath('/html/body/div[5]/div[5]/div[1]/ul/li')
for li in li_list:
print(li.xpath('./div[2]/h2/a/text()'))