The crawler's acquisition of data can generally be divided into two steps:
- Get webpage
Mainly used requests
(send web page request) selenium
(use browser to access web page)
- Parse webpage
Mainly used BeautifulSoup
The following will briefly introduce the use of the above three libraries
requests get the page from the URL
Use pip install requests
installation
Use import requests
references
Quick to use
import requests
r = requests.get('http://zoollcar.top') # 发送请求到url
print(r.text) # 输出获取到的页面文本
The 7 main methods of requests
Use format requests.ge(url, params=None, **kwargs)
method | Features |
---|---|
requests.request() |
Basic method for all other methods |
requests.get() |
Get the content of the URL |
requests.head() |
Request to obtain the response message report of the URL location resource, that is, obtain the header information of the resource |
requests.post() |
Attach new data after requesting the resource at the URL location |
requests.put() |
Request to store a resource at the URL location, overwriting the resource at the original URL location |
requests.patch() |
Request a partial update of the resource at the URL location, that is, change part of the content of the resource |
requests.delete() |
Request to delete resources stored in URL location |
Optional parameters **kwargs
parameter | Description |
---|---|
params | Dictionary or byte sequence, added to url as a parameter |
data | Dictionary, byte sequence or file object as the content of Request |
json | Data in JSON format as the content of the Request |
headers | Dictionary, HTTP custom header |
cookies | Dictionary or CookieJar, cookie in Request |
auth | Tuple, support HTTP authentication function |
files | Dictionary type, {'file': open('data.xls','rb')} |
timeout | Set the timeout period in seconds |
proxies | Dictionary type, set access proxy server, can increase login authentication |
allow_redirects | True/False, the default is True, redirect switch |
stream True/False | The default is True, get the content to download immediately |
verify True/False | The default is True, the authentication SSL certificate switch |
cert | Local SSL certificate path |
Selenium access URL through browser
Use pip install selenium
installation
Use from selenium import webdriver
references
Quick to use
from selenium import webdriver # 引入
driver = webdriver.Firefox() # 关联到本地火狐浏览器
driver.get("http://zoollcar.top") # 用浏览器打开网页
a = driver.find_element_by_css_selector('.site-title') # 通过css选择器选择元素
print(a.text) # 访问元素内容
!! Note that you need to install the corresponding driver Firefox browser is
geckodriver.exe
Google ischromedriver.exe
put under the directory that contains the system path
Various selectors
Selector | |
---|---|
find_elements_by_css_selector('div.edit') |
Choose according to css |
find_elements_by_xpath('//div[@class='edit']') |
Choose according to xpath |
find_elements_by_id('id') |
Select according to id attribute |
find_elements_by_name('name') |
Select according to the name attribute |
find_elements_by_link_text('www.zoollcar.top') |
Choose according to the link to the address |
find_elements_by_tag_name('h1') |
Select by element name |
find_elements_by_class_name('edit') |
Choose according to class |
These lists are returned are in line with the
if as long as the first one willelements
be replacedelement
Control CSS, pictures, js display and execution
from selenium import webdriver # 引入
fp = webdriver.FirefoxProfile() # 引入火狐配置
fp.set_preference("permissions.default.stylesheet",2) # 禁用CSS
fp.set_preference("permissions.default.image",2) # 禁用 图片显示
fp.set_preference("javascript.enabled",False) # 禁用js 测试不可用(不知道为什么)
driver = webdriver.Firefox( firefox_profile = fp ) # 关联到本地火狐浏览器
BeautifulSoup will obtain the web page string analysis
Quick to use
import requests
from bs4 import BeautifulSoup
r = requests.get('http://zoollcar.top')
soup = BeautifulSoup(r.text,'html.parser') # 解析获取到的文本
print(soup.h1.string) # 输出第一个h1标签的文本
Turn text into parse tree
Introducing the library from bs4 import BeautifulSoup
bs4 library will turn any HTML input into utf-8 encoding
Use soup = BeautifulSoup(html,'html.parser')
the html string parsing, which is built-in standard HTML parsing
can also use the following form, comply with different rules
soup = BeautifulSoup(html,'lxml')
soup = BeautifulSoup(html,['lxml','xml']])
soup = BeautifulSoup(html,'html5lib')
After parsing, you will get a bs4.BeautifulSoup class object, which is a tree-shaped object, and each tag can be extracted. The main extraction methods are:
- Search tree
- Traverse the tree
- CSS selector
Method of access and retrieval
Content access
soup.h1
Find the first label with the
soup.h1.name
label h1 Label name
soup.h1.attrs
Label attribute
soup.h1.string
Label text
This will return two types of strings
NavigableString non-comment string
Comment comment string
soup.h1.prettify() displays html in a human-friendly way, with carriage returns and spaces
Search method
soup.find_all(name,attrs,recursive,string,**kwargs) Find specific content! !
name Retrieval of tag name retrieval
attrs: retrieval of tag attribute value, can be labeled attribute retrieval id=link1
recursive: whether to retrieve all descendants, default True
string: retrieval in tag content
The shorthand for soup.find_all() is soup()
method | Features |
---|---|
<soup>.find() |
Search and return only one result |
<soup>.find_parents() |
Search in the ancestor node, return list type |
<soup>.find_parent() |
在先辈节点中返回一个结果 |
<soup>.find_next_siblings() |
在后续平行节点中搜索,返回列表类型 |
<soup>.find_next_sibling() |
在后续平行节点中返回一个结果 |
<soup>.find_previous_siblings() |
在前序平行节点中搜索,返回列表类型,同 |
<soup>.find_previous_sibling() |
在前序平行节点中返回一个结果 |
还可以使用css检索方法
使用<soup>.select('cssSelectName')
检索到结果
遍历的方法
内容遍历方法
方法 | 功能 |
---|---|
下行遍历 | |
.contents | 子节点列表 |
.children | 子节点的迭代类型 |
.descendants | 子孙节点的迭代类型,包含所有子孙节点 |
上行遍历 | |
.parent | 节点的直属父标签 |
.parents | 节点先辈标签的迭代类型 |
平行遍历 | |
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous_sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous_siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |