Basic use of crawler library requests and BeautifulSoup

The crawler's acquisition of data can generally be divided into two steps:

  1. Get webpage

Mainly used requests(send web page request) selenium(use browser to access web page)

  1. Parse webpage

Mainly used BeautifulSoup

The following will briefly introduce the use of the above three libraries

requests get the page from the URL

Use pip install requestsinstallation

Use import requestsreferences

Quick to use

import requests
r = requests.get('http://zoollcar.top') # 发送请求到url
print(r.text) # 输出获取到的页面文本

The 7 main methods of requests

Use format requests.ge(url, params=None, **kwargs)

method Features
requests.request() Basic method for all other methods
requests.get() Get the content of the URL
requests.head() Request to obtain the response message report of the URL location resource, that is, obtain the header information of the resource
requests.post() Attach new data after requesting the resource at the URL location
requests.put() Request to store a resource at the URL location, overwriting the resource at the original URL location
requests.patch() Request a partial update of the resource at the URL location, that is, change part of the content of the resource
requests.delete() Request to delete resources stored in URL location

Optional parameters **kwargs

parameter Description
params Dictionary or byte sequence, added to url as a parameter
data Dictionary, byte sequence or file object as the content of Request
json Data in JSON format as the content of the Request
headers Dictionary, HTTP custom header
cookies Dictionary or CookieJar, cookie in Request
auth Tuple, support HTTP authentication function
files Dictionary type, {'file': open('data.xls','rb')}
timeout Set the timeout period in seconds
proxies Dictionary type, set access proxy server, can increase login authentication
allow_redirects True/False, the default is True, redirect switch
stream True/False The default is True, get the content to download immediately
verify True/False The default is True, the authentication SSL certificate switch
cert Local SSL certificate path

Selenium access URL through browser

Use pip install seleniuminstallation

Use from selenium import webdriverreferences

Quick to use

from selenium import webdriver # 引入
driver = webdriver.Firefox() # 关联到本地火狐浏览器
driver.get("http://zoollcar.top") # 用浏览器打开网页
a = driver.find_element_by_css_selector('.site-title') # 通过css选择器选择元素
print(a.text) # 访问元素内容

!! Note that you need to install the corresponding driver Firefox browser is geckodriver.exeGoogle is chromedriver.exeput under the directory that contains the system path

Various selectors

Selector
find_elements_by_css_selector('div.edit') Choose according to css
find_elements_by_xpath('//div[@class='edit']') Choose according to xpath
find_elements_by_id('id') Select according to id attribute
find_elements_by_name('name') Select according to the name attribute
find_elements_by_link_text('www.zoollcar.top') Choose according to the link to the address
find_elements_by_tag_name('h1') Select by element name
find_elements_by_class_name('edit') Choose according to class

These lists are returned are in line with the
if as long as the first one will elementsbe replacedelement

Control CSS, pictures, js display and execution

from selenium import webdriver # 引入
fp = webdriver.FirefoxProfile() # 引入火狐配置
fp.set_preference("permissions.default.stylesheet",2) # 禁用CSS
fp.set_preference("permissions.default.image",2) # 禁用 图片显示
fp.set_preference("javascript.enabled",False)  # 禁用js 测试不可用(不知道为什么)
driver = webdriver.Firefox( firefox_profile = fp ) # 关联到本地火狐浏览器

BeautifulSoup will obtain the web page string analysis

Quick to use

import requests
from bs4 import BeautifulSoup
r = requests.get('http://zoollcar.top')
soup = BeautifulSoup(r.text,'html.parser') # 解析获取到的文本
print(soup.h1.string) # 输出第一个h1标签的文本

Turn text into parse tree

Introducing the library from bs4 import BeautifulSoup
bs4 library will turn any HTML input into utf-8 encoding

Use soup = BeautifulSoup(html,'html.parser')the html string parsing, which is built-in standard HTML parsing
can also use the following form, comply with different rules
soup = BeautifulSoup(html,'lxml')
soup = BeautifulSoup(html,['lxml','xml']])
soup = BeautifulSoup(html,'html5lib')

After parsing, you will get a bs4.BeautifulSoup class object, which is a tree-shaped object, and each tag can be extracted. The main extraction methods are:

  1. Search tree
  2. Traverse the tree
  3. CSS selector

Method of access and retrieval

Content access

soup.h1Find the first label with the
soup.h1.namelabel h1 Label name
soup.h1.attrsLabel attribute
soup.h1.stringLabel text

This will return two types of strings
NavigableString non-comment string
Comment comment string

soup.h1.prettify() displays html in a human-friendly way, with carriage returns and spaces

Search method

soup.find_all(name,attrs,recursive,string,**kwargs) Find specific content! !

name Retrieval of tag name retrieval
attrs: retrieval of tag attribute value, can be labeled attribute retrieval id=link1
recursive: whether to retrieve all descendants, default True
string: retrieval in tag content

The shorthand for soup.find_all() is soup()

method Features
<soup>.find() Search and return only one result
<soup>.find_parents() Search in the ancestor node, return list type
<soup>.find_parent() 在先辈节点中返回一个结果
<soup>.find_next_siblings() 在后续平行节点中搜索,返回列表类型
<soup>.find_next_sibling() 在后续平行节点中返回一个结果
<soup>.find_previous_siblings() 在前序平行节点中搜索,返回列表类型,同
<soup>.find_previous_sibling() 在前序平行节点中返回一个结果

还可以使用css检索方法
使用 <soup>.select('cssSelectName') 检索到结果

遍历的方法

内容遍历方法

方法 功能
下行遍历
.contents 子节点列表
.children 子节点的迭代类型
.descendants 子孙节点的迭代类型,包含所有子孙节点
上行遍历
.parent 节点的直属父标签
.parents 节点先辈标签的迭代类型
平行遍历
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签

Guess you like

Origin blog.csdn.net/zoollcar/article/details/86299697