Python crawler basics 01

The content is for your own learning and use. If there are any mistakes, please correct me. Thank you!
Author: rookiequ

python crawler basics 01

We can use crawlers to crawl the data we want from the website. The crawler I use is Pycharm+anaconda. If you want to crawl data, you first need a python environment and related IDE, which I won’t go into details here.

Principle knowledge of python crawler

How the python crawler works:

Insert image description here
Four steps of crawler:

  1. retrieve data. The crawler will get the URL we want it to crawl, send a request to the server, and obtain the data returned by the server.
  2. Analytical data. The crawler will convert the data returned by the server into a format that humans can understand.
  3. Filter data. The crawler will filter out the specific data we need from the returned data.
  4. Storing data. The crawler will save the data according to the storage method we set, so that we can perform the next step.
    Insert image description here

Use of Requests library

To install the Requests library, use the cmd command as an administrator and then install it.

pip install requests

First introduction to crawlers, let’s look at a small case

import requests		#引入requests模块

r = requests.get('http://www.baidu.com')
print(r.status_code)	#状态码
r.encoding = 'utf-8'	#设置编码
print(r.text)		    #读取的网页信息

HTTP protocol: It is a stateless application layer protocol based on the "request and response" model.

URL format: http://host[:port][path], such as http://www.rookiequ.top/admin, url is the Internet path corresponding to http protocol access resources, one url corresponds to one data resource

  1. host: legal host domain name or IP address
  2. port: port number, the default port is 80
  3. path: the path of the requested resource

7 main ways Requests provide points

method illustrate
requests.request() Construct a request that supports the basis of each of the following methods
requests.get() The main method to obtain HTML web pages, corresponding to http's get
requests.head() The main method to obtain HTML header information, corresponding to the head of http
requests.post() Method to submit post request to HTML web page, corresponding to http post
requests.put() Method to submit a put request to an HTML web page, corresponding to http's put
requests.patch() Submit a partial modification request to the HTML web page, corresponding to the http patch
requests.delete() Submit a deletion request to an HTML web page, corresponding to http's delete

requests.request(method, url, **kwargs)

**kwargs: Parameters to control access, all optional

params,data, json, header, cookies, auth, files, timeout , proxies, allow_redirects, stream, verify, cert,

The right side is a Request object that makes a request to the server, and the left side returns a Response object containing server resources.

#调用request方法
r = requests.get(url,params=None,**kwargs)
#url:获取页面的url链接
#params:url中的额外的参数,字典或者字节流格式,可选
#**kwargs:12个控制访问的参数

The following attributes of response are as follows:

Attributes Knowledge points
response.status_code Check if the request was successful
response.content The binary data of the response object
response.text String data of response object
response.encoding The encoding of the response object
import requests		#引入requests模块

r = requests.get('http://www.baidu.com')
print(r.status_code)	#http请求的返回状态,200成功,404失败
r.encoding = 'utf-8'	#从header中猜测的,响应内容的编码方式
type(r)					#查看返回的数据类型 Response对象
print(r.text)		    #读取的网页信息,响应内容的字符串形式
print(r.content)		#响应内容的二进制格式,一般图片音频用二进制
print(r.apparent_encoding) #从内容中分析出来的响应内容编码方式(备选编码方式)
print(r.header)			#头信息

Note: In r.encoding, if charset does not exist in the header, the encoding is considered to be ISO-8859-1 (Chinese cannot be parsed)

Exceptions in Requests library

abnormal illustrate
requests.ConnectionError Network connection errors and exceptions, such as DNS query failure, connection rejection, etc.
requests.HttpError HTTP error exception
requests.URLRequired URL missing exception
requests.TooManyRedirects Exceeding the maximum number of redirections will result in a redirection exception.
requests.ConnectTimeout Timeout exception when connecting to remote server
requests.Timeout The request URL times out and a timeout exception occurs.
response.raise_for_status() If it is not 200, exception requests.HttpError is generated.

A common code framework for crawling web pages

import requests

def getHTMLText(url):
	try:
		r = requests.get(url, timeout=30)
		r.raise_for_status()  #如果状态不是200,引发异常
		r.encoding = r.apparent_encoding
		return r.text
	except:
		return '产生异常'
		
if __name__ == '__main__':
	url = 'http://www.baidu.com'
	print(getHTMLText(url))

Robots protocol: The website tells web crawlers which content can and cannot be crawled. There are a lot of English words in the agreement, Allow and Disallow. Allow means that it can be accessed, and Disallow means that access is prohibited.

Example

import requests

# r = requests.get('http://www.baidu.com')
# print(r.status_code)
# r.encoding = 'utf-8'
# print(r.text)

keyword = 'python'
try:
    kv = {
    
    'wd':keyword}
    r = requests.get('http://www.baidu.com/s',params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print('爬取失败')
import requests

url = 'https://item.jd.com/2967929.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败")
import requests

url = 'https://www.amazon.cn/gp/product/B01M8L5Z3Y'
try:
    kv = {
    
    'user-agent':'Mozilla/5.0'}
    r = requests.get(url, headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")

Use of BeautifulSoup

BeautifulSoupUsed to parse and extract data from web pages.

  • Interpret data: Translate the HTML source code returned by the server into something we can understand
  • Extracting data: refers to selecting the data we need from the source data in a targeted manner

First prepare the learning environment:

pip install bs4

bs object = BeautifulSoup (text to be parsed, interpreter) , the brackets contain two parameters, the first must be a string type or variable, the second parameter is the parser, we can use html.parser (this is python A built-in library, it is not the only parser, others can be used)

Give a simple chestnut:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://xiaoke.kaikeba.com/example/canteen/index.html')
soup = BeautifulSoup(res.text,'html.parser') 
#把网页解析为BeautifulSoup对象
print(type(soup)) #查看soup的类型
print(soup) # 打印soup

soupThe data type is <class 'bs4.BeautifulSoup'>, indicating soupthat it is an BeautifulSoupobject. response.textThe data types are <class 'str'>different, but the printed content is the same.

Use BeautifulSoup to extract data. Here are two of its methods: find() and find_all(), which can match HTML tags and attributes and extract all data that meets the requirements in the BeautifulSoup object. The usage of the two is basically the same. The difference is that find()only the first data that meets the requirements is extracted, and then the <class 'bs4.element.Tag'>type is returned, while find_all()all the data that meets the requirements are extracted, and then the type returned is <class 'bs4.element.ResultSet'>the type, which is the Tag object. The structure of a list is stored together, and we can think of it as a list. There is also the select() method, which you can learn about later.

Insert image description here

The Tag type object was mentioned above, so what properties and methods does it have?

Insert image description here

import requests
from bs4 import BeautifulSoup
res = requests.get('https://xiaoke.kaikeba.com/example/canteen/index.html')
html = res.text
soup = BeautifulSoup(html,'html.parser') 
items = soup.find_all(class_='show-list-item')
for item in items: 
    title = item.find(class_='desc-title') # 在列表中的每个元素里,匹配属性class_='title'提取出数据
    material = item.find(class_='desc-material') #在列表中的每个元素里,匹配属性class_='desc-material'提取出数据
    step = item.find(class_='desc-step') #在列表中的每个元素里,匹配属性class_='desc-step'提取出数据
    print(title.text,material.text,step.text)

Let’s look at the process of the crawler through a graph:

Insert image description here

Let’s look at a case of crawling images:

import requests
from kkb_tools import open_file

res = requests.get('https://xiaoke.kaikeba.com/example/canteen/images/banner.png')
pic=res.content	#二进制
photo = open('banner.jpg','wb')	
photo.write(pic) 
photo.close()

open_file('banner.jpg')

Guess you like

Origin blog.csdn.net/weixin_42164880/article/details/106631479