[Python] What is a crawler, examples of crawlers

 

 There is s for encrypted access

1. Getting to know reptiles for the first time

What is a crawler
? A web crawler is a program or script that automatically grabs Internet information according to certain rules. Due to the diversity of Internet data and the limited resources, it has become the mainstream crawling strategy to crawl relevant webpages according to user needs and analyze them. What can crawlers do? You can crawl pictures, crawl
videos
you want to watch, etc. As long as the data you can access through the browser can be obtained by crawlers.
What is the essence of a crawler?
Simulate a browser to open a web page and get the part of the data we want in the web page.

2. The basic process of reptiles


 Crawlers use links to simulate browsers to obtain web pages. The reason why they can obtain data is that the server can respond to us through the path we sent to him;

The server sends the data to the webpage, and the browser parses the data into what we see, so the crawler crawls not only the webpage, but also the source code of the webpage, so we need to extract the required data through re regularization and other methods .


The response header is what we send to the server

The server sent us the data in the "response";

 If we don't send the header to the server, the server will not respond to us;

3. Programming specifications

This line of code can control the execution order between multiple codes


def text1(a):
    print('hello', a)
text1(1)


if __name__ == "__main__":
    text1(2)

 It can be seen that text(1) is executed first, because the py file is executed from the top, but with if __ name__ == "__ main__ ", there is no need to write text() anymore, just write in if , which can better control the execution flow; it is equivalent to the execution entry of the entire program.


4. Introduce custom modules

Introducing a module is simply to take a function in the code written by others and apply it where we need it


【Give me a chestnut】

 Among them, text1 is equivalent to a package, and test1.py is one of the modules. In text2.py in text2, we refer to the add function in the text1.py module in the text1 package.

Five, requests library

The following uses Python's built-in requests module, which is mainly used to send HTTP requests. The requests module is more concise than the urllib module.


Sending HTTP requests using requests requires first importing the requests module:

import requests

After importing, you can send HTTP requests, use the method provided by requests to send HTTP requests to the specified URL , for example:

# 导入 requests 包
import requests

# 发送请求
x = requests.get('https://www.runoob.com/')

# 返回网页内容
print(x.text)

After each request is called, a response object will be returned , which contains specific response information, such as status code, response header, response content, etc.:

print(response.status_code) # Get the response status code
print(response.headers) # Get response headers
print(response.content) # Get the response content

More response information is as follows:

property or method illustrate
apparent_encoding Encoding
close() close the connection to the server
content Return the content of the response, in bytes
cookies Returns a CookieJar object containing cookies sent back from the server
elapsed Returns a timedelta object containing the amount of time elapsed between sending a request and arriving a response, which can be used to test response speed. For example, r.elapsed.microseconds indicates how many microseconds it takes for the response to arrive.
encoding Decoding the encoding of r.text
headers Return response headers in dictionary format
history Returns a list of response objects (urls) containing the request history
is_permanent_redirect Returns True if the response is a permanently redirected url, otherwise returns False
is_redirect Return True if the response was redirected, False otherwise
iter_content() iterative response
iter_lines() iterate over the rows of the response
json() Returns a JSON object of the result (the result needs to be written in JSON format, otherwise an error will be thrown)
links Returns the parsing header link of the response
next Returns a PreparedRequest object for the next request in the redirection chain
ok Check the value of "status_code" and return True if it is less than 400 and False if it is not
raise_for_status() If an error occurs, the method returns an HTTPError object
reason A description of the response status, such as "Not Found" or "OK"
request Returns the request object that requested this response
status_code Return http status codes, such as 404 and 200 (200 is OK, 404 is Not Found)
text Return the content of the response, unicode type data
url return the URL of the response

1) requests method

The requests method is as follows:

method describe
delete(urlargs) Send a DELETE request to the specified url
get(urlparams, args) Send a GET request to the specified url
head(urlargs) Send a HEAD request to the specified url
patch(urldata, args) Send a PATCH request to the specified url
post( urldata , json , args ) Send a POST request to the specified url
put(urldata, args) Send a PUT request to the specified url
request(methodurlargs) Send the specified request method to the specified url

 6. Set timeout

If the server rejects you and doesn’t want to respond to you, you may be in a waiting state at this time, but you can’t wait forever, so you need to set a timeout period;

import requests

reponse = requests.get('https://b2.faloo.com/y_0_1.html',timeout=0.01)

Seven, timeout processing

In order to make the code more robust, it is necessary to detect timeouts;

import requests

try:
    reponse = requests.get('https://b2.faloo.com/y_0_1.html',timeout=0.01)
except requests.exceptions.ConnectTimeout as a:
    print('time out!')
 # except是检测内容,只要遇到"requests.exceptions.ConnectTimeout"就输出"time out!"

Eight, get the status code, get the response header

import requests

response = requests.get('https://b2.faloo.com/y_0_1.html')
print(response.status_code)  # 获取状态码
print(response.headers)  # 获取响应头

9. How to solve the problem of no error but no content when crawling web pages

Sometimes, we crawl a webpage and find that there is no error reported, but there is no content displayed. It may be because the visited website has an anti-crawler mechanism, and the solution is to access it through a simulated browser.


We directly crawled the Douban movie webpage and found that there was no error reported, but no content was displayed;


To solve this problem, first we need to obtain the content of user-agent in the header through the following method.

 requests.get(url=, headers=)

The most important parameters are url, headers

import requests


url = 'https://movie.douban.com/top250'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 "
                  "Safari/537.36 Edg/112.0.1722.48",
}  # 模拟浏览器的代理,这样豆瓣就以为我们是浏览器向它发送请求,就不会阻拦

rep = requests.get(url=url, headers=headers)
print(rep.text)

You can see that the page shows success:

10. Examples of reptiles

Crawled webpage: Feilu Novels.com

Link: Original Novel Ranking_Free Novel Download Ranking_Faloo Novel Network (faloo.com)

import re
import requests

# 爬取网页
reponse = requests.get('https://b2.faloo.com/y_0_1.html')

# 标题
div_text1 = re.findall(re.compile(r'<div class="TwoBox02_08">(.*?)</div>'), reponse.text)
title_list = []
for i in div_text1:
    title_list.append(re.findall(re.compile(r'<h1 class="fontSize17andHei" title="(.*?)">'), i)[0]) # 加下标是为了去掉括号[],因为使用?取消贪婪匹配后每一个符合条件的都是列表形式,使用下标可以将每一个小列表中的字符串取出来,方便之后的拼接
print(title_list)

# 作者
div_text2 = re.findall(re.compile(r'<div class="TwoBox02_09">(.*?)</div>'), reponse.text)
author_list = []
for i in div_text2:
    author_list.append(re.findall(re.compile(r'<a href="//b2.faloo.com/.* title="(.*?)"'), i)[0])  
print(author_list)

# 类型
div_text3 = re.findall(re.compile(r'<span class="fontSize14andHui">(.*?)</a>'), reponse.text)
model_list = []
for i in div_text3:
    model_list.append(re.findall(re.compile(r'<a href="//b2.faloo.com/l.*" title="(.*?)" target="_blank">'), i)[
                          0])  # 加下标是为了去掉括号[],因为使用?取消贪婪匹配后每一个符合条件的都是列表形式,使用下标可以将每一个小列表中的字符串取出来
print(model_list)

# 将爬取到的内容合并
multi_list = map(list, zip(title_list, author_list, model_list))
all_list = list(multi_list)
print(all_list)
with open('./novel.txt', 'w', encoding='utf-8') as fw:
    fw.write('书名                            作者          类型\n')
    for i in all_list:
        fw.write('      '.join(i) +'\n')

Execution success effect diagram:

 

Guess you like

Origin blog.csdn.net/weixin_65690979/article/details/130309532