python new crawler study notes [recommended collection]

1. How to parse url for any request

  • To parse the HTML DOM returned by Request in Python, you can use a parsing library, such as BeautifulSoup or lxml, to process the HTML document. Here is sample code using Beautiful Soup and lxml:
  • First, make sure you have the required libraries installed. For Beautiful Soup, you can install it using pip install beautifulsoup4 . For lxml, you can install it using pip install lxml .
  • Using the Beautiful Soup library:
  1. BeautifulSoup is a Python library used for web scraping purposes. It provides a convenient and efficient way to extract data from HTML and XML documents. Using BeautifulSoup, you can parse and traverse HTML structures, search for specific elements, and extract relevant data from web pages.
  1. The library supports different parsers such as the built-in Python parser, lxml and html5lib, allowing you to choose the most suitable parser based on your specific needs. The advantage of BeautifulSoup is its ability to handle malformed or corrupted HTML code, making it a powerful tool for handling web crawling tasks in complex situations.
import requests
from bs4 import BeautifulSoup

# 发送请求获取 HTML
response = requests.get(url)
html = response.text

# 创建 Beautiful Soup 对象
soup = BeautifulSoup(html, 'html.parser')

# 通过选择器选择 DOM 元素进行操作
element = soup.select('#my-element')
  • In the above example, requests.get(url) sends a request and gets the HTML response. Then, we use response.text to get the HTML content of the response and pass it to the Beautiful Soup constructor BeautifulSoup(html, 'html.parser') to create a Beautiful Soup object soup .
  • Next, you can use the methods and selectors provided by Beautiful Soup, such as select() , to select specific elements in the HTML DOM. In the above example, we are selecting the element with id my - element via the selector #my-element .
  • Using lxml library:
import requests
from lxml import etree

# 发送请求获取 HTML
response = requests.get(url)
html = response.text

# 创建 lxml HTML 解析器对象
parser = etree.HTMLParser()

# 解析 HTML
tree = etree.fromstring(html, parser)

# 通过XPath选择 DOM 元素进行操作
elements = tree.xpath('//div[@class="my-element"]')
  • In the above example, we first send the request and get the HTML response using requests.get(url) . Then, we create an lxml HTML parser object parser .
  • Next, we use etree.fromstring(html, parser) to parse the HTML and get an object tree representing the DOM tree .
  • Finally, we can use XPath expressions to select DOM elements. In the above example, we use the XPath expression //div[@class="my-element"] to select all div elements with the class attribute "my-element" .
  • Whether using Beautiful Soup or lxml, you can use the methods and properties provided by the respective libraries to manipulate and extract selected DOM elements.

2. How to get the text in the label

In Python, you can use a variety of libraries and methods to get the text inside HTML tags. Here are a few common methods:

  • Method 1: Use the BeautifulSoup library:
   from bs4 import BeautifulSoup

   # 假设html为包含标签的HTML文档
   soup = BeautifulSoup(html, 'html.parser')

   # 获取所有标签内的文本
   text = soup.get_text()

   # 获取特定标签内的文本(例如p标签)
   p_text = soup.find('p').get_text()
  • Method 2: Use lxml library:
   from lxml import etree

   # 假设html为包含标签的HTML文档
   tree = etree.HTML(html)

   # 获取所有标签内的文本
   text = tree.xpath('//text()')

   # 获取特定标签内的文本(例如p标签)
   p_text = tree.xpath('//p/text()')
  • Method 3: Use regular expressions:
   import re

   # 假设html为包含标签的HTML文档
   pattern = re.compile('<[^>]*>')
   text = re.sub(pattern, '', html)

You can choose one of these methods according to your needs, and they can all help you extract the text content in HTML tags. Please note that these methods may have some limitations when processing complex HTML documents, so it is recommended to use specialized HTML parsing libraries (such as BeautifulSoup, lxml) to process HTML documents for better flexibility and accuracy.

3. How to parse JSON format

  • To get the value of the title attribute in JSON data , you can use Python's json module to parse the JSON data. In your example data, the title attribute is on every element in the pageArticleList list in the data dictionary .
  • Here is a sample code that demonstrates how to get the value of the title attribute:
import json

# 假设你已经获取到了 JSON 数据,将其存储在 json_data 变量中
json_data = '''
{
  "status": 200,
  "message": "success",
  "datatype": "json",
  "data": {
    "pageArticleList": [
      {
        "indexnum": 0,
        "periodid": 20200651,
        "ordinate": "",
        "pageid": 2020035375,
        "pagenum": "6 科协动态",
        "title": "聚焦“科技创新+先进制造” 构建社会化大科普工作格局"
      }
    ]
  }
}
'''

# 解析 JSON 数据
data = json.loads(json_data)

# 提取 title 属性的值
title = data["data"]["pageArticleList"][0]["title"]

# 输出 title 属性的值
print(title)
  • In the above example, we stored the sample data in the json_data string. We then use the json.loads() function to parse the string into JSON data and store it in the data variable.

  • We can then extract the value of the title attribute through hierarchical access of the dictionary key. In this example, we use data["data"]["pageArticleList"][0]["title"] to get the value of the title attribute.

  • Finally, we print the results or perform other processing as needed.

  • Or use get() to get the value of a specific attribute

list = json.loads(res.text)
    for i in list:
        print(i.get('edition'))

Insert image description here

4. How to add commonly used headers

  • If you want to set HTTP request headers in actual code, you can do this by using the functionality of the corresponding programming language and HTTP library. Here is an example showing how to add common request headers using Python's requests library:
import requests

url = "https://example.com"
headers = {
    
    
    "User-Agent": "Mozilla/5.0",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://example.com",
    # 添加其他常用请求头...
}

response = requests.get(url,stream=True, headers=headers)
  • In the above example, we created a headers dictionary and added commonly used request header key-value pairs to the dictionary. Then, when sending the request, add these request headers to the GET request by passing the headers parameter.
    Please note that in actual use, the request header can be customized as needed. Commonly used request headers include " User-Agent " (user agent, used to identify the client browser/device), " Accept-Language " (accepted language), " Referer " (source page), etc.

5. How to merge two divs

try:
  html = ""
         <html>
           <body>
           </body>
         </html>
         ""
  soup = BeautifulSoup(html, 'html.parser')
  # 创建新的div标签
  new_div = soup.new_tag('div')
  temp_part1 = html_dom.find('div', 'detail-title')
  new_div.append(temp_part1)
  temp_part2 = html_dom.find("div", "detail-article")
  new_div.append(temp_part2)
  card = {
    
    "content": "", "htmlContent": ""}
  html_dom=new_div
except:
  return

6. How to delete part of the structure of html dom

  • To remove part of a fetched DOM structure in Python, you can use the Beautiful Soup library to parse and manipulate HTML. Here is a sample code that demonstrates how to delete part of the DOM structure:
from bs4 import BeautifulSoup

# 假设你已经获取到了 DOM 结构,将其存储在 dom 变量中
dom = '''
&lt;div class="container"&gt;
    &lt;h1&gt;Hello, World!&lt;/h1&gt;
    &lt;p&gt;This is a paragraph.&lt;/p&gt;
&lt;/div&gt;
'''

# 创建 Beautiful Soup 对象
soup = BeautifulSoup(dom, 'html.parser')

# 找到要删除的部分
div_element = soup.find('div', class_='container')
div_element.extract()

# 输出修改后的 DOM 结构
print(soup.prettify())
  • In the above example, we first store the DOM structure in the dom variable. Then, we created a parsing object soup using Beautiful Soup .
    Next, we use the find() method to find the part to be deleted, here is <div class="container"> . We then use the extract() method to remove the element from the DOM structure.
  • Finally, we use the prettify() method to output the modified DOM structure so that we can view the results.
    在实际应用中,需要根据要删除的部分的选择器和属性进行适当的调整。

7. How to get the text in all div tags at once

  • To get the text in all <div> tags at once, you can use the BeautifulSoup library or the lxml library for parsing. Here is sample code using these two libraries:
  • Method 1: Use the BeautifulSoup library:
from bs4 import BeautifulSoup

# 假设html为包含标签的HTML文档
soup = BeautifulSoup(html, 'html.parser')

# 查找所有div标签并获取其文本内容
div_texts = [div.get_text() for div in soup.find_all('div')]
  • Method 2: Use lxml library:
from lxml import etree

# 假设html为包含标签的HTML文档
tree = etree.HTML(html)

# 使用XPath查找所有div标签并获取其文本内容
div_texts = tree.xpath('//div//text()')
  • Using these codes, you can get the text content in all <div> tags at once. Please note that the result returned by these methods is a list, each element in the list corresponds to the text content of a <div> tag. You can further process the text content as needed.

8. How python crawler changes response text character set encoding

  • In a Python crawler, you can change the character set encoding of the response text in the following ways:

  • Method 1: Use the response.encoding attribute: After using the requests library to send a request and obtain the response object, you can use the response.encoding attribute to specify the character set encoding of the response text. Depending on the content in the response, you can try different encodings to set, such as UTF-8, GBK, etc. The sample code is as follows:

import requests

response = requests.get('https://example.com')
response.encoding = 'UTF-8'  # 设置响应文本的字符集编码为UTF-8
print(response.text)

apparent_encoding用于获取响应内容的推测字符集编码,是一个只读属性,它只返回推测的字符集编码,并不能用于设置或更改字符集编码。如果需要更改字符集编码,请使用response.encoding属性进行设置

  • Method 2: Use the chardet library to automatically detect the character set encoding: If you are not sure what the character set encoding of the response is, you can use the chardet library to automatically detect the character set encoding of the response text. This library can analyze the distribution of characters in text and guess possible character set encodings. The sample code is as follows:
import requests
import chardet

response = requests.get('https://example.com')
encoding = chardet.detect(response.content)['encoding']  # 检测响应文本的字符集编码
response.encoding = encoding  # 设置响应文本的字符集编码
print(response.text)
  • Method 3: Use Unicode encoding: If you are unable to determine the correct character set encoding of the response text, you can convert the text content to Unicode encoding, so that you do not need to specify the character set encoding. The sample code is as follows:
import requests

response = requests.get('https://example.com')
text = response.content.decode('unicode-escape')
print(text)
  • The above are three common ways to change the character set encoding of response text. Choose the most appropriate method for processing crawled web content based on the situation. Remember, when dealing with character set encodings, be careful to handle exceptions such as encoding errors or unrecognized character sets.

9. How to transcode character sets

  • Character set transcoding is the process of converting text from one character set encoding to another. In Python, you can use the **encode() and decode()** methods to perform character set transcoding operations.

  • Method 1: encode(encoding) converts the text from the current character set encoding to the specified encoding. Among them, the encoding parameter is a string representation of the target encoding format. The sample code is as follows:

text = "你好"
encoded_text = text.encode('utf-8')  # 将文本从当前编码转换为UTF-8编码
print(encoded_text)
  • Method 2: decode(encoding) decodes the text from the specified encoding format to the current character set encoding. Among them, the encoding parameter is a string representation of the original encoding format. The sample code is as follows:
encoded_text = b'\xe4\xbd\xa0\xe5\xa5\xbd'  # UTF-8 编码的字节串
decoded_text = encoded_text.decode('utf-8')  # 将字节串从UTF-8解码为Unicode文本
print(decoded_text)
  • When transcoding character sets, you need to ensure that the original text and target encoding match. If you are not sure of the original character set, you can first use a character set detection tool (such as chardet) to determine the original encoding before performing the transcoding operation.
  • Transcoding using the correct character set encoding ensures that text is displayed and processed correctly in different environments.

11. The difference between response.text and response.content

In the HTTP request libraries of many programming languages, such as Python's requests library, there are two commonly used attributes used to obtain the content of the HTTP response: response.text and response.content. The differences are as follows:

  • response.text:

1. response.text returns a string representing the content of the HTTP response.
2. This string is decoded according to the character encoding of the HTTP response, and UTF-8 encoding is used by default.
3. If the response contains other encoded content, you can manually specify the corresponding encoding method for decoding by specifying the response.encoding attribute.

  • response.content:

1. response.content returns a byte stream representing the content of the HTTP response.
2. This byte stream is raw binary data without any encoding and decoding operations.
3. response.content is suitable for processing binary files, such as pictures, audio and video files, etc.

In short, response.text is suitable for processing text content and will automatically perform encoding and decoding operations, while response.content is suitable for processing binary content and returns the original byte stream.

Which attribute to use depends on the type of content you're dealing with and your needs. If you are dealing with text content, such as HTML, JSON data, etc., then response.text is usually used. If you are dealing with binary files, such as images or audio and video files, then using response.content is more appropriate.

12. How to send a post request to access the page

Parsing a request mainly focuses on the following aspects:

  • Request path
  • Request parameters (post request is an implicit parameter, the browser sends a post request)
  • Request header

Insert image description here

  • Request type

The following is a sample code

import json

import requests
def main():
    url = 'https://www.gzyouthnews.org.cn/index/index'
    header = {
    
    
        'X-Requested-With':'XMLHttpRequest'
    }
    data={
    
    
        'act':'list',
        'date':'2023-08-10',
        'paper_id':1
    }
    res = requests.post(url=url,headers=header,data=data)
    list = json.loads(res.text)
    for i in list:
        print(i.get('edition'))

if __name__ == '__main__':
    main()

13. How to get parameters in url

To get the parameter page=100 from a given URL, you can use a URL parsing library to parse the URL and extract the required parameters.
The following is sample code for parsing URL parameters using Python's urllib.parse module:

from urllib.parse import urlparse, parse_qs

url = "https://blog.csdn.net/phoenix/web/v1/comment/list/131760390?page=100&amp;size=10&amp;fold=unfold&amp;commentId="

parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)

page_value = query_params.get("page", [None])[0]
print(page_value)

In the above example, we first parse the URL using the urlparse function and then parse the query parameter part using the parse_qs function. The parse_qs function parses query parameters into a dictionary where the keys are parameter names and the values ​​are a list of parameter values.

Then, we use query_params.get("page", [None])[0] to get the parameter value named page from the dictionary . This returns the value of the parameter, or None if the parameter does not exist .

The output will be 100 , which is the value of the page parameter extracted from the URL https://blog.csdn.net/phoenix/web/v1/comment/list/131760390?page=100&size=10&fold=unfold&commentId= .

Please note that if the URL parameter value is in string form, you may need to perform further type conversions as needed.

Guess you like

Origin blog.csdn.net/HHX_01/article/details/132554920