Python crawler (three): common crawler toolkit

In the previous article, crawling bilibili's barrage for graphic cloud display: crawling the subtitle word cloud display of "Charlotte Annoyance" at station B , is an example of the combination of crawling data + data display, here will introduce the common tools of crawlers;

table of Contents

Common tools

Requests

lxml

BeautifulSoup

tqdm

ffmpy3

matplotlib

seaborn


Common tools

The three steps of data crawling: download data---analyze data---analyze data . Some commonly used tools are used in it. Requests is to download url content, regular expressions, beautifulsoup and lxml are to quickly parse html documents. Tqdm can display the processing progress, ffmpy processes the video stream, and matplotlib and seaborn can visually analyze and display the data;

  • Requests

The requests package provides methods such as get, put, post, and delete for URL to simulate interaction. Response.text returns in Unicode format and usually needs to be converted to utf-8 format, otherwise it will be garbled. response.content is in binary mode, you can download videos and the like, if you want to watch it, you need to decode it into utf-8 format.
  Either through response.content.decode("utf-8) or through response.encoding="utf-8", the problem of garbled codes can be avoided.

response  = requests.get("https://www.baidu.com")
print(type(response))
print(response.status_code)
print(type(response.text))
response.enconding = "utf-8'
print(response.text)
print(response.cookies)
print(response.content)
print(response.content.decode("utf-8"))

 Get request with parameters and header:

url = 'http://www.baidu.com'
headers={
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"
         }
data = {
    'name':'yzg',
    'age':'18'
}
response = requests.get(url,params=data,headers=headers)
print(response.url)
print(response.text)

Post data to url address:

url = 'http://xxx'
data = {
    'name':'yzg',
    'age':'23'
    }
response = requests.post(url,data=data)
print(response.text)

Get the response after url access:

response = requests.get("http://www.baidu.com")
#打印请求页面的状态(状态码)
print(type(response.status_code),response.status_code)
#打印请求网址的headers所有信息
print(type(response.headers),response.headers)
#打印请求网址的cookies信息
print(type(response.cookies),response.cookies)
#打印请求网址的地址
print(type(response.url),response.url)
#打印请求的历史记录(以列表的形式显示)
print(type(response.history),response.history)

Get cookies, which can be used for session retention;

response = requests.get('https://www.baidu.com')
print(response.cookies)
for key,value in response.cookies.items():
    print(key,'==',value)

url = 'http://xxxx'
cookies = {'xx': 'x', 'xx': 'y'}
r = requests.get(url, cookies=cookies)
print(r.json())
  • lxml

lxml is a parsing library that supports HTML/XML/XPath parsing methods, and the parsing efficiency is very high. XPath (XML Path Language) is a language for finding information in XML documents. It was originally used to search for XML documents. But it is also suitable for searching HTML documents;

For more usage reference of XPath: http://www.w3school.com.cn/xpath/index.asp

For more usage reference of python lxml library: http://lxml.de/

Common rules of xpath:

expression description
nodename Select all child nodes of this node
/ Select direct children from the current node
// Select descendant nodes from the current node
. Select current node
.. Select the parent node of the current node
@ Select attribute
* Wildcard, select all element nodes and element names
@* Select all attributes
[@attrib] Select all elements with a given attribute
[@attrib='value'] Select all elements with a given attribute with a given value
[tag] Select all direct child nodes with the specified element
[tag='text'] Select all the specified elements and the text content is a text node

Read the text and parse the node:

from lxml import etree

text='''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">第一个</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0"><a href="link5.html">a属性</a>
     </ul>
 </div>
'''
html=etree.HTML(text) #初始化生成一个XPath解析对象
result=etree.tostring(html,encoding='utf-8')   #解析对象输出代码
print(type(html))
print(type(result))
print(result.decode('utf-8'))

According to the crawled URL address, the text content of the d tag is obtained using xpath analysis method:

url = 'https://api.bilibili.com/x/v1/dm/list.so?oid=183896111'
headers={
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"
         }
response=requests.get(url,headers=headers)
html=etree.HTML(response.content)
d_list=html.xpath("//d//text()")
  • BeautifulSoup

Like lxml, BeautifulSoup is also an xml format parser. It is relatively easier not to involve the knowledge content of xpath. Beautifulsoup will load the entire web page content into the DOM tree during parsing. The memory overhead and time-consuming are relatively high. Not recommended for mass content. However, BeautifulSoup does not need a clearly structured web page content, because it can directly find the tags we want. It is more suitable for some web pages with unclear HTML structure;

How to use it can refer to: https://www.crummy.com/software/BeautifulSoup/

from bs4 import BeautifulSoup
html = """
<html><head><title>haha,The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# print(soup.prettify())   # 格式化
print(soup.title)  
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)  # p标签
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))
  • tqdm

tqdm is a Python progress bar. You can add a progress prompt message in the Python long loop. The user only needs to encapsulate any iterator tqdm (iterator). After traversing the data crawling and writing to the local process, you can use tqdm to perform the progress bar display:

# 方法1:
import time
from tqdm import tqdm  
for i in tqdm(range(100)):  
    time.sleep(0.01)

#方法2:
import time
from tqdm import trange
for i in trange(100):
    time.sleep(0.01) 

You can set a description for the progress bar:

pbar = tqdm(["a", "b", "c", "d"])  
for char in pbar:  
    # 设置描述
    pbar.set_description("Processing %s" % char)
    time.sleep(1)

  • ffmpy3

ffmpy3 is a branch of ffmpy, it is a simple FFmpeg command line wrapper. ffmpy implements a Pythonic interface for executing FFmpeg through the command line and using Python's subprocess module for synchronous execution

import ffmpy3
ff = ffmpy3.FFmpeg(
    inputs={'input.mp4': None},
    outputs={'output.avi': None}
)
ff.run()
  • matplotlib

matplotlib is Python's most famous drawing library. It provides a set of command APIs similar to matlab, which is very suitable for interactive drawing. And it can be easily used as a drawing control. Seaborn is also based on matplotlib encapsulation, but matplotlib is more low-level and provides richer functions. Refer to: https://matplotlib.org/ 

matplotlib.pyplot is a collection of command-style functions. Each pyplot function makes some changes to an image, such as creating a graph, creating a drawing area in the graph, adding a line to the drawing area, and so on. In matplotlib.pyplot, various states are saved through function calls, so that things like the current image and drawing area can be tracked at any time. The drawing function directly acts on the current axes (a proper noun in matplotlib, a part of the graph, not the coordinate system in mathematics.)

import matplotlib.pyplot as plt
plt.plot([2,4,7,18])
plt.ylabel('some numbers')
plt.show()
%matplotlib inline

Figure: Before any drawing, we need a Figure object, which can be understood as a drawing board to start drawing.

import matplotlib.pyplot as plt
fig = plt.figure()
%matplotlib inline

Axes: Axes need to be defined after the Figure object, and Axes need to be added; there are 3 figures in the figure;

fig = plt.figure()
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)
ax1.set(xlim=[0.5, 4.5], ylim=[-2, 8], title='An Example Axes',ylabel='Y-Axis', xlabel='X-Axis')
ax2.set(xlim=[0.5, 4.5], ylim=[-2, 8], title='An Example Axes',ylabel='Y-Axis', xlabel='X-Axis')
ax3.set(xlim=[0.5, 4.5], ylim=[-2, 8], title='An Example Axes',ylabel='Y-Axis', xlabel='X-Axis')
plt.show()

It is also possible to define the number of sub-pictures and axes at once;

fig, axes = plt.subplots(nrows=2, ncols=2)
axes[0,0].set(title='Upper Left')
axes[0,1].set(title='Upper Right')
axes[1,0].set(title='Lower Left')
axes[1,1].set(title='Lower Right')

matplotlib provides line graphs, scatter graphs, histograms, distribution graphs, pie graphs, relationship graphs, etc., which can be explored on this basis;

  • seaborn

Regarding the data processing of seaborn combined with pandas, there are detailed usage methods in my previous blog post: Seaborn data visualization exploration (tips data set)

Guess you like

Origin blog.csdn.net/yezonggang/article/details/106662178