When Python writes crawler code, in addition to requests and beautifulsoup4, there are some other libraries that can be used. The following are some commonly used libraries:
-
Scrapy: Scrapy is a Python crawler framework, which provides powerful crawling tools and convenient data processing functions, and can quickly write efficient crawler programs.
-
Selenium: Selenium is an automated testing tool that can simulate browser behavior. For some websites that need to simulate login, click and other interactive operations, Selenium is a very useful tool.
-
PyQuery: PyQuery is a library similar to jQuery, which can be used to manipulate HTML documents in a way similar to CSS selectors, which is very convenient.
-
lxml: lxml is a Python XML processing library that can quickly parse XML documents and can also be used to parse HTML documents.
-
requests-html: requests-html is a library based on requests and lxml, which can easily parse HTML documents and supports JavaScript rendering and CSS selectors.
-
pandas: pandas is a Python data processing library, which can easily clean, organize and analyze data, and is very useful for data processing in crawler programs.
installation method:
Enter the pip install library name in the terminal , such as:
pip install scrapy
Here is a code example that imports the above library:
import scrapy
from selenium import webdriver
from pyquery import PyQuery as pq
from lxml import etree
from requests_html import HTMLSession
import pandas as pd