Crawler, the word that is often mentioned, is a visual description of the data collection process. Especially in the Python language, due to its rich library resources and good ease of use, it becomes an excellent choice for writing crawlers. This article will start with basic knowledge, explain the relevant knowledge of Python crawlers in a simple way, and share some unique usage and practical skills. This article will take the actual website as an example, explain each processing part in depth, and show the output, so as to help you quickly master Python crawler skills.
Before You Begin: Necessary Libraries
Python has many libraries that can be used to write crawlers, but we will focus on two here: requests and BeautifulSoup.
import requests
from bs4 import BeautifulSoup
requests
Libraries are used to send HTTP requests, and BeautifulSoup
libraries are used to parse HTML in HTTP responses.
Basic crawler: crawl all web content
Taking the official Python website ( https://www.python.org/) as an example, a basic Python crawler might be written like this:
url = "https://www.python.org/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify()[:500])
The purpose of this code is to get the content of the web page and parse it using the BeautifulSoup library. We can see that requests.get(url)
it is used to send a GET request, but BeautifulSoup(response.text, 'html.parser')
it is used to parse the HTML content in the HTTP response.
The first 500 characters of the output of this code are as follows:
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" dir="ltr" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="Python.org" name="application-name"/>
<meta content="The official home of the Python Programming Language"
Use CSS selectors to crawl specific elements
When we want to get specific elements, we can use CSS selectors. For example, we want to get all the head links in the Python official website:
elements = soup.select('div.top-bar > ul > li > a')
for element in elements:
print(element.get('href'), element.text)
Here, div.top-bar > ul > li > a
is a CSS selector to select
The class is top-bar
the a element under the li element in the ul element under the div element. These a elements are the head links we want.
The partial output of this code is as follows:
/ Python
/psf-landing/ PSF
/docs/ Docs
/pypl/ PyPI
/jobs/ Jobs
/community-landing/ Community
HTML parsing language crawling: XPath
In addition to CSS selectors, another commonly used HTML parsing technique is XPath. XPath, the full name of XML Path Language, is a language for finding information in XML documents, and can also be used in HTML document parsing.
The Python lxml
library provides XPath support:
from lxml import etree
html = '<div><a href="/a">A</a><a href="/b">B</a></div>'
root = etree.HTML(html)
links = root.xpath('//a/@href')
print(links)
In this code, we first define an HTML string. Then, we use etree.HTML()
functions to parse this string into a DOM tree. Finally, we use root.xpath()
the method to extract all the links.
absolute link crawling
You may have noticed that the links in the output of the above code are relative links, not absolute links. If we wish to get absolute links, we can use urljoin
the function:
from urllib.parse import urljoin
elements = soup.select('div.top-bar > ul > li > a')
for element in elements:
absolute_url = urljoin(url, element.get('href'))
print(absolute_url, element.text)
The partial output of this code is as follows:
https://www.python.org/ Python
https://www.python.org/psf-landing/ PSF
https://www.python.org/docs/ Docs
https://www.python.org/pypl/ PyPI
https://www.python.org/jobs/ Jobs
https://www.python.org/community-landing/ Community
Dynamically loaded data crawling: Selenium
In many modern web pages, data may not be loaded all at once when the page loads, but dynamically loaded via JavaScript as the user interacts with the page. At this point, we may need to use another tool: Selenium.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.python.org/')
element = driver.find_element_by_css_selector('div.top-bar > ul > li > a')
print(element.text)
This code uses Selenium to simulate the behavior of the browser to obtain data dynamically loaded by JavaScript. In this example, we only get the text of the first link. In actual use, you may need to perform more complex operations according to your needs.
crawler agent
Using a proxy can help us hide our real IP address, so as to avoid IP blocking due to crawling too much data from the same website. Here is a simple piece of code using a proxy:
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
response = requests.get("https://www.python.org/", proxies=proxies)
Here, we define a proxy dictionary and pass it to requests.get()
the function. This way, our requests are sent through the proxy server, hiding our real IP address.
Asynchronous crawler: improve crawler efficiency
When crawling a large amount of data, we usually need to make multiple HTTP requests. If each request waits for the previous request to complete, the efficiency will be very low. At this point, we can use Python's asynchronous IO library asyncio
and aiohttp
to improve efficiency. Here is a simple example:
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, 'http://python.org')
print(html[:500])
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
In this code, we first define an asynchronous fetch
function for sending HTTP requests and getting responses. Then, we main
create an HTTP session in the function and use this session to send the request. Finally, we use the event loop to run main
the function.
Crawler framework: Scrapy
Although the basic functions of the crawler can be achieved using the above methods, we may need a more powerful tool when dealing with more complex crawling tasks. Scrapy is a powerful crawler framework implemented in Python, which provides us with many advanced functions, such as concurrent requests, data processing and storage, etc.
Here is an example of a simple Scrapy crawler:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://python.org']
def parse(self, response):
self.log('Visited %s' % response.url)
yield {
'url': response.url,
'title': response.css('title::text').get(),
}
In this code, we define a scrapy.Spider
crawler class that inherits from . This class defines the name of the crawler, the starting URL, and methods for parsing responses. Scrapy will automatically handle sending requests and receiving responses for us, we only need to care about how to extract data from responses.
Automated tasks: timing crawlers
Sometimes we need to perform crawling tasks regularly, such as crawling website data once a day. Python schedule
libraries can help us achieve this:
import schedule
import time
def job():
print("I'm working...")
schedule.every(10).seconds.do(job)
while True:
schedule.run_pending()
time.sleep(1)
In this code, we first define a crawler task job
. Then, we use schedule.every().seconds.do()
the method to set the execution interval of the task. Finally, we use an infinite loop to keep executing the pending tasks.
Crawler Ethics: Comply with robots.txt
When crawling, we need to respect robots.txt
the rules of the website. robots.txt
It is a text file stored in the root directory of the website, which is used to tell the crawler which pages can be crawled and which pages cannot be crawled.
Python urllib.robotparser
modules can help us parse robots.txt
:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('http://www.python.org/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', 'http://www.python.org/')
print(can_fetch)
In this code, we first create an RobotFileParser
object, then use set_url
the method to set robots.txt
the URL, and use read
the method to read and parse robots.txt
. Finally, we use can_fetch
the method to determine whether our crawler can crawl the specified URL.
Note that not all sites have robots.txt
, and not all will strictly adhere to robots.txt
. When crawling a website, in addition to respecting it robots.txt
, we should also try to minimize the impact of crawlers on the website, such as limiting the frequency of crawling and avoiding crawling when the website has high traffic.
Summarize
To sum up, although Python crawlers have many complicated technologies and knowledge points, as long as you master the basic knowledge and some practical skills, you can solve most of the crawler tasks. In the future, I will continue to share more Python crawler knowledge and skills.
If it is helpful, please pay more attention to the personal WeChat public account: [Python full perspective] TeahLead_KrisChang, 10+ years of experience in the Internet and artificial intelligence industry, 10+ years of experience in technology and business team management, Tongji Software Engineering Bachelor, Fudan Engineering Management Master, Aliyun certified cloud service senior architect, head of AI product business with hundreds of millions of revenue.