Python crawler combat: control the flood of data and reveal the depths of web pages

Crawler, the word that is often mentioned, is a visual description of the data collection process. Especially in the Python language, due to its rich library resources and good ease of use, it becomes an excellent choice for writing crawlers. This article will start with basic knowledge, explain the relevant knowledge of Python crawlers in a simple way, and share some unique usage and practical skills. This article will take the actual website as an example, explain each processing part in depth, and show the output, so as to help you quickly master Python crawler skills.

Before You Begin: Necessary Libraries

Python has many libraries that can be used to write crawlers, but we will focus on two here: requests and BeautifulSoup.

import requests
from bs4 import BeautifulSoup

requestsLibraries are used to send HTTP requests, and BeautifulSouplibraries are used to parse HTML in HTTP responses.

Basic crawler: crawl all web content

Taking the official Python website ( https://www.python.org/) as an example, a basic Python crawler might be written like this:

url = "https://www.python.org/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify()[:500])

The purpose of this code is to get the content of the web page and parse it using the BeautifulSoup library. We can see that requests.get(url)it is used to send a GET request, but BeautifulSoup(response.text, 'html.parser')it is used to parse the HTML content in the HTTP response.

The first 500 characters of the output of this code are as follows:

<!DOCTYPE html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" dir="ltr" lang="en">  <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="Python.org" name="application-name"/>
<meta content="The official home of the Python Programming Language" 

Use CSS selectors to crawl specific elements

When we want to get specific elements, we can use CSS selectors. For example, we want to get all the head links in the Python official website:

elements = soup.select('div.top-bar > ul > li > a')
for element in elements:
    print(element.get('href'), element.text)

Here, div.top-bar > ul > li > ais a CSS selector to select

The class is top-barthe a element under the li element in the ul element under the div element. These a elements are the head links we want.

The partial output of this code is as follows:

/ Python
/psf-landing/ PSF
/docs/ Docs
/pypl/ PyPI
/jobs/ Jobs
/community-landing/ Community

HTML parsing language crawling: XPath

In addition to CSS selectors, another commonly used HTML parsing technique is XPath. XPath, the full name of XML Path Language, is a language for finding information in XML documents, and can also be used in HTML document parsing.

The Python lxmllibrary provides XPath support:

from lxml import etree

html = '<div><a href="/a">A</a><a href="/b">B</a></div>'
root = etree.HTML(html)

links = root.xpath('//a/@href')
print(links)

In this code, we first define an HTML string. Then, we use etree.HTML()functions to parse this string into a DOM tree. Finally, we use root.xpath()the method to extract all the links.

absolute link crawling

You may have noticed that the links in the output of the above code are relative links, not absolute links. If we wish to get absolute links, we can use urljointhe function:

from urllib.parse import urljoin

elements = soup.select('div.top-bar > ul > li > a')
for element in elements:
    absolute_url = urljoin(url, element.get('href'))
    print(absolute_url, element.text)

The partial output of this code is as follows:

https://www.python.org/ Python
https://www.python.org/psf-landing/ PSF
https://www.python.org/docs/ Docs
https://www.python.org/pypl/ PyPI
https://www.python.org/jobs/ Jobs
https://www.python.org/community-landing/ Community

Dynamically loaded data crawling: Selenium

In many modern web pages, data may not be loaded all at once when the page loads, but dynamically loaded via JavaScript as the user interacts with the page. At this point, we may need to use another tool: Selenium.

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.python.org/')

element = driver.find_element_by_css_selector('div.top-bar > ul > li > a')
print(element.text)

This code uses Selenium to simulate the behavior of the browser to obtain data dynamically loaded by JavaScript. In this example, we only get the text of the first link. In actual use, you may need to perform more complex operations according to your needs.

crawler agent

Using a proxy can help us hide our real IP address, so as to avoid IP blocking due to crawling too much data from the same website. Here is a simple piece of code using a proxy:

proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "http://10.10.1.10:1080",
}

response = requests.get("https://www.python.org/", proxies=proxies)

Here, we define a proxy dictionary and pass it to requests.get()the function. This way, our requests are sent through the proxy server, hiding our real IP address.

Asynchronous crawler: improve crawler efficiency

When crawling a large amount of data, we usually need to make multiple HTTP requests. If each request waits for the previous request to complete, the efficiency will be very low. At this point, we can use Python's asynchronous IO library asyncioand aiohttpto improve efficiency. Here is a simple example:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://python.org')
        print(html[:500])

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

In this code, we first define an asynchronous fetchfunction for sending HTTP requests and getting responses. Then, we maincreate an HTTP session in the function and use this session to send the request. Finally, we use the event loop to run mainthe function.

Crawler framework: Scrapy

Although the basic functions of the crawler can be achieved using the above methods, we may need a more powerful tool when dealing with more complex crawling tasks. Scrapy is a powerful crawler framework implemented in Python, which provides us with many advanced functions, such as concurrent requests, data processing and storage, etc.

Here is an example of a simple Scrapy crawler:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://python.org']

    def parse(self, response):
        self.log('Visited %s' % response.url)
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
        }

In this code, we define a scrapy.Spidercrawler class that inherits from . This class defines the name of the crawler, the starting URL, and methods for parsing responses. Scrapy will automatically handle sending requests and receiving responses for us, we only need to care about how to extract data from responses.

Automated tasks: timing crawlers

Sometimes we need to perform crawling tasks regularly, such as crawling website data once a day. Python schedulelibraries can help us achieve this:

import schedule
import time

def job():
    print("I'm working...")

schedule.every(10).seconds.do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

In this code, we first define a crawler task job. Then, we use schedule.every().seconds.do()the method to set the execution interval of the task. Finally, we use an infinite loop to keep executing the pending tasks.

Crawler Ethics: Comply with robots.txt

When crawling, we need to respect robots.txtthe rules of the website. robots.txtIt is a text file stored in the root directory of the website, which is used to tell the crawler which pages can be crawled and which pages cannot be crawled.

Python urllib.robotparsermodules can help us parse robots.txt:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://www.python.org/robots.txt')
rp.read()

can_fetch = rp.can_fetch('*', 'http://www.python.org/')
print(can_fetch)

In this code, we first create an RobotFileParserobject, then use set_urlthe method to set robots.txtthe URL, and use readthe method to read and parse robots.txt. Finally, we use can_fetchthe method to determine whether our crawler can crawl the specified URL.

Note that not all sites have robots.txt, and not all will strictly adhere to robots.txt. When crawling a website, in addition to respecting it robots.txt, we should also try to minimize the impact of crawlers on the website, such as limiting the frequency of crawling and avoiding crawling when the website has high traffic.

Summarize

To sum up, although Python crawlers have many complicated technologies and knowledge points, as long as you master the basic knowledge and some practical skills, you can solve most of the crawler tasks. In the future, I will continue to share more Python crawler knowledge and skills.

If it is helpful, please pay more attention to the personal WeChat public account: [Python full perspective] TeahLead_KrisChang, 10+ years of experience in the Internet and artificial intelligence industry, 10+ years of experience in technology and business team management, Tongji Software Engineering Bachelor, Fudan Engineering Management Master, Aliyun certified cloud service senior architect, head of AI product business with hundreds of millions of revenue.

Guess you like

Origin blog.csdn.net/magicyangjay111/article/details/131575281