[100 days proficient in python] Day41: Python web crawler development_ Introduction to crawler basics

Table of contents

 Column guide 

1 Overview of web crawlers

1.1 Working principle

1.2 Application scenarios

1.3 Crawler strategy

1.4 The crawler challenge

2 Web crawler development

2.1 Basic process of general web crawler

2.2 Common technologies of web crawlers

2.3 Third-party libraries commonly used by web crawlers

3 Simple crawler example


 Column guide 

Column subscription address: https://blog.csdn.net/qq_35831906/category_12375510.html

1 Overview of web crawlers

        A web crawler, also known as a web spider or a web robot, is an automated program used to browse and grab information on the Internet. Crawlers can traverse web pages, collect data, and extract information for further processing and analysis. Web crawlers play an important role in search engines, data collection, information monitoring and other fields.

1.1 Working principle

  1. Initial URL selection : The crawler starts with one or more initial URLs, which are usually the home page or other pages of the website you wish to begin crawling.

  2. Send HTTP request : For each initial URL, the crawler sends an HTTP request to get the web page content. Requests can include different HTTP methods such as GET and POST, and can also set request headers, parameters, and Cookies.

  3. Receive HTTP response : The server will return an HTTP response, which contains the HTML code of the web page and other resources, such as images, CSS, JavaScript, etc.

  4. Parsing webpage content : The crawler uses an HTML parsing library (such as Beautiful Soup or lxml) to parse the received HTML code and convert it into a Document Object Model (DOM) structure.

  5. Data extraction and processing : Through the DOM structure, crawlers extract the required information from web pages, such as titles, texts, links, pictures, etc. This can be achieved through CSS selectors, XPath, etc.

  6. Store data : The crawler stores the extracted data in local files, databases or other storage systems for subsequent analysis and use.

  7. Discovery of new links : When parsing web pages, the crawler will find new links and add them to the queue of URLs to be crawled so that it can continue to crawl more pages.

  8. Repeat process : The crawler executes the above steps in a loop, takes out the URL from the initial URL queue, sends a request, receives a response, parses the webpage, extracts information, processes and stores data, discovers new links, until the crawling task is completed.

  9. Control and maintenance : Crawlers need to set appropriate request frequency and delay to avoid excessive load on the server. You also need to monitor the running status of the crawler and handle errors and exceptions.

1.2 Application scenarios

  • Search engines : Search engines use crawlers to crawl web content and build indexes so that users can quickly find relevant information when searching.

  • Data collection : Enterprises, research institutions, etc. can use crawlers to collect data from the Internet for market analysis, public opinion monitoring, etc.

  • News aggregation : Crawlers can grab news titles, summaries, etc. from various news websites for news aggregation platforms.

  • Price comparison : E-commerce websites can use crawlers to grab competitors' product prices and information for price comparison analysis.

  • Scientific analysis : Researchers can use crawlers to obtain scientific literature, academic papers and other information.

1.3 Crawler strategy

        General Crawler and Focused Crawler are two different web crawling strategies used to obtain information on the Internet. They work differently and are used differently.

General Crawler: A general crawler is a general purpose crawler whose goal is to traverse as many web pages as possible on the Internet in order to collect and index as much information as possible. The general crawler will start from a starting URL, and then explore more web pages through link tracking, recursive crawling, etc., to build an extensive web page index.

Features of Universal Crawler:

  • The goal is to collect as much information as possible.
  • Start with one or more start URLs and follow the expansion through links.
  • Suitable for search engines and large data indexing projects.
  • The robots.txt file and anti-crawler mechanism of the website need to be considered.

Focused Crawler: A focused crawler is a crawler that focuses on a specific field or topic, and it selectively crawls web pages related to a specific topic. Different from general-purpose crawlers, focused crawlers only focus on certain specific web pages to meet specific needs, such as public opinion analysis, news aggregation, etc.

Focus on the characteristics of reptiles:

  • Focus on a specific topic or field.
  • Selectively crawl web pages according to specific keywords, content rules, etc.
  • Suitable for customized needs, such as public opinion monitoring, news aggregation, etc.
  • Information in specific fields can be obtained more precisely.

In practical applications, general-purpose crawlers and focused crawlers have their own advantages and uses. General-purpose crawlers are suitable for building comprehensive search engine indexes, as well as for large-scale data analysis and mining. Focused crawlers are more suitable for customized needs, and can obtain accurate information for specific fields or topics.

1.4 The crawler challenge

  • Changes in website structure : The structure and content of the website may change at any time, and crawlers need to be adjusted and updated.

  • Anti-crawler mechanism : Some websites have adopted anti-crawler measures, such as limiting request frequency and using verification codes.

  • Data cleaning : Data extracted from web pages may contain noise and need to be cleaned and organized.

  • Legal and moral issues : crawlers need to abide by laws and regulations, respect website rules, and do not abuse or infringe on the rights and interests of others.

        Summary : A web crawler is an automated program used to obtain information from the Internet. It collects and organizes data through steps such as sending requests, parsing web pages, and extracting information. In different application scenarios, crawlers play an important role, but they also need to face various challenges and compliance issues.

2 Web crawler development

2.1 Basic process of general web crawler

2.2 Common technologies of web crawlers

     A web crawler is an automated program used to collect data from the Internet. Commonly used web crawler technologies and third-party libraries include the following:

1. Request and response processing:

  • Requests: A library for sending HTTP requests and processing responses, which facilitates crawlers to obtain web content.
  • httpx: similar requests, supports synchronous and asynchronous requests, suitable for high-performance crawlers.

2. Parse and extract data:

  • Beautiful Soup: Used to parse HTML and XML documents and provide easy methods to extract the required data.
  • lxml: A high-performance HTML and XML parsing library that supports XPath and CSS selectors.
  • PyQuery: A jQuery-based parsing library that supports CSS selectors.

3. Dynamically render web pages:

  • Selenium: An automated browser library for handling dynamically rendered web pages, such as JavaScript loading content.

4. Asynchronous processing:

  • asyncio and aiohttp: used to process requests asynchronously and improve the efficiency of crawlers.

5. Data storage:

  • SQLite, MySQL, MongoDB: The database is used to store and manage crawled data.
  • CSV, JSON: Simple formats for exporting and importing data.

6. Anti-crawler and IP proxy:

  • User-Agent setting: Set the User-Agent header of the request to simulate different browsers and operating systems.
  • Proxy server: Use proxy IP to hide real IP address and avoid IP ban.
  • Captcha processing: Use captcha recognition technology to process websites that require captchas.

7. Robots.txt and site policy compliance:

  • robots.txt: A file that checks the website robots.txtand follows the rules of the website.
  • Crawler delay: Set the delay of crawler requests to avoid excessive load on the server.

8. Crawler framework:

  • Scrapy: A powerful crawler framework that provides many functions to organize the crawling process.
  • Splash: A JavaScript rendering service, suitable for handling dynamic web pages.

2.3 Third-party libraries commonly used by web crawlers

        Web crawlers use a variety of technologies and third-party libraries to achieve data acquisition, parsing and processing of web pages. The following are commonly used technologies and third-party libraries for web crawlers:

1. Request library: The core of a web crawler is to send HTTP requests and process responses. Here are some commonly used request libraries:

  • Requests: An easy-to-use HTTP library for sending HTTP requests and processing responses.
  • httpx: A modern HTTP client that supports asynchronous and synchronous requests.

2. Parsing library: The parsing library is used to extract the required data from HTML or XML documents.

  • Beautiful Soup: A library for extracting data from HTML and XML documents, supporting flexible querying and parsing.
  • lxml: A high-performance XML and HTML parsing library that supports both XPath and CSS selectors.

3. Data repository: Storing the crawled data is one of the important parts of the crawler.

  • SQLAlchemy: A powerful SQL toolkit for manipulating relational databases in Python.
  • Pandas: A data analysis library that can be used for data cleaning and analysis.
  • MongoDB: A non-relational database, suitable for storing and processing large amounts of unstructured data.
  • SQLite: A lightweight embedded relational database.

4. Asynchronous library: Using asynchronous requests can improve the efficiency of crawlers.

  • asyncio: Python's asynchronous IO library for writing asynchronous code.
  • aiohttp: an asynchronous HTTP client that supports asynchronous requests.

5. Dynamic rendering processing: Some web pages use JavaScript for dynamic rendering, which needs to be processed by the browser engine.

  • Selenium: An automated browser manipulation library for handling JavaScript-rendered pages.

6. Response to anti-crawler technology: Some websites take anti-crawler measures, which require some techniques to bypass.

  • Proxy pool: Use a proxy IP to avoid frequent access to the same IP being blocked.
  • User-Agent Randomization: Change the User-Agent to emulate different browsers and operating systems.

These are just some of the techniques and third-party libraries commonly used by web crawlers. According to actual project requirements, you can choose appropriate technologies and tools to achieve efficient, stable and useful web crawlers.


3 Simple crawler example

 Create a simple crawler, such as crawling text information on a static web page and outputting it.

import requests
from bs4 import BeautifulSoup

# 发送GET请求获取网页内容
url = 'https://www.baidu.com'
response = requests.get(url)
response.encoding = 'utf-8'  # 指定编码为UTF-8
html_content = response.text

# 使用Beautiful Soup解析HTML内容
soup = BeautifulSoup(html_content, 'html.parser')

# 提取网页标题
title = soup.title.text

# 提取段落内容
paragraphs = soup.find_all('p')
paragraph_texts = [p.text for p in paragraphs]

# 输出结果
print("Title:", title)
print("Paragraphs:")
for idx, paragraph in enumerate(paragraph_texts, start=1):
    print(f"{idx}. {paragraph}")

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/132377113