[100 days proficient in python] Day44: python web crawler development_crawler basics (crawler data storage: basic file storage, MySQL, NoSQL: MongDB, Redis database storage + actual combat code)

Table of contents

1 data storage

1.1 Crawler Storage: Basic File Storage

1.2 Crawler storage: use MySQL database

1.3 Use of crawler NoSQL database

1.3.1 Introduction to MongoDB

1.3.2 MongoDB usage

1.3.1 Crawler storage: use MongoDB database

1.4 Use of Redis database

1.4.1 Main Features

1.4.2 Common uses

1.4.3 Using Redis database

 1.4.4 Crawler storage: use Redis database

2 Example of actual web crawler data storage

3 Actual crawling of book information on a web page


1 data storage

1.1 Crawler Storage: Basic File Storage

   In the crawler, you can use different formats to store the captured data, such as text files (txt), JSON files, CSV files, etc.

Example of text file storage:

with open('data.txt', 'w') as file:
    file.write('Hello, World!')

Example of JSON file storage:

import json

data = {'name': 'John', 'age': 30}
with open('data.json', 'w') as file:
    json.dump(data, file)

Example of CSV file storage:

import csv

data = [['name', 'age'], ['John', 30], ['Alice', 25]]
with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

1.2 Crawler storage: use MySQL database

        MySQL is a relational database management system, you can use the Python mysql.connectormodule to connect and operate the MySQL database.

        The basic steps of using a MySQL database in a crawler include connecting to the database, creating a data table, and storing the captured data in the database. The following is a simple example to demonstrate how to use the MySQL database in the crawler to store the crawled book information.

Example: Store the captured book information in the MySQL database

First, make sure you have installed mysql-connector-pythonthe library, which is the Python library for connecting to MySQL databases.

pip install mysql-connector-python

Then, you need to create a database named in the MySQL database booksand a book_infodata table named to store book information.

CREATE DATABASE books;
USE books;

CREATE TABLE book_info (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(255),
    author VARCHAR(255),
    price DECIMAL(10, 2)
);

Next, you can use the following code example to store the captured book information in the MySQL database: 

import requests
import mysql.connector
from bs4 import BeautifulSoup

# 抓取数据
url = 'https://www.example.com/books'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 连接到 MySQL 数据库
connection = mysql.connector.connect(
    host='localhost',
    user='username',
    password='password',
    database='books'
)
cursor = connection.cursor()

# 解析并存储数据到数据库
book_elements = soup.find_all('div', class_='book')
for book_element in book_elements:
    title = book_element.find('h2').text
    author = book_element.find('p', class_='author').text
    price = float(book_element.find('p', class_='price').text.replace('$', ''))
    
    # 插入数据
    insert_query = "INSERT INTO book_info (title, author, price) VALUES (%s, %s, %s)"
    values = (title, author, price)
    cursor.execute(insert_query, values)

connection.commit()

# 关闭连接
cursor.close()
connection.close()

print("Book data saved to MySQL database.")

This example demonstrates how to store the captured book information in a MySQL database. In practical application, you need to modify the database connection parameters, web page parsing method and data table structure according to the actual situation. At the same time, you can also add mechanisms such as exception handling to ensure the stability of data storage. 

1.3 Use of crawler NoSQL database

        NoSQL (Not Only SQL) is a non-relational database that does not use traditional tables and relationships to store data. It is suitable for data storage in large-scale, high-performance, and distributed environments.

1.3.1 Introduction to MongoDB

        MongoDB is an open source NoSQL database that stores data in the form of documents. It features flexible schema design, horizontal scalability, and high performance.

1.3.2 MongoDB usage

from pymongo import MongoClient

# 创建数据库连接
client = MongoClient('mongodb://localhost:27017/')

# 创建或选择数据库和集合
db = client['mydb']
collection = db['mycollection']

# 插入文档
data = {'name': 'John', 'age': 30}
insert_result = collection.insert_one(data)
print(insert_result.inserted_id)

# 查询文档
query = {'name': 'John'}
result = collection.find_one(query)
print(result)

# 更新文档
update_query = {'name': 'John'}
new_values = {'$set': {'age': 31}}
collection.update_one(update_query, new_values)

# 删除文档
delete_query = {'name': 'John'}
collection.delete_one(delete_query)

# 关闭连接
client.close()

1.3.1 Crawler storage: use MongoDB database

        The steps to use the MongoDB database in the crawler include connecting to the database, creating a collection (similar to a table), and storing the captured data in the collection. The following example demonstrates how to use the MongoDB database to store the crawled book information in the crawler.

Example: Store the captured book information in the MongoDB database

First, make sure you have installed pymongothe library, which is the Python library for connecting to the MongoDB database.

pip install pymongo

Then, you need to start the MongoDB database and create a database (for example books_db) and a collection (for example book_collection).

Next, you can use the following code example to store the fetched book information in the MongoDB database:

import requests
import pymongo
from bs4 import BeautifulSoup

# 抓取数据
url = 'https://www.example.com/books'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 连接到 MongoDB 数据库
client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client['books_db']
collection = db['book_collection']

# 解析并存储数据到集合
book_elements = soup.find_all('div', class_='book')
for book_element in book_elements:
    title = book_element.find('h2').text
    author = book_element.find('p', class_='author').text
    price = float(book_element.find('p', class_='price').text.replace('$', ''))
    
    # 插入数据
    book_data = {'title': title, 'author': author, 'price': price}
    collection.insert_one(book_data)

print("Book data saved to MongoDB database.")

         This example demonstrates how to store the captured book information in a MongoDB database. In practical applications, you need to modify the database connection parameters, web page parsing method, and collection name according to the actual situation. At the same time, you can also add mechanisms such as exception handling to ensure the stability of data storage.

1.4 Use of Redis database

    The Redis database is usually used in scenarios such as caching, data storage, and message queues. The purpose of Redis is explained in detail below, and a basic example is provided to demonstrate how to use Redis in Python.

1.4.1 Main Features

  1. In-memory databases: Redis databases store data in memory, so access is very fast.
  2. Multiple data structures: Redis supports multiple data structures, such as strings, hashes, lists, sets, ordered sets, etc., making it suitable for different data storage needs.
  3. Persistence: Redis can persist data in memory to disk to restore data after restart.
  4. Distributed: Redis supports distributed deployment, and mechanisms such as master-slave replication and fragmentation can be set.
  5. High performance: Redis uses efficient data structures and storage engines, and has excellent read and write performance.
  6. Transaction support: Redis supports transactions, and multiple operations can be packaged into one transaction for execution to ensure atomicity.
  7. Publish/subscribe: Redis provides a publish/subscribe mechanism for implementing message publishing and subscription modes.
  8. Expiration time: You can set an expiration time for the key, after which the key will be automatically deleted.
  9. Lua script: Redis supports the execution of Lua script, which can perform complex operations on the server side.

1.4.2 Common uses

  1. Cache: Redis is suitable for caching commonly used data to speed up data access.
  2. Session management: In web applications, Redis can be used to store user session information.
  3. Real-time statistics: Redis can be used for real-time statistics, counters and other functions.
  4. Message queue: Redis's publish/subscribe mechanism can implement a lightweight message queue.
  5. Leaderboard: The ordered collection of Redis can be used to implement functions such as leaderboards.
  6. Distributed locks: Redis can be used to implement distributed locks to control concurrent access.
  7. Relieve database load: Store part of the query results in Redis to reduce the load on the database.

1.4.3 Using Redis database

Redis (Remote Dictionary Server) is an open source high-performance key-value storage system, which is widely used in various application scenarios, including cache, data storage, message queue, counter, real-time analysis, etc.

 1 install redismodule

pip install redis

    2 Example of use

import redis

# 创建连接
r = redis.Redis(host='localhost', port=6379, db=0)

# 存储数据
r.set('key', 'value')  # 存储字符串类型数据
r.hset('user:1', 'name', 'John')  # 存储哈希类型数据
r.hset('user:1', 'age', 30)  # 存储哈希类型数据

# 获取数据
value = r.get('key')
print("Value:", value.decode('utf-8'))

user_info = r.hgetall('user:1')
print("User Info:", user_info)

# 删除数据
r.delete('key')
r.hdel('user:1', 'age')

# 设置过期时间
r.setex('key', 3600, 'value')  # 设置键的过期时间(单位为秒)

# 发布和订阅
def subscriber():
    pubsub = r.pubsub()
    pubsub.subscribe(['channel'])  # 订阅名为 'channel' 的频道
    for message in pubsub.listen():
        print("Received:", message['data'])
        
def publisher():
    r.publish('channel', 'Hello, Subscribers!')  # 向频道发布消息

# 事务
def perform_transaction():
    with r.pipeline() as pipe:
        while True:
            try:
                pipe.watch('balance')  # 监视键 'balance'
                balance = int(pipe.get('balance'))
                if balance >= 10:
                    pipe.multi()  # 开始事务
                    pipe.decrby('balance', 10)  # 减少余额
                    pipe.incrby('expenses', 10)  # 增加支出
                    pipe.execute()  # 执行事务
                    break
                else:
                    print("Insufficient balance.")
                    break
            except redis.WatchError:
                continue

# 持久化
r.save()  # 手动触发持久化操作
r.bgsave()  # 后台进行持久化

# 关闭连接
r.close()

These examples introduce how to use different storage methods (text, JSON, CSV, database) to store the data crawled by crawlers, and how to connect and operate MySQL, MongoDB and Redis databases. The specific usage will be slightly different according to your needs and environment, and you can learn and practice more deeply based on these examples.

 1.4.4 Crawler storage: use Redis database

Example: Store the crawled webpage links in the Redis database

First, make sure you have installed redisthe library, which is the Python library for connecting to the Redis database.

Then, you need to start the Redis server. Next, you can use the following code sample to store the crawled web page links in the Redis database:

import requests
import redis
from bs4 import BeautifulSoup

# 抓取数据
url = 'https://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 连接到 Redis 数据库
r = redis.Redis(host='localhost', port=6379, db=0)

# 解析并存储数据到 Redis 数据库
link_elements = soup.find_all('a')
for link_element in link_elements:
    link = link_element.get('href')
    if link and link.startswith('http'):
        r.sadd('links', link)

print("Links data saved to Redis database.")

In this example, we grab all the links from the target web page and then saddstore them in the Redis database using Redis' collection data type (method).

This example is just a simple demonstration. You can add more functions to the crawler according to actual needs, such as data cleaning, deduplication, and persistence. At the same time, you need to configure the connection parameters of the Redis database according to the actual situation.

2 Example of actual web crawler data storage

        The following is a simple example that demonstrates how to store the crawled data in text files, JSON files and MySQL databases in the crawler:

import requests
import json
import mysql.connector

# 抓取数据
response = requests.get('https://api.example.com/data')
data = response.json()

# 文本文件存储
with open('data.txt', 'w') as file:
    for item in data:
        file.write(f"{item['name']} - {item['value']}\n")

# JSON 文件存储
with open('data.json', 'w') as file:
    json.dump(data, file, indent=4)

# 数据库存储
connection = mysql.connector.connect(
    host='localhost',
    user='username',
    password='password',
    database='mydatabase'
)

cursor = connection.cursor()
for item in data:
    query = "INSERT INTO data_table (name, value) VALUES (%s, %s)"
    values = (item['name'], item['value'])
    cursor.execute(query, values)

connection.commit()
cursor.close()
connection.close()

3 Actual crawling of book information on a web page

A simple example of a web crawler, covering the use of urlliblibraries, Beautiful Souplibraries, proxies and data stores. In this example, we will grab the book information on a web page and store the captured data in a JSON file.

import urllib.request
from bs4 import BeautifulSoup
import json

# 定义目标网页的 URL
url = 'https://www.example.com/books'

# 定义代理(如果需要使用代理)
proxies = {'http': 'http://proxy.example.com:8080'}

# 发起请求,使用代理
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
response = urllib.request.urlopen(req, proxies=proxies)

# 解析网页内容
soup = BeautifulSoup(response, 'html.parser')

# 创建一个空的书籍列表
books = []

# 获取书籍信息
book_elements = soup.find_all('div', class_='book')
for book_element in book_elements:
    title = book_element.find('h2').text
    author = book_element.find('p', class_='author').text
    price = book_element.find('p', class_='price').text
    books.append({'title': title, 'author': author, 'price': price})

# 存储到 JSON 文件
with open('books.json', 'w') as file:
    json.dump(books, file, indent=4)

print("Books data saved to 'books.json'")

Guess you like

Origin blog.csdn.net/qq_35831906/article/details/132412330