How does python store the crawler in the database first and then realize the visualization?

Python is a very popular programming language, especially when it comes to web scraping and data visualization. This article will introduce how to use Python to write a web crawler, store the crawled data in the database, and display the data through data visualization tools.

1. Crawl data

First, you need to write a Python crawler program to crawl data. Here is an example of crawling Douban Movie Top250.

Use the Requests library to send HTTP requests, use the BeautifulSoup library to parse HTML pages, and use regular expressions to extract the required data.

import requests
from bs4 import BeautifulSoup
import re

url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')

data = []
for movie in movies:
    name = movie.find('span', class_='title').get_text()
    score = re.findall('(\d\.\d)', str(movie.find('span', class_='rating_num')))
    if score:
        score = float(score[0])
    else:
        score = None
    director = re.findall('导演: (.*?) ', movie.find('p', class_='').get_text().strip())[0]
    data.append({'name': name, 'score': score, 'director': director})

2, stored in the database

Store the crawled data in the database, here we take MongoDB as an example.

Use the PyMongo library to connect to the MongoDB database and insert data into the specified collection.

from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client['douban']
collection = db['movies']

for d in data:
    collection.insert_one(d)

3. Data visualization

Use the Matplotlib library to draw a histogram to show the distribution of movie ratings.

Use the PyMongo library to read data from the MongoDB database and use the Matplotlib library to draw histograms.

import matplotlib.pyplot as plt

scores = []
for d in collection.find():
    scores.append(d['score'])

bins = [i/10 for i in range(0, 11)]
plt.hist(scores, bins=bins, edgecolor='black')
plt.xlabel('score')
plt.ylabel('count')
plt.title('Movie score distribution')
plt.show()

The complete code is as follows:

import requests
from bs4 import BeautifulSoup
import re
from pymongo import MongoClient
import matplotlib.pyplot as plt

# 爬取数据
url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')

data = []
for movie in movies:
    name = movie.find('span', class_='title').get_text()
    score = re.findall('(\d

Guess you like

Origin blog.csdn.net/m0_72605743/article/details/129863146