python 如何先爬虫存入数据库再实现可视化?

Python是一种非常流行的编程语言,特别是在Web爬虫和数据可视化方面。本文将介绍如何使用Python编写一个Web爬虫,将爬取的数据存入数据库,并通过数据可视化工具将数据展示出来。

1,爬取数据

首先,需要编写Python爬虫程序来爬取数据。这里以爬取豆瓣电影Top250为例。

使用Requests库发送HTTP请求,使用BeautifulSoup库解析HTML页面,并使用正则表达式提取所需数据。

import requests
from bs4 import BeautifulSoup
import re

url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')

data = []
for movie in movies:
    name = movie.find('span', class_='title').get_text()
    score = re.findall('(\d\.\d)', str(movie.find('span', class_='rating_num')))
    if score:
        score = float(score[0])
    else:
        score = None
    director = re.findall('导演: (.*?) ', movie.find('p', class_='').get_text().strip())[0]
    data.append({'name': name, 'score': score, 'director': director})

2,存入数据库

将爬取的数据存入数据库,这里以MongoDB为例。

使用PyMongo库连接MongoDB数据库,并将数据插入到指定的集合中。

from pymongo import MongoClient

client = MongoClient('localhost', 27017)
db = client['douban']
collection = db['movies']

for d in data:
    collection.insert_one(d)

3,数据可视化

使用Matplotlib库绘制柱状图来展示电影评分分布。

使用PyMongo库从MongoDB数据库中读取数据,并使用Matplotlib库绘制柱状图。

import matplotlib.pyplot as plt

scores = []
for d in collection.find():
    scores.append(d['score'])

bins = [i/10 for i in range(0, 11)]
plt.hist(scores, bins=bins, edgecolor='black')
plt.xlabel('score')
plt.ylabel('count')
plt.title('Movie score distribution')
plt.show()

完整代码如下:

import requests
from bs4 import BeautifulSoup
import re
from pymongo import MongoClient
import matplotlib.pyplot as plt

# 爬取数据
url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')

data = []
for movie in movies:
    name = movie.find('span', class_='title').get_text()
    score = re.findall('(\d

猜你喜欢

转载自blog.csdn.net/m0_72605743/article/details/129863146