Python is a very popular programming language, especially when it comes to web scraping and data visualization. This article will introduce how to use Python to write a web crawler, store the crawled data in the database, and display the data through data visualization tools.
1. Crawl data
First, you need to write a Python crawler program to crawl data. Here is an example of crawling Douban Movie Top250.
Use the Requests library to send HTTP requests, use the BeautifulSoup library to parse HTML pages, and use regular expressions to extract the required data.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
data = []
for movie in movies:
name = movie.find('span', class_='title').get_text()
score = re.findall('(\d\.\d)', str(movie.find('span', class_='rating_num')))
if score:
score = float(score[0])
else:
score = None
director = re.findall('导演: (.*?) ', movie.find('p', class_='').get_text().strip())[0]
data.append({'name': name, 'score': score, 'director': director})
2, stored in the database
Store the crawled data in the database, here we take MongoDB as an example.
Use the PyMongo library to connect to the MongoDB database and insert data into the specified collection.
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['douban']
collection = db['movies']
for d in data:
collection.insert_one(d)
3. Data visualization
Use the Matplotlib library to draw a histogram to show the distribution of movie ratings.
Use the PyMongo library to read data from the MongoDB database and use the Matplotlib library to draw histograms.
import matplotlib.pyplot as plt
scores = []
for d in collection.find():
scores.append(d['score'])
bins = [i/10 for i in range(0, 11)]
plt.hist(scores, bins=bins, edgecolor='black')
plt.xlabel('score')
plt.ylabel('count')
plt.title('Movie score distribution')
plt.show()
The complete code is as follows:
import requests
from bs4 import BeautifulSoup
import re
from pymongo import MongoClient
import matplotlib.pyplot as plt
# 爬取数据
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
movies = soup.find_all('div', class_='info')
data = []
for movie in movies:
name = movie.find('span', class_='title').get_text()
score = re.findall('(\d