Climb the Xiamen specialty snack list

1. Thematic web crawler design

  1.     Web crawler name: Crawl the data of Xiamen specialty snack ranking and analyze
  2.    Content crawled by web crawlers: Crawl the Xiamen specialty snack list, reading / votes

          3. Overview of thematic web crawler design scheme (including implementation ideas and technical difficulties):

      Idea: Use web crawler for data analysis, find web page source code for data analysis during the analysis process, read and keep the data, clean and analyze the data, and perform data visualization

      Technical difficulties: Too little data and too little knowledge to handle and improve.

2. Analysis of the structural characteristics of the
theme page 1. Analysis of the structure and characteristics of the theme page:

 

 

 

2.Htmls page analysis:

Crawl the Xiamen specialty snacks ranking page is "https://www.maigoo.com/top/400023.html"

The snack name tag is in "title oneline bgcolor-q"

The reading / voting label is in "attention color666 font16 oneline"

 

 

 

Three, web crawler programming

1. Data crawling and collection

import requests

from bs4 import BeautifulSoup

import pandas as pd

from pandas import DataFrame

url = "https://www.maigoo.com/top/400023.html" #Climbing the Xiamen specialty snack ranking

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}#伪装爬虫

r = requests.get (url) #Request website

r.encoding = r.apparent_encoding # unified encoding

data=r.text

soup = BeautifulSoup (data, 'html.parser') # Use the "delicious soup" tool

print (soup.prettify ()) # Display website structure

food = [] # Create an empty list

index=[]

for i in soup.find_all (class _ = "title oneline bgcolor-q"): #Add the name of the food to the empty list

    food.append(i.get_text().strip())

for k in soup.find_all (class _ = "attention color666 font16 oneline"): # Add the heat (reads / votes) to the empty list

    index.append(k.get_text().strip())

data=[food,index]

print(data)

s = pd.DataFrame (data, index = ["Xiamen gourmet snacks", "reads / votes"])

print (sT) #Data visualization

s.to_excel ('Xiamen characteristic snack data.xlsx') #Save the file, the data is persistent

 

 

 

 

2 Clean and process data (too little data)

 

s = s.drop_duplicates () # Duplicate value processing

s.head()

 

3.    Data analysis and visualization

 

import matplotlib.pyplot as plt

 

 #Display chinese

plt.rcParams ['font.sans-serif'] = ['STSong']

#Used to display negative sign normally

plt.rcParams['axes.unicode_minus']=False

plt.bar (['Xiamen Shacha Noodle', 'Xiamen Pancake', 'Scallion Frozen', 'Taro Bun', 'Xiamen Noodle Line Paste', 'Xiamen Pie'], [58,30,18,11 , 8,32], label = "Xiamen snack list")

plt.title ('Xiamen Special Snack Vertical Histogram')

plt.show()

 

 

#Pies

plt.pie ([58,30,18,11,8,32], labels = ['Xiamen sand tea noodles', 'Xiamen crepes', 'Scallion frozen', 'Taro buns', 'Xiamen noodles' , 'Xiamen Pie'])

#Set the display image to a circle

plt.axis('equal')

plt.title ('Xiamen Special Snack Pie Chart')

plt.show()

 

 

4. According to the relationship between the data, analyze the correlation coefficient between the two variables, draw a scatter plot, and establish a regression equation between the variables


#Draw the unary quadratic regression equation colnames = ["ranking", "comments"]
df = pd.read_excel ('Xiamen specialty snack data.xlsx', skiprows = 1, names = colnames)
X = df. Ranking
Y = df. Comments
def func (p, x):
a, b, c = p
return a * x * x + b * x + c
def error_func (p, x, y):
return func (p, x) -y

p0 = ( 0,0,0]
Para = leastsq (error_func, p0, args = (X, Y))
a, b, c = Para [0]
plt.figure (figsize = (10,6))
plt.scatter (X, Y, color = "green", label = u "comments", linewidth = 2)
x = np.linspace (0,30,20)
y = a * x * x + b * x + c
plt.plot (X, Y, color = "red", label = u "fitting straight line", linewidth = 2)
plt.title ("Diagram ranking and reviews of one-
variable quadratic regression equation") plt.legend ()
plt.show ()

 

 

5. Summarize the codes of the above parts and attach the complete program code

import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
import scipy as sp

url = "https://www.maigoo.com/top/400023.html"
#Crawl headers = {'User-Agent': 'Mozilla / 5.0 (Windows NT 6.3; Win64; x64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome / 69.0.3497.100 Safari / 537.36 ') # Camouflage crawler
r = requests.get (url)
#Request website r.encoding = r.apparent_encoding # Unified encoding
data = r.text
soup = BeautifulSoup (data, 'html.parser') # Use the "yummy soup" tool
print (soup.prettify ()) # Show website structure
food = [] # Create an empty list
index = []
for i in soup.find_all (class_ = "title oneline bgcolor-q"): # Add the name of the food to the empty list
food.append (i.get_text (). strip ())
for k in soup.find_all (class _ = "attention color666 font16 oneline"): # 把Add heat to the empty list
index.append (k.get_text (). Strip ())
data = [food, index]
print (data)
s = pd.DataFrame (data, index = ["Xiamen gourmet snacks", "reads / votes"])
print (sT) #Data visualization

s.to_excel ('Xiamen popular attractions data.xlsx') #Save the file, the data is persistent


#Duplicate value processing s = s.drop_duplicates ()
s.head ()



#Display Chinese plt.rcParams ['font.sans-serif'] = ['STSong'] #Used
to display negative signs
normally plt.rcParams ['axes.unicode_minus'] = False
plt.bar ([' Xiamen sand tea noodles ',' Xiamen crepes', 'Taiwan frozen', 'Taro bun', 'Xiamen noodle paste', 'Xiamen pie'], [58,30,18,11,8,32], label = "Xiamen Snack List ")
plt.title ('Xiamen Special Snack Vertical Histogram')
plt.show ()


#
饼图 plt.pie ([58,30,18,11,8,32], labels = ['Xiamen sand tea noodles',' Xiamen crepes', 'Scallion frozen', 'Taro buns',' Xiamen noodles Line paste ',' Xiamen pie ']) #Set the
display image to be round
plt.axis (' equal ')
plt.title (' Xiamen specialty snack pie chart ')
plt.show ()

 


#Draw the unary quadratic regression equation colnames = ["ranking", "comments"]
df = pd.read_excel ('Xiamen specialty snack data.xlsx', skiprows = 1, names = colnames)
X = df. Ranking
Y = df. Comments
def func (p, x):
a, b, c = p
return a * x * x + b * x + c
def error_func (p, x, y):
return func (p, x) -y

p0 = ( 0,0,0]
Para = leastsq (error_func, p0, args = (X, Y))
a, b, c = Para [0]
plt.figure (figsize = (10,6))
plt.scatter (X, Y, color = "green", label = u "comments", linewidth = 2)
x = np.linspace (0,30,20)
y = a * x * x + b * x + c
plt.plot (X, Y, color = "red", label = u "fitting straight line", linewidth = 2)
plt.title ("Diagram ranking and reviews of one-
variable quadratic regression equation") plt.legend ()
plt.show ()

 

4. Conclusion
1. After analyzing and visualizing the subject data, what conclusions can be obtained?

A: We can intuitively understand the ranking of Xiamen specialty snacks and people ’s love for Xiamen Shacha noodles is much higher than other snacks.


2. Make a simple summary of the completion of the program design task.

Answer: Because I haven't mastered the python application enough, I don't understand it in many places. I asked my classmates and Baidu. And because the amount of data is too small, it is impossible to improve the design of the Python course application, but you can understand the usefulness of Python's data analysis. You have a better grasp of it and increase your love for Python.

 

Guess you like

Origin www.cnblogs.com/w-625/p/12746364.html