Advanced Python Applications Programming

Python advanced application design task requirements


Implemented a subject-oriented Python web crawler program and complete the following:
(Note: per person a question, the subject matter of choice, all design content and source code to be submitted to the blog platform Park)

First, the web crawler themed design (15 points)
1. Thematic Web Crawler name
       Name: Movie ranking information website crawling video
content and data features 2. Thematic web crawlers crawling analysis
       The main reptile crawling movie rankings and ratings of each video site
3. Thematic Network Crawler program overview (including the realization of ideas and technical difficulties)
       The design scheme mainly depends on the target page request library crawling collect information, and then BeautifulSoup data cleansing, final results will be printed out. Cleaning and technical difficulties including layout data on the print result.
Second, the structure relating to the page analysis (15 points)
structural features of the subject page 1
Fantastic Art with love movies, for example, url: https://www.iqiyi.com/dianying_new/i_list_paihangbang.html

2.Htmls page parsing
3. Node (tag) and traversal method lookup method
(shown node tree structure, if necessary)
 lookup method find_all ()
Third, the network design crawlers (60 minutes)
the crawler body to be included in the following sections, to be attached to the source code and more detailed notes, and provides an output result after every part of the program theme.
code:
import requests
from bs4 import BeautifulSoup
#导入requests库 从bs4库中调用BeautifulSoup
#爬取爱奇艺电影频道目标的HTML页面
def getHTMLText(url):
    try:
        #用requests抓取网页信息,请求超时时间为60秒
        r = requests.get(url,timeout=60)
        #如果状态码不是200,则引发异常
        r.raise_for_status()
        #配置编码
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "爬取失败"
    
#获取电影名称
def getMovie(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    #用find_all方法从HTML页面中所有的p标签,从中获取到电影的名称
    for p in soup.find_all("p","site-piclist_info_title"):
        ulist.append(p.a.string)
    return ulist

#获取电影评分
def getPage1(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    #用find_all方法从HTML页面中所有的div标签,从中获取到电影的评分
    for div in soup.find_all("div","site-title_score"):
        ulist.append(div.span.strong.string)
    return ulist

def getPage2(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    u = []
    for div in soup.find_all("div","site-title_score"):
        ulist.append(list(div.span))
    return ulist

#打印电影信息函数
def printUnivList(ulist1,ulist2,ulist3,num):
    print("{:^50}".format("电影名称及评分"))
    for i in range(num):
        print("{:^45}\t\t{}{}".format(ulist1[i],ulist2[i],ulist3[i]))

#填入要请求的服务器地址URL
Url = "https://www.iqiyi.com/dianying_new/i_list_paihangbang.html"
#创建一个数组m用来存放爬取到的电影名称
m = []
#创建2个数组分别存储电影评分的个位部分的数值以及小数点后的数值
p1 = []
p2 = []
#创建一个数组P3将p1和p2的数据进行合并处理
p3 = []
#获取到HTML页面信息
html = getHTMLText(Url)
#获取到电影名称
getMovie(m,html)
#获取到电影评分
getPage1(p1,html)
getPage2(p2,html)
#将p1和p2的信息合并存储到p3
for i in range(len(p2)):    
     p3.append(p2[i][1])
#打印所有爬取到的电影信息
printUnivList(m,p1,p3,len(m))

运行结果:

1.数据爬取与采集
使用request库进行爬取数据
#爬取爱奇艺电影频道目标的HTML页面
def getHTMLText(url):
    try:
        #用requests抓取网页信息,请求超时时间为60秒
        r = requests.get(url,timeout=60)
        #如果状态码不是200,则引发异常
        r.raise_for_status()
        #配置编码
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "爬取失败"

 


2.对数据进行清洗和处理
用BeautifulSoup 库进行数据清洗
#获取电影名称
def getMovie(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    #用find_all方法从HTML页面中所有的p标签,从中获取到电影的名称
    for p in soup.find_all("p","site-piclist_info_title"):
        ulist.append(p.a.string)
    return ulist

#获取电影评分
def getPage1(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    #用find_all方法从HTML页面中所有的div标签,从中获取到电影的评分
    for div in soup.find_all("div","site-title_score"):
        ulist.append(div.span.strong.string)
    return ulist

def getPage2(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    u = []
    for div in soup.find_all("div","site-title_score"):
        ulist.append(list(div.span))
    return ulist

#打印电影信息函数
def printUnivList(ulist1,ulist2,ulist3,num):
    print("{:^50}".format("电影名称及评分"))
    for i in range(num):
        print("{:^45}\t\t{}{}".format(ulist1[i],ulist2[i],ulist3[i]))

#填入要请求的服务器地址URL
Url = "https://www.iqiyi.com/dianying_new/i_list_paihangbang.html"
#创建一个数组m用来存放爬取到的电影名称
m = []
#创建2个数组分别存储电影评分的个位部分的数值以及小数点后的数值
p1 = []
p2 = []
#创建一个数组P3将p1和p2的数据进行合并处理
p3 = []
#获取到HTML页面信息
html = getHTMLText(Url)
#获取到电影名称
getMovie(m,html)
#获取到电影评分
getPage1(p1,html)
getPage2(p2,html)
#将p1和p2的信息合并存储到p3
for i in range(len(p2)):    
     p3.append(p2[i][1])
#打印所有爬取到的电影信息
printUnivList(m,p1,p3,len(m))

 


3.文本分析(可选):jieba分词、wordcloud可视化
#encoding=utf-8
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import jieba
file_object = open(r'C:\Users\lenovo\Desktop\琐屑\a') 
#不要把open放在try中,以防止打开失败,那么就不用关闭了
try:
    file_context = file_object.read() #file_context是一个string,读取完后,就失去了对test.txt的文件引用
finally:
    file_object.close()
#print(file_context)
seg_list = jieba.cut_for_search(file_context)# 搜索引擎模式
#print(list(seg_list))
#print(" ".join(seg_list))
# 设置词云 
wc = WordCloud( 
# 设置背景颜色 
background_color="black", 
# 设置最大显示的词云数 
max_words=2000, 
# 这种字体都在电脑字体中,一般路径 
font_path='C:\Windows\Fonts\simfang.ttf', 
height=1200, width=1600, 
# 设置字体最大值 
max_font_size=100, 
# 设置有多少种随机生成状态,即有多少种配色方案 
random_state=30, )
myword = wc.generate(" ".join(seg_list))  # 生成词云
# 展示词云图
plt.imshow(myword)
plt.axis("off")
plt.show()
wc.to_file('C://Users//123//Desktop//p.png')  # 把词云保存下

结果图:


4.数据分析与可视化
(例如:数据柱形图、直方图、散点图、盒图、分布图、数据回归分析等)

 5.数据持久化
 
四、结论(10分)
1.经过对主题数据的分析与可视化,可以得到哪些结论?
通过对主题数据的分析与可视化,可以得到电影的排名信息以及各自的评分
2.对本次程序设计任务完成的情况做一个简单的小结。
通过这次任务,基本实现把想要的数据爬取下来,以及对其进行数据清洗及分析。
这次实验同样存在不足之处,爬取下来的数据排版问题没有得到好的解决

Guess you like

Origin www.cnblogs.com/BoYCB/p/11962015.html