1. Choose a topic that interests you (not everyone can be the same).
Subject: Crawl football news related information
2. Write a crawler program in python to crawl data on related topics from the Internet.
3. Perform text analysis on the crawled data to generate a word cloud.
txt
Word cloud:
4. Explain the text analysis results.
def getNewsDetail(Url): resd = requests.get(Url) resd.encoding = 'utf-8' soupd = BeautifulSoup(resd.text, 'html.parser') #Open news details and parse news = {} news['广商好波'] = soupd.select('.headline')[0].text.rstrip().replace("\r\n"," ") # info = soupd.select('.artical-info')[0].text.replace("\r\n"," ") # news['内容'] = soupd.select('.artical-main-content')[0].text.strip().replace("\r\n"," ") print(news) return (news);
The text content includes title, source, content, etc. by crawling the news of a certain team on the news website. Title as a keyword in the word cloud.
5. Write a complete blog, describing the above implementation process, problems encountered and solutions, data analysis ideas and conclusions.
Question 1: At the beginning, there was no packaging method for the code, and it was only written in a long time, and it was found that the writing was messed up. Solution: Reorganize the written code and package it step by step.
Question 2: When reading and entering the txt file, it was found that utf was used in garbled characters. Later, it was completed by consulting other students.
f=open('pynews.txt','r',encoding='GBK').read()
Data analysis ideas and conclusions: According to the data of this study, I think that the European giants can compete for the Chinese football business market. Through data analysis: Most of the top European teams (with news occupying 70% of the news homepage) are favored by Chinese fans. If these teams come to China to play friendly games, or to learn from each other with the national football team, they will not only improve the strength of the national football team, but also open up the Chinese jersey sales market.
6. Finally submit all the crawled data, crawler and data analysis source code.
from urllib import request import numpy as np import requests import re from PIL import Image from bs4 import BeautifulSoup from datetime import datetime from wordcloud import WordCloud, ImageColorGenerator import matplotlib.pyplot as plt def getNewsDetail(Url): resd = requests.get(Url) resd.encoding = 'utf-8' soupd = BeautifulSoup(resd.text, 'html.parser') #Open news details and parse news = {} news['广商好波'] = soupd.select('.headline')[0].text.rstrip().replace("\r\n"," ") # info = soupd.select('.artical-info')[0].text.replace("\r\n"," ") # news['内容'] = soupd.select('.artical-main-content')[0].text.strip().replace("\r\n"," ") print(news) return (news); newslist = [] def getListPage(newsUrl): #9. Take out all the news of a news list page and wrap it into a function def getListPage(pageUrl) res = requests.get(newsUrl) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') for news in soup.select('.england-cat-grid-r2 ul li'): Url = news.select('a')[0].attrs['href'] # filmslist.append(getFilmsDetail(Url)) print(Url) newslist.append(getNewsDetail(Url)) return (newslist) # print(res.text) newstotal = [] firstPageUrl='https://soccer.hupu.com/spain/' newstotal.extend(getListPage(firstPageUrl)) f = open('pynews.txt', 'w', encoding='utf-8') txtName = "pynews.txt" f = open(txtName, "a+") f.write(str(newstotal)) f.close() for news in newstotal: print(news) f=open('pynews.txt','r',encoding='GBK').read() font=r'C:\Windows\Fonts\simkai.ttf' a=np.array(Image.open("pdd.jpg")) wordcloud=WordCloud( background_color="white",font_path=font,width=1000,height=860,mask=a,margin=2).generate(f) imagecolor=ImageColorGenerator(a) plt.imshow(wordcloud) plt.axis("off") plt.show() wordcloud.to_file('1.jpg')