Guangzhou Business Football News (crawling football news)

1. Choose a topic that interests you (not everyone can be the same).

Subject: Crawl football news related information

2. Write a crawler program in python to crawl data on related topics from the Internet.

 

3. Perform text analysis on the crawled data to generate a word cloud.

txt

Word cloud:

 

4. Explain the text analysis results.

def getNewsDetail(Url):
    resd = requests.get(Url)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser') #Open news details and parse
    news = {}
    news['广商好波'] = soupd.select('.headline')[0].text.rstrip().replace("\r\n"," ")
    # info = soupd.select('.artical-info')[0].text.replace("\r\n"," ")
    # news['内容'] = soupd.select('.artical-main-content')[0].text.strip().replace("\r\n"," ")
    print(news)
    return (news);

The text content includes title, source, content, etc. by crawling the news of a certain team on the news website. Title as a keyword in the word cloud.

5. Write a complete blog, describing the above implementation process, problems encountered and solutions, data analysis ideas and conclusions.

Question 1: At the beginning, there was no packaging method for the code, and it was only written in a long time, and it was found that the writing was messed up. Solution: Reorganize the written code and package it step by step.

Question 2: When reading and entering the txt file, it was found that utf was used in garbled characters. Later, it was completed by consulting other students.

f=open('pynews.txt','r',encoding='GBK').read()

Data analysis ideas and conclusions: According to the data of this study, I think that the European giants can compete for the Chinese football business market. Through data analysis: Most of the top European teams (with news occupying 70% of the news homepage) are favored by Chinese fans. If these teams come to China to play friendly games, or to learn from each other with the national football team, they will not only improve the strength of the national football team, but also open up the Chinese jersey sales market.

6. Finally submit all the crawled data, crawler and data analysis source code.

from urllib import request

import numpy as np
import requests
import re

from PIL import Image
from bs4 import BeautifulSoup
from datetime import datetime
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt


def getNewsDetail(Url):
    resd = requests.get(Url)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser') #Open news details and parse
    news = {}
    news['广商好波'] = soupd.select('.headline')[0].text.rstrip().replace("\r\n"," ")
    # info = soupd.select('.artical-info')[0].text.replace("\r\n"," ")
    # news['内容'] = soupd.select('.artical-main-content')[0].text.strip().replace("\r\n"," ")
    print(news)
    return (news);
newslist = []
def getListPage(newsUrl): #9. Take out all the news of a news list page and wrap it into a function def getListPage(pageUrl)
    res = requests.get(newsUrl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    for news in soup.select('.england-cat-grid-r2 ul li'):
     Url = news.select('a')[0].attrs['href']
    # filmslist.append(getFilmsDetail(Url))
     print(Url)
     newslist.append(getNewsDetail(Url))
    return (newslist)
    # print(res.text)


newstotal = []
firstPageUrl='https://soccer.hupu.com/spain/'
newstotal.extend(getListPage(firstPageUrl))
f = open('pynews.txt', 'w', encoding='utf-8')
txtName = "pynews.txt"
f = open(txtName, "a+")
f.write(str(newstotal))
f.close()
for news in newstotal:
 print(news)

f=open('pynews.txt','r',encoding='GBK').read()
font=r'C:\Windows\Fonts\simkai.ttf'
a=np.array(Image.open("pdd.jpg"))
wordcloud=WordCloud( background_color="white",font_path=font,width=1000,height=860,mask=a,margin=2).generate(f)
imagecolor=ImageColorGenerator(a)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud.to_file('1.jpg')

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325056035&siteId=291194637
Recommended