python crawler

1. Themes

This time, I simply crawl the campus news of Guangdong Light Industry Vocational and Technical College and generate a word cloud for analysis.

Second, the realization process

1. Enter the campus news module on the official website of Guangdong Light Industry Vocational and Technical College, first click on one of the news, and use the developer tool (F12) to analyze and obtain the title, release time and link of the news in the dictionary news{} and store the news. The content is written to content.txt

 

# Get information about a news
def getNewDetails(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soup = BeautifulSoup(resd.text, 'html.parser')
    news = {}
    news['title'] = soup.select('div > h3')[0].text
    a = news['title']
    news['release time'] = (soup.select('.title')[0].text).lstrip( '{}release time'.format(a))
    news['link'] = newsUrl
    content = soup.find_all('div',class_='content')[0].text
    writeNewsDetail(content)
    return news

 

# Write the obtained news content to content.txt
def writeNewsDetail(content):
    f = open('content.txt','a',encoding='utf-8')
    f.write(content)
    f.close()

 

 

 

2. Now that the information of a news can be obtained, all the news of a news page can also be obtained. How to get the URLs of all the news on the page? The old way is to press F12 to open the developer tools and find the tab where the news URL is located.

 

# Get all news URLs of a news page
def getListPage(newsurl):
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    soup = soup.find('div', class_='mainl')
    newslist = []
    for new in soup.select('ul > li > a'):
        newsUrl = new.attrs['href']
        newslist.append(getNewDetails(newsUrl))
    return newslist

 

3. Once you get one page, you can also solve other pages. Since the campus news of Guangqing Industry is only 54 pages, I just climbed down.

# Since the URL of the first news page is different from other news pages
# So first write the information of all the news on the first page into newsTotal
newsurl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index.html'
newsTotal = []
newsTotal.extend(getListPage(newsurl))
# Then write the news information of the remaining pages in a loop
for i in range(2,55):
    listPageUrl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index_{}.html'.format(i)
    newsTotal.extend(getListPage(listPageUrl))

  

4. Store the previously obtained news information (title, release time and link) in an Excel sheet

# Store the obtained news information in the Excel table news.xlsx
df = pandas.DataFrame(newsTotal)
df.to_excel('news.xlsx',encoding='utf-8')

 

 

5. At this point, all the information has been crawled, and then you can segment the information and generate a word cloud.

file = codecs.open('content.txt', 'r','utf-8')
# Generate a word cloud image with the given image as the background
image=np.array(Image.open('tree.jpg'))
# Word cloud font settings
font=r'C:\Windows\Fonts\simhei.ttf'
word=file.read()
#Remove English, keep Chinese
resultword=re.sub("[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\!\@\#\\\&\*\%]", "",word)
wordlist_after_jieba = jieba.cut(resultword, cut_all = True)

wl_space_split = " ".join(wordlist_after_jieba)
print(wl_space_split)
my_wordcloud = WordCloud(font_path=font,mask=image,background_color='white',max_words = 100,max_font_size = 100,random_state=50).generate(wl_space_split)
#Generate word cloud based on pictures
iamge_colors = ImageColorGenerator(image)
#my_wordcloud.recolor(color_func = iamge_colors)
#Display the generated word cloud
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
#Save the generated picture, it will take effect when the picture is closed, and the interrupt program will not be saved
my_wordcloud.to_file('result.jpg')

 Generate a picture of the word cloud (you can generate the shape of the word cloud according to your favorite pictures)

3. Problems encountered and solutions

Problem: There was an error when installing the word cloud (there was no screenshot at the time of the error)

Solution: I learned through the Internet and found that the Python version of my computer is 32-bit, and the previously installed word cloud is 64-bit, so an error occurred. After re-downloading and installing the 32-bit word cloud on the Internet solved

Steps: 1. Select wordcloud-1.4.1-cp36-cp36m-win32.whl to download

 

 2. Open the command line and enter pip install wordcloud-1.4.1-cp36-cp36m-win32.whl and pip install wordcloud to install

3. After the installation is successful, just add dependencies to pycharm

 

4. Data Analysis and Conclusions

Through the data, we can see that the campus news mainly reports on the situation of the campuses (especially the design college), the work of the party committee, and vocational and technical information.

 

5. Complete code

import requests
from bs4 import BeautifulSoup
import pandas
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import codecs
import numpy as np
from PIL import Image
import re

# Write the obtained news content to content.txt
def writeNewsDetail(content):
    f = open('content.txt','a',encoding='utf-8')
    f.write(content)
    f.close()

# Get information about a news
def getNewDetails(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soup = BeautifulSoup(resd.text, 'html.parser')
    news = {}
    news['title'] = soup.select('div > h3')[0].text
    a = news['title']
    news['release time'] = (soup.select('.title')[0].text).lstrip( '{}release time'.format(a))
    news['link'] = newsUrl
    content = soup.find_all('div',class_='content')[0].text
    writeNewsDetail(content)
    return news

# Get all news URLs of a news page
def getListPage(newsurl):
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    soup = soup.find('div', class_='mainl')
    newslist = []
    for new in soup.select('ul > li > a'):
        newsUrl = new.attrs['href']
        newslist.append(getNewDetails(newsUrl))
    return newslist


# Since the URL of the first news page is different from other news pages
# So first write the information of all the news on the first page into newsTotal
newsurl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index.html'
newsTotal = []
newsTotal.extend(getListPage(newsurl))
# Then write the news information of the remaining pages in a loop
for i in range(2,55):
    listPageUrl = 'http://www.gdqy.edu.cn/viscms/xiaoyuanxinwen2538/index_{}.html'.format(i)
    newsTotal.extend(getListPage(listPageUrl))

for news in newsTotal:
    print(news)

# Store the obtained news information in the Excel table news.xlsx
df = pandas.DataFrame(newsTotal)
df.to_excel('news.xlsx',encoding='utf-8')

file = codecs.open('content.txt', 'r','utf-8')
# Generate a word cloud image with the given image as the background
image=np.array(Image.open('tree.jpg'))
# Word cloud font settings
font=r'C:\Windows\Fonts\simhei.ttf'
word=file.read()
#Remove English, keep Chinese
resultword=re.sub("[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\!\@\#\\\&\*\%]", "",word)
wordlist_after_jieba = jieba.cut(resultword, cut_all = True)

wl_space_split = " ".join(wordlist_after_jieba)
print(wl_space_split)
my_wordcloud = WordCloud(font_path=font,mask=image,background_color='white',max_words = 100,max_font_size = 100,random_state=50).generate(wl_space_split)
#Generate word cloud based on pictures
iamge_colors = ImageColorGenerator(image)
#my_wordcloud.recolor(color_func = iamge_colors)
#Display the generated word cloud
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
#Save the generated picture, it will take effect when the picture is closed, and the interrupt program will not be saved
my_wordcloud.to_file('result.jpg')

  

6. Personal feelings and experiences

In fact, I originally wanted to crawl the event news module of the League of Legends official website, but I encountered a lot in the process of crawling. When it comes to news information, it is obvious that the labels are correct, and the information obtained is not the target information at all or cannot be found at all. In the end, I had to give up choosing the news module of Guangqing Industry for crawling. Although the two crawling experiences encountered many problems, I still felt a little sense of accomplishment when I saw the crawled information displayed one by one, and I also learned a lot of new knowledge in the process, such as the generation of word clouds. All in all, if you try, you will gain, and you need more practice and hard work.

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325060535&siteId=291194637