Python Big Homework - Crawler + Visualization + Data Analysis + Database (Data Analysis)

Python Big Homework - Crawler + Visualization + Data Analysis + Database (Introduction)

Python Big Homework - Crawler + Visualization + Data Analysis + Database (Crawler)

Python big homework - crawler + visualization + data analysis + database (visualization)

Python big homework - crawler + visualization + data analysis + database (database)

1. Generate lyrics word cloud

First, we need to get the lyrics of all the crawled songs and synthesize them into strings

Then extract the Chinese in it, and then synthesize the string

text = re.findall('[\u4e00-\u9fa5]+', lyric, re.S)  # 提取中文
text = " ".join(text)

Then use jieba for word segmentation, and save the words with a length greater than or equal to 2.

word = jieba.cut(text, cut_all=True)  # 分词
new_word = []
for i in word:
    if len(i) >= 2:
        new_word.append(i)  # 只添加长度大于2的词
final_text = " ".join(new_word)

Next, choose a good-looking picture for the generated word cloud, and you can start generating!
insert image description here

mask = np.array(Image.open("2.jpg"))
word_cloud = WordCloud(background_color="white", width=800, height=600, max_words=100, max_font_size=80, contour_width=1, contour_color='lightblue', font_path="C:/Windows/Fonts/simfang.ttf", mask=mask).generate(final_text)
# plt.imshow(word_cloud, interpolation="bilinear")
# plt.axis("off")
# plt.show()
word_cloud.to_file(self.keyword+'词云.png')
os.startfile(self.keyword+'词云.png')

The contour_width=1, contour_color='lightblue' in the WordCloud parameters are the thickness and color of the outline of the background image respectively. If it is not set, the outline will not appear. The font_path is used to specify the font.

After generation, it can be displayed by show or saved locally and opened. The final result is as follows

insert image description here

2. Pie chart of popular singer songs

insert image description here

The first is to get a list of top artists and the number of songs by top artists

Then divide the number of songs by each artist by the total number of songs by all ten artists to get the percentage of songs by each artist

Next, you can choose which block to highlight, as shown in the picture, the Jay Chou part is highlighted

As follows, you only need to set the value of the prominent part to be large

explode = [0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Next, you can generate a pie chart

plt.figure(figsize=(6, 9))  # 设置图形大小宽高
plt.rcParams['font.sans-serif'] = ['SimHei']  # 解决中文乱码问题
plt.axes(aspect=1)  # 设置图形是圆的
plt.pie(x=proportion, labels=name, explode=explode, autopct='%3.1f %%',
shadow=True, labeldistance=1.2, startangle=0, pctdistance=0.8)
plt.title("热门歌手歌曲量占比")
# plt.show()
plt.savefig("热门歌手歌曲量占比饼图.jpg")
os.startfile("热门歌手歌曲量占比饼图.jpg")

Where x is the list of the proportion of songs, labels is the corresponding label (in this figure, the name of the singer), explode is the highlight mentioned above, and the values ​​in these three lists are in one-to-one correspondence. , autopct is the display method for setting the proportion value, 3.1f means that the width is 3 digits (if it is larger than it will be output as it is), and the precision is a floating point number of 1

You can also choose to show it directly, or save it locally and open it again

3. Bar chart of song popularity ratio

Before, we obtained the information of the top500 songs through the crawler (below), and now we want to analyze the popularity of the songs and generate a histogram

insert image description here

The effect diagram is as follows:

insert image description here

Originally, I wanted to generate a column chart of the number of popular songs owned by singers, but the popular songs in the website that crawls the popular songs do not have corresponding singers, and I need to go to other websites to get the corresponding singers of each song, which is too troublesome. I didn't do it, and interested friends can implement it by themselves

First we want to get the number of songs in each hot range

The following data list is the number of songs corresponding to the x-tuple range

We only need to traverse the song heat list, and add 1 to the corresponding heat in its data list each time, and finally get the number of songs in each heat range

x = ('0-10', '10-20', '20-30', '30-40', '40-50', '>50')
data = [0, 0, 0, 0, 0, 0]

The next step is to create a histogram, first to solve the problem of Chinese garbled characters

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Then you can create it through plt.bar, where the first parameter is the abscissa data, the second parameter is the ordinate data, the third parameter is the fill color for the histogram, and the fourth parameter is the transparency

title, xlabel, ylabel are obviously the title of the histogram, the names of the abscissa and ordinate

plt.bar(x, data, color='steelblue', alpha=0.8)
plt.title("pop500歌曲热度")
plt.xlabel("歌曲热度范围")
plt.ylabel("歌曲数量")
plt.show()

Guess you like

Origin blog.csdn.net/qq_25046827/article/details/122001629