Poetry data mining and analysis based on Python

Collect and follow to avoid getting lost


Preface

  In order to improve the Chinese nation’s understanding of traditional culture and the cultural confidence of the Chinese people, it is very necessary to conduct data analysis on poems and songs, and to understand and analyze the profound philosophy of life contained in poems, songs and poems. The design is based on python by crawling the poems on the Internet, downloading them locally, using third-party modules in python to process the relevant information of the poems, and making pictures to make the information more intuitive, using pycharm software As the programming tool of this project, it is developed in Python language and implemented using various third-party libraries. It presents information on all aspects of poetry, from which readers can see the characteristics of many poems and the connections between them, and understand the philosophy of life in them. This is not only conducive to the dissemination and promotion of the traditional culture of the Chinese nation, but also the promotion of The dream of the great rejuvenation of the Chinese nation has been successfully realized.

Keywords: Python; Jieba word segmentation; natural language processing; sentiment analysis; data analysis, data mining; poetry analysis.

1. Function introduction

   (1) Data acquisition: Obtain the URL of the poem through Baidu.
(2) Data mining: crawl poetry on the Internet to the local hard drive by writing python code.
(3) Data analysis: Perform data processing and analysis on the poems in the local hard disk by writing python code, and analyze the poems from various angles.
(4) Data visualization: Visualize the processed and analyzed poems by writing python code.

Project function module design

The data acquisition phase, data crawling phase, data analysis phase, and data visualization phase are the four major modules of the project system. The data preparation phase includes information query and information collection. The data crawling phase includes crawling poems. The data analysis module includes the poetry preprocessing stage and the poetry analysis stage. The data visualization module refers to visualizing data.
(1) Data acquisition: Obtain the URL of the poem through Baidu.
(2) Data mining: crawl poetry on the Internet to the local hard drive by writing python code.
(3) Data analysis: Perform data processing and analysis on the poems in the local hard disk by writing python code, and analyze the poems from various angles.
(4) Data visualization: Visualize the processed and analyzed poems by writing python code. As shown in Figure 3.1-1.
Insert image description here

Project logic model design

The models included in the project are poetry models. The poetry models have URLs, categories, main bodies, and word segmentation content. The main body is the main content of the poem, including the author, poem title, dynasty, and poem content. As shown in Figure 3.2-1.

Insert image description here

2. Development environment

  This project uses python as the development language, third-party libraries such as request, jieba, BeautifulSoup, and matplotlib, and PyCharm to develop the project. During the project implementation process, software engineering development methods are strictly followed. First, collect and review knowledge related to the project, and then investigate and analyze the needs to make a needs analysis. Then make a detailed analysis of the project, list the functional modules, and finally implement each functional module step by step.
  The implementation of the project requires programming capabilities, so after completing the analysis of the project functions, code needs to be written. After the entire project design is completed, it needs to be tested. Not only each module block of the project must be tested independently, but all modules of the project must be integrated together for unified testing. After all modules are tested and the design can run normally, the entire design is fully realized

————————————————

3. Programming

3.1 Data acquisition module test

3.1.1 Poetry format and link analysis

  The total URL of poetry is https://so.gushiwen.org, so we define it as the_main_page_url. For example, if a page is https://so.gushiwen.org/gushi/shijing.aspx, we define it as Write the_main_page_url/gushi/shijing.aspx.
  We find 300 Tang poems through the home page the_main_page_url. The URL where all Tang poems are located is the_main_page_url/gushi/tangshi.aspx. As shown in Figure 4.3.1-1.
Insert image description here

Figure 4.3.1-1
    Click F12 to enter developer mode. Click the mouse button in the upper left corner and click on the right side of a poem to point to the html code where the poem is located and the poem. related information. As shown in Figure 4.3.1-2 and Figure 4.3.1-3.
Insert image description here

Figure 4.3.1-2
Insert image description here

Figure 4.3.1-3
  By observation, we can see the relevant information of this poem. It can be concluded that the name is "Xinggong" and the author is Yuan Zhen. According to the URL where all poems are located is the_main_page_url/gushi/tangshi.aspx and the information pointed to, the URL where the poem is located can be obtained the_main_page_url/shiwenv_45c396367f59.aspx. You can get the_main_page_url/shiwenv_c90ff9ea5a71.aspx, the_main_page_url/shiwenv_5917bc6dca91.aspx by clicking on several other poems. Therefore, it can be concluded that the URL template where the poem is located is prefixed with the_main_page_url, and by adding href="/shiwenv_xxx.aspx" in your own information, you can get the URL where the poem is located. So we enter the URL of a poem, such as the_main_page_url/shiwenv_c90ff9ea5a71.aspx, and press F12 on the keyboard to enter the developer mode. The information is shown in Figure 4.3.1-4 and Figure 4.3.1- 5 shown.

Insert image description here

Figure 4.3.1-4
Insert image description here

Figure 4.3.1-5
  By clicking elements, you can get all the content in Figure 4.3.1-5 above.
  From this we can see the name, author, dynasty and content of the poem. The name of the poem is located in the tag h1, and the author is located in the first a tag in the p tag with class source. For example, the author's name is Yuan Zhen, and the dynasty is located in the second a tag in the p tag with class source. In the parentheses, you can see that the dynasty of the poet is the Tang Dynasty, and the poem content is located in the div tag with class contson. Therefore, if you find the above content and filter it through the rules, you can find all the information about the poem. Similarly, the author's personal homepage is the_main_page_url/authorv_201a0677dee4.aspx.
So we enter the author's homepage and enter the developer mode, as shown in Figure 4.3.1-6 and Figure 4.3.1-7.
Insert image description here

Figure 4.3.1-6
Insert image description here

Figure 4.3.1-7

  Through other functions, we can still draw many pictures and provide us with a lot of information, such as shown in Figure 4.5-4. Its function is ciyun(). From the picture below, we can see that the larger the characters, the more they appear in the poem, such as describing characters such as someone, a gentleman, a guest, etc., so we can associate it with homesickness, such as He Zhizhang's "Returning Hometown" "Laughing and asking where the guest came from" in "The Book of Odds" expresses the poet's longing for his hometown, so much so that the children who have returned home for a long time no longer recognize him; the sun, moon, and clouds are used to describe the scenery, so we can associate The "Smoke Cage, Cold Water, Moon Cage Sand" in Du Mu's "Poker at Qinhuai" describes a wonderful scene of mist, mist, and cold weather; the seasons are described as spring and autumn...for example, the "Ying Jie" in Du Fu's "The Prime Minister of Shu" "Green grass comes from spring", showing the beautiful scenery of spring when all things are revived and the grass is green and green. Wang Bo's "Self-confidence in the sea" in Wang Bo's "Farewell to Du Shaofu for his appointment in Shuzhou" expresses the deep friendship between the poet and his old friend. At a casual glance, you can see a few keywords. As soon as you see this picture, all kinds of familiar poems will pop up in your mind.

Insert image description here

Figure 4.5-4

4. Conclusion

  This design mainly implements a project that uses Python to write codes and mine and analyze poetry, and successfully achieved the initial design goals. From it, you can get good knowledge about the traditional culture of the Chinese nation and some python-related learning codes. It mainly uses very convenient modules in python, such as matplotlib library, etc., and chooses pyCharm as the coding software. The design is conducive to the dissemination of Chinese history and culture and promotes the development of global culture. In this information age, the spread of information on the Internet will further increase the spread speed and influence of Chinese culture.
  After testing, the design has achieved (1) Data acquisition: Obtain the URL of the poem through Baidu. (2) Data mining: crawl poems on the Internet to the local hard drive by writing python code. (3) Data analysis: Perform statistics and analysis on the poems in the local hard drive by writing python code, and analyze the poems from various angles. (4) Data visualization: By writing python code, the processed and analyzed poems are formed into various styles of pictures, thereby intuitively showing the great "energy" contained in the small Tang poems.
  However, this design has certain shortcomings, but it is still fully developed. However, there is still room for improvement, such as crawling more dynasties, more poets' poems, or all poets' poems. Famous lines, analyze them, try to visualize the poems through various pictures, etc. This way you can learn more about the poets and poems and their mood and environment when they wrote the poems. It can also help to count the content of the poems, from which Get more information.

Table of contents

Table of Contents
Abstract I
Data mining and analysis of poetry based on python 1
Data mining and analysis of poetries on python2
1 Introduction 3
2 Project outline design 4
2.1 Project design ideas 4
2.2 Introduction to development tools and development environment 4
2.2.1 Introduction to PyCharm development software 4
2.2.2 Introduction to Jieba module 4
2.2.3 Introduction to requests module 5
2.2.4 Introduction to BeautifulSoup module 5
2.3 Project feasibility analysis 5
2.3. 1 Technical feasibility 5
2.3.2 Economic feasibility 5
2.3.3 Operational feasibility 6
3 Overall project design 7
3.1 Project functional module design 7
3.2 Project logical model design 7
3.3 Project physical structure design 8
4 Detailed project design and implementation 9
4.1 Overview of project implementation 9
4.2 Project environment construction 9
4.2 .1 python installation 9
4.2.2 PyCharm installation 9
4.2.3 Third-party library installation 9
4.3 Data acquisition module Test 10
4.3.1 Poetry format and link analysis 10
4.4 Data mining module test 14
4.4.1 Code writing 14
4.4.2 Overall module test 16
Conclusion 24
Acknowledgments 25

Guess you like

Origin blog.csdn.net/QQ2743785109/article/details/133781298