Use GPT-2 to write poems automatically, starting from five-character quatrains

Before the Spring Festival, I used GPT2 to train an automatic couplet system: for the Spring Festival of the Year of the Rat, use GPT-2 to automatically generate (write) Spring Festival couplets and couplets  . Logically speaking, this NLG methodology can be applied to the automatic generation of text in any field. Of course, the more the format is The fixed the better. This naturally made me think of automatically writing poems. The format of poems is relatively fixed. We have already covered it before. For example, the function of automatically writing poems with hidden heads and poems with first characters has been launched on the AINLP public account. It is directly reused : "Automatic Poetry Composer" is online, and the code and data are public  . In addition, there is a larger poem data item that can be used as the "raw material" for automatic poem composition : [Github] Chinese-poetry: The most comprehensive database of ancient Chinese poems , plus the project GPT2-Chinese: [Github] GPT2-Chinese: Chinese GPT2 training code  , it can be said that everything is ready, just owe trial.

So this week we will continue the topic of natural language generation from the five-character quatrains. Regarding the five-character quatrains, Baidu Baike says this:

Five-character quatrains is a genre of traditional Chinese poetry, referred to as Wujue, which refers to five-character and four-sentence poems that conform to the norms of verse poetry, belonging to the category of modern poetry. This style originated from the Yuefu poems of the Han Dynasty, deeply influenced by the folk songs of the Six Dynasties, and matured in the Tang Dynasty. Each of Wujue's poems has only two crosses, which can show a fresh picture and convey a variety of real moods. The biggest feature is that the short chapters contain rich content because the small sees the big, and the less is always more. There are two grids in the Five Jue categories. Representative works include Wang Wei's "Birds Singing Stream", Li Bai's "Quiet Night Thoughts", Du Fu's "Eight Formations", Wang Zhihuan's "Climbing the Stork Tower", Liu Changqing's "Send Lingche Master" and so on.

I mainly used the data of "Quan Tang Poems" and "Quan Song Poems" in Chinese-poetry, and first of all pay tribute to the author of this project:

"The Poems of the Tang Dynasty" was in the forty-fourth year of Kangxi in the Qing Dynasty (1705). It was edited and edited by 10 people including Peng Dingqiu, Shen Sanzeng, Yang Zhongne, Wang Shirong, Wang Yi, Yu Mei, Xu Shuben, Che Dingjin, Pan Conglu, and Zha Sitang. , "There are more than 48,900 poems, and more than 2,200 people", a total of 900 volumes, 12 volumes of catalog. From Encyclopedia

"Complete Song Poems" Following the high prosperity of Tang poetry, Song poetry has made new developments and creations in ideological content and artistic expression. Many outstanding writers have appeared, and many genres have formed, which have contributed to the development of poetry in the Yuan, Ming and Qing dynasties. Had a far-reaching impact.

Explain that
"Quan Tang Poems" and "Quan Song Poems" are stored in traditional Chinese, please convert them if necessary, but the converted characters do not conform to the context.

Here you need to first convert the traditional and simplified sentences through OpenCC, then extract the five-character quatrains in it, and finally convert to the GPT2-Chinese training format, and then the training and testing. Interested students can try it by themselves. It is very convenient. The training experience is OK. Reuse the above about automatic couplet:

1) The training data can be converted by writing a script according to the GPT2-Chinese training data format requirements, and some markers can be added, so that you can trick based on these markers when generating;
2) Please set the parameter min-length during training It is a small number, the default is 128. Because the couplet data length is relatively short, you will only get garbled characters after training according to the default setting. I set it directly to 1;
3) Adjust the batch_size and configuration parameters according to the size of your GPU memory, here The batch_size defaults to 8, and OOM will appear on the 1080TI machine during training. Set it to 4 and it will run through completely, and other parameters do not need to be changed;

After the automatic poem GPT2 model training is completed, it can be tested directly based on the generate.py script in GPT2-Chinese, which is very convenient. I wrote a server version based on generate.py and flask-restful, and connected it to the background of the AINLP official account. Interested students can follow the AINLP official account and test directly:

The keyword "write poems/write poems" triggers the automatic generation of poems. For example, enter "write poems in spring", and the automatic poem composition model will automatically continue writing based on "chun", and will give out poems starting with "春". The same is true for other characters, currently there are no more than five characters, because only five-character quatrains can be generated automatically:

图片

The keyword "hidden head poem" triggers the generation of hidden head poem, for example, input "hidden head poem spring, summer, autumn and winter", based on the GPT2 model superimposed trick generation:

图片

Finally, welcome to pay attention to the AINLP public account to test the functions of automatic poem writing and poem generation:

图片

For the AINLP dialogue function module, interested students can refer to:

Tencent word vector and similar words, similarity, word game series
Similar word query: playing with Tencent AI Lab Chinese word vector
playing with Tencent word vector: word similarity calculation and online query
Tencent word vector actual combat: indexing and quick query through Annoy
Fun with Tencent word vector: Game of Words (word addition and subtraction game)
word vector game: Messi-Argentina + Portugal =?
Tencent 8 million Chinese word vector API Demo build

NLP related tools and online test (public
account dialogue test) Five Chinese word segmentation tools online PK: Jieba, SnowNLP, PkuSeg, THULAC, HanLP
Chinese word segmentation tools online PK new addition: FoolNLTK, LTP, StanfordCoreNLP
Python Chinese word segmentation tools collection: installation , Use and test
eight Chinese part-of-speech tagging tools Use and online test
Baidu's deep learning Chinese lexical analysis tool LAC trial trip
, try Baidu's deep learning sentiment analysis tool
AINLP public account, add SnowNLP sentiment analysis module

自动对联及作诗机
风云三尺剑,花鸟一床书---对联数据集和自动对联机器人
自动对联活动获奖结果以及机器对联赏析
"自动作诗机"上线,代码和数据都是公开的
鼠年春节,用 GPT-2 自动写对联和对对联

夸夸聊天机器人及其他技能
一行Python代码实现夸夸聊天机器人
为了夸夸聊天机器人,爬了一份夸夸语料库
夸夸聊天机器人升级:从随机到准个性化
来,试试语音(识别)聊天(机器人)
来,试试成语接龙
推荐一份中文数据,来试试汉字、词语、成语、歇后语在线检索
AINLP公众号新增"狗屁不通文章生成器"接口
来,试试彩虹屁生成器

如果对AINLP公众号的文章感兴趣,也欢迎参考我们的年度阅读清单AINLP年度阅读收藏清单

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区,专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享,主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等,欢迎关注!加技术交流群请添加AINLP君微信(id:AINLP2),备注工作/研究方向+加群目的。


图片


Guess you like

Origin blog.51cto.com/15060464/2675643