R language - test report - analysis and generate a word cloud

First, the training content

  1. Before performing the data reading operation is required to download and load the appropriate package file.
  2. R languages ​​using scan () method, txt file read from an external experiment data set content.
  3. This experiment requires the dataset to the inputted word segmentation, filtering undesirable character length requirements. After statistical word frequency, the digital filtering operation can still be performed, the last words in the descending order.
  4. After the dataset word, word frequency statistics in order to use the package wordcloud, the data set to generate a word cloud paintings and pictures.

 

 

Second, the experimental subjects goal

  1. Required to master the basic operation of the package, including downloads, load and so on. Meanwhile, the respective content data set from the read binary text file.
  2. In the process of data processing, the master key word, word frequency statistics, filtering and sorting methods, and the corresponding function parameters to control.
  3. Depth understanding of the basic principles word cloud generated images, as well as basic implementation, the flexibility for a variety of types of data sets applications. At the same time, the conversion process master data sets generated images.

 

 

 

Third, experimental platform

  1, system: Windows 10

          Intel(R) Core(TM)i7-7500U CPU @ 2.70GHz 2.90 GHz

        RAM 8.00GB

  2. Tools:

          R x64 3.6.1   

  notepad.exe 

  eclipse

  word 2016

 

 

 

Fourth, the implementation steps

1) reads in the data

1, switch workspaces R language. First in the C: \ directory, create a new folder for the workspace, and then open R x64 3.6.1, enter the command: getwd (), to get the current R language program workspace. The input command: setwd ( "C: / workspace"), as the working directory workspace R language, coupled to verify successful handover.

FIG workspace switch 4-1

 

2, download jiebaR package. In R x64 3.6.1 program, enter the command: install.packages ( "jiebaR"), download 'jiebaR' package. In the pop-up mirrors options, select China (Shanghai) [https] Shanghai's server. After the download is complete, the effect shown in Figure 4-2:

    

Figure 4-2 Download jiebaR success

  After entering the finished download command, RGui (64-bit) automatically download and install checks dependency relationships 'jiebaRD' and 'Rcpp' package, after the package is finished downloading, R language opens and checked using MD5.

 

 

3, using the second method step, wordcloud download package. In R x64 3.6.1 program, enter the command: install.packages ( "wordcloud"), download "wordcloud" package. In the pop-up mirrors options, continue to choose Shanghai's server.

Figure 4-3 Download wordcloud success

 

3, loading the package has been downloaded jiebaR and wordcloud bag. Enter the command: library ( "Rserve"), load Rserve package. Enter the command: library ( "wordcloud"), load wordcloud package. And using (.packages ()) to see if the package is loaded.

Figure 4-4 Loader package successfully

 

5, read data from the file. Data is read into the separator '\ n', what = '' represents a string type to read. Enter the command: f <- scan ( 'C: /Users/Raodi/Desktop/snx.txt',sep =' \ n ', what =' ')

 

Figure 4-5 load data from a file

 

2) Data processing

1, word. Use qseg type data inputted word, the command: txt <-qseg [f].

2, the filter character length. Use the command: txt <-txt [nchar (txt)> 1], the removal of the words in the character length is less than 2.

3、统计词频。使用命令:txt<-table(txt),对已经规约词长的数据进行词频统计。

4、过滤数字。单个数值字符在词云中是无意义的,所以要对数字进行过滤。使用命令:txt<-txt[!grepl('[0-9]+',names(txt))],批量去除数据集中的数字。

5、查看处理完后剩余的词数。使用命令:length(txt)。

6、降序排序,并提取出现次数最多的前100个词语。使用命令:txt<-sort(txt, decreasing = TRUE)[1:100]  ,进行降序排序,并提取出现次数最多的前100个词语。

7、查看100个词频最高的词语。


  

4-6 数据处理

3)    词云制作

1、设置生成的词云图片的属性。使用命令:png("snxcloud.png", width = 500, height = 500)  ,在R语言当前的工作目录下,生成高和宽都是500的snxcloud.png图片。

2、设置该图片的背景颜色为黑色:par(bg = "black")

3、对数据集进行wordcloud()函数运算。命令如下:

  wordcloud(names(txt), txt, colors = rainbow(100), random.order=F)

4、保存数据集产生snxcloud.png图片。命令:dev.off()

图4-7制作词云图片

 

 

图4-8 工作目录中生成词云图片

 

运行以上代码后,即可在工作空间得到snxcloud.png文件,如下图:

 

图4-9 snxcloud.png

 

 

 

 

 

五、 实验成果

当在实验的过程中,出现图5-1的效果时,则表示在R语言程序中,从文件读入数据、分词、过滤字符长度和统计词频等数据处理操作,以及词云图片的生成没有问题。即,上述的实验步骤操作正确。

 

图5-1 实验操作正确

 

如图5-2所示,在R的工作目录下成功生成了snxcloud.png词云文件,也再次验证了上述的实验操作正确,并能生成相应的词云文件。

 

图5-2 工作目录生成文件

 

本实验最终得到的词云,效果如图5-3所示:

 

图5-3 实验的词云成品

 

六、 实训总结

关于本次实验的经验收获和实验总结,可分点总结如下:

    1. 经过本实验,可得出结论:jiebaR是一款高效的R语言中文分词包,而Wordcloud包在做词语分析时并没有多大的作用。Wordcloud包就是一个可以使词频以图形的形式展示的软件包,它可以通过改变词云的形状和颜色,使得分析结果锦上添花。
    2. 本实验的关键在于,对数据集进行分词、词频统计、过滤和排序等数据处理的过程和方法,生成词云图片只是对已经处理的数据集以图片的方式进行保存。
    3. 本实验中,需要将数据集中的数字进行过滤。因为经过分词器的处理,单个数值字符在生成的词云中难以分辨含义和方向,即缺乏无意义的,所以需要将数字进行过滤处理。
    4. 生成词云的方法远不止本实验中的这种,方法其实还有很多,如:wordcloud2。但是从整体来说,方法和基本的原理是类似的,至于操作步骤也可以举一反三,灵活变通。

Guess you like

Origin www.cnblogs.com/Raodi/p/12155173.html