R语言修改文本

需求：使用R语言对存储在文本文件中的一个参数进行修改，文件结构如下，文件数据量较大，但每次只需修改第20行的参数。

常规解决方案：读到内存，修改，写出

问题：读写文件耗费时间

解决方案1：

使用seek(con,where,rw){base}函数定位到所需的位置（where及相关参数确定偏移），直接将该行重写writeLines(text,con)

红框中内容分别表示所需修改文本的位置及其对应的偏移位置

con<-file("84.063346096_32.321391739.txt","r+")
seek(con,1061,rw = "write")
writeLines("275.821",con)
close(con)

问题：每行的字符长度不确定，第11行到第12行数字有效位数不同，因此每个文件里使用1061不具有可操作性

解决方案2：

可通过readLines(con,n)中的n控制指针位置到第20行，再结合seek(...,origin)的origin = 'current'获取偏移位置

seek函数返回的是当前所在的光标位置，第一次调用seek()，返回结果为0表示读取文本光标和写入文本光标并不相同，尽管前面已经读取了前五行，但并未写入，故仍为0.但第二次调用seek函数，返回的结果却是4174让人费解...

> con <- file("84.063346096_32.321391739.txt","r+")
> readLines(con,5)
[1] "&METADATA_NAMELIST"                                                   
[2] " startdate             = \"197301010000\""                            
[3] " enddate               = \"201301010000\"con testfor_a_while      = 0"
[4] " output_dir            = \"D:/Noah/\""                                
[5] " Latitude              = 32.321391739"                                
> seek(con,25,rw = "write",origin = "current")
[1] 0
> seek(con,25,rw = "write",origin = "current")
[1] 4174
> seek(con,25,rw = "write",origin = "current")
[1] 4199

注意4175

> con <- file("84.063346096_32.321391739.txt","r+")
> readLines(con,2)
[1] "&METADATA_NAMELIST" " startest2"        
> seek(con,25,rw = "write",origin = "current")
[1] 0
> seek(con,25,rw = "write",origin = "current")
[1] 4175
> seek(con,25,rw = "write",origin = "current")
[1] 4200
> close(con)

如下的结果是能够理解的，每次偏移当前位置25个，第一次调用时，其初始状态为0，每次偏移增加25，产生累加效应

> con <- file("84.063346096_32.321391739.txt","r+")
> seek(con,25,rw = "write",origin = "current")
[1] 0
> seek(con,25,rw = "write",origin = "current")
[1] 25
> seek(con,25,rw = "write",origin = "current")
[1] 50
> seek(con,25,rw = "write",origin = "current")
[1] 75

解释：R中，虽然seek函数支持读/写指针，但是对windows支持并不好，如下所示，上述测试的问题可能由此导致。

Warning

Use of seek on Windows is discouraged. We have found so many errors in the Windows implementation of file positioning that users are advised to use it only at their own risk, and asked not to waste the R developers' time with bug reports on Windows' deficiencies.

最终解决方案：

读取前19行文本readLines()，计算其字符数目nchar()，即获取对应第20行的位置，后续相同

其中，readLines()返回的结果并不包含换行符“\n”，而这是占2个位置的，因此，需要加上19*2

总结：

使用R语言进行文本数据的处理效率不高，R语言更擅长与结构化数据的处理。如同在数据挖掘中反复强调的一个观点：数据的清洗和修剪是最耗费时间的一部分工作（我也不知道是谁说的，个人经验）。在解决这个问题时，应当充分考虑可IO开销。放在整个项目中，需要对2w+个气象驱动数据进行类似的操作，若反复读写所带来的代价过大。

猜你喜欢