R语言之grep函数和正则通配符查询

原文地址
首先介绍grep()函数及sub()函数
函数grep()
grep(pattern, x,ignore.case = FALSE, fixed=FALSE,value=FALSE,perl=FALSE)
在向量x中寻找含有特定字符串(pattern参数指定)的元素,返回其在x中的下标;
value=TRUE时返回相应的元素
若fixed=FALSE,则pattern为一个正则表达式,若fixed=TRUE,则pattern为一个字符串文本
grep(“A”,c(“b”,“A”,“c”),fixed = TRUE) #返回值为2

函数sub()
sub(pattern,replacement,x,ignore.case = FALSE,fixed=FALSE)
在x中搜索pattern,并以replacement将其替换,若fixed=FALSE,则pattern为一个正则表达式,若fixed=TRUE,则pattern为一个字符串文本
注意sub()函数只替换第一个匹配到的结果
例子
注意"\s"是一个用来查找空白的正则表达式。使用"\s"的原因是"\s"是R中的一个转义字符
sub("\s",".",“Hello There”) 返回值为Hello.There

介绍几种R语言中的正则通配符

(1)“^”匹配一个字符串的开始,比如sub("^a","",c("abcd","dcba")),
表示将开头为a的字符串。如果要将开头的一个字符串替换,简单地写成“^ab”就行。
(2)“$”匹配一个字符串的结尾,例如sub("a$", "", c("abcd", "dcba")表示把字符串中结尾为"$"的替换为""
(3)”.”表示除了换行字符以外的任一字符(是任一一个字符不是任一多个字符),例如sub("a.c","",c("aabcd","dcaaaba"),结果为 "ad"      "dcaaaba"
(4)"*"表示将其前的字符进行0个或多个匹配,比如sub("a*b","",c("aabcd","dcaaaba")) 返回为"cd"  "dca"
(5)"?"匹配0或1个字符意思就是即使可以匹配的有多个也只匹配一个,例如sub("?a","",c("aabcd","dcaaaba"))返回"abcd"   "dcaaba"
(6)"+"匹配1或多个字符,例如 sub("a+","",c("cabcd","dcaaaba")) ,返回 "cbcd" "dcba",这个匹配的是一个a或多个a一起的在此例子中"cadcd"匹配的是一个a,"dcaaaba"匹配的是aaa
(7)".*"可以匹配任一字符,比如sub("a.*e", "", c("abcde","edcba"))(没看懂)
(8)"|"表示逻辑的或,比如sub("ab|ba","",c("abcd","dcba"))返回"cd" "dc"
(9)"^"还可以表示逻辑的补集,需要写在"[]"中,比如sub("[^ab]","",c("abcd","dcba"))返回"abd" "cba"(没看懂)
(10)"[]"还可以用来匹配多个字符,如果不使用任何分隔符号,则搜寻这个集合,比如在sub("[ab]","",c("abcd","dcba"))中,和"a|b"效果一样。

(11)“[-]”的形式可以匹配一个范围,比如sub("[a-c]","",c("abcde","edcba"))匹配从a到c的字符,sub("[1-9]","",c("ab001","001ab"))匹配从1到9的数字。
(12)   最后需要提一下的是“贪婪”和“懒惰”的匹配规则。默认情况下是匹配尽可能多的字符,是为贪婪匹配,比如sub("a.*b","",c("aabab","eabbe")),默认匹配最长的a开头b结尾的字串,也就是整个字符串。如果要进行懒惰匹配,也就是匹配最短的字串,只需要在后面加个“?”,比如sub("a.*?b","",c("aabab","eabbe")),就会匹配最开始找到的最短的a开头b结尾的字串。

例子

num <- c(310,456,311,431,421,435,312,313,320,321,322,323,314,324,317,3231)
> #寻找开始为3的数字
> grep("^3",num)
 [1]  1  3  7  8  9 10 11 12 13 14 15 16
> grep("^3",num,value=T)
 [1] "310"  "311"  "312"  "313"  "320"  "321"  "322"  "323"  "314"  "324"  "317"  "3231"
> #开头为31的数字
> grep("^31",num,value=T)
[1] "310" "311" "312" "313" "314" "317"
> #以4为结尾的数字
> grep("4$",num,value=T)
[1] "314" "324"
> #含有3*2的数字*号表示任一的一个字符
> grep("3.2",num,value=T)
[1] "312" "322"
> #所有含31的数字
> grep("31",num,value=T)
[1] "310"  "311"  "431"  "312"  "313"  "314"  "317"  "3231"
> #所有开头为3或者末尾为1的数字
> y <- c(310,456,311,431,421,435,312,313,320,321,322,323,314,324,317,3231,451,314,231,522,2312,2132,2112)
> grep("3*1",y,value=T)
 [1] "310"  "311"  "431"  "421"  "312"  "313"  "321"  "314"  "317"  "3231" "451"  "314" 
[13] "231"  "2312" "2132" "2112"
> #所有含31的值
> grep("?31",y,value=T)
 [1] "310"  "311"  "431"  "312"  "313"  "314"  "317"  "3231" "314"  "231"  "2312"
> grep("3.*1",y,value=T)
 [1] "310"  "311"  "431"  "312"  "313"  "321"  "314"  "317"  "3231" "314"  "231"  "2312"
> grep("1",y,value=T)
 [1] "310"  "311"  "431"  "421"  "312"  "313"  "321"  "314"  "317"  "3231" "451"  "314" 
[13] "231"  "2312" "2132" "2112"
> grep("1",y)
 [1]  1  3  4  5  7  8 10 13 15 16 17 18 19 21 22 23
> 
发布了39 篇原创文章 · 获赞 11 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_42712867/article/details/95580699