The use of regular in spark

The use of regular in spark

In recent projects, when writing spark for data cleaning, regularization was used several times. Therefore, we summarize the cases of several directions of regularization, and the main directions are the use of sparkSQL functions and custom functions.

Incomplete, to be added:

1. Match:

//此方法含义为如果输入age符合正则规则(findAllMatchIn迭代器不为空),则返回本身
val calAge = udf((age: String, defyear: String) => {
    
    
    val reg = "^(19[5-9][0-9]|20[0-1][0-9]|2020)$".r
    if (reg.findAllMatchIn(age).hasNext) {
    
    
        age
    } else defyear
})

2, split:

//yyyMMdd格式转换(2012-10-30T05:09:45.592Z)
val ymd = udf((str: String) => {
    
    
  if (str == null || str.equals("None")) {
    
    
    null
  } else {
    
    
    val tms = str.split("[T.]")
    tms(0) + " " + tms(1)
  }
})

3. Replace:

//hiveSQL函数,regexp_replace()(2012-10-30T05:09:45.592Z)
df.withColumn("joinedAt2",unix_timestamp(regexp_replace(regexp_replace($"joinedAt","Z",""),"T"," "))).show()

4. Extraction:

//正则提取获取时间(2012-10-02 15:53:05.754000+00:00),方法1:
val invitedTime=udf((time:String)=>{
    
    
    val reg = "(.*)\\..*".r
    val reg(a)=time
    a
})
//正则提取获取时间(2012-10-02 15:53:05.754000+00:00),方法2:
val invitedTime2=udf((time:String)=>{
    
    
    val r: Regex = "(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})".r
    val matches: Iterator[Regex.Match] = r.findAllMatchIn(time)
    matches.mkString
})
//hivesql函数,regexp_extract()正则提取每个分组的内容(2012-10-30T05:09:45.592Z)
df.withColumn("joinedAt3",
              concat_ws(" ",
                        regexp_extract($"joinedAt","(.*)T(.*).[0-9]{3}Z",1),
                        regexp_extract($"joinedAt","(.*)T(.*).[0-9]{3}Z",2)))
.show(false)

The detailed use of regular in scala refers to the article: Sixteen, regular in scala

Guess you like

Origin blog.csdn.net/xiaoxaoyu/article/details/115220353