spark for web log regular match

 

  For analysis to make web log is a common practice to learn spark project. This article describes the web log regular match-related tips.

  1.  Examples of test

  From the Internet to find an access log text columns

218.19.140.242 - - [10/Dec/2010:09:31:17 +0800] "GET /query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP/1.1" 200 1933 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 (.NET CLR 3.5.30729)"

  The following is a spark-shell in the test code:

val list = """218.19.140.242 - - [10/Dec/2010:09:31:17 +0800] "GET /query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP/1.1" 200 1933 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 (.NET CLR 3.5.30729)""""
val logPattern = """^(\S+) (\S+) (\S+) \[([\w/]+)([\w:/]+)\s([+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\d+) "(\S+)" "(.*?)"$""".r
logPattern.findFirstIn(list) match {
case Some(logPattern(_*)) => true
case _ => false
}

  Output:

Boolean = true

  The results given way modify, delete some specific projects to test which of the regular problems:

val list = """218.19.140.242 - - [10/Dec/2010:09:31:17 +0800]"""
val logPattern = """^(\S+) (\S+) (\S+) \[([\w/]+)([\w:/]+)\s([+\-]\d{4})\]$""".r
logPattern.findFirstIn(list) match {
case Some(logPattern(_*)) => true
case _ => false
}

 

  2. web log format

 

  The split log:

(1) 218.19.140.242 // client ip

(2) - // label identifies visitors - for a blank

(3) - // record the user's HTTP authentication

(4) [10 / Dec / 2010: 09: 31: 17 +0800] // time record request, + 0800 represents a time zone in which the server is eight East region

(5) "GET /query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP / 1.1" // GET request, the requested resource path protocol HTTP / 1.1

(6) 200 // status code

(7) the amount of data 1933 //

(8)"-"    //

(9) "Mozilla / 5.0 (Windows ......." // record the client's browser information

  3. scala regular match

 ^ Matches the beginning

$ Matches end

\ S + matches any whitespace character

\ [([\ W /] +) ([\ w: /] +) \ s ([+ \ -] \ d {4}) \] matching time

 \ D {3} match three numbers

\ D + match a plurality of digital

Inert matches (. *?)

1 ,. matches any character except newline "\ n" outside;
2, * indicates a character zero or infinity times before the match;
3, * + or heel? Represents a non-greedy match, that match as little as possible, such as *? Repeated any number of times, but less duplication wherever possible;
.? 4, * denotes any number of matching repeat, but with minimal repeats make the whole premise of a successful match.

 

references:

1.https://www.douban.com/note/325691248/

2.https://www.runoob.com/scala/scala-regular-expressions.html

3.https://blog.csdn.net/qq_37699336/article/details/84981687

 

Guess you like

Origin www.cnblogs.com/tiaoweiliao/p/11419054.html