For analysis to make web log is a common practice to learn spark project. This article describes the web log regular match-related tips.
- Examples of test
From the Internet to find an access log text columns
218.19.140.242 - - [10/Dec/2010:09:31:17 +0800] "GET /query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP/1.1" 200 1933 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 (.NET CLR 3.5.30729)"
The following is a spark-shell in the test code:
val list = """218.19.140.242 - - [10/Dec/2010:09:31:17 +0800] "GET /query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP/1.1" 200 1933 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 (.NET CLR 3.5.30729)"""" val logPattern = """^(\S+) (\S+) (\S+) \[([\w/]+)([\w:/]+)\s([+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\d+) "(\S+)" "(.*?)"$""".r logPattern.findFirstIn(list) match { case Some(logPattern(_*)) => true case _ => false }
Output:
Boolean = true
The results given way modify, delete some specific projects to test which of the regular problems:
val list = """218.19.140.242 - - [10/Dec/2010:09:31:17 +0800]""" val logPattern = """^(\S+) (\S+) (\S+) \[([\w/]+)([\w:/]+)\s([+\-]\d{4})\]$""".r logPattern.findFirstIn(list) match { case Some(logPattern(_*)) => true case _ => false }
2. web log format
The split log:
(1) 218.19.140.242 // client ip
(2) - // label identifies visitors - for a blank
(3) - // record the user's HTTP authentication
(4) [10 / Dec / 2010: 09: 31: 17 +0800] // time record request, + 0800 represents a time zone in which the server is eight East region
(5) "GET /query/trendxml/district/todayreturn/month/2009-12-14/2010-12-09/haizhu_tianhe.xml HTTP / 1.1" // GET request, the requested resource path protocol HTTP / 1.1
(6) 200 // status code
(7) the amount of data 1933 //
(8)"-" //
(9) "Mozilla / 5.0 (Windows ......." // record the client's browser information
3. scala regular match
^ Matches the beginning
$ Matches end
\ S + matches any whitespace character
\ [([\ W /] +) ([\ w: /] +) \ s ([+ \ -] \ d {4}) \] matching time
\ D {3} match three numbers
\ D + match a plurality of digital
Inert matches (. *?)
1 ,. matches any character except newline "\ n" outside;
2, * indicates a character zero or infinity times before the match;
3, * + or heel? Represents a non-greedy match, that match as little as possible, such as *? Repeated any number of times, but less duplication wherever possible;
.? 4, * denotes any number of matching repeat, but with minimal repeats make the whole premise of a successful match.
references:
1.https://www.douban.com/note/325691248/
2.https://www.runoob.com/scala/scala-regular-expressions.html
3.https://blog.csdn.net/qq_37699336/article/details/84981687