Connected to the article, this paper Hive related art cleaning and calculated to complete the data
A storage. Raw log information
Table hive supports regular expression way to store and read, as follows:
CREATE EXTERNAL TABLE nginxlog ( ip STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) (\\[[^\\]]*\\]) (\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\"[^ ]*\") (\"[^\\\"]*\")" ) STORED AS TEXTFILE LOCATION '/test' ;
Our access.log log data format is as follows:
192.168.111.1 [29/Jul/2019:19:58:55 +0800] "GET /big.png?url=http://127.0.0.1/a.html&urlname=a.html&scr=1366x768&ce=1&cnv=0.6735760053703803&ref=http://127.0.0.1/b.html&stat_uv=67256303183188720208&stat_ss=6553789412_7_1564401535833 HTTP/1.0" 200 37700 "http://127.0.0.1/a.html" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
So build regular expression table statement
([^] *) (\\ [[^ \\]] * \\]) (\ "[^ \"] * \ ") (- | [0-9] *) (- | [0-9 ] *) (\ "[^] * \") (\ "[^ \\\"] * \ ") and above must match the data,
first of all have to understand this regular expression matching in each packet significance:
([^] *) // which matches any character other than a space, in ^ bracket expression, it does not accept the time it represents a set of characters.
(- | [0-9] *) // indicates a match - or more digits 0 to 9
(\ "[^ \\\"] * \ ") // \" represents the actual double quote character before the quotation marks \ character is java language in order to escape the double quote character after it, the type has nothing to do with the positive
( \\ [[^ \\]] * \\]) // \\ [to a regular type \ [, i.e., an actual left bracket positive sign in the formula
In addition Hive formal application before this regular type, preferably in unit test code verification JAAV of:
@Test public void TESTLOG () { String regex = "(([^] *) (\\ [[^ \\]] * \\]) (\" [^ \ "] * \") (- | [ 0-9] *) (- | [0-9] *) (\ "[^] * \") (\ "[^ \\\"] * \ "))" ; pattern pattern = Pattern.compile ( REGEX); String Data = "192.168.111.1 [21 is / Jul-/ 2019: 15: 53 is: 07 +0800}] \" the GET /favicon.ico the HTTP / 1.1 \ "404 555 \" http://192.168.111.123/ \ "\" Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 70.0.3538.67 Safari / 537.36 \ "" ; Matcher Matcher = Pattern.matcher (the Data);
// Note: matches () method to match the entire string represented, find () method can be represented by matching substring IF (Matcher. matches()) { for(int i=0;i<matcher.groupCount();i++){ System.out.println(matcher.group(i+1)) ; } }else{ System.out.println("No match found.%n"); } }
Second processing raw log information -> intermediate processing data is generated
III. KPI indicators produce the final statistical data from the intermediate data processing