Log collected offline processing scheme --2. Calculation data cleaning and

Connected to the article, this paper Hive related art cleaning and calculated to complete the data

A storage. Raw log information

Table hive supports regular expression way to store and read, as follows:

CREATE EXTERNAL TABLE nginxlog (
  ip STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) (\\[[^\\]]*\\]) (\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (\"[^ ]*\") (\"[^\\\"]*\")"
)
STORED AS TEXTFILE LOCATION '/test' ;

Our access.log log data format is as follows:

192.168.111.1 [29/Jul/2019:19:58:55 +0800] "GET /big.png?url=http://127.0.0.1/a.html&urlname=a.html&scr=1366x768&ce=1&cnv=0.6735760053703803&ref=http://127.0.0.1/b.html&stat_uv=67256303183188720208&stat_ss=6553789412_7_1564401535833 HTTP/1.0" 200 37700 "http://127.0.0.1/a.html" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

So build regular expression table statement 

([^] *) (\\ [[^ \\]] * \\]) (\ "[^ \"] * \ ") (- | [0-9] *) (- | [0-9 ] *) (\ "[^] * \") (\ "[^ \\\"] * \ ") and above must match the data, 
first of all have to understand this regular expression matching in each packet significance:
  ([^] *) // which matches any character other than a space, in ^ bracket expression, it does not accept the time it represents a set of characters.
   (- | [0-9] *)   // indicates a match - or more digits 0 to 9
  (\ "[^ \\\"] * \ ")   // \" represents the actual double quote character before the quotation marks \ character is java language in order to escape the double quote character after it, the type has nothing to do with the positive 

( \\ [[^ \\]] * \\]) //
\\ [to a regular type \ [, i.e., an actual left bracket positive sign in the formula

In addition Hive formal application before this regular type, preferably in unit test code verification JAAV of:

@Test
     public  void TESTLOG () { 

        String regex = "(([^] *) (\\ [[^ \\]] * \\]) (\" [^ \ "] * \") (- | [ 0-9] *) (- | [0-9] *) (\ "[^] * \") (\ "[^ \\\"] * \ "))" ; 
        pattern pattern = Pattern.compile ( REGEX);
        String Data = "192.168.111.1 [21 is / Jul-/ 2019: 15: 53 is: 07 +0800}] \" the GET /favicon.ico the HTTP / 1.1 \ "404 555 \" http://192.168.111.123/ \ "\" Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 70.0.3538.67 Safari / 537.36 \ "" ; 
        Matcher Matcher = Pattern.matcher (the Data);
         
     // Note: matches () method to match the entire string represented, find () method can be represented by matching substring
IF (Matcher. matches()) { for(int i=0;i<matcher.groupCount();i++){ System.out.println(matcher.group(i+1)) ; } }else{ System.out.println("No match found.%n"); } }

 

 
 

 

 

 

Second processing raw log information -> intermediate processing data is generated

 

 

 

III. KPI indicators produce the final statistical data from the intermediate data processing

 

Guess you like

Origin www.cnblogs.com/hzhuxin/p/11266385.html