Classroom Test - data cleansing

Result file data Description:

Ip : 106.39.41.166, (city)

DATE : 10 / Nov / 2016: 00: 01: 02 +0800, (date)

Day : 10, (number of days)

Traffic: 54 is, (traffic)

Type: video, (Type: Video video or article Article This article was )

The above mentioned id: 8701 (video or article of the above mentioned id )

Testing requirements:

1,  data cleaning: Cleaning in accordance with the data, and import data washing hive data repository .

Two-stage data cleaning:

( 1 ) First stage: the required information is extracted from the original log

ip:    199.30.25.88

time:  10/Nov/2016:00:01:03 +0800

traffic:  62

Article: Article This article was / 11325

Video: Video / 3235

( 2 ) The second stage: to do fine operation based on information extracted from the

ip ---> City City ( IP )

date--> time:2016-11-10 00:01:03

day: 10

traffic:62

type:article/video

id:11325

( . 3 ) Hive database table structure :

create table data(  ip string,  time string , day string, traffic bigint,

type string, id   string )

 

The first phase results show

 

second stage

The second phase is currently some of the problems, only completed part of the code

Twelve stage part of the code

import java.io.IOException;
import java.sql.Date;
import java.text.SimpleDateFormat;
import java.util.Locale;
import java.util.StringTokenizer;
import javax.xml.bind.helpers.ParseConversionEventImpl;
import org.omg.CORBA.PUBLIC_MEMBER;
import com.sun.org.apache.bcel.internal.generic.RETURN;
import com.sun.org.apache.xerces.internal.impl.xs.identity.FieldActivator;
import com.sun.org.apache.xerces.internal.impl.xs.identity.Selector.Matcher;
import com.sun.org.apache.xerces.internal.impl.xs.identity.Selector.XPath;
import com.sun.xml.internal.bind.CycleRecoverable.Context;
import com.sun.xml.internal.bind.Locatable;
import com.sun.xml.internal.ws.config.management.policy.ManagementPrefixMapper;
import com.sun.xml.internal.ws.policy.privateutil.PolicyUtils.Text;
import javafx.scene.chart.PieChart.Data;
import javafx.scene.shape.Line;
import jdk.internal.dynalink.beans.StaticClass;
public class Namecount {
 
 public String[] parse(String line) {
  String ip=parseIp(line);
  String time=parseTime(line);
  String day=parseDay(line);
  String traffic=parseTraffic(line);
  String type=parseType(line);
  String id=parseId(line);
  return new String[] {
    ip,time,day,traffic,type,id
  };
 }
   private String parseId(String line)
   {
    final String trim=line.substring(line.lastIndexOf("\"")+1).trim();
    String id=trim.split(",")[1];
    return id;
    
    }
   private String parseType(String line)
   {
    final String trim=line.substring(line.lastIndexOf("\"")+1).trim();
    String type=trim.split(",")[1];
    return type;
   }
   private String parseTraffic(String line)
   {
    final String trim=line.substring(line.lastIndexOf("\"")+1).trim();
    String traffic=trim.split(",")[1];
    return traffic;
   }
   private String parseDay(String line)
   {
    final String trim=line.substring(line.lastIndexOf("\"")+1).trim();
    String day=trim.split(",")[1];
    return day;
   }
   private String parseTime(String line)
   {
  final int first=line.indexOf("[");
  final int last=line.indexOf("+0800]");
     String date=line.substring(first + 1,last).trim();
     String time = null;
 Date date1=parseDateFormat(time);
    return dateformat1.format(date1);
}
private String parseIp(String line) {
 String ip=line.split(",")[0].trim();
 return ip;
}
 
 public void map(Locatable key,Text value,Context context)
    throws IOException,InterruptedException{
     Text outputValue=new Text();
     String line =value.toString();
     Namecount aa=new Namecount();
     StringTokenizer tokenizerArticle=new StringTokenizer(line,"\n");
     while (tokenizerArticle.hasMoreElements()) {
      String stra=tokenizerArticle.nextToken().toString();
      String [] Newstr=aa.parse(stra);
    
     }
    }
  
    public static final SimpleDateFormat FORMAT=new SimpleDateFormat("xx/yyy/nnnn:ss:ff:mm",Locale.ENGLISH);
    public static final SimpleDateFormat dateformat1=new SimpleDateFormat("nnnn-yyy-xx ss:ff;mm");
    private Date parseDateFormat(String string) {
     Date parse=null;
     try {
      parse=(Date) FORMAT.parse(string);
     }catch(Exception e) {
      e.printStackTrace();
     }
     return parse;
    }
}

 

 

 

 

Guess you like

Origin www.cnblogs.com/love-nan/p/11853468.html