问题描述:
提供一个月的用户来电和来电时间,计算24小时之内重复拨打的来电号码数量。
备注:
输入为csv文件,两列,大概一个月100w行数据左右。附件为实例数据。
【method1直观遍历】
复杂度O(kn),k约为3w
优化:1判断时间先后靠getTime得到毫秒数long值;2电话号码用long的equals
【method2利用HashSet】
复杂度O(n)
维护一个集合利用空间换时间,两个int型指针s和e,s控制遍历列表,e控制24小时来电范围。每次s推进1位,判断e指向的新增通话是否在集合中,是则count++,e到达范围结尾时从集合中删去s指向的号码。
【method3将HashSet改为HashMap】
由于HashSet中元素不能重复,重复号码count++但无法再次加入HashSet。但每次s推进1位都会删除一个号码,则会造成重复数较少的bug。
解决方法是改用HashMap,记录号码key及号码的出现次数value,这样重复号码可以多次加入集合(次数+1),删除时只是次数-1。由于Java的HashSet底层由HashMap实现,对性能完全没有影响。
经测试,100w级来电数据mehod1(用时5min)和method3(用时2s)重复率数字完全一致
package bjtel; import java.io.File; import java.nio.charset.Charset; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; import com.csvreader.CsvReader; public class DuplicateIn24 { public static void main(String[] args) throws Exception { for(File f:new File("D:\\工作相关\\24小时重复拨打率").listFiles()){ if(f.isFile()) run(f); } // CsvReader reader = new CsvReader("D:\\工作相关\\24小时重复拨打率\\test2.csv", ',', Charset.forName("GBK")); // DateFormat format=new SimpleDateFormat("yyyy/MM/dd HH:mm"); // DateFormat format2=new SimpleDateFormat("yyyy/MM/dd"); // List<Long> dates=new ArrayList<>(); // List<Long> calls=new ArrayList<>(); // reader.readHeaders(); // while(reader.readRecord()){ // try { // calls.add(Long.parseLong(reader.getColumnCount()==3?reader.get(2):reader.get(0))); // } catch (Exception e) { // continue; // } // try { // dates.add(format.parse(reader.get(1)).getTime()); // } catch (Exception e) { // dates.add(format2.parse(reader.get(1)).getTime()); // } // // } // reader.close(); // if(dates.size()==calls.size()) // System.out.println("共扫描"+dates.size()+"条通话"); // int count=method3(dates, calls); // System.out.println("方法1:24小时重复拨打"+count+"条,占"+(double)count/calls.size()); // System.out.println("方法2:"); // method1(dates, calls); } public static void run(File file) throws Exception{ CsvReader reader = new CsvReader(file.getCanonicalPath(), ',', Charset.forName("GBK")); DateFormat format=new SimpleDateFormat("yyyy/MM/dd HH:mm"); DateFormat format2=new SimpleDateFormat("yyyy/MM/dd"); List<Long> dates=new ArrayList<>(); List<Long> calls=new ArrayList<>(); reader.readHeaders(); while(reader.readRecord()){ try { calls.add(Long.parseLong(reader.getColumnCount()==3?reader.get(2):reader.get(0))); } catch (Exception e) { continue; } try { dates.add(format.parse(reader.get(1)).getTime()); } catch (Exception e) { dates.add(format2.parse(reader.get(1)).getTime()); } } reader.close(); if(dates.size()==calls.size()) System.out.println("【"+file.getName()+"】共扫描"+dates.size()+"条通话"); int count=method3(dates, calls); System.out.println("重复拨打"+count+"条,占"+(double)count/calls.size()); } public static int method2(List<Long> dates,List<Long> calls){ int count=0; Set<Long> scale=new HashSet<>(); for(int s=0,e=1;s<dates.size();s++){ scale.add(calls.get(s)); while(e<dates.size()&&dates.get(e)<dates.get(s)+86400000){ if(e>=calls.size()) return count; if(scale.contains(calls.get(e))){ count++; } scale.add(calls.get(e++)); } scale.remove(calls.get(s)); } return count; } public static int method3(List<Long> dates,List<Long> calls){ int count=0; Map<Long,Integer> scale=new HashMap<>(); for(int s=0,e=1;s<dates.size();s++){ // if(s%10000==0) // System.out.println(count); for(;e<dates.size()&&dates.get(e)<dates.get(s)+86400000;e++){ if(e>=calls.size()) return count; int t=scale.get(calls.get(e))==null?0:scale.get(calls.get(e)); if(t>0) count++; scale.put(calls.get(e), t+1); } int t=scale.get(calls.get(s))==null?0:scale.get(calls.get(s)); if(t<=1) scale.remove(calls.get(s)); else scale.put(calls.get(s), t-1); } return count; } public static int method1(List<Long> dates,List<Long> calls){ int count=0; for(int i=0;i<dates.size();i++){ if(i%50000==0) System.out.println("已扫描"+i+"条"+" 重复"+count+"条"); long scale=dates.get(i)+86400000; // for(int j=i+1;dates.get(j)<scale;j++){ int j=i+1; while(j<dates.size()&&dates.get(j)<scale){ if(calls.get(i).equals(calls.get(j))){ count++; break; } j++; } // if(j==dates.size()) // System.out.println(i+" "+dates.get(i)); } System.out.println("方法2:24小时重复拨打"+count+"条,占"+(double)count/calls.size()); return count; } }
若增加需求,统计出现重复的电话的原接待人并输出每个负责人共造成多少重复拨打。
则将method3的HashMap,改为记录key为号码,value为出现次数及负责人封装而成的对象。
代码如下:(附件2为新需求的输入)
package bjtel; import java.io.FileWriter; import java.nio.charset.Charset; import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Date; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Map.Entry; import com.csvreader.CsvReader; public class DuplicateIn24WithDetail { public static void main(String[] args) throws Exception{ String input=args.length==2?args[0]:"D:\\工作相关\\24小时重复拨打率\\contactdetail_20141231.csv"; String output=args.length==2?args[1]:"D:\\data.csv"; CsvReader reader = new CsvReader(input, ',', Charset.forName("GBK")); DateFormat format=new SimpleDateFormat("yyyy-MM-dd HH:mm:SS"); DateFormat format2=new SimpleDateFormat("yyyy-MM-dd"); List<Call> calls=new ArrayList<>(); reader.readHeaders(); while(reader.readRecord()){ Call call=new Call(); try { call.number=Long.parseLong(reader.get(0)); } catch (NumberFormatException e) { call.number=System.currentTimeMillis(); } call.receiver=reader.get(2); try { call.time=format.parse(reader.get(1)).getTime(); } catch (Exception e) { call.time=format2.parse(reader.get(1)).getTime(); } calls.add(call); } reader.close(); // Map<String,Integer> detail=new HashMap<>(); // System.out.println(method2(calls, detail)+" "+detail); // Map<String,Integer> detail1=new HashMap<>(); // System.out.println(method1(calls, detail1)+" "+detail1); Map<String,Integer> detail=new HashMap<>(); System.out.println(format2.format(new Date(calls.get(0).time))+"共出现"+method2(calls, detail)+"例重复拨打"); System.out.println("共"+detail.size()+"名员工的具体数据输出至"+args[1]); FileWriter fileWriter=new FileWriter(output,true); for(Entry<String, Integer> e:detail.entrySet()){ fileWriter.append(format2.format(new Date(calls.get(0).time))+","+e.getKey()+","+e.getValue()+"\r\n"); } fileWriter.close(); } public static int method1(List<Call> calls, Map<String,Integer> detail) { int count=0; for(int i=0;i<calls.size();i++){ if(i%50000==0) System.out.println("已扫描"+i+"条"+" 重复"+count+"条"); long scale=calls.get(i).time+86400000; int j=i+1; while(j<calls.size()&&calls.get(j).time<scale){ if(calls.get(i).number.equals(calls.get(j).number)){ count++; detail.put(calls.get(i).receiver, detail.get(calls.get(i).receiver)==null?1:detail.get(calls.get(i).receiver)+1); break; } j++; } } return count; } public static int method2(List<Call> calls, Map<String,Integer> detail){ int count=0; Map<Long,Scale> scale=new HashMap<>(); for(int s=0,e=1;s<calls.size();s++){ for(;e<calls.size()&&calls.get(e).time<calls.get(s).time+86400000;e++){ if(e>=calls.size()) return count; Scale unique=scale.get(calls.get(e).number); int t=unique==null?0:unique.count; if(t>0){ count++; detail.put(unique.lastReceiver, detail.get(unique.lastReceiver)==null?1:detail.get(unique.lastReceiver)+1); } scale.put(calls.get(e).number, new Scale(t+1, calls.get(e).receiver)); } int t=scale.get(calls.get(s).number)==null?0:scale.get(calls.get(s).number).count; if(t<=1) scale.remove(calls.get(s)); else scale.put(calls.get(s).number, new Scale(t-1, scale.get(calls.get(s).number).lastReceiver)); } return count; } } class Call{ public long time; public Long number; public String receiver; } class Scale{ public int count; public String lastReceiver; public Scale(int count,String lastReceiver){ this.count=count; this.lastReceiver=lastReceiver; } }
结论:method2较快,但只能统计一个集合内重复的拨打数;method1较慢,但采用顺序遍历可以统计截止到某个时间段的24小时重复拨打量