统计数据之HashMap

因为工作需求，需要将数据库内大量数据取出来统计之后存入，因为数据了太大，我一开始就有些顾虑，毕竟新手，知道自己还有所欠缺。

果不其然，第一次写出来的是一个很蠢的统计过程，大致过程是取出数据，然后放入map，存入过程中，遍历map，查看是否存在数据，这关键的，最烂的一步，以至于我测试的时候，使用10W条测试就已经跑死了。

大致代码贴出来以示警戒：

	private void infosOne(List<String> list){
		long startTime=System.currentTimeMillis();
		boolean isT = false;
		Map<String,Integer> jokeInfos1 = new HashMap<String, Integer>();
		for(String d : list){			
			String[] jokeIds = d.split(",");
			for(String jid: jokeIds){
				for (Map.Entry<String,Integer> entry : jokeInfos1.entrySet()) {
					if(entry.getKey().equals(jid)){
						isT = true;
						entry.setValue(entry.getValue()+1);
//						System.out.println("key= " + entry.getKey() + " and value= " + entry.getValue());
						break;
					}
				}
				if(!isT){
					jokeInfos1.put(jid, 1);
				}
				isT = false;
			}
		}
		long endTime=System.currentTimeMillis();
		System.out.println("infosOne 遍历map 用时="+(endTime-startTime)+"ms 共"+jokeInfos1.size()+"条记录");
//		for (Map.Entry<String,Integer> entry : jokeInfos1.entrySet()) {
//			System.out.println("key= " + entry.getKey() + " and value= " + entry.getValue());	
//			
//		}
	}

后来和同学聊起，受他们指点，决定使用get(key)、containsKey来代替遍历，效果果然好很多

第一次测试使用的是60W条，大量重复数据

		List<String> sList = new ArrayList<String>();
		for(int i = 0;i<100000; i++){
			sList.add("123,333,421,3,12,44,51,242,554,999,");
			sList.add("28,888,221,8,12,22,51,222,552,777,");
			sList.add("14,444,451,4,15,44,51,545,554,666,");
			sList.add("17,777,711,7,11,77,11,171,117,222,");
			sList.add("1123,71,1,7,12,5,");
			sList.add("88,");
			sList.add("628,");
			sList.add("8532,");
			sList.add("13,177,711,7,11,77,11,171,117,222,");
		}

这个时候差距就已经出来了，测试结果如下

infosContainsKey 用时=2235ms 共41条记录
infosGetKey 用时=1843ms 共41条记录

infosOne 遍历map 用时=9886ms 共41条记录

然后我在循环中加了一条数据，很冲动

			sList.add(i+",177,711,7,11,77,11,171,117,"+(i+4)+",");

当时跑出来的结果是这样的

infosContainsKey 用时=2980ms 共100004条记录
infosGetKey 直接get 用时=2381ms 共100004条记录

为啥没有遍历的结果？因为我没等出来，虽然最终会出来，但是完全不是我想要的结果，这种过程也没必要测出来，警讯大家不用就是了。

然后我把10W循环缩小成了1W，查看结果：

		for(int i = 0;i<10000; i++){

infosContainsKey 用时=347ms 共10004条记录
infosGetKey 直接get 用时=264ms 共10004条记录
infosOne 遍历map 用时=66471ms 共10004条记录

跑了几次，结果相差不大，总之这时候差距就体现出来了，这方法真的是。。。太烂了

然后观察几次结果，发现使用get(key)每次都会比containsKey快一些

get(key)是这样的

public V get(Object key) {
        if (key == null)
            return getForNullKey();
        Entry<K,V> entry = getEntry(key);

        return null == entry ? null : entry.getValue();
    }

    /**
     * Offloaded version of get() to look up null keys.  Null keys map
     * to index 0.  This null case is split out into separate methods
     * for the sake of performance in the two most commonly used
     * operations (get and put), but incorporated with conditionals in
     * others.
     */
    private V getForNullKey() {
        if (size == 0) {
            return null;
        }
        for (Entry<K,V> e = table[0]; e != null; e = e.next) {
            if (e.key == null)
                return e.value;
        }
        return null;
    }

final Entry<K,V> getEntry(Object key) {
        if (size == 0) {
            return null;
        }

        int hash = (key == null) ? 0 : hash(key);
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash &&
                ((k = e.key) == key || (key != null && key.equals(k))))
                return e;
        }
        return null;
    }

containsKey是这样的

  public boolean containsKey(Object key) {
        return getEntry(key) != null;
    }

    /**
     * Returns the entry associated with the specified key in the
     * HashMap.  Returns null if the HashMap contains no mapping
     * for the key.
     */
    final Entry<K,V> getEntry(Object key) {
        if (size == 0) {
            return null;
        }

        int hash = (key == null) ? 0 : hash(key);
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash &&
                ((k = e.key) == key || (key != null && key.equals(k))))
                return e;
        }
        return null;
    }

细看之下发现，这俩方法几乎差不多呀，归根到底，都是getEntry(key)，那么时间差在什么地方呢？我先把我写的两个方法贴出来：

	private void infosContainsKey(List<String> list){
		long startTime=System.currentTimeMillis();
		Map<String,Integer> jokeInfos2 = new HashMap<String, Integer>();
		for(String d : list){			
			String[] jokeIds = d.split(",");
			for(String jid: jokeIds){
				if(jokeInfos2.containsKey(jid)){
					jokeInfos2.put(jid, jokeInfos2.get(jid)+1);
				}else{
					jokeInfos2.put(jid, 1);
				}
			}
		}
		long endTime=System.currentTimeMillis();
		System.out.println("infosContainsKey 用时="+(endTime-startTime)+"ms 共"+jokeInfos2.size()+"条记录");
	}

	private void infosGetKey(List<String> list){
		long startTime=System.currentTimeMillis();
		Integer temp;
		Map<String,Integer> jokeInfos3 = new HashMap<String, Integer>();
		for(String d : list){			
			String[] jokeIds = d.split(",");
			for(String jid: jokeIds){
				temp=jokeInfos3.get(jid);
				if(temp!=null){
					jokeInfos3.put(jid, temp+1);
				}else{
					jokeInfos3.put(jid, 1);
				}
			}
		}
		long endTime=System.currentTimeMillis();
		System.out.println("infosGetKey 用时="+(endTime-startTime)+"ms 共"+jokeInfos3.size()+"条记录");
	}

发现，可能差距是因为，infosGetKey只用了一次get，而 infosContainsKey 先containsKey 又get了一次，所以花的时间多了一些，

于是我为了验证一下，把一次get改成了两次，然后用20W循环测试

结果如下：

infosContainsKey 用时=5557ms 共200004条记录
infosGetKey 用时=5584ms 共200004条记录

再改回来测试：

infosContainsKey 用时=5602ms 共200004条记录
infosGetKey 用时=4917ms 共200004条记录

总之呢，暂时先用get的方法来做，虽然问题可能还不一定解决，不过目前这个是最好用的，后面可能会使用数据库方面的内容来优化这个统计任务。

猜你喜欢