14.哈希表

哈希表

1.基本概念

哈希表最重要的就是“键”转化为“索引”–哈希函数的设计，同时哈希冲突后如何解决。

哈希表充分体现了算法设计领域的经典思想:空间换时间。

哈希表是时间和空间之间的平衡。

2.哈希函数的设计

“键”通过哈希函数得到的“索引”，索引分布的越均匀越好。

2.1整型

小范围正整数直接使用
小范围负整数直接偏移 -100-10(统一+100)

大整数

身份证号：130481199304111819

通常做法：取模比如取后四位，等同于mod 10000

取后六位？等同于 mod 1000000 这样由于倒数5、6位只能在1-31之间，所以会造成分布不均匀。同时没有利用所有的信息。

具体问题就具体分析。

一个简单的解决方法:模一个素数。不同数量级的数据规模可以选择不同的素数。

2.2浮点型

2.3 字符串

当做整数处理。

注:B可以根据实际情况选择，M表示素数取模。

为了防止整型溢出，可以采用以下的方法：以下结果是一样的，（我没有验证过）

2.4复合类型

2.5总结

转成整型处理，并不是唯一的方法。

原则：

一致性：如果a==b,则hash(a)==hash(b)
高效性: 计算高效简便
均匀性：哈希值均匀分布

3.Java中的hashCode

3.1

int a=42;
System.out.println(((Integer)a).hashCode());
int b=-42;
System.out.println(((Integer)b).hashCode());
double d=3.14;
System.out.println(((Double)d).hashCode());
String str="antfin";
System.out.println(str.hashCode());
//42
//-42
//300063655
//-1412795196

Java中的hashcode返回值为int，所以有正有负，具体在类中转化为索引需要开发者自己实现，因为事先不知道索引的大小。

3.2类

public class Student {
    private int ID;
    private int cls;
    private String firstName;
    private String lastName;

    public Student(int ID, int cls, String firstName, String lastName) {
        this.ID = ID;
        this.cls = cls;
        this.firstName = firstName;
        this.lastName = lastName;
    }

    @Override
    public int hashCode(){
        int B=31; //随机取
        int hash=0;
        hash=hash*B+ID ;
        hash=hash*B+cls;
        hash=hash*B+firstName.toLowerCase().hashCode();
        hash=hash*B+lastName.toLowerCase().hashCode();
        return hash;
    }
}

System.out.println(new Student(12,22,"antfin","alibaba").hashCode());

其实有默认的hashCode，是根据地址映射的。

System.out.println(new Student(12,22,"antfin","alibaba").hashCode());
        System.out.println(new Student(12,22,"antfin","alibaba").hashCode());

不覆盖hashcode方法。
1627674070
1360875712

但这与我们的逻辑不符合，所以我们要同时覆盖hashcode 和equals

4.链地址法—处理哈希冲突

对一个素数M取模，由于Java中hashCode(k1)可能为负数，所以要取绝对值，以下是同样的效果

(hashCode(k1)&0x7fffffff)%M

0x7fffffff其实在32位中表示首位为0，剩下31位为1.

链地址法，其实是一个查找表，不见得不一定是链表。查找表也可以使用平衡树结构。数组中每一个都存储查找表。

由于Java中TreeMap的底层就是红黑树，其实就是当冲突达到一定程度就使用了TreeMap，红黑树的时间复杂度虽然比链表低，但是当数据规模较小的时候，链表是更快的。

5.实现hashMap

使用用红黑树实现的TreeMap，体现了面对对象的好处。

/**
 * Alipay.com Inc. Copyright (c) 2004-2018 All Rights Reserved.
 */
package com.antfin.hashcode;

import java.util.TreeMap;

/**
 * @author alibaba
 * @version $Id: HashTable.java, v 0.1 2018年07月24日 下午11:51 alibaba Exp $
 */
public class HashTable<K, V> {
    private TreeMap<K, V>[] hashTable;
    private int             size;
    //M的取值很重要
    private int             M;

    public HashTable(int M) {
        this.M = M;
        size = 0;
        hashTable = new TreeMap[M];
        for (int i = 0; i < M; i++) { hashTable[i] = new TreeMap<>(); }
    }

    public HashTable() {
        this(97);
    }

    private int hash(K key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public int getSize() {
        return size;
    }

    public void add(K key, V value) {
        TreeMap<K, V> map = hashTable[hash(key)];
        if (map.containsKey(key)) { map.put(key, value); } else {
            map.put(key, value);
            size++;
        }
    }

    public V remove(K key) {
        V ret = null;
        TreeMap<K, V> map = hashTable[hash(key)];
        if (map.containsKey(key)) {
            ret = map.remove(key);
            size--;
        }
        return ret;
    }

    public void set(K key, V value) {
        TreeMap<K, V> map = hashTable[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException(key + "doesn't exist!"); }
        map.put(key, value);
    }

    public boolean contain(K key) {
        return hashTable[hash(key)].containsKey(key);
    }

    public V get(K key) {
        return hashTable[hash(key)].get(key);
    }
}

6.时间复杂度分析

6.1上一节实现的哈希表的时间复杂度

总共有M个地址，如果放入哈希表的元素为N,由于数组大小为M，支持随机访问的能力，故不用考虑定位数组的时间复杂度，则平均时间复杂度为：

如果每个地址是链表:O(N/M)

如果每个地址是平衡树:O(log(N/M))

说好的O(1)呢？扩容，resize

6.2哈希表的动态空间处理

平均每个地址承载的元素多过一定程度，即扩容。

N/M>=upperTol

平均每个地址承载的元素少过一定程度，即缩容

N/M<=lowerTol

/**
 * Alipay.com Inc. Copyright (c) 2004-2018 All Rights Reserved.
 */
package com.antfin.hashcode;

import java.util.TreeMap;

/**
 * @author alibaba
 * @version $Id: HashTables.java, v 0.1 2018年07月28日 下午2:42 alibaba Exp $
 */
public class HashTables<K, V> {
    private        TreeMap<K, V> hashTables[];
    private        int           size;
    private        int           M;
    private static int           initCapacity = 7;
    private        int           UPPER_LOT    = 10;
    private        int           LOWER_LOT    = 2;

    public HashTables(int M) {
        this.M = M;
        this.size = 0;
        hashTables = new TreeMap[M];
        for (int i = 0; i < M; i++) {
            hashTables[i] = new TreeMap<>();
        }
    }

    public HashTables() {
        this(initCapacity);
    }

    public int getSize() {
        return size;
    }

    private int hash(K key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public void add(K key, V value) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (map.containsKey(key)) { map.put(key, value); } else {
            map.put(key, value);
            size++;
            if (size>UPPER_LOT*M)
                resize(M*2);
        }
    }

    public void set(K key, V value) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException("set failed," + key + "doesn't exist!"); }
        map.put(key, value);
    }

    public boolean contain(K key) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (map.containsKey(key)) { return true; }
        return false;
    }

    public V get(K key) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException("get failed," + key + "doesn't exist!"); }
        return map.get(key);
    }

    public V remove(K key) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException("remove failed," + key + "doesn't exist!"); }
        size--;
        if (size<LOWER_LOT*M&&M/2>=initCapacity)
            resize(M/2);
        return map.remove(key);
    }

    private void resize(int newM) {
        TreeMap<K,V> newHashtable []=new TreeMap[newM];
        for (int i=0;i<newM;i++){
            newHashtable[i]=new TreeMap<>();
        }
        //此处要注意新旧M的替换，因为for循环使用的是oldM而hash则使用的是newM
        int oldM=M;
        this.M=newM;
        for (int i=0;i<oldM;i++){
            TreeMap<K,V>map=hashTables[i];
            for (K key:map.keySet()){
                newHashtable[hash(key)].put(key,map.get(key));
            }
        }
        this.hashTables=newHashtable;
    }

}

一开始，我不理解为什么增加扩容和缩容就变成了O(1)操作，其实是因为通过扩容和缩容操作，每条链上面的元素个数成了(lowerTol-upperTol)即确定的，自然变成了常数级别的。

6.3更复杂的动态空间处理方法

保证扩容的M仍然为素数，减少哈希碰撞的概率，使元素分布的更均匀。

/**
 * Alipay.com Inc. Copyright (c) 2004-2018 All Rights Reserved.
 */
package com.antfin.hashcode;

import java.util.TreeMap;

/**
 * @author alibaba
 * @version $Id: HashTables.java, v 0.1 2018年07月28日 下午2:42 alibaba Exp $
 */
public class HashTables<K, V> {
    private final int           capacity[]    =
            {53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593,
                    49157, 98317, 196613, 393241, 786433, 1572869, 3145739, 6291469,
                    12582917, 25165843, 50331653, 100663319, 201326611, 402653189, 805306457, 1610612741};
    private       TreeMap<K, V> hashTables[];
    private       int           size;
    private       int           M;
    private       int           capacityIndex = 0;
    private       int           UPPER_LOT     = 10;
    private       int           LOWER_LOT     = 2;

    public HashTables() {
        this.M = capacity[capacityIndex];
        this.size = 0;
        hashTables = new TreeMap[M];
        for (int i = 0; i < M; i++) {
            hashTables[i] = new TreeMap<>();
        }
    }

    public int getSize() {
        return size;
    }

    private int hash(K key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public void add(K key, V value) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (map.containsKey(key)) { map.put(key, value); } else {
            map.put(key, value);
            size++;
            if (size > UPPER_LOT * M && capacityIndex + 1 < capacity.length) { resize(capacity[capacityIndex++]); }
        }
    }

    public void set(K key, V value) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException("set failed," + key + "doesn't exist!"); }
        map.put(key, value);
    }

    public boolean contain(K key) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (map.containsKey(key)) { return true; }
        return false;
    }

    public V get(K key) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException("get failed," + key + "doesn't exist!"); }
        return map.get(key);
    }

    public V remove(K key) {
        TreeMap<K, V> map = hashTables[hash(key)];
        if (!map.containsKey(key)) { throw new IllegalArgumentException("remove failed," + key + "doesn't exist!"); }
        size--;
        if (size < LOWER_LOT * M && capacityIndex-1 >= 0) { resize(capacity[capacityIndex--]); }
        return map.remove(key);
    }

    private void resize(int newM) {
        TreeMap<K, V> newHashtable[] = new TreeMap[newM];
        for (int i = 0; i < newM; i++) {
            newHashtable[i] = new TreeMap<>();
        }
        //此处要注意新旧M的替换，因为for循环使用的是oldM而hash则使用的是newM
        int oldM = M;
        this.M = newM;
        for (int i = 0; i < oldM; i++) {
            TreeMap<K, V> map = hashTables[i];
            for (K key : map.keySet()) {
                newHashtable[hash(key)].put(key, map.get(key));
            }
        }
        this.hashTables = newHashtable;
    }

}

6.4其他的碎碎念

Java标准库平衡树就是红黑树:TreeMap,TreeSet

哈希表:HashMap,HashSet

一个bug

在Java标准库中，链表转为红黑树的前提也是数据具有可比较性。

7.更多处理哈希冲突的方法

7.1开放地址法

每个地址都对每个元素开放，每个地址直接存元素(而不是链表或者红黑树)，冲突了直接放在该元素地址后面下一个为空的地方。(线性探测)

线性探测：遇到哈希冲突+1
平方探测：遇到哈希冲突 +1，+4，+9 ，+16，不会产生一整片空间全部被占据的方法。
二次哈希,遇到哈希冲突，选择另外一个哈希函数，+hash2(key)

当负载率()到达一定程度的时候就扩容。

7.2

再哈希法

哈希表