HashSet of Java data type series

1. HashSet first acquaintance

HashSet is an implementation class of Java collection Set. Set is an interface. In addition to HashSet, it also has TreeSet, and inherits Collection. HashSet collection is very commonly used, and it is also the knowledge that programmers often ask during interviews. Point, below is the structure diagram

HashSet of Java data type series

 

Still look at the annotations of the class as always

/**
 * This class implements the <tt>Set</tt> interface, backed by a hash table (actually a <tt>HashMap</tt> instance).  It makes no guarantees as to the iteration order of the set; 
 * in particular, it does not guarantee that the order will remain constant over time.  This class permits the <tt>null</tt> element.
 * 这个类实现了set 接口,并且有hash 表的支持(实际上是HashMap),它不保证集合的迭代顺序
 * This class offers constant time performance for the basic operations(<tt>add</tt>, <tt>remove</tt>, <tt>contains</tt> and <tt>size</tt>),
 * assuming the hash function disperses the elements properly among the buckets.  
 * 这个类的add remove contains 操作都是常数级时间复杂度的
 * Iterating over this set requires time proportional to the sum of the <tt>HashSet</tt> instance's size (the number of elements) plus the
 * "capacity" of the backing <tt>HashMap</tt> instance (the number of buckets).  
 * 对此集合进行迭代需要的时间是和该集合的大小(集合中存储元素的个数)加上背后的HashMap的大小是成比列的
 * Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
 * 因此如果迭代的性能很重要,所以不要将初始容量设置的太大(因为这意味着背后的HashMap会很大)
 * <p><strong>Note that this implementation is not synchronized.</strong> If multiple threads access a hash set concurrently, 
 * and at least one ofthe threads modifies the set, it <i>must</i> be synchronized externally.
 * 注意次集合没有实现同步,如果多个线程并发访问,并且至少有一个线程会修改,则必须在外部进行同步(加锁)
 * This is typically accomplished by synchronizing on some object that naturally encapsulates the set.
 * 通常在集合的访问集合的外边通过对一个对象进行同步实现(加锁实现)
 * If no such object exists, the set should be "wrapped" using the @link Collections#synchronizedSet Collections.synchronizedSet} ethod.  
 * 如果没有这样的对象,可以尝试Collections.synchronizedSet 方法对set 进行封装(关于Collections工具类我单独写了一篇,可以自行查看)
 * this is best done at creation time, to prevent accidental unsynchronized access to the set:<pre> Set s = Collections.synchronizedSet(new HashSet(...));</pre>
 * 这个操作最好是创建的时候就做,防止意外没有同步的访问,就像这样使用即可 Set s = Collections.synchronizedSet(new HashSet(...))
 * <p>The iterators returned by this class's <tt>iterator</tt> method are <i>fail-fast</i>: if the set is modified at any time after the iterator is
 * created, in any way except through the iterator's own <tt>remove</tt> method, the Iterator throws a {@link ConcurrentModificationException}.
 * 这一段我们前面也解释过很多次了关于fail-fast 我们不解释了(可以看ArrayList 一节)
 * Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.
 * <p>Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the
 * presence of unsynchronized concurrent modification.  Fail-fast iterators
 * throw <tt>ConcurrentModificationException</tt> on a best-effort basis.
 * Therefore, it would be wrong to write a program that depended on this exception for its correctness: <i>the fail-fast behavior of iterators should be used only to detect bugs.</i>
 * 
 * @author  Josh Bloch
 * @author  Neal Gafter
 * @see     Collection
 * @see     HashMap
 * @since   1.2
 */

public class HashSet<E>
    extends AbstractSet<E>
    implements Set<E>, Cloneable, java.io.Serializable
{ ... }

1. HashSet construction method


First, let’s take a look at it as a whole. We will look at the following construction methods one by one later.

private transient HashMap<E,Object> map;
//默认构造器
public HashSet() {
    map = new HashMap<>();
}
//将传入的集合添加到HashSet的构造器
public HashSet(Collection<? extends E> c) {
    map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
    addAll(c);
}
//仅明确初始容量的构造器(装载因子默认0.75)
public HashSet(int initialCapacity) {
    map = new HashMap<>(initialCapacity);
}

//明确初始容量和装载因子的构造器
public HashSet(int initialCapacity, float loadFactor) {
    map = new HashMap<>(initialCapacity, loadFactor);
}
// 
HashSet(int initialCapacity, float loadFactor, boolean dummy) {
    map = new LinkedHashMap<>(initialCapacity, loadFactor);
}

By reading the following source code and understanding the class annotations, we found that HashSet is a leather bag company . It does external work and throws it directly to HashMap when it receives it. Because the bottom layer is implemented through HashMap, here is a brief mention:

The data storage of HashMap is realized by array + linked list/red-black tree. The approximate storage process is to calculate the storage position in the array through the hash function. If the position already has a value, it is judged whether the key is the same, the same is overwritten, not the same Put it in the linked list corresponding to the element. If the length of the linked list is greater than 8, it will be converted into a red-black tree. If the capacity is not enough, it needs to be expanded (note: this is just a general process).

No-parameter construction


The default constructor is also the most commonly used constructor

/**
 * Constructs a new, empty set; the backing <tt>HashMap</tt> instance has default initial capacity (16) and load factor (0.75).
 * 创建一个新的、空的set, 背后的HashMap实例用默认的初始化容量和加载因子,分别是16和0.75
 */
public HashSet() {
    map = new HashMap<>();
}

Set-based construction

/**
 * Constructs a new set containing the elements in the specified collection.  The <tt>HashMap</tt> is created with default load factor (0.75) and 
 * an initial capacity sufficient to contain the elements in the specified collection.
 *  创建一个新的包含指定集合里面全部元素的set,背后的HashMap 依然使用默认的加载因子0.75,但是初始容量是足以容纳需要容纳集合的(其实就是大于该集合的最小2的指数,更多细节可以查看HashMap 那一节的文章)
 * @param c the collection whose elements are to be placed into this set
 * @throws NullPointerException if the specified collection is null 当集合为空的时候抛出NullPointerException
 */
public HashSet(Collection<? extends E> c) {
    map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
    addAll(c);
}

Specify the initial capacity structure


In fact, there are three types of construction below, but I attributed it to this category, because they all specify the initial capacity


Only the initial capacity is specified

/**
 * Constructs a new, empty set; the backing <tt>HashMap</tt> instance has the specified initial capacity and default load factor (0.75).
 * 创建一个新的,空的集合,使用了默认的加载因子和以参数为参考的初始化容量
 * @param      initialCapacity   the initial capacity of the hash table
 * @throws     IllegalArgumentException if the initial capacity is less than zero 参数小于0的时候抛出异常
 */
public HashSet(int initialCapacity) {
    map = new HashMap<>(initialCapacity);
}

The initial capacity referenced by the parameter , why it is expressed in this way, you can also see the in-depth analysis of Hashmap

/**
 * Constructs a new, empty set; the backing <tt>HashMap</tt> instance has the specified initial capacity and the specified load factor.
 * 创建一个新的,空的集合,使用了指定的加载因子和以参数为参考的初始化容量
 * @param      initialCapacity   the initial capacity of the hash map
 * @param      loadFactor        the load factor of the hash map
 * @throws     IllegalArgumentException if the initial capacity is less  than zero, or if the load factor is nonpositive 加载因子和初始容量不合法
 */
public HashSet(int initialCapacity, float loadFactor) {
    map = new HashMap<>(initialCapacity, loadFactor);
}

dummy


We will explain this in detail later in LinkedHashSet, in fact, this parameter will be given a True in LinkedHashSet, then at this time map is a reference to LinkedHashMap, not HashMap

/**
 * Constructs a new, empty linked hash set.  (This package private constructor is only used by LinkedHashSet.) The backing HashMap instance is a LinkedHashMap with the specified initial
 * capacity and the specified load factor.
 * 忽略 dummy 参数的话和前面一样,这个构造方法主要是在LinkedHashSet中使用,而且你看到这个时候map 不再直接是HashMap 而是 LinkedHashMap
 * @param      initialCapacity   the initial capacity of the hash map
 * @param      loadFactor        the load factor of the hash map
 * @param      dummy             ignored (distinguishes this constructor from other int, float constructor.)
 * @throws     IllegalArgumentException if the initial capacity is less
 *             than zero, or if the load factor is nonpositive
 */
HashSet(int initialCapacity, float loadFactor, boolean dummy) {
    map = new LinkedHashMap<>(initialCapacity, loadFactor);
}

Important attributes of HashSet


map


As mentioned earlier, HashSet is a leather bag company, and here is the big boss behind it, that is, the person who really does the job. All the data of HashSet is stored in this HashMap.

private transient HashMap<E,Object> map;

PRESENT

// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();

This value is a bit interesting, it is a key-value key-value pair accepted by HashMap, so every time you add an element to HashSet, it will compose your parameter e and this element into a key value (e-PRESENT) Yes, hand it over to HashMap


OMG, I have never seen such a shameless data structure. I named myself HashSet and made it at the same level as HashMap. It fools users externally and deceives HashMap internally. Each value gives people the same data Isn’t this the same as eggshells?


2. Common methods of HashSet


1. add method

public static void main(String[] args) {
    HashSet hashSet=new HashSet<String>();
    hashSet.add("a");
}
复制代码

The add method of HashSet is implemented by the put method of HashMap, but HashMap is a key-value key-value pair, and HashSet is a collection, so how is it stored? Let’s take a look at the source code.

// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();

/**
 * Adds the specified element to this set if it is not already present,If this set already contains the element, the call leaves the set unchanged and returns <tt>false</tt>.
 * 添加一个 不存在的元素到集合,如果已经存在则不作改动,然后返回false
 * @param e element to be added to this set
 * @return <tt>true</tt> if this set did not already contain the specified 不存在在返回true,存在返回false
 */
public boolean add(E e) {
  	// map.put(e, PRESENT) 的返回值就是oldValue
    return map.put(e, PRESENT)==null;
}
复制代码

Looking at the source code, we know that the elements added by HashSet are stored in the key position of HashMap, and the value takes the default constant PRESENT, which is an empty object. Actually, I can’t help but complain when I see this. Isn’t it good to give a null? HashMap It supports null as key and value. What's more, here is just value. Although you are using the same object PRESENT here, null is the correct solution at this time.

One more thing to say is about the return value. We know that the return value of the HashMap.put() method is oldValue, of course it may be null-that is, there is no oldValue, and HashSet decides to return based on whether oldValue is empty. Value, which means that when oldValue is not empty, it returns false to indicate that it already exists. In fact, you can consider a question that is why you don’t judge whether it exists first, and add it when it does not exist. Isn’t it more reasonable? Welcome discuss

HashMap does not judge whether it exists, because its value is meaningful, because the value needs to be updated, but what about HashSet

As for the put method of map, you can see the in-depth analysis of Hashmap

HashSet of Java data type series

 

Of course there are other variants of the add method, such as addAll(Collection<? extends E> c) method

2. The remove method

The remove method of HashSet is implemented by the remove method of HashMap

/**
 * Removes the specified element from this set if it is present. More formally, removes an element if this set contains such an element.  Returns <tt>true</tt> 
 * 如果set 中有这个元素的话,remove 操作会将它删除,通常情况下,如果存在的话返回True
 * @param o object to be removed from this set, if present 如果存在则删除
 * @return <tt>true</tt> if the set contained the specified element
 */
public boolean remove(Object o) {
  	// 调用了HashMap 的remove 方法
    return map.remove(o)==PRESENT;
}
//map的remove方法
public V remove(Object key) {
    Node<K,V> e;
    //通过hash(key)找到元素在数组中的位置,再调用removeNode方法删除
    return (e = removeNode(hash(key), key, null, false, true)) == null ? null : e.value;
}

final Node<K,V> removeNode(int hash, Object key, Object value,
                           boolean matchValue, boolean movable) {
    Node<K,V>[] tab; Node<K,V> p; int n, index;
    //步骤1.需要先找到key所对应Node的准确位置,首先通过(n - 1) & hash找到数组对应位置上的第一个node
    if ((tab = table) != null && (n = tab.length) > 0 &&
        (p = tab[index = (n - 1) & hash]) != null) {
        Node<K,V> node = null, e; K k; V v;
        //1.1 如果这个node刚好key值相同,运气好,找到了
        if (p.hash == hash &&
            ((k = p.key) == key || (key != null && key.equals(k))))
            node = p;
        /**
         * 1.2 运气不好,在数组中找到的Node虽然hash相同了,但key值不同,很明显不对, 我们需要遍历继续
         *     往下找;
         */
        else if ((e = p.next) != null) {
            //1.2.1 如果是TreeNode类型,说明HashMap当前是通过数组+红黑树来实现存储的,遍历红黑树找到对应node
            if (p instanceof TreeNode)
                node = ((TreeNode<K,V>)p).getTreeNode(hash, key);
            else {
                //1.2.2 如果是链表,遍历链表找到对应node
                do {
                    if (e.hash == hash &&
                        ((k = e.key) == key ||
                         (key != null && key.equals(k)))) {
                        node = e;
                        break;
                    }
                    p = e;
                } while ((e = e.next) != null);
            }
        }
        //通过前面的步骤1找到了对应的Node,现在我们就需要删除它了
        if (node != null && (!matchValue || (v = node.value) == value ||
                             (value != null && value.equals(v)))) {
            /**
             * 如果是TreeNode类型,删除方法是通过红黑树节点删除实现的,具体可以参考【TreeMap原理实现
             * 及常用方法】
             */
            if (node instanceof TreeNode)
                ((TreeNode<K,V>)node).removeTreeNode(this, tab, movable);
            /** 
             * 如果是链表的情况,当找到的节点就是数组hash位置的第一个元素,那么该元素删除后,直接将数组
             * 第一个位置的引用指向链表的下一个即可
             */
            else if (node == p)
                tab[index] = node.next;
            /**
             * 如果找到的本来就是链表上的节点,也简单,将待删除节点的上一个节点的next指向待删除节点的
             * next,隔离开待删除节点即可
             */
            else
                p.next = node.next;
            ++modCount;
            --size;
            //删除后可能存在存储结构的调整,可参考【LinkedHashMap如何保证顺序性】中remove方法
            afterNodeRemoval(node);
            return node;
        }
    }
    return null;
}
复制代码

The specific implementation of removeTreeNode method can refer to TreeMap principle implementation and common methods

The specific implementation of the afterNodeRemoval method can refer to the in-depth analysis of LinkedHashMap, but here it uses the empty method in HashMap, which is actually meaningless

3. Traverse

Sequential problem

As a collection, HashSet has a variety of traversal methods, such as ordinary for loops, enhanced for loops, and iterators. Let’s take a look at iterator traversal

public static void main(String[] args) {
    HashSet<String> setString = new HashSet<> ();
    setString.add("星期一");
    setString.add("星期二");
    setString.add("星期三");
    setString.add("星期四");
    setString.add("星期五");

    Iterator it = setString.iterator();
    while (it.hasNext()) {
        System.out.println(it.next());
    }
}
复制代码

What is the printed result?

星期二
星期三
星期四
星期五
星期一
复制代码

As expected, HashSet is implemented through HashMap. HashMap uses hash(key) to determine the storage location, which does not have storage order. Therefore, the elements traversed by HashSet are not in the order of insertion.

The problem of fast failure

I have already given this demonstration in the previous other collections, but in order to emphasize this problem, I will give another one here. Of course, the use scenario is that you want to perform an operation according to the collection of the situation during the traversal process.

@Test
public void iterator() {
    HashSet<String> setString = new HashSet<> ();
    setString.add("星期一");
    setString.add("星期二");
    setString.add("星期三");
    setString.add("星期四");
    setString.add("星期五");
    System.out.println(setString.size());
    Iterator<String> it = setString.iterator();
    while (it.hasNext()) {
        String tmp=it.next();
        if (tmp.equals("星期三")){
             setString.remove(tmp);
        }
        System.out.println(tmp);
    }
    System.out.println(setString.size());
}
复制代码

operation result

5
星期二
星期三
Exception in thread "main" java.util.ConcurrentModificationException
	at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
	at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
	at datastructure.java数据类型.hash.JavaHashSet.main(JavaHashSet.java:17)
复制代码

You just need to make a slight modification

@Test
public void iterator() {
    HashSet<String> setString = new HashSet<> ();
    setString.add("星期一");
    setString.add("星期二");
    setString.add("星期三");
    setString.add("星期四");
    setString.add("星期五");
    System.out.println(setString.size());
    Iterator<String> it = setString.iterator();
    while (it.hasNext()) {
        String tmp=it.next();
        if (tmp.equals("星期三")){
            it.remove();
        }
        System.out.println(tmp);
    }
    System.out.println(setString.size());
}
复制代码
5
星期二
星期三
星期四
星期五
星期一
4
复制代码

3. Summary

HashSet is actually a data structure catalyzed in a certain scenario. It has almost no implementation of its own. All functions are realized with the help of HashMap. In the article, we also give some questions about the HashSet add method. Do not judge whether it exists first, but go directly to the HashMap put method, and then judge the result according to the return value of put

HashSet performance

It should be noted that the capacity has an impact on the iterative performance of HashSet, because the iteration needs to consider the number of actual storage elements and the size of the capacity

 

HashSet of Java data type series

 

Guess you like

Origin blog.csdn.net/a159357445566/article/details/115210866