Disjoint Sets

The ADT of disjoint sets is like:

public interface DisjointSets {
    void connect(int p, int q);
    boolean isConnected(int p, int q); 
}

In this interface, p and q represent the p-th and q-th items, respectively.

Implementation

This section introduces some variants of implementations of disjoint sets.

The initial idea for implementing disjoint sets is to maintain a list of sets, e.g, List<Set<Integer>>. For the connect(item_1, item_2) operation, we have to iterate through up to $N$ sets to find item_1 and item_2, the time complexity of which is $O(N)$ . It is also $O(N)$ to perform isConnected(item_1, item_2).

Quick Find

In this implementation, the disjoint sets are represented by an array id, where id[x] is the ID of the set to which item x belongs.

int[] id = {4, 4, 4, 5, 4, 5, 6};

/* Need to iterate through the array => Θ(N) */
public void connect(int p, int q) {
  int pid = id[p];
  int qid = id[q];
  for (int i = 0; i < id.length; i++){
    if (id[i] == pid){
      id[i] = qid;
    }
  }
}

Quick find means that we can quickly find out if two items are in the same set.

/* Θ(1) */
public boolean isConnected(int p, int q){
  return (id[p] == id[q]);
}

The following implementations manage the disjoint sets with an array of trees.

Quick Union

Assign each item the index of its parent. If an item has no parent, then it is a 'root' and we assign it a negative value.

To connect two items, we find the set that each item belongs to (the roots of their respective trees) with a helper method find and make one the child of the other.

connect and isConnect are both $O(N)$ in this case. It seems that QuickUnion is worse than QuickFind, where the isConnect is $O(1)$ . However, $O(N)$ indicates an upper bound and the worst case is depicted below. When our trees are balanced, both connect and isConnected perform reasonably well.

Weighted Quick Union

New rule: whenever we call connect, we always link the root of the smaller (in terms of the number of nodes) tree to the larger tree.

Following the above rule ensures that the maximum height of any tree is $\Theta (\log N)$ . $N$ is the number of elements in our Disjoint sets. By extension, the runtimes of connect and isConnected are bounded by $O(\log N)$ .

We can store the size information in the root of the tree by replacing the -1's with -(size of tree).

Imagine any element $x$ in tree $T1$ . The depth of $x$ increases by 1 only when $T1$ is placed below another tree $T2$ . When that happens, the size of the resulting tree will be at least double the size of $T1$ because $\text{size}(T2)\geq\text{size}(T1)$ . The tree with $x$ can double at most $\log_2{N}$ times until we've reached a total of $N$ items ( $2^{\log_2{N}}$ ). So we can double up to $\log_2N$ times and each time, our tree adds a level → maximum $\log_2N$ levels.

Weighted Quick Union with Path Compression

Observation: whenever we call find(x) (to find the ancestor of x) we have to traverse the path from x to root. Along the way we can connect all the items we visit to their root at no extra asymptotic cost.

Now the complexity of connect and isConnect is $O(\log ^* N)$ and that is pretty close to constant.

$\log* (n)$ is known as "Iterated logarithm".
$\log*(n) = \# \text{ of } \log \text{ s.t. } \log(\log(\log\dots(\log(n)))) = 0$

PreviousData-Structures-and-Algorithms NextPriority Queues and Heaps

Last updated 10 months ago