Hash Tables and Clustering


Given a table of size N:

what is the probability of forming a cluster with 0 items in the table? Obviously, 0%.

Inserting an item into the table has 1/N chance of ending up in any of the N slots.


Now, the chances of forming a cluster by inserting a second item is exactly 3/N, as there are 3 possible slots that could cause the cluster to form:

Of course, the insertion may not form a cluster. The item may be inserted into a slot that is not immediately before or after the first item but may be one slot further. The chances of this occurring is exactly 2/N:
Finally, the chances that the item is more than 1 slot away is exactly (N - 5)/N. This is because any of the 5 white slots below would fall into one of the two categories above. Essentially, (N - 5)/N is the number of remaining slots.
Given that we now have 2 items in the table, we could have any one of these three situations:
No intervening slots
Exactly one intervening slot
Or a variation of this: (any arrangement that has 2 or more slots between the two items)
At least 2 intervening slots
So we have the probability of arrangement one as 3/N, the probability of arrangement two as 2/N and the probability of arrangement three as (N - 5)/N. This includes all of the possible arrangements of a table with two items.


Now, insert a third item. Given the possibilities above, these are the candidate insertion points that could form a cluster:

Arrangement #1 has probability: 4/N
Arrangement #2 has probability: 5/N
Arrangement #3 has probability: 6/N
The probability of being in arrangement 1 before item 3 is inserted is 3/N, so the probability of ending up in arrangement 1 with 3 items is:
 3     4     12
--- x --- = ----
 N     N     N2
The probability of being in arrangement 2 before item 3 is inserted is 2/N, so the probability of ending up in arrangement 2 with 3 items is:
 2     5     10
--- x --- = ----
 N     N     N2
The probability of being in arrangement 3 before item 3 is inserted is (N - 5)/N, so the probability of ending up in arrangement 3 with 3 items is:
(N - 5)    6    6N - 30
------- x --- = -------
   N       N       N2
Adding up the probabilities:
 12     10    (6N - 30)    6     8
---- + ---- + --------- = --- - ---
 N2     N2        N2       N     N2
         8                                   6
Because --- is essentially 0, the result is --- .
         N2                                  N

We see that the chances of forming a cluster with 3 items is about twice as much as forming a cluster with only 2 items. This is the basic idea behind Knuth's formulas, without all of the non-trivial proofs and mathematics.

In general, we can see the probabilities of growing an existing cluster:

Chances of growing a 2-cluster into a 3-cluster: 4/N

Chances of growing a 3-cluster into a 4-cluster: 5/N
Chances of growing a 4-cluster into a 5-cluster: 6/N
etc...

It becomes more and more likely that each element inserted will form a new or larger cluster causing search performance to degrade rapidly.