What is the most efficient way to generate random numbers in a range, without duplicates?
First let's define the problem completely. Given:
rand(m, n)
-
returns a uniformily distributed random integer in the range
[m, n]
[a, b]
-
a range from which to select the numbers (where
a < b
) N
- the quantity of numbers to generate
s = b - a
- the target range size
Note that s >= N
is also given, or the problem is ill-defined
;)
The problem is to generate N
uniformly distributed random
numbers in [a, b]
without producing duplicate numbers —
that is the same number z
does not appear twice in the output
list.
There are two immediately obvious solutions which work to different degrees depending on the exact circumstances. We'll be using python to describe our proposed solutions, and provide a simple benchmarking/test harness at the end. However, we may discuss alternative implementations choices such as using a bitset instead of an array — the source is only for illustration.
Solution One
numbers = []
while len(numbers) < N do:
num = rand(a, b)
if num not in numbers:
numbers.append(num)
return numbers
This solution works by picking a number, and then comparing it against the
previously generated list. If we've already chosen it, we discard and pick
again. This solution works well when N
is small compared to the
size of the range s
. It runs into issues when N
approaches s
, where it is very unlikely to generate an unused
number. When N = 2
and s = 100
, the odds of only
drawing two numbers total is 99%. In a severe case when N = 2
and s = 3
, the odds of drawing only two numbers is 50%. In cases
where (s - N) < N
, it is better to choose the numbers we
won't return instead of the ones we will, and then do a linear set
difference.
Depending on how the num not in numbers
check is implemented,
this algorithm may require a linear scan through the numbers
variable, resulting in at least O(N^2)
run time. If a separate
bitset is kept around for O(1)
duplicate checking, then the space
required by this algorithm goes from O(N)
with a low constant
(~1) — the minimum required — to O(N + s)
.
This algorithm gets worse very quickly. Due to the same cause as the Birthday
Paradox, the odds of getting duplicates grows very quickly as N
approaches the size of s
. A complete analysis is still
forthcoming.
Solution Two
possible = [ x for x in range(a, b + 1) ]
numbers = []
while len(numbers) < N:
num = rand(0, len(possible) - 1)
numbers.append(possible[num])
possible.remove(possible[num])
return numbers
This solution works by tracking the set of numbers that are still possible to
generate, and choosing one each iteration. This solution always chooses
exactly N
random numbers, and runs in a deterministic amount of
time. For situations where N
is close to s
, it can
run much more quickly than solution one. The main cost involved with this
algorithm is maintaining the possible
variable. Worst case, if
possible
is an array, then the cost of removing an element each
iteration is roughly s
, making this an O(N * s)
algorithm.
Solution Three
arr = [ x for x in range(a, b + 1) ]
# Fisher-Yates shuffle
for i in range(0, len(arr) - 1):
j = rand(i, len(arr) - 1)
# swap i-th and j-th element
tmp = arr[i]
arr[i] = arr[j]
arr[j] = tmp
return arr[0:N] # return first N elements
This solution works by generating a random bag, and then returning the first
few required elements. This takes O(s)
space, and the shuffle
takes O(s)
time. When N
is close to s
this is an optimal solution. When N == s
then we must use
O(N)
space, and because we generate O(N)
numbers,
our time complexity must be at least that. When N << s
however, this solution wastes a lot of effort shuffling extraneous elements
in the possible array.
You may notice that in a Fisher-Yates shuffle, once i
advances
past an element, arr[i]
is never touched again. This means that
we can stop once we have shuffled the first N
numbers and return
just that. This removes the need to shuffle extra elements, but the time
complexity remains at O(s)
due to the initial arr
construction. In addition, the O(s)
space requirement can be
prohibitive when s
is large. Imagine trying to select 10 random
titles out of a music library of 100k songs — we're wasting 4 orders of
magnitude of space compared to solution one.
We now have a few solutions: one that is optimal when N <<
s
and one which is optimal when N
approaches
s
. In the middle ground, neither solution works well. Solution
one will pick numerous duplicates and must choose again. Solution three does
not pick duplicates, but uses twice as much space as the optimal solution. Is
it possible to construct a solution which is works at either end of the
spectrum — or is even optimal in all cases? If you know
me at all, you'll know I tried ;) Here's what I've come up with:
Final solution
s = b - a
possible = { } # map, int -> int
numbers = []
for i in range(0, N + 1):
num = rand(i, s - 1)
if i not in possible:
possible[i] = i
if num not in possible:
possible[num] = num
numbers.append(possible[num] + a)
tmp = possible[i]
possible[i] = possible[num]
possible[num] = tmp
return numbers
Let's first discuss how this algorithm operates, and then move on to time and space complexity.
Structurally, the loop is almost identical. Instead of going up to
len(arr) - 2
(second to last element), we only go up to
N
and end short of shuffling the entire "array". We store the
"array" lazily in a hashmap. If an index is in the hashmap, we use it's
value. If it is not, it is assumed to be the index. At the start, since there
are no keys, every index i
maps to itself, making the ordered
array of 0..s
we have in solution three (we then offset it to
a..b
by adding a
before pushing the value to
numbers
). At best, we only have to store N
values in
the hash map, and at worst we must store min(2 * N, s)
elements
and can skip the initial O(s)
phase of creating the array.
This solution works much the same way as solution three, but saves time and
space by building the random bag as lazily as possible. This algorithm
requires O(N)
time and space proportional to O(N)
in
a hash map.
Conclusions
We now have a nice theoretically best solution, but in the real world cache effects and the hidden constants in big-O notation can cause "worse" solutions to outperform the theoretical best. Attached is a simple test harness I have made and complete code of each of the four solutions. Benchmarking results are still being made, but I should have some nice pretty graphs up here soon, or perhaps in a "the practical side" post :)
Test script: uniq_rand.py