~jac

Selecting random numbers without duplicates

What is the most efficient way to generate random numbers in a range, without duplicates?

First let's define the problem completely. Given:

rand(m, n): returns a uniformily distributed random integer in the range [m, n]
[a, b]: a range from which to select the numbers (where a < b)
N: the quantity of numbers to generate
s = b - a: the target range size

Note that s >= N is also given, or the problem is ill-defined ;)

The problem is to generate N uniformly distributed random numbers in [a, b] without producing duplicate numbers — that is the same number z does not appear twice in the output list.

There are two immediately obvious solutions which work to different degrees depending on the exact circumstances. We'll be using python to describe our proposed solutions, and provide a simple benchmarking/test harness at the end. However, we may discuss alternative implementations choices such as using a bitset instead of an array — the source is only for illustration.

Solution One

	numbers = []
	while len(numbers) < N do:
		num = rand(a, b)
		if num not in numbers:
			numbers.append(num)
	return numbers

This solution works by picking a number, and then comparing it against the previously generated list. If we've already chosen it, we discard and pick again. This solution works well when N is small compared to the size of the range s. It runs into issues when N approaches s, where it is very unlikely to generate an unused number. When N = 2 and s = 100, the odds of only drawing two numbers total is 99%. In a severe case when N = 2 and s = 3, the odds of drawing only two numbers is 50%. In cases where (s - N) < N, it is better to choose the numbers we won't return instead of the ones we will, and then do a linear set difference.

Depending on how the num not in numbers check is implemented, this algorithm may require a linear scan through the numbers variable, resulting in at least O(N^2) run time. If a separate bitset is kept around for O(1) duplicate checking, then the space required by this algorithm goes from O(N) with a low constant (~1) — the minimum required — to O(N + s).

This algorithm gets worse very quickly. Due to the same cause as the Birthday Paradox, the odds of getting duplicates grows very quickly as N approaches the size of s. A complete analysis is still forthcoming.

Solution Two

	possible = [ x for x in range(a, b + 1) ]
	numbers = []
	while len(numbers) < N:
		num = rand(0, len(possible) - 1)
		numbers.append(possible[num])
		possible.remove(possible[num])
	return numbers

This solution works by tracking the set of numbers that are still possible to generate, and choosing one each iteration. This solution always chooses exactly N random numbers, and runs in a deterministic amount of time. For situations where N is close to s, it can run much more quickly than solution one. The main cost involved with this algorithm is maintaining the possible variable. Worst case, if possible is an array, then the cost of removing an element each iteration is roughly s, making this an O(N * s) algorithm.

Solution Three

	arr = [ x for x in range(a, b + 1) ]

	# Fisher-Yates shuffle
	for i in range(0, len(arr) - 1):
		j = rand(i, len(arr) - 1)
		# swap i-th and j-th element
		tmp = arr[i]
		arr[i] = arr[j]
		arr[j] = tmp

	return arr[0:N] # return first N elements

This solution works by generating a random bag, and then returning the first few required elements. This takes O(s) space, and the shuffle takes O(s) time. When N is close to s this is an optimal solution. When N == s then we must use O(N) space, and because we generate O(N) numbers, our time complexity must be at least that. When N << s however, this solution wastes a lot of effort shuffling extraneous elements in the possible array.

You may notice that in a Fisher-Yates shuffle, once i advances past an element, arr[i] is never touched again. This means that we can stop once we have shuffled the first N numbers and return just that. This removes the need to shuffle extra elements, but the time complexity remains at O(s) due to the initial arr construction. In addition, the O(s) space requirement can be prohibitive when s is large. Imagine trying to select 10 random titles out of a music library of 100k songs — we're wasting 4 orders of magnitude of space compared to solution one.

We now have a few solutions: one that is optimal when N << s and one which is optimal when N approaches s. In the middle ground, neither solution works well. Solution one will pick numerous duplicates and must choose again. Solution three does not pick duplicates, but uses twice as much space as the optimal solution. Is it possible to construct a solution which is works at either end of the spectrum — or is even optimal in all cases? If you know me at all, you'll know I tried ;) Here's what I've come up with:

Final solution

	s = b - a
	possible = { } # map, int -> int
	numbers = []
	for i in range(0, N + 1):
		num = rand(i, s - 1)
		if i not in possible:
			possible[i] = i
		if num not in possible:
			possible[num] = num

		numbers.append(possible[num] + a)
		tmp = possible[i]
		possible[i] = possible[num]
		possible[num] = tmp
	return numbers

Let's first discuss how this algorithm operates, and then move on to time and space complexity.

Structurally, the loop is almost identical. Instead of going up to len(arr) - 2 (second to last element), we only go up to N and end short of shuffling the entire "array". We store the "array" lazily in a hashmap. If an index is in the hashmap, we use it's value. If it is not, it is assumed to be the index. At the start, since there are no keys, every index i maps to itself, making the ordered array of 0..s we have in solution three (we then offset it to a..b by adding a before pushing the value to numbers). At best, we only have to store N values in the hash map, and at worst we must store min(2 * N, s) elements and can skip the initial O(s) phase of creating the array.

This solution works much the same way as solution three, but saves time and space by building the random bag as lazily as possible. This algorithm requires O(N) time and space proportional to O(N) in a hash map.

Conclusions

We now have a nice theoretically best solution, but in the real world cache effects and the hidden constants in big-O notation can cause "worse" solutions to outperform the theoretical best. Attached is a simple test harness I have made and complete code of each of the four solutions. Benchmarking results are still being made, but I should have some nice pretty graphs up here soon, or perhaps in a "the practical side" post :)

Test script: uniq_rand.py

originally posted 2015-02-28