Generating random numbers using C++ standard library: the problems

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
C's standard library facilities

Let's start with some C-style random number generation.
// Seed based on time. Not really random.
std::srand(std::time(nullptr));
// Generate 1'000 random numbers in range 0-100'000
for (size_t _ = 0; _ < 1'000; ++_) {
    std::cout << std::rand() % 100'000 << '\n';
This code is simple enough to write and to understand but comes with a host of problems.
The resulting numbers will not be uniformly distributed. The results will be biased towards lower numbers, due to the use of modulo.
Numbers above 32767 might not be present at all.
Whether the code is thread-safe is up to the implementation. Which functions invoke rand is also up to the implementation, so data races can happen without you expecting them.
If you do not see why converting the numbers using modulo cause non-uniformly distributed results, consider a simple case, where std::rand can only return 0, 1, or 2, each with the same probability, and we desire numbers in the range [0, 2). There are 2 ways to get 0, 0 % 2, and 2 % 2, while there is only one way to get 1, 1 % 2. In other words, we get 2:1 ratio of 0s to 1s due to using modulo.
The second problem is more obscure, but simpler to understand. The range of possible values generated by std::rand is specified as [0, RAND_MAX), where RAND_MAX can be any constant larger-or-equal to 32767. On platforms that use this lower bound^[1], the example above will never print number larger than 32767.
The last problem is just a symptom of the original C specification ignored threading.
The first two problems are solvable. Replacing modulo with rejection sampling (and potentially calling std::rand multiple times if needed) solves the bias issue. To generate values larger than RAND_MAX, you can just concatenate the result of multiple calls to std::rand.
The thread-safety is impossible to solve in general case^[2], but in specific cases, you can guard user code calls to std::rand with a mutex, and it should work well enough. Some implementations provide a per-thread std::rand, which is a much better solution, but you cannot rely on this.
However, solving all of this is either impossible, or a lot of non-trivial work, and even then you run into the problem that std::rand is allowed to return different numbers on different platforms given the same seed. At this point, it is easier to write your own set of random number generation tools, and so C++11 standardized its own set, in the form of <random>.
C++'s standard library facilities
At first glance, <random> seems exceedingly complex for a simple task. You have to pick a templated Uniform Random Bit Generator, possibly seed it, pick a templated Distribution, and then pass an instance of your URBG to the distribution to get a number... This is the C example rewritten using <random>:
// Truly random seed. 
std::mt19937 rng(std::random_device{}());
// Avoid constructing distribution all the time
std::uniform_int_distribution<> dist(0, 100'000);
// Generate 1'000 random numbers in range 0-100'000
for (size_t _ = 0; _ < 1'000; ++_) {
    std::cout << dist(rng) << '\n';
There is bit more code than there was with C, but bearably so, and most of the issues are fixed. The distribution will be uniform, all numbers in the desired interval are possible, and the code is thread-safe.
At second glance, <random> is awesome, even if there is a bit of boilerplate for simple operations. The decomposed and pluggable design means that you can customize your random numbers by replacing only a small part of the random number generation pipeline. The standard also provides a wide range of Random Number Engines and distributions^[3], so you should be able to do most things you want out of the box. It even provides an abstraction for getting actually random numbers for seeding the generators, std::random_device.
At the third glance, when you've started using <random> extensively and started digging deeper, you will find out that every single part of it is deeply flawed, and the best solution is to avoid using it completely.
Distributions are nonportable
Did you notice that the text above said
most of the issues are fixed
and then did not talk about portability? That's because both of the snippets, the C one and the C++ one, share one issue. Even if you hardcode the seed, the snippets will give you different results on different platforms^[4]. For bonus points, the results are not even guaranteed to be portable between different versions of the same standard library, as the standard library implementations are allowed to change how they implement std::uniform_int_distribution^[5].
What this boils down to is that if you have repeatability requirements for your generated random numbers^[6], then you cannot use the standard-provided distributions. Luckily, generating random numbers using <random> is properly decomposed, and you can "just" write your own distributions, and keep using the rest of <random>, right?
Well...
std::random_device might not be random, and there is no way to check
The C++ snippet uses std::random_device to generate some initial randomness to seed our instance of Mersenne Twister in the form of std::mt19937. The problem is that std::random_device is poorly specified, and inscrutable.
In theory, it should serve as an abstraction over some external source of entropy. In practice, an implementation is allowed to use any deterministic random number engine to implement it, e.g. a Mersenne twister, and there is no way to find out. There is a member function std::random_device::entropy(), which is in theory there to detect such case, but it does not work in practice.
The blame for this is shared between the standard and the implementations. The function's full signature is double entropy() const noexcept, and it is the return type that breaks it. The standard provides a definition of entropy^[7], but it does not provide any sort of guidance on how to count entropy of an external source of randomness, or expected return values for different cases.
This, in turn, caused different implementations to do their own thing. We will take a look at the big three, MS's STL, libc++ and libstdc++.
MS's implementation handles this the best. It knows its random_device is just a thin wrapper over kernel's cryptographically secure random, so it always returns 32 and inlines the member function into the header to allow for constant propagation^[8].
In order of sanity of implementation, libc++ is next, because it always just returns 0. This return value does not reflect reality, 4 out of 5 possible configurations^[9] of libc++'s random_device use strong random backend, and the last one also provides strong random bytes unless the user deliberately sabotages themselves. The return value also makes libc++'s implementation of std::random_device::entropy useless, but at least it is obviously useless, so the user is not given false hopes and expectations. There is value in this.
The worst implementation of std::random_device::entropy can be found in libstdc++. The reason it is the worst is that it is not obviously useless, you have to think about it for a bit to figure out why the return value is useless. This is because, unlike libc++, libstdc++ can return non-zero values. In most configurations, libstdc++ always returns 0^[10], but when it is configured to read from /dev/urandom (or /dev/random), it uses RNDGETENTCNT to check how much entropy the kernel thinks it has available and returns that to the user.
The underlying problem of this approach is TOCTOU. If you first check whether there is enough randomness^[11], and only then ask for that randomness, then by the time you ask for the randomness it could've been depleted, and you cannot get it anymore.
At this point, we know that we will likely have to implement our own distributions, and either implement our own random_device, or detect which standard library we are compiling against, and hardcode versions that provide good random_device::operator() implementations. But at least we can still use all the different Random Number Engines provided by the standard library, right?
Well...
There is no way to seed a Random Number Engine properly
The Random Number Engines almost work. But if something only almost works, it is broken.
Let's go back to the first line of the C++ example.
std::mt19937 rng(std::random_device{}());
It seeds a specific version of Mersenne Twister with unsigned int worth of random data. Let's assume sizeof(unsigned int) == 4. The internal state of mt19937 is 2496 (624 * 4) bytes. Taken together, this means that for every state we can seed the rng into, there are \(2^{4984}\) states that we cannot seed the rng into.
This has some interesting implications. For example, the program below will never print 7^[12].
int main() {
    std::mt19937 urbg(std::random_device{}());
    std::cout << urbg() << '\n';
Some output values also uniquely identify their seed. If I tell you that the code program printed 3046098682, then you can quickly^[13] find the seed generated by random_device, and thus predict all future outputs of a Mersenne twister seeded in this way^[14].
In theory, the standard provides a way to seed the Mersenne twister properly. The tool is called SeedSequence, and there is an implementation of it in the standard library, std::seed_seq. Once again, when you try to use it in practice, it breaks down.
std::seed_seq is essentially a wrapper over std::vector that you can give a bunch of randomness to, and then a random number engine can extract (stretched) randomness out. It is used like this:
auto rd_dev = std::random_device{};
std::seed_seq seq{rd_dev(), rd_dev(), rd_dev(), rd_dev()};
std::mt19937 urbg(seq);
This time we initialized our instance of mt19937 with 16 (4 * 4) bytes of randomness. Progress! There are two problems with this snippet though:
There is no way to know how much randomness you have to provide to a RandomNumberEngine T, and thus how much randomness you have to feed into seed_seq.
std::seed_seq is very tightly specified by the standard. The implementation forced by the standard is not a bijection^[15].
A fun fact about 1. is that std::mersenne_twister_engine provides a member variable you can query to find out how much data it needs^[16]. However, this is an accident of standardization, and no other standard-provided random number engine provides a way to retrieve this information.
The second problem means that even if you hardcode seed sizes of all random number engine types your program uses, you still couldn't use std::seed_seq for initialization, because it loses entropy... here is an example of this on Godbolt:
#include <array>
#include <iostream>
#include <random>
int main() {
    std::seed_seq seq1({0xf5e5b5c0, 0xdcb8e4b1}),
                  seq2({0xd34295df, 0xba15c4d0});
    std::array<uint32_t, 2> arr1, arr2;
    seq1.generate(arr1.begin(), arr1.end());
    seq2.generate(arr2.begin(), arr2.end());
    // prints 1 because seed_seq::generate is not a bijection
    std::cout << (arr1 == arr2) << '\n';
In other words, even if you write your own type that fulfils the SeedSequence named requirements, you have to hardcode the sizes of your Random Number Engine types somewhere.
Recap
To recapitulate, generating random numbers using C standard library has many problems, with some fixable at great programming effort, and other unfixable. If you are for some reason stuck with just the C library, you should definitely write your own.
Generating random numbers using C++ standard library fixes most of the problems of using the C library. However, the operative word here is most, and it introduces its own problems instead. In the end, whether you can successfully use <random> depends on your requirements.
If you need cross-platform reproducibility, then you cannot use standard-provided distributions at all, and you have to write your own.
If you need actually random data for whatever reason, you either have to write your own version of random_device, or hardcode a list of platforms + configurations where you can use std::random_device.
if you want to properly seed a Random Number Engine, you have to write your own SeedSequence, and also hardcode the required seed sizes of all your Random Number Engines.
My use cases for <random> usually require cross-platform reproducibility, need properly random seed values, and would prefer fully seeded RNEs. This means that I either have to write 90% of <random> on my own, or use a different implementation, such as Boost.Random or PCG random utilities...
And I am not the only one. When I was writing a couple of standardization proposals for fixing <random>, I made an informal Reddit poll asking people about their use of <random>. The absolute majority of people answered either that they have their own implementation, or use Boost.Random. Few people used other open source libraries, and very, very, very few people use the standard random.
That's it for this post. The next post explores possible avenues for fixing <random> and making it usable by more people in more domains.
This is not a theoretical problem, e.g. MSVC defines RAND_MAX as 32767. ↩︎
There are two problems. The first is that your standard library is allowed to call std::rand from arbitrary other functions. The other is that the other libraries in your binary are also allowed to call std::rand, and you have no way to tell them to lock a mutex before doing so. ↩︎
Including some less known distributions, like Student's t-distribution. ↩︎
Both these configurations share the same libc, so the result of std::rand will be the same. However, it obviously differs from MSVC's results, because MSVC's std::rand cannot return a number that big. ↩︎
Microsoft's standard library implementation is in fact considering a change to a more performant implementation of std::uniform_int_distribution. ↩︎
There are many different use cases for reproducible random generation. Fuzzing, procedural generation, networked clients, physics simulations, etc. ↩︎
I am reasonably sure that standard library maintainers are smart enough to look it up on their own, but ¯\_(ツ)_/¯. ↩︎
This does not mean you can use it for compile-time checking, but at least it is something. ↩︎
Libc++ can either use getentropy, arc4random, nacl_secure_random, rand_s, or read from a file, like /dev/urandom. In theory the user can specify their own file, and thus get nonrandom data. ↩︎
As an example, current versions of libstdc++ use rand_s when on Windows. rand_s is kernel-provided cryptographically secure random number generator, but libstdc++ still returns 0. ↩︎
I do not subscribe to the view that we can use up kernel's entropy after it has initialized its pools properly, but I can see how a good-faith implementation can arrive here. ↩︎
It actually cannot print over 30% of all possible values of a 4-byte unsigned integer. ↩︎
It takes about 10 minutes on my desktop. ↩︎
With a properly seeded Mersenne twister, you would have to observe 624 successive outputs to be able to predict all future generated numbers. ↩︎
If you assume that std::seed_seq never gets more random data than a random number engine will request later, then it would suffice for it to be an injection. It is not an injection either. ↩︎
You actually have to query more than one member variable and do some arithmetics, but it is possible to find out how much random data you need to feed a std::mersenne_twister_engine to initialize it fully. ↩︎