Tuesday, May 20, 2014

Dealing with randomness

Two weeks ago, I wrote an article regarding randomness on UNIX systems and libraries. In it, I dealt with some theoretical API issues, and real world issues of libraries being horribly misdesigned. Today I'd like to focus more on the current state of things, and further discuss real world problems.


To start, let us understand what randomness means. Any perceived randomness on your part is your inability to track all the variables. Meaning that so called randomness to us mere mortals is something that we do not know nor can predict, at least not with the tools we have at our disposal.

The ideas behind randomness and determinism have long been battled over by philosophers and scientists, with the former debating whether it's possible for such things to exist, and the latter hunting for its existence. But thankfully, in terms of cryptography, even though randomness doesn't exist, we can make believe it does, and we can generate data that appears random to anyone from the outside. We can do so by using variables that are hard to track and influence, and extremely confusing algorithms.

Collecting unpredictable values is a challenge, and nothing is ever as unpredictable as one might think. Even if we were to look to quantum mechanics, which today is believed to be the closest thing to random that we know of, and measure some quantum behavior, the data might still be predicable, and therefore not random enough for cryptography needs. Cryptography Engineering pages 138-139 covers this in more detail, without even getting into possible advancements in the field. All the more so, being completely unpredictable is quite challenging using more conventional means.

Essentially, we're left with trying to do our best, without really being able to ensure we're doing things well. The less information we leak, and more unpredictable we behave, the better.


We tend to make assumptions. We make assumptions that the people behind Linux know how to create a great operating system. We assumed the OpenSSL team knew how to create a secure library. We assume the OpenBSD developers know safe software engineering. However, assumptions may be wrong. Never assume, research and test it.

Like others, I tend to assume that different pieces of software were written correctly. A couple of years back, I was having trouble with a web server in conjunction with IPv6, which was built on top of a network server library. I assumed the library written by supposed experts with IPv6 knowledge knew what they were doing, and blamed the newness of IPv6, and assumed the issue was not with the library but with the network stacks in the OS.

For unrelated reasons I decided to improve my network development knowledge and purchased Unix Network Programming 3rd Edition, which is now over a decade old, yet includes instructions on properly working with IPv6. Turns out I was able to fix the bug in the networking library I was using by modifying a single line in it. After reading this book, I also realized why I was having some rare hard to reproduce bug in another application, and fixed the issue in two minutes.

The aforementioned book is considered the bible in terms of sockets programming. It explains common gotchas, and shows how to build robust networking software which works properly. We would assume the developers of a major networking library would have read it, but experience shows otherwise.

I hardly need to elaborate how this has applied elsewhere (hello OpenSSL).

False messiahs

Most software is written by people who aren't properly trained for their positions. They either get thrust into it, or dabble in something for fun, and it turns into something popular which people all over the globe use regularly. Suddenly popularity becomes the measurement for gauging expertise. Some kid with a spare weekend becomes the new recognized authority in some field with no real experience. People who end up in this position tend to start doing some more research to appear knowledgeable, and receive unsolicited advice, which to some extent propels them to be an expert in their field, at least at a superficial level. This can then further reinforce the idea that the previously written software was correct.

Situations like this unfortunately even apply to security libraries and software which needs to be secure. We can read a book, which covers some important design nuance, and we'll think to ourselves that if we laymen read the popular book, surely those who need to be aware of its contents did, and assume the software we depend on every day is written properly. I've read a few books and papers over the years on proper random library gotchas, design, and more, yet used to simply use OpenSSL's functions in my code, as I assumed the OpenSSL developers read these books too and already took care of things for me.

Recently, the OpenBSD team actually reviewed OpenSSL's random functions, and realized that no, the OpenSSL developers did not know what they were doing. Since the Debian key generation fiasco a couple of years ago revealed they were using uninitialized variables as part of randomness generation, I suspected they didn't know what they were doing, but never bothered looking any deeper. It's scary to realize you may in fact know how to handle things better than the so-called experts.

I wrote an article the other day regarding how to protect private keys. A friend of mine after reading it asked me incredulously: You mean Apache isn't doing this?!

We have to stop blindly believing in certain pieces of software or developers. We must educate ourselves properly, and then ensure everything we use lives up to the best practices we know about.

UNIX random interfaces

Linux invented two interfaces for dealing with random data which were then copied by the other UNIX-like operating systems. /dev/random which produces something closely related to what the OS believes is random data it collected, and /dev/urandom which produces an endless stream of supposedly cryptographically secure randomish data. The difference in the output between the two of them should not be discernible, so theoretically, using one in place of the other should not make a difference.

There's a ton of websites online which tell developers to use only /dev/urandom, since the whole point of a cryptographically-secure pseudo-random number generator is to appear random. So who needs /dev/random anyway, and finite amounts of randomishness is problematic, so everyone should just use something which is unlimited. Then the Linux manual page will be blamed as a source of fostering confusion for suggesting there's situations that actually require /dev/random.

Now those who have an ounce of critical thinking within themselves should be questioning, if there's never a point with /dev/random, why did Linux invent it in the first place? Why does it continue to provide it with the same semantics that it always had? In fact, the manual page is doing nothing more than paraphrasing and explaining the comments in the source code:
 * The two other interfaces are two character devices /dev/random and
 * /dev/urandom.  /dev/random is suitable for use when very high
 * quality randomness is desired (for example, for key generation or
 * one-time pads), as it will only return a maximum of the number of
 * bits of randomness (as estimated by the random number generator)
 * contained in the entropy pool.
 * The /dev/urandom device does not have this limit, and will return
 * as many bytes as are requested.  As more and more random bytes are
 * requested without giving time for the entropy pool to recharge,
 * this will result in random numbers that are merely cryptographically
 * strong.  For many applications, however, this is acceptable. 
There's actually a number of problems with pseudo-random number generators. They leak information about their internal states as they output their data. Remember, data is being generated from an internal state, there's an input which generates the stream of output, it's not just magically created out of thin air. The data being output will also eventually cycle. Using the same instance of a pseudo-random number generator over and over is a bad idea, as its output will become predictable. Not only will its future output become predictable, anything it once generated will also be known. Meaning pseudo-random number generators lack what is known as forward secrecy.

Now if you believe the hype out there that for every situation, one should only use /dev/urandom, then you're implying you don't trust the developers of the Linux random interfaces to know what they're talking about. If they don't know what they're talking about, then clearly, they also don't know what they're doing. So why are you using any Linux supplied random interface, after all, Linux obviously handles randomness incorrectly!

FreeBSD actually makes the above argument, as they only supply the /dev/urandom interface (which is known as /dev/random on FreeBSD), which uses random algorithms created by leading cryptographers, and nothing remotely like what Linux is doing. Each of the BSDs in fact claim that their solution to randomness is superior to all the other solutions found among competing operating systems.

Linux on the other hand takes an approach where it doesn't trust the cryptographic algorithms out there. It has its own approach where it collects data, estimates the entropy of that data, and uses its own modified algorithms for producing the randomness that it does. Cryptographers are constantly writing papers about how non-standard Linux is in what it does, and numerous suggestions are made to improve it, in one case, even a significant patch.

Now I personally dislike Linux's approach to throw cryptographic practice out the window, but on the other hand, I like their approach in not putting too much faith into various algorithms. I like FreeBSD's approach to using well designed algorithms by cryptographers, but I dislike their inability to ensure the highest levels of security for when you really need it.

The cryptographers out there are actually divided on many of the issues. Some believe entropy estimation is utterly flawed, and cryptographically strong algorithms are all you need. Others believe that such algorithms are more easily attacked and prone to catastrophic results, and algorithms combined with pessimistic estimators and protective measures should be used.

In any case, while the various operating systems may be basing themselves on various algorithms, are they actually implementing them properly? Are they being mindful of all the issues discussed in the various books? I reviewed a couple of them, and I'm not so sure.

A strong point to consider for developers is that even though OS developers may improve their randomness in response to various papers those developers happen to stumble across or gets shoved in their faces, it doesn't necessarily mean you're free from worrying about the issues in your applications. It's common to see someone compiling and using some modern software package on say Red Hat Enterprise Linux 5, with its older version of Linux. I recently got a new router which allows one to SSH into it. Doing so, I saw it had some recentish version of BusyBox and some other tools on it, but was running on Linux 2.4!

Situations to be mindful of

There are many situations one must be mindful of when providing an interface to get random data. This list can be informative, but is not exhaustive.

Your system may be placed onto a router or similar device which generates private keys upon first use. These devices are manufactured in bulk, and are all identical. These devices also generally lack a hardware clock. In typical operating conditions, with no special activity occurring, these devices will be generating identical keys, there's nothing random about them.

Similar to above, a full system may have an automated install process deploying on many machines. Or an initial virtual machine image is being used multiple times. (Do any hosting companies make any guarantees about the uniqueness of a VM you purchase from them? Are preinstalled private keys identical for all customers?)

A system may be designed to use a seed file, a file which contains some random data, which is stored on disk to ensure a random state for next boot-up. The system may be rebooted at some point after the seed file is used, but before it is updated, causing an identical boot state.

There can be multiple users on a system, which can influence or retrieve random data. Other users may very well know the sequences generated before and after some data was generated for another user. That can then be used in turn to determine what random data was generated for the other user.

Hardware to generate random data may contain a backdoor, and should not be fully trusted.

Hardware to generate random data may break, or be influenced in some manner, causing the generated data to not actually be random.

Two machines next to each other may be measuring the same environmental conditions, leading one machine to know the random data being generated for the other.

A process may be inheriting its state from its parent, where both of them will be generating the same sequences, or multiple children will be generating the same sequences.

Wall-clock time can be adjusted, allowing the same time to occur multiple times, which in turn will lead to identical random states where wall-clock time is the only differentiating factor.

Possible Solutions

A proper solution for identical manufacturing or deployments would be to write a unique initial state to each one. However, this cannot be relied upon, as it requires compliance by the manufacturer or deployer, which can increase costs and complexity, and good tools to do so are not available.

Devices generally have components which contain serial numbers. These serial numbers should be mixed into initial states, minimally ensuring that identical devices do not have identical states. As an example, routers will have different MAC addresses. They may even have multiple MAC addresses for different sides of the router, or for wireless. Be aware however that it is possible for an outsider to collect all the MAC addresses, and thus reveal the initial state.

If a hardware clock is available, it should be mixed into the boot-up state to differentiate the boot-up state from previous boots and other identical machines that were booted at different times.

Random devices should not emit data until some threshold of assurances are reached. Linux and NetBSD provide an API for checking certain thresholds, although race conditions make the API a bit useless for anything other than ensuring the devices are past an initial boot state. FreeBSD now ensures its random device does not emit anything till some assurances are met, but older versions did not carry this guarantee. Also be wary of relying on /dev/random for assurances here, as my prior random article demonstrated that the device under that path may be swapped for /dev/urandom, a practice done by many sysadmins or device manufacturers.

Seed-files should not solely be relied upon, as embedded devices may not have them, in addition to the reboot issue. Above techniques should help with this.

The initial random state provided to applications should be different for each user and application, so they are not all sharing data in the same sequence.

Hardware generators should not be used without some pattern matching and statistical analysis to ensure they are functioning properly. Further, what they generate should only be used as a minor component towards random data generation, so they cannot solely influence the output, and prevent machines in the same environment producing the same results.

Due to application states being cloned by fork() (or similar), a system wide random interface can be more secure than an application state (thus OpenSSL when directly using /dev/urandom can be more secure than various flavors of LibreSSL). For application level random interfaces, they should have their states reset upon fork(). Mixing in time, process ID, and parent process ID can help.

Always use the most accurate time possible. Also, mix in the amount of time running (which cannot be adjusted), not just wall-clock time.

Many of the above solutions alone do not help much, as such data is readily known by others. This is a common library mistake where libraries tend to try system random devices, and if that fails, fall back on some of these other sources of uniquish data. Rather, system random devices should be used, which mix in all these other sources of uniquish data, and upon failure, go into failure mode, not we can correct this extremely poorly mode.

Usage scenarios

Not only does random data have to be generated correctly, it has to be used correctly too.

A common scenario is attempting to get a specific amount of random values. The typical approach is to divide the random value returned by the amount of possible values needed, and take the remainder. Let me demonstrate why this is wrong. Say your random number generator can generate 32 possible values, meaning any number from 0 to and including 31, and we only need to choose between five possible values.

As can be seen from this chart, modulo reducing the random number will result in 0 and 1 being more likely than 2, 3, 4. Now if we were to suppose lower numbers are inherently better for some reason, we would consider such usage fine. But in that scenario, our random number generator should be:
 function GetInherentlyGoodRandomNumber()
   return 0 //Lower numbers are inherently better!
Being that the above is obviously ridiculous, we want to ensure equal distribution. So in the above scenario, if 30 or 31 occurs, a new random number should be chosen.

Similar issues exist with floating point random numbers. One generally uses floating point random numbers when one wants to use random values with certain probabilities. However, typical algorithms do not provide good distributions. There is an entire paper on this topic, which includes some solutions.

The C library's fork() call (when it exists) should be resetting any random state present in C library random functions, but unfortunately that's not done in the implementations I've checked (hello arc4random). The issue extends to external libraries, which most of them never even considered supplying secure_fork() or similar. Be wary to never use fork() directly. You almost always want a wrapper which handles various issues, not just random states.

Further reading

Pretty much everything I stated here is a summary or rehash of information presented in important books and papers. Now based on my past experiences that I described above, you're probably not going to read anything further. However, if the topics here apply to you, you really should read them.


henke37 said...

Of course, the error for the division is going to be negligible with a normally sized randomness source. That is, unless you need a very wide range of values.

insane coder said...

Hi henke37,

If the modulus cuts the pool nearly in half, you'll have some values with twice the probability of being used than the rest of them.

henke37 said...

Yup, but I argue that the error is ignorable if the modulus is several orders of magnitude smaller than the randomness source.

insane coder said...

You're right, and it is in most cases.

But consider this as a common case:
You're generating a password, where each character can be one of 70 to 90 values, and your random function generates 255 values. Suddenly, certain characters have a noticeable bias.

insane coder said...

People asked that I post links to the comics I based my joke function off of.


Joe P. said...

It seems as though the so-called "experts" all fall into one of two categories. Either, like you mentioned, it's the kid without any real knowledge who did a weekend of work and got blindly voted into the office of expert-in-chief (cue stars and stripes forever), or it's someone who worked in the field 30 years ago, but who now teaches all his university students that data security means not labeling your punch cards. (The former case is very common to see on sites like stackoverflow, where the majority of questions have some naively juvenile answer with dozens of votes, and occasionally a different, more complete answer with 1 vote tops, and I can't tell you how many outdated professors I've seen "teaching" the next generation of developers.)

Computer science, and particularly the field of security, evolve very quickly, so I think the only people who are real experts are the ones who aren't arrogant enough to think they could ever be an expert, nor complacent enough to think that they don't need to study it themselves. I have been laughed at by peers for refusing to just trust a popular library or service to do its job at face value. I have been told that "the libraries may have their flaws, but everyone is using them, so we'll all fail together". I have been told that "you can't expect something you make to be more reliable than libraries which have been worked on by hundreds of developers". Guess what, those same people who said those things are the ones who were stumbling around like blind monkeys on a trampoline six weeks ago trying to clean up the mess from heartbleed. And who do they blame? Themselves for assuming something was secure without first reviewing it? Of course not! They blame the same so-called "experts" that they obsequiously put into that position in the first place. When software comes with notices which say things like "this program is provided as is with absolutely no warranty", it should be a sign to everyone that it's their own responsibility to review it and understand it. More often than not, people just assume someone else has done so, so why bother?

When it comes to a single piece of code, two heads are not necessarily better than one, as the second often comes along and makes assumptions that are incorrect. The "hundreds of developers" argument sounds as though the project is the result of the collected knowledge of hundreds of experienced developers, but more often than not it turns out to be the result of hundreds of false assumptions instead. It's rare that developers document their code enough to really give over enough information to the next person who comes along. The next person who comes along is even less likely to document properly for the following person, etc. Once you're dealing with hundreds, it's beyond irreparable.

I'm not saying I never use existing libraries, but I don't do so blindly, without careful review and consideration to understand what it is that I'm trusting my information to. Unfortunately, what I see is an unmitigated disaster the vast majority of the time. If that happens, I simply roll my own. So laugh at me if you will, call me paranoid if you will, tell me I suffer from not-invented-here syndrome if you will, but when the next heartbleed happens, I'll be sleeping comfortably knowing my code is still working properly while the rest of you are once again cleaning up after your "experts".

Alon Ziv said...

Another option for seeding RNG's is to take a hash value of an SRAM block at power-up. This is basically a hardware RNG that is present in nearly all devices; it requires a bootloader change to use effectively.

Of course, this mechanism has the same issues as other hardware RNG's. But it's still better than nothing, and it is a pity that almost nobody uses it.

insane coder said...

There are some gotchas involved in SRAM block power up. The SRAM as a whole is also filled with a pattern based on many fewer bits which propagate to larger SRAM blocks.

This paper here discusses this topic and is interesting: http://www.rfidsec07.etsit.uma.es/slides/papers/paper-12.pdf

It concludes that at most, only 1/16 of SRAM has any entropy. However, as with most things, it's probably an overestimation.

Hashing SRAM at boot is a good idea, but probably provides far less randomness than one might think. Also, in the case of VMs, boot up isn't a real hardware boot up, and problems are much more severe. This paper discusses some of the problems and offers a use-case solution: http://www.isoc.org/isoc/conferences/ndss/10/pdf/15.pdf

Alon Ziv said...

Yeah, I know the limitations of SRAM randomness (I had read the linked paper more-or-less when it was first published :)). Still, if you have (say) 8KB of internal SRAM - which is low, by today's standards - you will very likely have more than 128 bits of physical entropy. Which is all one can ask for. :)

As for hedging, I have known about it before it got this name... I have used the same ideas to implementing cryptography in adversarial settings, several years ago. I consider this another "never bad" idea - in my opinion, whenever any crypto function uses some randomness, it should first "contribute back" by mixing all of its parameters (including secret/private keys) into the entropy pool.

Evan Byrne said...

I know this is an old article, so please forgive me. I was wondering though, couldn't we use what we know about quantum mechanics to create strong random value generators? Think two-slit experiment. We are able to represent overall outcomes of quantum mechanical experiments in terms of probabilities, but individual events appear to be completely random at the time of measurement.

Evan Byrne said...

Ah yes, I missed the comment at the beginning about quantum mechanics. Still, even if we are confident that the overall outcome of an experiment might be roughly 50/50, the individual events still appear random at time of measurement. Just because my electron went through slit #1 five times in a row doesn't mean I should be confident that it will go through slit #2 next. The only thing I would really be worried about is the device getting out of tune so that it favors a certain slit.

insane coder said...

Hello Evan,

What we know about quantum mechanics can be used to create the *base* for a strong random number generator.

The book I referred to discusses how these generators can be influenced. So even if the outcome of some experiment is 50/50 in isolation, that's not the case in the real world where the environment of the device will be under attack.

This article also contains several examples that convey that any generator in isolation is a bad idea (for example, it stops measuring correctly and suddenly becomes very biased). Sure we should use devices applying quantum mechanic technology as part of a random number generator on a system, but only as one piece of a much larger processes which doesn't depend upon a single point of failure.