Monday, June 30, 2014

Memory management in C and auto allocating sprintf() - asprintf()

Memory Management

Memory management in C is viewed by some to be quite tricky. One needs to work with pointers that can point anywhere in memory, and if misused, cause a program to misbehave, or worse.

The basic functions to allocate and deallocate memory in C are malloc() and free() respectively. The malloc() function takes a size in bytes of how much to allocate, and returns a pointer to the allocated memory to be used. Upon failure, a null pointer, which can be thought of as a pointer pointing to 0 is returned. The free() function takes the pointer returned by malloc(), and deallocates the memory, or frees up the memory it was once pointing to.

To work with malloc() in simple situations, typically, code along the following lines is used:
void *p = malloc(size);
if (p)
{
  ... work with p ...
  free(p);
}
else
{
  ... handle error scenario ...
}

Unfortunately many experienced programmers forget to handle the failure scenario. I've even heard some say they purposely don't, as they have no clue how to proceed, and just letting the program crash is good enough for them. If you meet someone who makes that argument, revoke their programming license. We don't need such near sighted idiots writing libraries.

In any case, even the above can lead to some misuse. After this block of code runs, what is p now pointing to?

After the above code runs, in the case that malloc() succeeded, p is now pointing to memory in middle of nowhere, and can't be used. This is known as a dangling pointer. Dangling pointers can be dangerous, as an if clause as above will think the pointer is valid, and may do something with it, or lead to the infamous use after free bug. This becomes more likely to occur as the situation becomes more complicated and there are loops involved, and how malloc() and free() interact can take multiple paths.

Pointers involved with memory management should always be pointing at 0 or at allocated memory. Anything else is just asking for trouble. Therefore, I deem any direct use of free() dangerous, as it doesn't set the pointer to 0.

So if free() is considered harmful, what should one use?

In C++, I recommend the following:

static inline void insane_free(void *&p)
{
  free(p);
  p = 0;
}

This insane_free() is now a drop in replacement for free(), and can be used instead. (Since C++ programs normally use new and delete/delete[] instead, I leave it as an exercise to the reader how to work with those.)

However, C doesn't support direct references. One can pass a pointer by a pointer to accomplish similar results, but that becomes clunky and is not a drop in replacement. So in C, I recommend the following:
#define insane_free(p) { free(p); p = 0; }
It makes use of the preprocessor, so some may consider it messy, but it can be used wherever free() currently is. One could also name the macro free in order to automatically replace existing code, but it's best not to program that way, as you begin to rely on these semantics. This in turn means someone copying your code may think a call to free() is the normal free() and not realize something special is occurring when they copy it elsewhere without the macro.

Correct usage in simple cases is then:
void *p = malloc(size);
if (p)
{
  ... work with p ...
  insane_free(p);
}
else
{
  ... handle error scenario ...
}
If you think using a wrapper macro or function is overkill, and just always manually assigning the pointer to 0 after freeing is the way to go, consider that it's unwieldy to constantly do so, and you may forget to. If the above technique was always used, all use after free bugs would never have occurred in the first place.

Something else to be aware of is that there's nothing wrong with calling free(0). However, calling free() upon a pointer which is not null and not pointing to allocated memory is forbidden and will crash your program. So stick to the advice here, and you may just find memory management became significantly easier.

If all this talk of pointers is beyond you, consider acquiring Understanding and Using C Pointers.

sprintf() and asprintf()

If you do a lot of C programming, at some point, you probably have used the sprintf() function, or its safer counterpart snprintf().

These two functions sprintf() and snprintf() act like printf(), but instead of printing to the standard output, they print to a fixed-length string buffer. Now a fixed-length string buffer is great and all, but what if you wanted something which automatically allocated the amount it needed?

Enter asprintf(), a function which acts like sprintf(), but is auto-allocating, and needs no buffer supplied. This function was invented ages ago by GLIBC, shortly thereafter copied to the modern BSDs, and found its way further still into all sorts of libraries, although is not yet ubiquitous.

Let's compare the prototypes of the two:
int sprintf(char *buffer, const char *format, ...); 
int asprintf(char **ret, const char *format, ...);
The sanest approach would have been for a function like asprintf() to have the prototype of:
char *asprintf(const char *format, ...);
But its creators wanted to make it act like sprintf(), and its design can also be potentially more useful.

Instead of passing asprintf() a buffer, a pointer to a variable of type char * needs to be passed, like so:
char *buffer;
asprintf(&buffer, ...whatever...);
Now how asprintf() actually works is no big secret. The C99 standard specified that snprintf() upon failure should return the amount of characters that would be needed to contain its output. Which means that conceptually something along the following lines would be all that asprintf() needs to do:
char *buffer = malloc(snprintf(0, 0, format, data...)+1);
sprintf(buffer, format, data...);
Of course though, the above taken verbatim would be incorrect, because it mistakenly assumes that nothing can go wrong, such as the malloc() or snprintf() failing.

First let's better understand what the *printf() functions return. Upon success, they return the amount of characters written to the string (which does not include the trailing null byte). Or in other words, the return value is equivalent to calling strlen() on the data being output, which can save you needing to use a strlen() call with sprintf() or similar functions for certain scenarios. Upon failure, for whatever reason, the return is -1. Of course there's the above mentioned exception to this with snprintf(), where the amount of characters needed to contain the output would be returned instead. If during the output, the size overflows (exceeds INT_MAX), many implementations will return a large negative value (failure with snprintf(), or success with all the functions).

Like the other functions, asprintf() also returns an integer of the nature described above. Which means working with asprintf() should go something like this:
char *buffer;
if (asprintf(&buffer, ...whatever...) != -1)
{
  do_whatever(buffer);
  insane_free(buffer);
}
However, unlike the other functions, asprintf() has a second return value, its first argument, or what the function sees as *ret. To comply with the memory management discussion above, this should also be set to 0 upon failure. Unfortunately, many popular implementations, including those in GLIBC and MinGW fail to do so.

Since I develop with the above systems, and I'm using asprintf() in loops with multiple paths, it becomes unwieldy to need to pass around the buffer and the returned int, so I'd of course want saner semantics which don't leave dangling pointers in my program.

In order to correct such mistakes, I would need to take code from elsewhere, or whip up my own function. Now I find developing functions such as these to be relatively simple, but even so, I always go to check other implementations to see if there's any important points I'm missing before I go implement one. Maybe, I'll even find one which meets my standards with a decent license which I can just copy verbatim.

In researching this, to my shock and horror, I came across implementations which properly ensure that *ret is set to 0 upon failure, but the returned int may be undefined in certain cases. That some of the most popular implementations get one half wrong, and that some of the less popular get the other half wrong is just downright terrifying. This means that there isn't any necessarily portable way to check for failure with the different implementations. I certainly was not expecting that, but with the amount of horrible code out there, I guess I really shouldn't be surprised anymore.

Also in the course of research, besides finding many implementations taking a non-portable approach, many have problems in all sorts of edge cases. Such as mishandling overflow, or not realizing that two successive calls to a *printf() function with the same data may not necessarily yield the same results. Some try to calculate the length needed with some extra logic and only call sprintf() once, but this logic may not be portable, or always needs updating as new types are added to the format string as standards progress, or the C library decided to offer new features. Some of the mistakes I found seem to be due to expecting a certain implementation of underlying functions, and then later the underlying functions were altered, or the code was copied verbatim to another library, without noting the underlying functions acted differently.

So, once again, I'm finding myself needing to supply the world with more usable implementations.

Let's dive into how to implement asprinf().

Every one of these kind of functions actually has two variants, the regular which takes an unlimited amount of arguments, and the v variants which take a va_list (defined in stdarg.h) after the format argument instead. These va_lists are what ... gets turned into after use, and in fact, every non-v *printf() function is actually wrapped to a counterpart v*printf() function. This makes implementing asprintf() itself quite straight forward:



To fix the aforementioned return problems, one could also easily throw in here a check upon the correct return variable used in the underlying vasprintf() implementation and use it to set the other. However, that's not a very portable fix, and the underlying implementation of vasprintf() can have other issues as described above.

A straight forward implementation of vasprintf() would be:



As long as you have a proper C99 implementation of stdarg.h and vsnprintf(), you should be good to go. However, some systems may have vsnprintf() but not va_copy(). The va_copy() macro is needed because a va_list may not just be a simple object, but have handles to elsewhere, and a deep copy is needed. Since vsnprintf() being passed the original va_list may modify its contents, a copy is needed because the function is called twice.

Microsoft Visual C++ (MSVC, or Microsoft Vs. C++ as I like to think of it) up until the latest versions has utterly lacked va_copy(). This and several other  systems that lack it though usually have simple va_lists that can be shallow copied. To gain compatibility with them, simply employ:


#ifndef va_copy 
#define va_copy(dest, src) dest = src 
#endif

Be warned though that if your system lacks va_copy(), and a deep copy is required, using the above is a recipe for disaster.

Once we're dealing with systems where shallow copy works though, the following works just as well, as vsnprintf() will be copying the va_list it receives and won't be modifying other data.



Before we go further, there's two points I'd like to make.
  • Some implementations of vsnprintf() are wrong, and always return -1 upon failure, not the size that would've been needed. On such systems, another approach will need to be taken to calculate the length required, and the implementations here of vasprintf() (and by extension asprintf()) will just always return -1 and *ret (or *strp) will be 0.
  • The code if ((r < 0) || (r > size)) could instead be if (r != size), more on that later.
Now on Windows, vsnprintf() always returns -1 upon failure, in violation of the C99 standard. However, in violation of Microsoft's own specifications, and undocumented, I found that vsnprintf() with the first two parameters being passed 0 as in the above code actually works correctly. It's only when you're passing data there that the Windows implementation violates the spec. But in any case, relying on undocumented behavior is never a good idea.

On certain versions of MinGW, if __USE_MINGW_ANSI_STDIO is defined before stdio.h is included, it'll cause the broken Windows *printf() functions to be replaced with C99 standards compliant ones.

In any case though, Windows actually provides a different function to retrieve the needed length, _vscprintf(). A simple implementation using it would be:



This however makes the mistake of assuming that vsnprintf() is implemented incorrectly as it currently is with MSVC. Meaning this will break if Microsoft ever fixes the function, or you're using MinGW with __USE_MINGW_ANSI_STDIO. So better to use:



Lastly, let me return to that second point from earlier. The vsnprintf() function call the second time may fail because the system ran out of memory to perform its activities once the call to malloc() succeeds, or something else happens on the system to cause it to fail. But also, in a multi-threaded program, the various arguments being passed could have their data change between the two calls.

Now if you're calling functions while another thread is modifying the same variables you're passing to said function, you're just asking for trouble. Personally, I think that all the final check should do is ensure that r is equal to size, and if not, something went wrong, free the data (with insane_free() of course), and set r to -1. However, any value between 0 and size (inclusive), even when not equal to size means the call succeeded for some definition of success, which the above implementations all allow for (except where not possible Microsoft). Based on this idea, several of the implementations I looked at constantly loop while vsnprintf() continues to indicate that the buffer needs to be larger. Therefore, I'll provide such an implementation as well:



Like the first implementation, if all you lacked was va_copy(), and shallow copy is fine, it's easy to get this to work on your platform as described above. But if vsnprintf() isn't implemented correctly (hello MSVC), this will always fail.

All the code here including the appropriate headers, along with notes and usage examples are all packed up and ready to use on my asprintf() implementation website. Between everything offered, you should hopefully find something that works well for you, and is better than what your platform provides, or alternative junk out there.

As always, I'm only human, so if you found any bugs, please inform me.

Sunday, June 22, 2014

Avoid incorrect ChaCha20 implementations

ChaCha20 is a stream cipher which is gaining a lot of popularity of late. Practically every library today which provides ciphers seems to have it as an addition in their latest releases.

In cryptography, there are two kinds of ciphers, block ciphers and stream ciphers. Block ciphers are where the underlying algorithm works with data with a certain fixed chunk size (or block). Popular blocks sizes are 16 and 64 bytes. Stream ciphers are effectively block ciphers where the chunk size is a single byte.

Classical stream ciphers, such as RC4, can work with data of arbitrary size, although every single byte is dependent on every previous byte. Which means encryption/decryption cannot begin in the middle of some data, and maintain compatibility where some other starting point was used. Block ciphers generally can have their blocks encrypted and decrypted arbitrarily, with none dependent upon any other, however, they cannot work with data of arbitrary size.

In order to allow block ciphers to work with data of arbitrary size, one needs to pad the data to be encrypted to a multiple of the block size. However, a clever alternative is counter mode.

Different modes for working with block ciphers exist. Some try to improve security by making each block depend on every other, some utilize various interesting techniques for other properties.  Counter mode does not encrypt the desired data (the plaintext) directly, rather, an ever incrementing counter is encrypted. The result of this encryption is then xored with the desired data.

Counter mode effectively turns a block cipher into a stream cipher, as the plaintext is never actually passed to the block cipher. Rather, a counter which is a multiple of the block size is used. One can always xor bytes with an arbitrary size, and since that is the only step in counter mode against the plain text, it is effectively a stream cipher. Since the underlying cipher can be a block cipher with no dependency between blocks, this kind of stream cipher also allows one to jump ahead to any particular multiple of the block size in the data, and begin encryption/decryption from there.

Now while ChaCha20 is advertised as a stream cipher, it's actually designed as a block cipher in counter mode. The internal design mostly mirrors that of typical counter mode design, except that the counter components are directly fused with a large but simple block cipher. Since it's really a block cipher, it has an internal block size, and also allows one to jump ahead to some multiple of it.

Since ChaCha20 is considered to have a great level of security, and all these other wonderful properties, it's starting to see a lot of use. However, practically every implementation I'm seeing is either utterly broken, or has some ridiculous API.

Common ChaCha20 implementation mistakes:
  • Implemented as a typical block cipher, not allowing usage with arbitrary amounts of bytes, or worse, the API allows for it, but produces incorrect results.
  • Implemented as a typical stream cipher with no way to jump ahead.
  • Failing on Big-Endian systems.

The first mistake I listed is the most common. If some software is only using ChaCha20 internally, and always using it in a multiple of its block size (or it's all the crummy API offers), then things are fine. But if it's a library which is inviting others to use it, and it can be used incorrectly, expect disaster to ensue.

The reference implementation of ChaCha20 was designed that an arbitrary amount of data can be encrypted, as long as all but the last usage of the API was a multiple of the block size. This was also mentioned in its documentation. However, practically every other implementation out there copies this design in some way, but makes no note of it. Worse yet, some libraries are offering ChaCha20 with this implementation flaw alongside other stream ciphers with an identical API whereas those can be used arbitrarily throughout.

Essentially, this means if you're using ChaCha20 right now in a continuous fashion with chunks of various sizes, your data is being encrypted incorrectly, and won't be interoperable with other implementations. These broken implementations are able to output exactly one chunk correctly which is not a multiple of the block size, which destroys their internal buffers, and screws up every output thereafter.

I noticed a similar situation with hash algorithm implementations several years back. However, most hash implementations are fine. Yet with ChaCha20, practically every implementation I looked at recently was broken.

Since this situation cannot stand, especially with ChaCha20 gaining speed, I am providing a simple implementation without these flaws. This implementation is designed to be correct, portable, and simple. (Those wanting an optimized version of this should consider paying for more optimized routines)

Usage of the C99 API I designed is as follows:

Custom type: chacha20_ctx
This type is used as a context for a state of encryption.

To initialize:
void chacha20_setup(chacha20_ctx *ctx, const uint8_t *key, size_t length, uint8_t nonce[8]);

The encryption key is passed via a pointer to a byte array and its length in bytes. The key can be 16 or 32 bytes. The nonce is always 8 bytes.

Once initialized, to encrypt data:
void chacha20_encrypt(chacha20_ctx *ctx, const uint8_t *in, uint8_t *out, size_t length);

You can pass an arbitrary amount of data to be encrypted, just ensure the output buffer is always at least as large as the input buffer. This function can be called repeatedly, and it doesn't matter what was done with it previously.

To decrypt data, initialize, and then call the decryption function:
void chacha20_decrypt(chacha20_ctx *ctx, const uint8_t *in, uint8_t *out, size_t length);

For encryption or decryption, if you want to jump ahead to a particular block:
void chacha20_counter_set(chacha20_ctx *ctx, uint64_t counter);

Counter is essentially the number of the next block to encrypt/decrypt. ChaCha20's internal block size is 64 bytes, so to calculate how many bytes are skipped by a particular counter value, multiply it by 64.

In addition to just providing a library, I gathered the test vectors that were submitted for various RFCs, and included a series of unit tests to test it for correctness.

For fun, since I'm also playing around a bit with LibreSSL these days, I wrapped its API up in the API I described above. The wrapper is included in my package with the rest of the code, however it is currently not designed for serious usage outside of the included test cases.

Since I already whipped up some unit tests that anyone can use, I'll leave it as an exercise to the reader to determine which libraries are and aren't implemented correctly.

I tried to ensure my library is bug free, but I am only human. If you find a mistake, please report it.