Monday, March 26, 2007

Methods for safe string handling

Every now and then you hear about how a buffer overflow was discovered in some program. Immediately, everyone jumps on the story with their own take on the matter. Fans of a different programming language will say: "of course this wouldn't have happened if you'd have used my programming language". Secure library advocates will say: "you should have used that library instead". While experts of that language will say: "The guy was an idiot, he coded that all wrong".

I'd like to look at basic "C string" handling in C. We're talking about functions like strlen(), strcpy(), strcat(), and strcmp(). These functions report length, copy, concatenate, and compare C strings respectively. A C string is an array of characters, which is terminated with a null character to signify the end. strlen() would find the length by seeing how far from the beginning the null was. strcat() would find the null in the first C string, and then to that location copy characters from the second string, till it finds the null. So on and so forth with all the C string functions.

Now these functions are seen by some as inherently broken, since they can read/write data right off the end of the buffer, and the terminating nulls can sometimes vanish if one isn't careful. You see these kinds of issues with C++ programs too. Usually when a Java programmer hears about this, they tell you to use Java, since Java has a built in string class which can't get screwed up by any of these simple operations. Learned C++ programmers know that C++ also has a built in string class, just as good, if not better than Java's string class. Knowledgeable C++ programmers will use C++'s strings to avoid all these issues, and have really nice features, such as sane string comparison using ==.

Switching from C++ to Java, because people know that Java has a string class is kind of ridiculous. Yet I keep hearing from all kinds of people how programmers should switch to Java because of it. One using C++ should generally use the C++ class, unless they have a good reason otherwise. It's a shame most C++ programmers use C strings instead of C++ ones, probably due to them not knowing about it. However this doesn't help when programming in pure C.

For pure C, you can turn to one of the libraries for handling strings, such as SafeStr. But most people won't choose to go this route. Due to this, care has to be taken.

Now I won't kid you, if there's a buffer overflow in a C program, it is the programmer's fault. If s/he was more careful, it wouldn't have happened. However some areas of code are large, have many code paths, and are downright confusing. In those cases, it's easy to screw up. To help prevent screwing up, there exist some more C functions to handling strings properly, and some C libraries also provide extra non standard functions.

In answer to just overflowing a buffer when copying or concatenating, "n" versions are provided. These are strncpy(), strncat(). They take a third parameter to tell it how much they're dealing with. strncpy()'s n refers to how big the buffer is, and it'll only copy up to that amount of characters. strncat()'s n is up to how many characters can be stuck onto the end of the first string. In the case of strncpy(), if null isn't found in the copy process, the result is not null terminated. Leaving us with the other problem. strncat() on the other hand will always null terminate, because it attaches n+1 bytes whenever the second string's end isn't reached. Meaning that you have to tell strncat() length_of(remaining bytes in buffer)-1. This leads to confusion, because of different n meanings, and because strncpy() introduces another problem.

There's also strncmp(), for specifying the maximum amount of characters to compare, which you can use if one of the strings isn't null terminated. Surprisingly however, there is no strnlen(), to check how many characters a string is, without running off the end. Considering that strncpy() doesn't always null terminate, sounds like a useful feature to have.

Taking this into account, in some C libraries, you'll find strnlen() which returns the length, or instead will return the value of n, in the case where no null was found. Those needing it, it's an easy function to implement yourself:
size_t strnlen(const char *s, size_t n)
  const char *p = (const char *)memchr(s, 0, n);
  return(p ? p-s : n);

Although it would be intelligent to follow up every call to "strncpy(s, n)" with a "s[n-1] = 0" to terminate it yourself. But this hardly helps the confusion. Also take into account that str[n][cpy/cat]() return the destination for their return value, so you'll sometimes see code like:
if (!strcmp(strncpy(buf, entered_text, sizeof(buf)), param))

However, this code is broken as buf may not be null terminated. A correct version would perhaps be:
if (!strncmp(strncpy(buf, entered_text, sizeof(buf)), param, min(sizeof(param), sizeof(buf))))

Which is ugly at best.

To solve and work around these issues, OpenBSD has invented the strlcpy() and strlcat() functions, which have been implemented in all BSD derivatives (including Solaris and Mac OS X). Manpage here.

Although I found the standard descriptions confusing at best. Here's my take on it after some study:
size_t strlcpy(char *dest, const char *src, size_t size);

    The strlcpy() function copies the C string pointed to by src (including the null) to
the array pointed to by dest. However, not more than size bytes of src are copied. Meaning
at most size-1 characters will be copied. The copy will always be null terminated, unless
size was 0, in which case nothing is done to dest.

Return Value:
    Return is always the amount of characters needed to hold the copy. Meaning strlen(src).
If the return value is <size, everything was copied.

With strlcpy(), you can run it once with a size of 0 if you want to find out how much you need to allocate. Although pointless, you'd be better off with strlen(). Now this won't help much if src isn't null terminated, but it should avoid issues you have with misusing the return value like in the case offered above, or when there wasn't enough room. If you always pass strlcpy() the sizeof() the buffer, or the value passed to malloc() as the case may be, you should be safe.

If you read the manpage, you also see a usefulness to the return value. A problem with constantly using strcat() is that you have to keep iterating through the former strings, leading to a speed loss. With strlcpy(), you can do the following for concatenation:

if ((n1 = strlcpy(a, b, sizeof(a))) < sizeof(a))
  if ((n2 = strlcpy(a+n1, c, sizeof(a)-n1)) < sizeof(a)-n1)
    if ((n3 = strlcpy(a+n2, d, sizeof(a)-n2)) < sizeof(a)-n2)

A nice trick for mass concatination, although as the manpage points out, ridiculous, and negates strlcat().

Moving onwards, the manpage listed above for strlcpy() is also for strlcat(), yet as above, I found it a bit confusing too. Here's my take:

size_t strlcat(char *dest, const char *src, size_t size);

    The strlcat() function appends the src string to the dest string overwriting the ‘\0’
character at the end of dest, and then adds a terminating ‘\0’ character. However, not more
than size-strlen(dest) bytes of src are copied. Meaning a maximum of size-1 characters will
fill dest in the end. The copy will always be null terminated, unless size was less than the
length of dest, or dest is not null terminated, in which case nothing is done to dest.

Return Value:
    Return is the amount of characters needed to hold the copy when dest initially is null
terminated and its length is less than size. Otherwise the return is size+strlen(src). If
the return value is <size, everything was copied.

What's nice about strlcat() is that for the size param, you can pass it the sizeof() or the malloc() value like you do for strlcpy(). But beware the return value, the OpenBSD code is rightly commented as follows: "Returns strlen(src) + min(siz, strlen(initial dst))". Take a moment to comprehend that.

If you're not using a BSD and you want these functions, code is here and here. Be wary of some of the other implementations you find online. I looked at some of them, and they acted differently in some other corner cases. One I looked at even crashed in one of the corner cases.

However looking at that code there, it looks a bit messy. Reviewing our previous multiple concatenation case, which is also spoken about in the manpage, one sees these as a bit weak. If one wants nice multi concat without too much fuss, they'd normally use snprintf() (C99) with a bunch of "%s%s%s" as the format. I myself though prefer a more elegant solution to all of this.

I therefor have created the following logical extension of OpenBSD's l functions, I give you strlmrg():
size_t strlmrg(char *dest, size_t size, ...)
  char *s, *end = dest + (size-1);
  size_t needed = 0;

  va_list ap;
  va_start(ap, size);

  while ((s = va_arg(ap, char *)))
    if (s == dest)
      size_t n = strnlen(s, (end+1)-s);
      needed += n;
      dest += n;
      needed += strlen(s);
      if (dest && (dest < end))
        while (*s && (dest < end))
          *dest++ = *s++;
        *dest = 0;


Pass strlmrg() the destination buffer, it's size (from sizeof() or the param to malloc()), and all the strings you want concatenated, followed by a null pointer.
Example 1:
printf("%zu; %s\n", strlmrg(line, sizeof(line), "I ", "Went ", "To ", "The ", "Park.", NULL), line);

It would print: "19; I Went To The Park."
Example 2:
n = strlmrg(buffer, sizeof(buffer), a, b, c, d, e, f, (void *)0);

Which would concatenate a to f inside buffer (given that it could fit), and return the amount of characters copied. Note, it returns how many characters would be copied, so you can use it to determine the size. See this example:
size_t n = strlmrg(0, 0, a, b, c, (void *)0);
char *p = malloc(n+1); //+1 for the null
strlmrg(p, n+1, a, b, c,(void *)0); //Again, +1 for the null

When strlmrg() returns less than size, everything was merged in. The result is always null terminated except when dest is null, size is 0, or it encounters one of the source pointers to match the location it is currently trying to copy to.
You should avoid passing one of the source pointers to be a location from the destination buffer. If you happened to pass in such an overlapping source pointer, and it's not null terminated prior to it reaching size, you will get size as the return value instead of the full size. Also don't try to pass it any non null terminated source pointer, or forget to pass the last null pointer.

Once we have strlmrg() implemented, it also paves the way for a simple and straightforward implementation for strlcpy() and strlcat().
size_t strlcpy(char *dest, const char *src, size_t size)
  return(strlmrg(dest, size, src, (void *)0));

size_t strlcat(char *dest, const char *src, size_t size)
  return(strlmrg(dest, size, dest, src, (void *)0));

And unlike the official ones, these won't crash if dest or src is null. I tested these wrappers, and they seemed to match results with the official ones in every regular and edge case I tried.

I also tested strlmrg() in a variety of cases, and it seems to be very good and secure. If you find a bug, or have an improvement to offer, feel free to post about it.



Snehal Harshe said...

Thank you so much for sharing all this wonderful information !!!! It is so appreciated!! You have good humor in your blogs. So much helpful and easy to read!
Software Testing Course in Pune

3RI Technologies said...

Amazing content shared. Thanks keep sharing such informational content with us!
Software Testing Training in Pune

Techwriter said...

I recently came across your article and have been reading along. I want to express my admiration of your writing skill and ability to make readers read from the beginning to the end. I would like to read newer posts and to share my thoughts with you. Job Oriented Courses

Easy Loan Mart said...

The Safe String Library is a C Library providing string and memory buffer routines that protect against buffer overflows. The Secure Development Lifecycle (SDL) recommends that certain C Library functions not be used, because of their propensity to create buffer overflow vulnerabilities.
You are also read more Personal Loans Near Me

Anonymous said...

Java is a high-level programming language that is also human-readable. It is akin to human language and has a very basic and easy to maintain grammar that is similar to but simpler than the syntax of the C++ language. for more info visit: Java Classes In Pune

ClinicalResearch said...

It's wonderful to come across such a brilliant post. The details provided are extremely helpful and beneficial for us. Keep up the excellent work. Your blog has provided me with valuable information, and it was a pleasure to read. Thank you for sharing this exceptional content, and please continue to share more.


ClinicalResearch said...
This comment has been removed by the author.
Akash Giri said...

Thanks for sharing such a great article.
Software Testing Classes in Pune

hendry said...

Interesting post, I learned something new today. Thanks for sharing your knowledge.

Ankita Jadhav said...

Thanks for sharing such an awesome article. Please keep spreading knowledge like this.

Web Design Course in Pune
Web Development Classes in Pune

kajal said...

"Your passion for this topic is evident and contagious!" for more information visit us on
"Clinical Research Courses"