Wednesday, March 28, 2007


Quicksort



Sorting is something most programs have to do on some occasion. Quicksort is accepted as the best in-practice general purpose sorting algorithm available today. Standard C offers a built in Quicksort under the function name qsort(), with the following prototype:


void qsort(void *base, size_t nmemb, size_t size, int(*compar)(const void *, const void *));


Now while it may be built in, each C library implements it differently. Also, it's designed to work with any kind of type/structure, however one can generally optimize it a bit more for the kind that they're dealing with. Since sorting is important, and can take a while with many items, one generally wants their sorting to be optimized as much as possible.

I've looked at several implementations for various C libraries out there, however they all have one thing in common - messy. One I looked at didn't even implement Quicksort, but a Shakersort.

I'm wondering perhaps if some of the clever people who read this blog can come up with a clean but fast implementation.

To start off, here's a mostly unoptimized simple implementation of the standard qsort():


static void swap_internal(char *a, char *b, size_t size)
{
if (a != b)
{
char t;
while (size--)
{
t = *a;
*a++ = *b;
*b++ = t;
}
}
}

static void qsort_internal(char *begin, char *end, size_t size, int(*compar)(const void *, const void *))
{
if (end > begin)
{
char *pivot = begin;
char *l = begin + size, *r = end;

while (l < r)
{
if (compar(l, pivot) <= 0)
{
l += size;
}
else
{
r -= size;
swap_internal(l, r, size);
}
}
l -= size;
swap_internal(begin, l, size);
qsort_internal(begin, l, size, compar);
qsort_internal(r, end, size, compar);
}
}

void qsort(void *base, size_t nmemb, size_t size, int(*compar)(const void *, const void *))
{
qsort_internal((char *)base, (char *)base+nmemb*size, size, compar);
}


Note, this implementation is ~40 lines and can be improved. It can be improved with a better pivot selection, removing the recursion, and throwing in an insertion sort for certain cases.

I'm challenging my readers to give me a good clean implementation, which is under 80 lines (bad indentation doesn't count, I'll be running submitted code through my formatter), compliant, and isn't a total wreck. Creating some helper functions to make it clearer is a good idea too. Almost every implementation I've seen makes use of gotos to all over the place, and has returns smack in the middle of the function.

It'd be nice to bring something so useful into the new millennium, instead of using messy or slow implementations from the 70s.

If I get some nice implementations in, I'll do another piece on how to optimize Quicksort for sorting certain types/structures.

2 comments:

Aaron said...

Don't bother. qsort() is going to be clunky and slow because compilers can't optimise out the very painful indirect call (read pipeline stall) through the comparator.

I'm not sure why you want an implementation that looks "clean." It's not like you actually need to look at and review the sorting algorithm itself on a regular basis. Real sorting algorithm implementations aren't clean, because there are a lot of complications. There's a lot of special cases you need to watch out for in a real implementation, and each of these will require adding various "ugly" bits in.

Besides what you mentioned, it's also important to always sort the smaller partition first, because this ensures the stack space will be bounded.

Regarding removing explicit recursion for a stack, this is a common optimisation in some implementation, but it often actually makes things slower, particularly on newer chips. It also means we need to allocate memory, either with C99 VLAs (poorly supported), alloca() (non-standard), or malloc/new (might fail).

Another issue with simple implementations is that what you probably want in a high quality implementation is really an 'introspective sort;' something that will fall back to a heapsort when qsort is taking too long. This eliminates a very serious failure mode, one that can lead to security problems, at only the cost of increased code size.

Well, in other words, I have a suspicion that you're going to have serious difficult beating a high quality qsort() implementation, which has been tuned for many man-hours by experts, but any modest improvements you make will be negated by the poor comparator performance. As an example of such a high-quality qsort(), see glibc's qsort().

C++ offers many advantages here, which you may or mat be interested in. For example, C++ std::sort() is mostly the same as qsort(), but eliminates the comparator problem, as well as the type safety issue. It is difficult to make a qsort implementation, in C or C++, which will beat a high quality compiler's std::sort().

So, as another example of a good qsort implementation, which I think is decently clean, is the one in libstdc++-v3 for std::sort(). The main annoyance is the mangling necessary to stay inside the standard namespace and avoid keonig lookup problems, but it's not that big of a deal.

Alternately, a lot of people don't realize that a radix sort beats the socks off of qsort in most real cases. The reason you don't see it as much is that its arguably harder to generalize, as its not comparison-based. The main advantage is that radix sort is O(n) average and worst case, whereas qsort is O(nlogn) average and O(n**2) worst case. However, radix sort can be applied to most data sets that a comparison-based sort can. The difficulty in generalizing the radix sort is a serious problem though, because radix sort can get just as hairy as qsort.

Anyway, real problems are difficult, and have complicated solutions; this requires complicated and complex code. Simple solutions look good on paper, but not always so good in the real world.

insane coder said...

Aaron:
Don't bother. qsort() is going to be clunky and slow because compilers can't optimize out the very painful indirect call (read pipeline stall) through the comparator.

True, but you can get rid of it in your own implementation.


I'm not sure why you want an implementation that looks "clean." It's not like you actually need to look at and review the sorting algorithm itself on a regular basis. Real sorting algorithm implementations aren't clean, because there are a lot of complications. There's a lot of special cases you need to watch out for in a real implementation, and each of these will require adding various "ugly" bits in.

That's where you're wrong, there are many cases where you need to specialize the sorting. Say for example you have multiple arrays where all your data is associated with each other. Perhaps one array contains first names, another last names, another ID numbers and so on. These should be in a structure, but for whatever reason, they are independent arrays. If you want to sort them, you need to modify your quick sort to check just one of these arrays for order, but for the swapping part to swap in each of these arrays. Due to this specialization needed at times, one should have a clean quick qsort handy which they can modify as needed.


It also means we need to allocate memory, either with C99 VLAs (poorly supported), alloca() (non-standard), or malloc/new (might fail).

Not so.
Take this for example:

static void qsort_internal(char *begin, char *end, size_t size, int(*compar)(const void *, const void *))
{
if (begin < end)
{
char *pivot = partition(begin, end, size, compar);
qsort_internal(begin, pivot, size, compar);
qsort_internal(pivot+size, end, size, compar);
}
}

Can be replaced with:

static void qsort_internal(char *begin, char *end, size_t size, int(*compar)(const void *, const void *))
{
while (begin < end)
{
char *pivot = partition(begin, end, size, compar);
qsort_internal(begin, pivot, size, compar);
begin = pivot+size;
}
}

And once we reach this point, the other qsort call can be passed off to another CPU thus parallelizing the code making it faster on dual core / dual CPU machines, or replaced with an internal stack of set size. Once the looping mechanism is in place, a static array of log(sizeof(maximum_array)) is all that is needed, meaning you can do stack_t stack[32]; at the top of your qsort() for a machine limited to 4GB of memory.


Another issue with simple implementations is that what you probably want in a high quality implementation is really an 'introspective sort;' something that will fall back to a heapsort when qsort is taking too long. This eliminates a very serious failure mode, one that can lead to security problems, at only the cost of increased code size.

I mentioned that in my article.


Well, in other words, I have a suspicion that you're going to have serious difficult beating a high quality qsort() implementation, which has been tuned for many man-hours by experts, but any modest improvements you make will be negated by the poor comparator performance. As an example of such a high-quality qsort(), see glibc's qsort().

I already beat several high quality qsort() implementations which is a story for a different time. glibc's is not quite the best out there. Here's results of a test I ran:

Sun 406000
Microsoft 448000
AT&T 589000
glibc 590000
Insane 594000
IBM 603000
BSD 1993 809000
Syslinux 1342000
uClibc 1547000
BSD 1990 2006000
BSD 1983 2277000
Simple 3973000
diet libc 10809000

Simple is the one I offered above, which beat out diet libc's. Insane is what I'm currently working on. I would like more feedback from experts to get some new ideas so I can tune my ideas better and beat even Sun's. I've already employed several techniques which I will discuss in a later article to get around limitations your straight forward high quality qsort()s run into.


C++ offers many advantages here, which you may or mat be interested in. For example, C++ std::sort() is mostly the same as qsort(), but eliminates the comparator problem, as well as the type safety issue. It is difficult to make a qsort implementation, in C or C++, which will beat a high quality compiler's std::sort().

Yes this I know, std::sort() is blazingly fast, and I use it in C++, it's not an option in C though, and those programs and programmers still exist. There are also times where you need specialization like I explained above.


Alternately, a lot of people don't realize that a radix sort beats the socks off of qsort in most real cases. The reason you don't see it as much is that its arguably harder to generalize, as its not comparison-based. The main advantage is that radix sort is O(n) average and worst case, whereas qsort is O(nlogn) average and O(n**2) worst case. However, radix sort can be applied to most data sets that a comparison-based sort can. The difficulty in generalizing the radix sort is a serious problem though, because radix sort can get just as hairy as qsort.

I may discuss radix sort more a different time.


Anyway, real problems are difficult, and have complicated solutions; this requires complicated and complex code. Simple solutions look good on paper, but not always so good in the real world.

Yes I know, I've spent much time working on my own implementations of qsort() and studying other ones, and reading the existing papers on the subject. What I want is to see others who have done similar work to elaborate on what conclusions they came to. I myself plan on writing a future article about some optimizations I came up with, and perhaps provide a new extremely fast, but clean and easy to modify implementation.