Saturday, March 27, 2010

Does anyone understand types and magnitudes?

Regularly I have to work with many popular libraries out there, as well as many libraries written by coworkers and similar.

One thing which I see constantly misused over and over again is the type system used in C and C++.

I wonder if most people understand it, have bothered to contemplate it, or simply don't care if their library is misprogrammed trash.

First of all, C/C++ for integers supports both signed and unsigned types. Many programmers seem to ignore the unsigned types altogether. Maybe some education is in order.

Each base type in C consists of one or more bytes. Each byte on modern main stream systems are 8 bits to a byte. Meaning a single byte can store 28 values, or in other words 256 possibilities. Two bytes can store 216 values, or in other words 65,536 possibilities.

A type designed for integers (whole numbers) which is unsigned will operate with the meaning of its values as ranging from 0 till amount of possibilities - 1. In the case of a one byte type: 0 to 255, for a 2 byte type: 0 to 65,535.

When a value is signed (the usual default if no signed or unsigned description is applied), a single bit is used to describe whether the rest of the bits represent a positive or negative value. This effectively cuts the amount of positive values that can be represented in half. In the case of a single byte, 27 values will be negative, -128 to -1, 27-1 values will be distinctly positive, 1 to 127, and one value will be 0 (for a total of 256 possibilities). This same math extends to as many bytes being used in a given type.

When dealing with something which requires working with both positive and negative values such as debt (monetary balance), direction, difference, relative position, and things of a similar nature, signed values is quite natural and should of course be used.

But if you're working with something which has no meaning for negative values such as "Number of students in this class", "How old are you", "Amount of chickens in the coop", "Amount of rows in database table", or "Amount of elements in this array", or amounts of any nature really, using a signed type is one of the worst things you could possibly do. You are reserving half of your possible values for a situation which will never occur (barring mistakes in programming).

Let's take a look at some real world examples. The most popular type used for at least the past decade would have to be the 32 bit integer. 232 gives us 4,294,967,296 different value possibilities. If that was a storage size, it would be 4GB. Effectively a 32 bit unsigned integer can store the values 0 to 4,294,967,295, or one byte less than 4GB. If it was signed, it would be able to store the values -2,147,483,648 to 2,147,483,6487, topping out at one less than 2GB.

If you have an Operating System or File System which limits you to 2GB as a max size for files, or your Digital Camera only lets you use up to 2GB flash cards, it's because the idiots that programmed it used signed values. They did this because either they were expecting files and flash cards of negative size, or because they were a bunch of idiots. Which of those two possibilities actually happened should be obvious.

The same goes for any programming library you see with a limit of 2GB. As soon as you hear anyone mention a limitation of 2GB, what should immediately register in your head is that idiots worked on this application/library/standard/device.

Now sometimes programmers try to justify their ignorance or their idiocy with remarks such as using negative values to indicate error conditions. Meaning, they write functions which return a positive value in case of success, and -1 in case of error. This is a very bad practice for two reasons. First of all, one shouldn't use a single variable to indicate two different things. It's okay to use a single variable where each bit (or sets of bits) indicate a flag for a set of flags. But it's very wrong to have a single value indicate success/failure and amount, or day of week and favorite color, or any two completely unrelated things like that. Doing so only leads to sloppy code, and should always be avoided. Second of all, you're cutting your amount of usable possibilities in half, and reserving a ton of values just for a single possibility. You can't be more wasteful than that.

If you need a way for your function to return success/failure and amount, there are cleaner and more effective ways to do so. Many times an amount of 0 is invalid. If a function is supposed to return the amount of something, 0 isn't really an amount. If I asked how many students are in my class, and I get 0, that in itself should be indicative of an error. If I asked how many rows did my SQL statement just change, and I get 0, that should be indicative of an error, or at the very least, there were no SQL rows to change. If I really need to know if the 0 means a real error, or there was no data to work with, a separate function can tell me if the last call was a success or failure. In C and in POSIX, it is common to use the global variable errno to indicate errors. If you're making your own library, you can make a function to tell you why the last value was 0. There was no data, there was an error accessing the data, and so on.

Another method would be to pass one parameter by pointer/reference to store the amount or success state, and have the other returned from the function. Another technique would be to return an enum which tells you which error condition happened or everything is okay, and use a different function to retrieve the value of that operation upon success. Lastly, C++ programmers can use std::pair<> for all their multiple return needs, or simply return the amount normally, and throw an exception on any kind of error.

Now that hopefully all my loyal readers are now more educated about the sign of their types, and various function return techniques that better libraries use, let's talk a bit about magnitude.

I see again and again libraries that don't have a clue about magnitude. The various C/C++ standards state the following about the normal built in types.

1) The amount of bytes used for the following types should be in this proportion: short <= int <= long <= long long.
2) short should be at least 2 bytes.
3) long should be at least 4 bytes.
4) long long should be at least 8 bytes;.

Which means all these scenarios are perfectly legal:
sizeof(short) == 2
sizeof(int) == 2
sizeof(long) == 4
sizeof(long long) == 8

sizeof(short) == 2
sizeof(int) == 4
sizeof(long) == 4
sizeof(long long) == 8

sizeof(short) == 2
sizeof(int) == 4
sizeof(long) == 8
sizeof(long long) == 8

sizeof(short) == 2
sizeof(int) == 4
sizeof(long) == 8
sizeof(long long) == 16

sizeof(short) == 4
sizeof(int) == 4
sizeof(long) == 8
sizeof(long long) == 16

sizeof(short) == 4
sizeof(int) == 8
sizeof(long) == 16
sizeof(long long) == 32

And so on.

In order to get types of a particular size, C99 added the header <stdint.h> and C++-201x added <cstdint>, with types such as int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, and several others. Each of these types are the size in bits that their name indicates, with those without a u at the beginning being signed, and those with being unsigned. Using these types you can get the exact size you need.

But remember to use things appropriately. I see several popular SQL engines for example that support as many rows in a table as a uint64_t is able to contain, but if you ask them how many rows were altered by the last query, they'll report that information in a plain and simple int! You by now should realize there are two problems with such a course of action.

Now lastly, if you're writing some kind of function which takes an array, it is normal convention to pass a pointer to the array, as well as a size variable to specify how many elements or bytes are within the array.

The standard C library uses a type called size_t to deal with such values. A size_t can contain the highest amount of bytes the program is able to address, and is of course unsigned.

If you need to pass an array, your prototype should always be along the lines of:

void func(void *p, size_t size);
void func(uint8_t *p, size_t size);

If you need to return a size, for something which is addressable memory, again always use a size_t. Functions like strlen() and sizeof() return a size_t, and functions like memcpy(), memset(), or malloc() take a size_t as one of their arguments.

Let me show you a quick program to illustrate a point.


#include <cstdio.h>

#define O(x) printf(#x": %zu\n", sizeof(x))
int main()
{
O(signed char);
O(signed short);
O(signed int);
O(signed long);
O(signed long long);
O(signed);
O(unsigned char);
O(unsigned short);
O(unsigned int);
O(unsigned long);
O(unsigned long long);
O(unsigned);
O(size_t);
O(void *);
O(float);
O(double);
O(long double);
return(0);
}


On a 32 bit system of mine, the output is as follows:

signed char: 1
signed short: 2
signed int: 4
signed long: 4
signed long long: 8
signed: 4
unsigned char: 1
unsigned short: 2
unsigned int: 4
unsigned long: 4
unsigned long long: 8
unsigned: 4
size_t: 4
void *: 4
float: 4
double: 8
long double: 12


On a 64 bit system of mine, the output is as follows:

signed char: 1
signed short: 2
signed int: 4
signed long: 8
signed long long: 8
signed: 4
unsigned char: 1
unsigned short: 2
unsigned int: 4
unsigned long: 8
unsigned long long: 8
unsigned: 4
size_t: 8
void *: 8
float: 4
double: 8
long double: 16


Now notice for 32 bit, the void * type is 4 bytes, and on 64 bit, it is 8 bytes. This means that a void * can address any position in memory that the system is able to. In order to specify the amount, you'll notice in each case the sizeof(size_t) matches the sizeof(void *). If I were to get a 128 bit system which had a void * of 16 bytes, the size_t would also be at least 16 bytes.

There are other types which also matched the size of the void * each time around, but if you look at my earlier explanation of the relationship of the various C types, you'll notice those other types are too variable to know if they'll be good for an amount of every system you try to use your application or library on, so always stick with the size_t.

All too often I see programmers use an int for such an amount which is clearly wrong, or an unsigned int, which is better, but also wrong, especially on my 64 bit system. Some programmers I also see use the type of just unsigned by itself. I don't know who thought that up, but it's identical to an unsigned int, which is also clearly wrong. I included unsigned and signed specifically in my test above because some people are under the mistaken notion that those types become as large as possible. Every time I see code which contains one of them, I want to vomit.

If you ever have any doubt as to what size a type may be on your system, or get into an argument with someone, test things yourself with a sizeof() call, it's not hard to do, or point them to this article.

Now go out there and write some good libraries. I have too much vomit and bile on the floor here from all the horrible code I have to work with or review.

Thursday, March 11, 2010

C++-201x Variadic Templates

C++-201x (or C++-0x as near sighted individuals like to call it) introduced a new feature called variadic templates, which allows a template class or function to recieve multiple parameters to it of various types as you would normally achieve with "...".

You could continue to use stdarg if you wish, but a limitation of it is that you have no way of knowing what type each parameter is to use with va_arg(), unless you just hard code it to only work with a specific type, or pass a type list how the printf() family of functions do it, or something similar.

I always wanted a way to do func(integer, string, float, whatever), and interchange that with func(float, whatever, char), and now you can.

If you try to research this topic online, basically everyone is just writing and talking about how to implement a tuple class, or how to make a type safe printf. Lets try something a little more interesting and perhaps instructive, yet simple.


#include <iostream>
using namespace std;

//Output function to output any type without type specifiers like printf() family
template <typename T, typename ...P>
void output(T t, P ...p)
{
cout << t << ' ';
if (sizeof...(p)) { output(p...); }
else { cout << '\n'; }
}

//Since variadic templates are recursive, must have a base case
void output() { cout << '\n'; }


//Compute sum of all parameters
template <typename T, typename ...P>
T sum(T t, P ...p)
{
if (sizeof...(p))
{
t += sum(p...);
}
return(t);
}

//Since variadic templates are recursive, must have a base case
template <typename T>
T sum(T t) { return(t); }

//Test it
int main()
{
output();
output('5');
output('5', 2);
output('5', 2, "cows");
output('5', 2, "cows", -1);
output('5', 2, "cows", -1, 0.5f);
output('5', 2, "cows", -1, 0.5f, 16.3);

cout << endl;

cout << sum(1) << '\n'
<< sum(1, 2) << '\n'
<< sum(1, 2, 3) << '\n'
<< sum(1, 2, 3, 4) << '\n'
<< sum(1, 2, 3, 4, 5) << '\n';

cout << endl;

cout << sum(0.1) << '\n'
<< sum(0.1, 0.2) << '\n'
<< sum(0.1, 0.2, 0.3) << '\n';

cout << endl;

return(0);
}


Here's how to compile and what the output looks like:

/tmp> g++-4.4 -Wall -o test test.cpp -std=gnu++0x
/tmp> ./test

5
5 2
5 2 cows
5 2 cows -1
5 2 cows -1 0.5
5 2 cows -1 0.5 16.3

1
3
6
10
15

0.1
0.3
0.6

/tmp>


As can be seen I achieved my desired goals of being able to pass various types, not know the types to be passed in advance, and not need a type list passed anywhere. All the other examples I've seen seem to be forgetting these important points.

As for how this works "typename ...P" defines a template type of P which will be many parameters packed into a single variable. "P ...p" receives all these parameters in a variable named "p".

However the only things I can really do is see how many parameters are in "p" using "sizeof...(p)", or split "p" up into all its parameters, to then pass to a function by using "p...". It would be nice to be able to simply iterate through the values of "p", but that's not possible at the moment.

Using standard recursion techniques, you can pass the split up parameters from "p" to a function which has enough templated parameters as "p" is currently holding, or a lesser amount, and the last parameter can be a variadic parameter to catch all the others which don't fit into the first few. Then using those first parameters, you can access the values you need, and so on ad infinitum.

Read up on recursion if you're confused.

Now if you're still wondering what this may be useful for, just consider functions which handle data serialization or something similar.

One thing to note, at least with GCC at the moment, you always need to have a separate function as a base case, even if it's never used.

Take "output('5', 2, "cows", -1, 0.5f, 16.3);", on the last leg of iteration, when the double is passed to the first parameter to be received by "t", and "p" is empty, "if (sizeof...(p)) { output(p...); }" which is the line responsible for the recursion seeing an empty "p" won't ask for "output()", but it seems the compiler still wants it to exist:


test.cpp: In function ‘void output(T, P ...) [with T = double, P = ]’:
test.cpp:8: instantiated from ‘void output(T, P ...) [with T = float, P = double]’
test.cpp:8: instantiated from ‘void output(T, P ...) [with T = int, P = float, double]’
test.cpp:8: instantiated from ‘void output(T, P ...) [with T = const char*, P = int, float, double]’
test.cpp:8: instantiated from ‘void output(T, P ...) [with T = int, P = const char*, int, float, double]’
test.cpp:8: instantiated from ‘void output(T, P ...) [with T = char, P = int, const char*, int, float, double]’
test.cpp:33: instantiated from here
test.cpp:8: error: no matching function for call to ‘output()’


I'm not sure though if this is required by the standard, or a bug in GCC.

As for C++-201x support in general, GCC seems to be the furthest ahead, even surpassing Comeau C++.

See a comparison of the various compilers' support of C++-201x.