Thursday, November 25, 2010

C++ Serialization Anyone?

Today I had one of the most amazing programming experiences that I've ever had, from my entire exciting career. I'm still a bit stunned that this happened. I fully thought what happened was completely impossible till now.

At work, we use a lot of different languages to create our software. It's not odd for us to be working on a project which somehow ends up using a dozen languages. Between server code, client code, databases, communication, mark up, styling, pre-processing, dynamic code generation, and other commonality, it's rather easy actually.

Between all these different programming languages, quite often, we need some sort of data interchange format. There's many to choose from, ranging from something custom to something well known like XML. Using these formats, we can pass data from one segment of our application stack to another. Even when they use two different programming languages. Or to save some data, and load it back up later.

When it comes to these things, soft typed functional languages are generally easier to work with than hard typed. Soft typed languages are very good at building objects from data on the fly, thanks to their ability to not care much about what types they're looking at. Is it a number or a string? Doesn't matter to the soft typed language, as they store it all the same way.

When dealing with database access from hard typed languages, the popular method is to create some sort of catch all or convert to anything type. For some, terms like "QVariant" or "boost::any" are always on their lips. The intent of these and similar constructs is to ease things when dealing with data in an unknown type. Although such constructs generally require building a switch block which needs to check some enumeration method to figure out how to handle the data within the rest of the program. Such code is just downright annoying.

At work some time back, thanks to a lot of the new features C++-201x has been adding, we've been able to build a database access library which can handle data without any of these old kludges. Essentially, database access for us in C++ is now just as easy as it is in PHP (or perhaps even easier!).

Now database communication is great, but there's still the issue of data interchange between two programs, which aren't using a database as an intermediary. Many soft typed or functional languages can have a simple encode() or decode() function, pass it any object, and have a nice string representation of it which can be sent off, or saved to a file for later. C++ and related languages always had the nightmare of needing to iterate manually over every data type, or over a hierarchy to work with something like XML, or similar data formats.

There's those that have created workarounds of course. Such as adding a serialize() function for every type you have to work with individually. Or create some serializable objects that one copies data to or from, and which handle all the serialization work internally. Or one of my personal favorites, write a separate parser which can read a description of a format, and generate the C++ objects and code needed to serialize or deserialize it.

Well, today a coworker and I were putting our heads together on how to deal with a certain project. I wrote code some time back which can serialize/deserialize to and from an std::map which contains numbers, strings, or a mix thereof. We were using this data interchange format between two programs. However, now we need to deal with much more complex data, and a series of key pairs just won't cover it. One end of the equation is C++, the other end is a soft typed language which could pretty easily work with whatever we came up with.

We first thought about the option of using a classic method such as XML or JSON, and use some kind of hierarchical writer from C++, and have the soft typed language just read it directly into an object with one of its built in language features. Till my friend had a brilliant realization. The hierarchy of language containers and their children is recursive, as is any serialization that can encode an infinite amount of data stacked in a hierarchy. Then we started discussing if we could make a serialize() function in C++ which could take any C++ type and work, even when not knowing everything about it in advance. It'd be easy for plain old data types, but gets more complicated once we start dealing with containers of those, and containers of containers.

Of course this is where most conversations along these lines end. But then I brought up template meta programming, and some new features C++ is now adding (and already in GCC), and this discussion went on much further than usual, till the point we were talking code. Well, we got into it, and two hours and two hundred lines of code later, we now have a function with the following prototype:

template<typename T>
std::string serialize(const T &t);


It is able to take any type that exists in C++, and well, serialize it. Have some type which contains some types which contains a few more types which contains some other types? It's all serializeable with this function. No pre-processing, dynamic code generation, compiler hacks, or clumsy per program hierarchical parsing required. It just worksTM.

Now next week, we'll have to write the deserializer function to pull that magic in reverse. Using the same idea, it shouldn't be a problem. If the data matches the supplied structure, parse it in, otherwise, throw an error. But currently, our project is done, as we are now able to have our C++ applications send very complex data to our soft typed languages rather easily.

Looking over the code with my coworker, it all seems extremely obvious. Why the heck didn't we think of this 20 years ago? Now am I getting all excited over something that has been done before? Anyone familiar with anything like this?

Question is, what to do now that we know this? File for a patent? (Yes, I'm evil!) Or perhaps ignore this, as no one cares about this topic anyway?

Saturday, October 30, 2010

This just in, 20% of enterprises and most IT people are idiots

So, who still runs Internet Explorer 6? I do, because sometimes, I'm a web developer.
Along with IE 6, I also run later versions of IE, as well as Firefox, Opera, Chrome, Safari, Arora, Maxthon, and Konqueror. So do my web developer friends and coworkers.

The reason why we do is simple. We want to test our products with every browser with any sort of popularity. Or is a browser that comes with some sort of OS or environment with any sort of popularity. Same goes for browsers with a specific engine.

By playing with so many browsers, we get a feel for things which seem to not be well known (even when documented well), or completely missed by the "pros" who write the most noise on the subject at hand. After work, sometimes my coworkers and I like to get together and joke about how Google security researchers put out a security memo on IE, citing no recourse, while the solution is clearly documented on MSDN, or any similar scenario.

Perhaps we're bad guys for not writing lengthy articles on every subject, and keeping knowledge to ourselves. On the other hand, we get to laugh at the general population on a regular basis. Something which I'm sure every geek at heart revels in.

Here's a small sampling of popular beliefs that we snicker at:


That last one is actually what prompted this article. 20% of enterprises say they can't upgrade to newer versions of Windows because they need to use IE 6. On top of this, almost every IT guy who had anything to say about this believe in this situation or mentions virtualization as an out. Heck, even Microsoft themselves are saying you need a special XP mode in Windows 7 for IE 6, as is every major article site on the net who comment on this situation.

Hilarious considering that you can install and run IE 6 just fine in Windows 7. There's plenty of solutions out there besides that one too. They've also been around for several years.

Anyways, experts, pros, designers, IT staff and average Internet surfers out there, just keep on being clueless on every single topic, some of us are having a real laugh.

Sunday, October 24, 2010

Programming manuals for those who "get it"

Top secret official programming manuals for certain popular hand held devices have now been uploaded to the usual public document repositories. Some have been around for a while to those who get it, some are new.

Sunday, September 19, 2010

Optimizing your file dialog

Recently, someone asked me why is it that the custom file dialogs in some programs I wrote are able to load a file list so much faster than the so called "native file dialogs" on those operating systems. The answer is that I optimized many areas of loading a file list, and displaying it. Today we'll look at some optimizations that can be performed, instead of using the most obvious approaches to most of the process.


  1. Know your system calls for reading directory entries



    You'll want to create some sort of library around the native directory functions in order to use the most efficient path on each operating system. Operating systems vary widely in what they they provide.

    Sometimes you have multiple APIs, or a single API with many options you can turn on and off. Unless the provided API is in itself doing everything as efficiently as possible (such as efficiently returning the file names in a sorted matter), you'll want to use the lowest level API, or turn off all options it provides.

    If bidirectional or scanning options are provided, turn them off, as you'll want forward only. The overhead for some of these features can significantly slow down the process of obtaining a file list. Same goes for anything else that adds overhead.

    Be aware of what information is provided other than file names. Some OSs will return a structure which may also contain file size or file permissions alongside each file. If you're filtering by any of these, use the information directly there, instead of using an additional call to stat() or similar.


    In addition to the above, if you need to filter by file size or file permissions, and the API does not provide that information directly, avoid the need for concatenating strings to get your information. A call to stat() would require concatenating a file name onto the directory name if it's not in the current directory. If your OS provides for it, use fstatat() or similar instead (see "OpenGroup Extended API Set Part 2").

  2. Process file names in an optimized manner


    Once you start getting back file name(s), before anything, determine the length of a file name, as all other operations will need to know the length. Pretty much every operating system / file system limits the length of a file name to 255 bytes, the length of that can fit in a single byte. Compute the length (if need be) and keep reusing this cache.

    If you want to do extension filtering, now is the time to do so. Do not use strrchr(), or anything of the sort to find where the extension begins (searching for a period). Functions like strrchr() will first travel to the end then look back, effectively being O(2N). Since you already know the length of the string, use an O(N) function to look back from the end. Perhaps even using one which can scan multiple bytes (whole CPU words) at a time.

    For extension filtering, the most common method is generally one that breaks down to a series of compares. Put a little more thought into your algorithm usage! Either sort your list of extensions and binary search it for a match, or use perfect hashing. In either case, make sure the sorting or hashing phase is performed just once.

    For copying the files names you want elsewhere, use memcpy() instead of strcpy() and similar. Don't forget to pass around the length for other uses too.


  3. Use efficient data structures efficiently


    The next question is how to keep track of your list of files. The most two common ideas are to use a vector and just keep adding to it, or use a linked list. Both those solutions are wrong.

    A vector while great in theory, has some performance and memory issues. When a vector can no longer contain what is being added to it, it doubles in size. The doubling in size can cause up to 99% more allocation than is actually needed during the growing process, which also comes with a lot of copying around. This may suit many workloads, but won't always suit the file list workload well.

    A linked list on the other hand is one of the worst possible data structures you can use. Every entry has 100-200% extra overhead for pointers, and often means many series of allocations. Many small allocations leads to performance issues, and a lot of memory fragmentation.

    The solution I would recommend is either a tweaked vector, or a combination of a vector plus linked list. Tweaking a vector starts off with reserving a set amount of space up front. Have your vector by default start with room for say 512 entries, and only then grow as needed. When you see your vector is full, instead of doubling in size, reserve another 512 entries. Smaller requests are more likely to be fulfilled for a realloc() in the same location, than a doubling in size. Very large directories beyond this point also becomes increasingly rarer at each new milestone. On top of all this, have your program remember how many files were in each of the past 10 or so directories it looked in (with filtering options of course). Then when loading a directory you recently loaded in the past, you can allocate an amount which is usually just what you need, or only needs one more chunk allocated onto it. Keep in mind that most users will keep using the same few directories over and over again with each application.

    The combination of the vector and linked list would instead build a structure out of the two ideas combined. Create your array of 512 entries, along with a fill pointer and and linked list pointers. When your array is full, allocate another block of this structure as the next chain in the linked list. This ensures no need to ever reallocate, or copy data around during the growing phase. Of course keep a global pointer to the current link being filled, instead of iterating to it each time you add a new entry.

    Which of these two options you use depends on how you want to sort and display your files. Each data structure has pros and cons for different use cases. Before we get into that, know how to store your file names themselves.

    The above description was for the pointers to the arrays holding the file names, the file names should NOT have a separate block of allocated memory for each. That's inefficient. Instead allocate large chunks of memory, say 16KB each. Store your file names in this memory pool. Forget the silly trailing null if you can, store the length before the name itself. The two data structures explained above for containing lists of files will point into these pools. If a pool can't contain another entry you're about to add, allocate another 16KB chunk, point to it from the previous one, and start allocating in this new one. If the need for a new one was for a rare overly large file, say 230 bytes, when 200 bytes were free, you can implement an algorithm to save some overhead by seeing if the next couple of files fit into the end of any previous pools. The pools in general will probably waste less than your OSs malloc() for small file names, and you'll certainly get better performance. Deallocation will also be a quick (backwards) traversal for large chunks instead of many small ones.

  4. Sort efficiently


    The first part of sorting is comparison. strcmp()/strncmp() is not the function to use. You want something based on memcmp() as that works faster thanks to having the length of the file name already cached. Of course that's only if you want standard sorting.

    If you want case insensitive sorting, other methods will have to be used. If you're only dealing with ASCII file names, and you're not worried about file names containing control characters or some symbols end up being of equal value, you can compare 4 bytes at a time or'ing each with 0x20202020 (0x2020202020202020 for 8 bytes at a time). You're probably out of luck otherwise for case insensitive, unless someone knows of any good tricks to use.

    If you need "natural sorting", where numbers are compared to each other, so file1 precedes file15, and file2 follows file1 not file15, there's a few implementations floating around online and most are just copies of one another. Sadly, the most popular one I see, which is present in two operating systems, and the native implementation in a particular programming language, happens to incorrectly handle numbers past a certain magnitude (did anyone even test the implementations or just accept them at face value?), and has a worse case of O(3N). So beware of what you use or base your code off of.

    I'd recommend something along these lines:

    int strnatbase(const char *s1, const char *s2)
    {
    for (; *s1; ++s1, ++s2) //No need to check s2, as we compare it to s1, and it won't equal if it's null
    {
    if (isdigit(*s1) && isdigit(*s2)) //If they're both digits, use this special handling
    {
    while (*s1 == '0') { ++s1; } while (*s2 == '0') { ++s2; } //First skip leading zeros

    register bool d1 = isdigit(*s1), d2 = isdigit(*s2); //Are we still in a run of digits for both strings?
    if (d1 != d2) { return(d1-d2); } //If for only one of them, return that as the difference

    for (; d1 && (*s1 == *s2); ++s2) { d1 = isdigit(*++s1); } //Keep going while we have matching digits

    d2 = isdigit(*s2); //If we broke the above loop because a single string ran out of digits
    if (d1 != d2) { return(d1-d2); } //Return that as the difference

    for (const char *p1 = s1, *p2 = s2; d1; ) //Otherwise, difference in the digits themselves, clarify magnitude
    {
    d1 = isdigit(*++p1); d2 = isdigit(*++p2); //First string to run out of digits first
    if (d1 != d2) { return(d1-d2); } //Will lose right here
    } //If loop breaks, both strings are out of digits, the difference found above will be handled below
    }
    if (NOT_EQUAL(*s1, *s2)) { break; } //In all other cases, fall through with difference in current position - if any
    }
    return(*s1-*s2);
    }


    Modify as necessary for unicode, and replace NOT_EQUAL() with your function (or macro) of choice for comparing two letters (accounting for case differences if you want).

    As for the sort functions themselves, it will depend on which data structure you chose. You'll probably want Quicksort for the vector. Make sure your Quicksort sorts the pointers to the data and not the data itself! Same for any sorting algorithm sorting this. Swapping pointers is quick and easy. For the latter structure, you might want Merge sort, or perhaps Smoothsort. What you choose can depend on several factors.

    You may or may not want to utilize threading. Depending on how you filter, the OS may be giving you mostly sorted file names which can also play a role in the algorithm you choose. In the case of the linked lists of arrays, you might want to sort each array with one algorithm, then sort the whole. If you're utilizing threading, you can begin sorting the array(s) as you're still obtaining files from the operating system, as disk reads are slower than memory reads (when the current directory contents is not (yet) cached by the OS).

  5. Display directly


    This point is rather simple. Have your file dialog display entries directly from whichever data structure you chose, without copying anything around. Know that most existing file dialog routines provided by various libraries will not be compatible with the data structures described above. So roll your own, or choose structures which can be passed directly to your library.

    Another possibility is to copy entries as they fit in the dialog window. The dialog is unable to show an infinite amount of entries at once, so convert from your structure to its structure just for the current files being viewed. Update as scrolling is performed.

  6. Search efficiently


    Many file dialogs provide the option to type in some characters, and the file dialog jumps to the first file that begins with those characters. Since your file dialog is sorted, use binary search. Some people don't realize this, but binary search can find more than just an exact match, binary search can also be used to find the transition from entries "less than" to "first equal match" (in other words, binary search can also be used to find the transition between entries before the first match and the first match itself). Use this to find the first matching file in O(log(N)).



This basically wraps up this discussion on how to optimize file dialogs. I once was annoyed that loading up a directory with 20,000 files in it on a particular OS with a particular file browser took a good torturesome 20-30 seconds. Now using my own, I can load a directory with 100,000 files in 1-2 seconds on the same setup.

Thoughts, improvements, complaints, and every other kind of comment is welcome. Be sure to comment if you take of these ideas to optimize your library or application. I'd love to hear about it.

Saturday, July 24, 2010

Simplifying bootstrapping for virtual constructors

Last week I demonstrated a solution to the virtual constructor problem. My solution avoids many issues with the factory function solution. Yet it did require some bootstrapping to use.

The bootstrapping required a new function to be created for every single derived class that needs to be virtualized. When working with many derived classes, this becomes unacceptable. It's bad enough to solve this problem we need to generate a map, should we have to create additional functions as well? Each time a new derived class is added, should I go out of my way with two steps?

Turns out, making use of templates, we can combine the define and function generation step.

static compress *compress_zip::construct() { return new compress_zip; }
static compress *compress_gzip::construct() { return new compress_gzip; }
static compress *compress_7zip::construct() { return new compress_7zip; }


Instead of creating the above, and using it as follows:

std::map<COMPRESS_TYPES, compress *(*)()> compress_factory;
compress_factory[COMPRESS_ZIP] = compress_zip::construct;
compress_factory[COMPRESS_GZIP] = compress_gzip::construct;
compress_factory[COMPRESS_7ZIP] = compress_7zip::construct;


First create a single construct template function:

template <typename T>
compress *compress_construct()
{
return new T;
}


This has to be done only once.

Now when adding to the map, we can do the following:

std::map<COMPRESS_TYPES, compress *(*)()> compress_factory;
compress_factory[COMPRESS_ZIP] = compress_construct<compress_zip>;
compress_factory[COMPRESS_GZIP] = compress_construct<compress_gzip>;
compress_factory[COMPRESS_7ZIP] = compress_construct<compress_7zip>;


If some new type now comes along, simply add it with a single line. A new function will be generated on use by the template, so you no longer have to. Now we have truly managed to map a type directly to an identifier.

Of course no tutorial would be complete without a self contained example:

#include <iostream>
#include <map>
#include <string>

class base
{
std::string n;

protected:
base(const std::string &n) : n(n) {}

public:
base() : n("base") {}
std::string name() { return(n); }
};

struct derived : public base
{
derived() : base("derived") {}
};

template <typename T>
base *construct()
{
return new T;
}

int main(int argc, const char *const *const argv)
{
std::map<std::string, base *(*)()> factory;
factory["b"] = construct<base>;
factory["d"] = construct<derived>;

try
{
//Instantiate based on run-time variables
base *obj = factory.at(argv[1])();

//Output
std::cout << obj->name() << std::endl;

//Cleanup
delete obj;
}
catch (const std::exception &e) { std::cout << "Error occured: " << e.what() << std::endl; }

return 0;
}


Output:

/tmp> g++-4.4 -Wall -o factory_test factory_test.cpp
/tmp> ./factory_test b
base
/tmp> ./factory_test d
derived
/tmp>

Now that this problem has been solved nice and neatly. What about solving it for multiple constructors? Also known as the abstract factory problem. What if each class has multiple constructors, and we want a collection of them mapped to a single identifier? Can we do it without repeating a lot of code over and over?

With some minor bootstrapping, the answer is again yes! There's multiple solutions to this problem, but the following is what I found to be the nicest at the moment.

First create a pure virtual class with a function to match each constructor you'd like to virtualize. Each should of course return a base pointer.

Imagine we had 3 constructors, one taking no parameters, one taking a C string, and another taking a C++ string, we would setup the following:

struct construct_interface
{
virtual base *operator()() const = 0;
virtual base *operator()(const char *) const = 0;
virtual base *operator()(const std::string &) const = 0;
};


Once we have the interface defined, we'll create a template construct function which implements and returns that interface within a singleton similar to the above construct function:

template <typename T>
const construct_interface *construct()
{
static struct : public construct_interface
{
base *operator()() const { return new T; }
base *operator()(const char *s) const { return new T(s); }
base *operator()(const std::string &s) const { return new T(s); }
} local;
return &local;
}


It'd be nice to return a reference instead of a pointer, but dealing with references in std::map is kind of icky. We can use macros to cleanup any annoying pointer dereferencing issues that could arise.

Now our construct function returns a pointer to an interface which can construct a derived type using any of its constructors. We'll use it with an std::map like so:

std::map<std::string, const construct_interface *> factory;
factory["base"] = construct<base>();
factory["derived"] = construct<derived>();


Notice the map no longer tracks function pointers, but pointers to the interface. Also, when assigning to the map, we're calling the construct function to obtain the pointer. We could modify the above example to track function pointers and leave out the (), and even make the construct function return a reference to an interface instead, but then we'd need to add an extra () when creating objects. While that too can be hidden by a macro, or just ignored, as an extra () still looks rather clean, it does add extra overhead, as it is likely you'll initialize your map just once, and use it to create many objects during the lifetime of the program.

Now to use the map to create an object, the following has to be done:

//Use first constructor, the default constructor
base *obj1 = (*(factory).at(id))();

//Use second constructor, the one taking a C string
base *obj2 = (*(factory).at(id))(s);


It works nicely, but as I explained above that's rather ugly.

This macro can help simplify things:

#define VIRTUAL_NEW(factory, id) (*(factory).at((id)))


And unlike the interface class and template construction function which needs to be created for each set of classes making use of virtual constructors, the above macro can be reused for every virtual constructor collection that makes use of the above idiom.

Using the macro, the code now looks as follows:

//Use first constructor, the default constructor
base *obj1 = VIRTUAL_NEW(factory, id)();

//Use second constructor, the one taking a C string
base *obj2 = VIRTUAL_NEW(factory, id)(s);


That's it, problem solved!

Putting it all together, here's a working example:

#include <iostream>
#include <map>
#include <string>
#include <cstdlib>

class base
{
protected:
int x, y;

public:
base() : x(1), y(2) {}
base(const char *s) : x(std::atoi(s)), y(3) {}
base(const std::string &s) : x(std::atoi(s.c_str())), y(4) {}

virtual int operator()() { return x+y; }
};

class derived : public base
{
public:
derived() {}
derived(const char *s) : base(s) {}
derived(const std::string &s) : base(s) {}

virtual int operator()() { return x*y; }
};

struct construct_interface
{
virtual base *operator()() const = 0;
virtual base *operator()(const char *) const = 0;
virtual base *operator()(const std::string &) const = 0;
};

template <typename T>
const construct_interface *construct()
{
static struct : public construct_interface
{
base *operator()() const { return new T; }
base *operator()(const char *s) const { return new T(s); }
base *operator()(const std::string &s) const { return new T(s); }
} local;
return &local;
}

#define VIRTUAL_NEW(factory, id) (*(factory).at((id)))

int main(int argc, const char *const *const argv)
{
std::map<std::string, const construct_interface *> factory;
factory["b"] = construct<base>();
factory["d"] = construct<derived>();

if (argc == 3)
{
try
{
//Instantiate based on run-time variables
base *obj1 = VIRTUAL_NEW(factory, argv[1])();
base *obj2 = VIRTUAL_NEW(factory, argv[1])(argv[2]);
base *obj3 = VIRTUAL_NEW(factory, argv[1])(std::string(argv[2]));

//Output
std::cout << (*obj1)() << '\n'
<< (*obj2)() << '\n'
<< (*obj3)() << '\n'
<< std::flush;

//Cleanup
delete obj1;
delete obj2;
delete obj3;
}
catch (const std::exception &e) { std::cout << "Error occured: " << e.what() << std::endl; }
}

return 0;
}


Output:

/tmp> g++-4.4 -Wall -o abstract_factory_test abstract_factory_test.cpp
/tmp> ./abstract_factory_test b 2
3
5
6
/tmp> ./abstract_factory_test b 3
3
6
7
/tmp> ./abstract_factory_test d 2
2
6
8
/tmp> ./abstract_factory_test d 3
2
9
12
/tmp>

Hopefully you should now be able to take this example and plug it in just about anywhere. Easy to add new types. No need to modify existing classes. Fully dynamic. Easy to use!

This method really shines if you dynamically load derived classes while your program is running. Just make sure your DLL uses the the template function internally, so an instance of the function for the new type is created, and dynamically add an id to the map, and presto you're done.

Now go out there and leverage the power of C++!

Tuesday, July 20, 2010

Time to shutdown the factory for code violations

There's a common problem in software development regarding how to create an object's type dynamically in C++ and similar languages. If I have a collection of classes (they all inherit from the same base), I can use a base pointer to any instance, and perform operations as needed. The issue is generating that pointer in the first place.

Say I have an image class, with subclasses of image_png, image_jpeg, image_gif, and so on. I can use an image pointer along with its virtual functions to do an operation like image_pointer->resize(new_dimensions), and have it do whatever needs to be done for the image type in question to resize it and write the data properly.

This idea applies to all kinds of projects over and over again. Another example could be a compression class. Where I have the class compress, along with subclasses compress_zip, compress_gzip, compress_7zip, and so on. I can do compress_pointer->uncompress(in_file, out_file), and have that work appropriately with virtual functions.

However my image pointer and my compress pointer need to be pointing at the right kind of object for this to work in the first place. My image is sent over the network, I have information at runtime of image/jpeg and the binary data, and I need to load that into the correct subclass image_jpeg. My compressed file is on the hard drive, I read its type via its extension or its header, again I need my pointer to point at the appropriate subtype. I don't have this information at compile time, so how do I initialize my pointer?

In this case, the idea of virtual constructors, or factories come into play. The simplest factory would be in the form of an if/else if structure, or a switch.
image *image_factory(const char *mime_type, const uint8_t *binary_data, size_t size)
{
  image *image_pointer;
  if (!strcmp(mime_type, "image/jpeg")) { image_pointer = new image_jpeg(binary_data, size); }
  else if (!strcmp(mime_type, "image/png")) { image_pointer = new image_png(binary_data, size); }
  else if (!strcmp(mime_type, "image/gif")) { image_pointer = new image_gif(binary_data, size); }
  else { throw&nbsp"Unsupported image type"; }
  return image_pointer;
}

compress *compress_factory(COMPRESS_TYPES requested_save_format)
{
  compress *compress_pointer;
  switch (requested_save_format)
  {
    case COMPRESS_ZIP: compress_pointer = new compress_zip; break;
    case COMPRESS_GZIP: compress_pointer = new compress_gzip; break;
    case COMPRESS_7ZIP: compress_pointer = new compress_7zip; break;
    default: throw std::runtime_error("Unknown enum value");
  }
  return compress_pointer;
}


Now the above functions allow us to get the base pointer initialized appropriately, but they have some drawbacks.

  • A single function needs to know about every type.

  • If which types are supported changes during runtime, perhaps based on which DLLs are found, the factory function becomes more complex, and has to be specifically tailored for each application.

  • If there are multiple constructors, and in some cases one is needed, and in some cases another, suddenly we need to recreate the same logic in multiple factory functions, one factory function per constructor.

  • This style of factory function has to be recreated for every single collection of classes present, the logic behind a factory function is not abstracted away.


Based on these drawbacks, obviously such a method is the incorrect solution, even though it is the most popular method used.

In the case where we have variables mapping to other variables, such as 1 = "Apple", 2 = "Orange", 3 = "Tomato", the obvious solution would be to setup an array or use a map. If the mapping needed is between an integer and a variable, going from integer -> variable is simply a matter of indexing into an array. When dealing with non integer types, the standard std::map fits the bill rather nicely, as it can allow any type to become an index.

So the question becomes, can we map a variable to a type itself? With a bit of bootstrapping, the answer is actually yes! What we need are function pointers to replace code logic.

We can't create a function pointer directly to a constructor, so we'll need a bit of bootstrapping. We can create function pointers to global functions, or static member functions.

In our compression case we can do the following:
static compress *compress_zip::construct() { return new compress_zip; }
static compress *compress_gzip::construct() { return new compress_gzip; }
static compress *compress_7zip::construct() { return new compress_7zip; }


We'll place a static construct() within each of our classes we'd like to be able to instantiate dynamically, and then we can use them in a map like so:
std::map<COMPRESS_TYPES, compress *(*)()> compress_factory;
compress_factory[COMPRESS_ZIP] = compress_zip::construct;
compress_factory[COMPRESS_GZIP] = compress_gzip::construct;
compress_factory[COMPRESS_7ZIP] = compress_7zip::construct;


Once we have a map in place that can be constructed at runtime, we no longer have most of our above issues. No canned function needs to be created which knows about all the supported types in advance, which needs modification for each new type. Values are only added to the map that the required DLLs were found for it, or other runtime constraints. The entire idea is also abstracted away, and we don't need to constantly recreate new factory functions for each family of objects. We simply just initialize a factory object and make use of it.

Now how exactly do we use our map? The idea which is rather popular and expounded in book after book, is to create a factory template class something similar to the following:
template <typename identifier, typename product, typename function>
struct factory
{
  void set_mapping(const identifier &id, function f) { mapping[id] = f; }
  product *construct(const identifier &id)
  {
    function f = mapping.at(id);
    return f();
  }

  private:
  std::map<identifier, function> mapping;
};


Now that we have all that encapsulated, we can do the following:
factory<COMPRESS_TYPES, compress, compress *(*)()> myfactory;
myfactory.set_mapping(COMPRESS_ZIP, compress_zip::construct);
...
compress *compress_pointer = myfactory.construct(requested_save_format);


However, such a method has a serious drawback. What if my constructor needs parameters? As it does in the image example above? Do I end up creating a new factory encapsulation class each time? One library out there solves this problem with ~800 lines of code (read Modern C++ Design for more info). Basically the construct function in the factory class is overloaded a bunch of times with template parameters over and over again, for some conceivable amount of parameters a function might take.
AbstractProduct *CreateObject(const IdentifierType &id,
                              Parm1 p1, Parm2 p2, Parm3 p3, Parm4 p4, Parm5 p5, Parm6 p6, Parm7 p7, Parm8 p8)
{
  typename IdToProductMap::iterator i = associations_.find(id);
  if (i != associations_.end()) { return (i->second)(p1, p2, p3, p4, p5, p6, p7, p8); }
  return this->OnUnknownType(id);
}

The above snippet is repeated 16 times in that library with 0-15 parameters. First of all, such code commits the unforgivable sin of code repetition, and is limited to 15 parameters. What if I need more? Sin more?

The entire idea of a factory which produces an object within itself needs to be shutdown. By trying to be overly clever and encapsulate more than needed, object factories turn into nothing but bloat and horrible code.

Simplicity is king. Therefore, the map should NOT be encapsulated, nor should an object be directly produced. Our so called factory should only produce the appropriate mold needed to form our object.

In the case of images:
static image *image_jpeg::construct(const uint8_t *binary_data, size_t size)
{
  return new image_jpeg(binary_data, size);
}

static image *image_png::construct(const uint8_t *binary_data, size_t size)
{
  return new image_png(binary_data, size);
}

static image *image_gif::construct(const uint8_t *binary_data, size_t size)
{
  return new image_gif(binary_data, size);
}

std::map<std::string, image *(*)(const uint8_t *, size_t)> image_factory;
image_factory["image/jpeg"] = image_jpeg::construct;
image_factory["image/png"] = image_png::construct;
image_factory["image/gif"] = image_gif::construct;


Now to create an image, simply do the following when you want to create an image based on run time variables:
image *image_pointer = image_factory.at(mime_type)(binary_data, size);

Notice that code would take the place of:
image *image_pointer = new image_png(binary_data, size);


Now the creation of a particular object at runtime looks pretty much like creating a specific object, and can feel natural. Just like new which throws an error when it fails (which can happen here too), std::map::at() throws an error when the requested index is not found, so you can reuse the same type of exception handling you're already using when creating objects in general.

If you'd like a complete example which you can play with yourself, try this:
#include <iostream>
#include <string>
#include <map>
#include <stdexcept>
#include <cstdlib>

class Base
{
  int x, y;
  public:
  Base(int x, int y) : x(x), y(y) {}
  virtual ~Base() {};
  virtual int mult() { return x*y; }
  virtual int add() { return x+y; }

  static Base *create(int x, int y) { return new Base(x, y); }
};

class Derived : public Base
{
  int z;
  public:
  Derived(int x, int y) : Base(x, y), z(1) {}
  void setZ(int z) { this->z = z; }
  virtual int mult() { return Base::mult()*z; }
  virtual int add() { return Base::add()+z; }

  static Base *create(int x, int y) { return new Derived(x, y); }
};

int main(int argc, const char *const *const argv)
{
  //Get factory ready
  std::map<std::string, Base *(*)(int, int)> factory;

  //Add some rules
  factory["base"] = Base::create;
  factory["derived"] = Derived::create;

  try
  {
    //Instantiate based on run-time variables
    Base *obj = factory.at(argv[1])(4, 5); //Note: instead of new Base(4, 5) or new Derived(4, 5)

    //Set Z if derived and requested
    if (argc > 2)
    {
      if (Derived *d = dynamic_cast<Derived *>(obj)) { d->setZ(std::atoi(argv[2])); }
    }

    //Output
    std::cout << obj->mult() << ' ' << obj->add() << std::endl;

    //Cleanup
    delete obj;
  }
  catch (const std::exception &e) { std::cout << "Error occured: " << e.what() << std::endl; }

  return 0;
}


Here's what the output looks like:
/tmp> g++-4.4 -Wall -o factory_test factory_test.cpp
/tmp> ./factory_test
Error occured: basic_string::_S_construct NULL not valid
/tmp> ./factory_test base
20 9
/tmp> ./factory_test derived
20 10
/tmp> ./factory_test derived 6
120 15
/tmp> ./factory_test cows
Error occured: map::at
/tmp>


Basically the only issue we're left with is that for each class you want to be able to use a factory for, you have to create a static member function for each constructor you want to use with a particular factory. I don't think there is away around that. It's a minor inconvenience, but in the end offers a lot of power in a clean fashion.

This idea of combining a map with a function pointer can be expanded to other areas as well. You can use this method for calling functions in general based on any runtime input. If your class has many different member functions, each with the same parameters, but for different operations, consider creating a map to member function pointers, and use the input as an index into a map to call the appropriate member function.

Suggestions, improvements, and comments welcome.

Thursday, June 3, 2010

Undocumented APIs

This topic has been kicked around all over for at least a decade. It's kind of hard to read a computer programming magazine, third party documentation, technology article site, third party developer mailing list or the like without running into it.

Inevitably someone always mentions how some Operating System has undocumented APIs, and the creators of the OS are using those undocumented APIs, but telling third party developers not to use them. There will always be the rant about unfair competition, hypocrisy, and other evils.

Large targets are Microsoft and Apple, for creating undocumented APIs that are used by Internet Explorer, Microsoft Office, Safari, iPhone development, and other in-house applications. To the detriment of Mozilla, OpenOffice.org, third parties, and hobbyist developer teams.

If you wrote one of those articles/rants/complaints, or agreed with them, I'm asking you to turn in your Geek Membership card now. You're too stupid, too idiotic, too dumb, too nearsighted, and too vilifying (add other derogatory adjectives you can think of) to be part of our club. You have no business writing code or even discussing programming.

Operating Systems and other applications that communicate via APIs provide public interfaces that are supposed to be stable. If you find a function in official public documentation, that generally signifies it is safe to use the function in question, and the company will support it for the foreseeable future versions. If a function is undocumented, it's because the company doesn't think they will be supporting the function long term.

Now new versions which provide even the exact same API inevitably break things. It is so much worse for APIs which you're not supposed to be using in the first place. Yet people who complain about undocumented APIs, also complain when the new version removes a function they were using. They blame the company for purposely removing features they knew competition or hobbyists were using. That's where the real hypocrisy is.

As much as you hate a new version breaking your existing software, so does Microsoft, Apple, and others. The popularity of their products is tied to what third party applications exist. Having popular third party applications break on new versions is a bad thing. That's why they don't document APIs they plan on changing.

Now why do they use them then? Simple, you can write better software when you intimately understand the platform you're developing for, and write the most direct code possible. When issues occur because of their use, it's easy for the in-house teams to test their own software for compatibility with upcoming versions and fix them. If Microsoft sees a new version of Windows break Internet Explorer, they can easily patch the Internet Explorer provided with the upcoming version, or provide a new version altogether. But don't expect them to go patching third party applications, especially ones which they don't have the source to. They may have done so in the past, and introduced compatibility hacks, but this really isn't their problem, and you should be grateful when they go out of their way to do so.

So quit your complaining about undocumented APIs, or please go around introducing yourself as "The Dumb One".

Saturday, March 27, 2010

Does anyone understand types and magnitudes?

Regularly I have to work with many popular libraries out there, as well as many libraries written by coworkers and similar.

One thing which I see constantly misused over and over again is the type system used in C and C++.

I wonder if most people understand it, have bothered to contemplate it, or simply don't care if their library is misprogrammed trash.

First of all, C/C++ for integers supports both signed and unsigned types. Many programmers seem to ignore the unsigned types altogether. Maybe some education is in order.

Each base type in C consists of one or more bytes. Each byte on modern main stream systems are 8 bits to a byte. Meaning a single byte can store 28 values, or in other words 256 possibilities. Two bytes can store 216 values, or in other words 65,536 possibilities.

A type designed for integers (whole numbers) which is unsigned will operate with the meaning of its values as ranging from 0 till amount of possibilities - 1. In the case of a one byte type: 0 to 255, for a 2 byte type: 0 to 65,535.

When a value is signed (the usual default if no signed or unsigned description is applied), a single bit is used to describe whether the rest of the bits represent a positive or negative value. This effectively cuts the amount of positive values that can be represented in half. In the case of a single byte, 27 values will be negative, -128 to -1, 27-1 values will be distinctly positive, 1 to 127, and one value will be 0 (for a total of 256 possibilities). This same math extends to as many bytes being used in a given type.

When dealing with something which requires working with both positive and negative values such as debt (monetary balance), direction, difference, relative position, and things of a similar nature, signed values is quite natural and should of course be used.

But if you're working with something which has no meaning for negative values such as "Number of students in this class", "How old are you", "Amount of chickens in the coop", "Amount of rows in database table", or "Amount of elements in this array", or amounts of any nature really, using a signed type is one of the worst things you could possibly do. You are reserving half of your possible values for a situation which will never occur (barring mistakes in programming).

Let's take a look at some real world examples. The most popular type used for at least the past decade would have to be the 32 bit integer. 232 gives us 4,294,967,296 different value possibilities. If that was a storage size, it would be 4GB. Effectively a 32 bit unsigned integer can store the values 0 to 4,294,967,295, or one byte less than 4GB. If it was signed, it would be able to store the values -2,147,483,648 to 2,147,483,6487, topping out at one less than 2GB.

If you have an Operating System or File System which limits you to 2GB as a max size for files, or your Digital Camera only lets you use up to 2GB flash cards, it's because the idiots that programmed it used signed values. They did this because either they were expecting files and flash cards of negative size, or because they were a bunch of idiots. Which of those two possibilities actually happened should be obvious.

The same goes for any programming library you see with a limit of 2GB. As soon as you hear anyone mention a limitation of 2GB, what should immediately register in your head is that idiots worked on this application/library/standard/device.

Now sometimes programmers try to justify their ignorance or their idiocy with remarks such as using negative values to indicate error conditions. Meaning, they write functions which return a positive value in case of success, and -1 in case of error. This is a very bad practice for two reasons. First of all, one shouldn't use a single variable to indicate two different things. It's okay to use a single variable where each bit (or sets of bits) indicate a flag for a set of flags. But it's very wrong to have a single value indicate success/failure and amount, or day of week and favorite color, or any two completely unrelated things like that. Doing so only leads to sloppy code, and should always be avoided. Second of all, you're cutting your amount of usable possibilities in half, and reserving a ton of values just for a single possibility. You can't be more wasteful than that.

If you need a way for your function to return success/failure and amount, there are cleaner and more effective ways to do so. Many times an amount of 0 is invalid. If a function is supposed to return the amount of something, 0 isn't really an amount. If I asked how many students are in my class, and I get 0, that in itself should be indicative of an error. If I asked how many rows did my SQL statement just change, and I get 0, that should be indicative of an error, or at the very least, there were no SQL rows to change. If I really need to know if the 0 means a real error, or there was no data to work with, a separate function can tell me if the last call was a success or failure. In C and in POSIX, it is common to use the global variable errno to indicate errors. If you're making your own library, you can make a function to tell you why the last value was 0. There was no data, there was an error accessing the data, and so on.

Another method would be to pass one parameter by pointer/reference to store the amount or success state, and have the other returned from the function. Another technique would be to return an enum which tells you which error condition happened or everything is okay, and use a different function to retrieve the value of that operation upon success. Lastly, C++ programmers can use std::pair<> for all their multiple return needs, or simply return the amount normally, and throw an exception on any kind of error.

Now that hopefully all my loyal readers are now more educated about the sign of their types, and various function return techniques that better libraries use, let's talk a bit about magnitude.

I see again and again libraries that don't have a clue about magnitude. The various C/C++ standards state the following about the normal built in types.

1) The amount of bytes used for the following types should be in this proportion: short <= int <= long <= long long.
2) short should be at least 2 bytes.
3) long should be at least 4 bytes.
4) long long should be at least 8 bytes;.

Which means all these scenarios are perfectly legal:
sizeof(short) == 2
sizeof(int) == 2
sizeof(long) == 4
sizeof(long long) == 8

sizeof(short) == 2
sizeof(int) == 4
sizeof(long) == 4
sizeof(long long) == 8

sizeof(short) == 2
sizeof(int) == 4
sizeof(long) == 8
sizeof(long long) == 8

sizeof(short) == 2
sizeof(int) == 4
sizeof(long) == 8
sizeof(long long) == 16

sizeof(short) == 4
sizeof(int) == 4
sizeof(long) == 8
sizeof(long long) == 16

sizeof(short) == 4
sizeof(int) == 8
sizeof(long) == 16
sizeof(long long) == 32

And so on.

In order to get types of a particular size, C99 added the header <stdint.h> and C++-201x added <cstdint>, with types such as int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, and several others. Each of these types are the size in bits that their name indicates, with those without a u at the beginning being signed, and those with being unsigned. Using these types you can get the exact size you need.

But remember to use things appropriately. I see several popular SQL engines for example that support as many rows in a table as a uint64_t is able to contain, but if you ask them how many rows were altered by the last query, they'll report that information in a plain and simple int! You by now should realize there are two problems with such a course of action.

Now lastly, if you're writing some kind of function which takes an array, it is normal convention to pass a pointer to the array, as well as a size variable to specify how many elements or bytes are within the array.

The standard C library uses a type called size_t to deal with such values. A size_t can contain the highest amount of bytes the program is able to address, and is of course unsigned.

If you need to pass an array, your prototype should always be along the lines of:

void func(void *p, size_t size);
void func(uint8_t *p, size_t size);

If you need to return a size, for something which is addressable memory, again always use a size_t. Functions like strlen() and sizeof() return a size_t, and functions like memcpy(), memset(), or malloc() take a size_t as one of their arguments.

Let me show you a quick program to illustrate a point.


#include <cstdio.h>

#define O(x) printf(#x": %zu\n", sizeof(x))
int main()
{
O(signed char);
O(signed short);
O(signed int);
O(signed long);
O(signed long long);
O(signed);
O(unsigned char);
O(unsigned short);
O(unsigned int);
O(unsigned long);
O(unsigned long long);
O(unsigned);
O(size_t);
O(void *);
O(float);
O(double);
O(long double);
return(0);
}


On a 32 bit system of mine, the output is as follows:

signed char: 1
signed short: 2
signed int: 4
signed long: 4
signed long long: 8
signed: 4
unsigned char: 1
unsigned short: 2
unsigned int: 4
unsigned long: 4
unsigned long long: 8
unsigned: 4
size_t: 4
void *: 4
float: 4
double: 8
long double: 12


On a 64 bit system of mine, the output is as follows:

signed char: 1
signed short: 2
signed int: 4
signed long: 8
signed long long: 8
signed: 4
unsigned char: 1
unsigned short: 2
unsigned int: 4
unsigned long: 8
unsigned long long: 8
unsigned: 4
size_t: 8
void *: 8
float: 4
double: 8
long double: 16


Now notice for 32 bit, the void * type is 4 bytes, and on 64 bit, it is 8 bytes. This means that a void * can address any position in memory that the system is able to. In order to specify the amount, you'll notice in each case the sizeof(size_t) matches the sizeof(void *). If I were to get a 128 bit system which had a void * of 16 bytes, the size_t would also be at least 16 bytes.

There are other types which also matched the size of the void * each time around, but if you look at my earlier explanation of the relationship of the various C types, you'll notice those other types are too variable to know if they'll be good for an amount of every system you try to use your application or library on, so always stick with the size_t.

All too often I see programmers use an int for such an amount which is clearly wrong, or an unsigned int, which is better, but also wrong, especially on my 64 bit system. Some programmers I also see use the type of just unsigned by itself. I don't know who thought that up, but it's identical to an unsigned int, which is also clearly wrong. I included unsigned and signed specifically in my test above because some people are under the mistaken notion that those types become as large as possible. Every time I see code which contains one of them, I want to vomit.

If you ever have any doubt as to what size a type may be on your system, or get into an argument with someone, test things yourself with a sizeof() call, it's not hard to do, or point them to this article.

Now go out there and write some good libraries. I have too much vomit and bile on the floor here from all the horrible code I have to work with or review.