Saturday, July 24, 2010

Simplifying bootstrapping for virtual constructors

Last week I demonstrated a solution to the virtual constructor problem. My solution avoids many issues with the factory function solution. Yet it did require some bootstrapping to use.

The bootstrapping required a new function to be created for every single derived class that needs to be virtualized. When working with many derived classes, this becomes unacceptable. It's bad enough to solve this problem we need to generate a map, should we have to create additional functions as well? Each time a new derived class is added, should I go out of my way with two steps?

Turns out, making use of templates, we can combine the define and function generation step.

static compress *compress_zip::construct() { return new compress_zip; }
static compress *compress_gzip::construct() { return new compress_gzip; }
static compress *compress_7zip::construct() { return new compress_7zip; }


Instead of creating the above, and using it as follows:

std::map<COMPRESS_TYPES, compress *(*)()> compress_factory;
compress_factory[COMPRESS_ZIP] = compress_zip::construct;
compress_factory[COMPRESS_GZIP] = compress_gzip::construct;
compress_factory[COMPRESS_7ZIP] = compress_7zip::construct;


First create a single construct template function:

template <typename T>
compress *compress_construct()
{
return new T;
}


This has to be done only once.

Now when adding to the map, we can do the following:

std::map<COMPRESS_TYPES, compress *(*)()> compress_factory;
compress_factory[COMPRESS_ZIP] = compress_construct<compress_zip>;
compress_factory[COMPRESS_GZIP] = compress_construct<compress_gzip>;
compress_factory[COMPRESS_7ZIP] = compress_construct<compress_7zip>;


If some new type now comes along, simply add it with a single line. A new function will be generated on use by the template, so you no longer have to. Now we have truly managed to map a type directly to an identifier.

Of course no tutorial would be complete without a self contained example:

#include <iostream>
#include <map>
#include <string>

class base
{
std::string n;

protected:
base(const std::string &n) : n(n) {}

public:
base() : n("base") {}
std::string name() { return(n); }
};

struct derived : public base
{
derived() : base("derived") {}
};

template <typename T>
base *construct()
{
return new T;
}

int main(int argc, const char *const *const argv)
{
std::map<std::string, base *(*)()> factory;
factory["b"] = construct<base>;
factory["d"] = construct<derived>;

try
{
//Instantiate based on run-time variables
base *obj = factory.at(argv[1])();

//Output
std::cout << obj->name() << std::endl;

//Cleanup
delete obj;
}
catch (const std::exception &e) { std::cout << "Error occured: " << e.what() << std::endl; }

return 0;
}


Output:

/tmp> g++-4.4 -Wall -o factory_test factory_test.cpp
/tmp> ./factory_test b
base
/tmp> ./factory_test d
derived
/tmp>

Now that this problem has been solved nice and neatly. What about solving it for multiple constructors? Also known as the abstract factory problem. What if each class has multiple constructors, and we want a collection of them mapped to a single identifier? Can we do it without repeating a lot of code over and over?

With some minor bootstrapping, the answer is again yes! There's multiple solutions to this problem, but the following is what I found to be the nicest at the moment.

First create a pure virtual class with a function to match each constructor you'd like to virtualize. Each should of course return a base pointer.

Imagine we had 3 constructors, one taking no parameters, one taking a C string, and another taking a C++ string, we would setup the following:

struct construct_interface
{
virtual base *operator()() const = 0;
virtual base *operator()(const char *) const = 0;
virtual base *operator()(const std::string &) const = 0;
};


Once we have the interface defined, we'll create a template construct function which implements and returns that interface within a singleton similar to the above construct function:

template <typename T>
const construct_interface *construct()
{
static struct : public construct_interface
{
base *operator()() const { return new T; }
base *operator()(const char *s) const { return new T(s); }
base *operator()(const std::string &s) const { return new T(s); }
} local;
return &local;
}


It'd be nice to return a reference instead of a pointer, but dealing with references in std::map is kind of icky. We can use macros to cleanup any annoying pointer dereferencing issues that could arise.

Now our construct function returns a pointer to an interface which can construct a derived type using any of its constructors. We'll use it with an std::map like so:

std::map<std::string, const construct_interface *> factory;
factory["base"] = construct<base>();
factory["derived"] = construct<derived>();


Notice the map no longer tracks function pointers, but pointers to the interface. Also, when assigning to the map, we're calling the construct function to obtain the pointer. We could modify the above example to track function pointers and leave out the (), and even make the construct function return a reference to an interface instead, but then we'd need to add an extra () when creating objects. While that too can be hidden by a macro, or just ignored, as an extra () still looks rather clean, it does add extra overhead, as it is likely you'll initialize your map just once, and use it to create many objects during the lifetime of the program.

Now to use the map to create an object, the following has to be done:

//Use first constructor, the default constructor
base *obj1 = (*(factory).at(id))();

//Use second constructor, the one taking a C string
base *obj2 = (*(factory).at(id))(s);


It works nicely, but as I explained above that's rather ugly.

This macro can help simplify things:

#define VIRTUAL_NEW(factory, id) (*(factory).at((id)))


And unlike the interface class and template construction function which needs to be created for each set of classes making use of virtual constructors, the above macro can be reused for every virtual constructor collection that makes use of the above idiom.

Using the macro, the code now looks as follows:

//Use first constructor, the default constructor
base *obj1 = VIRTUAL_NEW(factory, id)();

//Use second constructor, the one taking a C string
base *obj2 = VIRTUAL_NEW(factory, id)(s);


That's it, problem solved!

Putting it all together, here's a working example:

#include <iostream>
#include <map>
#include <string>
#include <cstdlib>

class base
{
protected:
int x, y;

public:
base() : x(1), y(2) {}
base(const char *s) : x(std::atoi(s)), y(3) {}
base(const std::string &s) : x(std::atoi(s.c_str())), y(4) {}

virtual int operator()() { return x+y; }
};

class derived : public base
{
public:
derived() {}
derived(const char *s) : base(s) {}
derived(const std::string &s) : base(s) {}

virtual int operator()() { return x*y; }
};

struct construct_interface
{
virtual base *operator()() const = 0;
virtual base *operator()(const char *) const = 0;
virtual base *operator()(const std::string &) const = 0;
};

template <typename T>
const construct_interface *construct()
{
static struct : public construct_interface
{
base *operator()() const { return new T; }
base *operator()(const char *s) const { return new T(s); }
base *operator()(const std::string &s) const { return new T(s); }
} local;
return &local;
}

#define VIRTUAL_NEW(factory, id) (*(factory).at((id)))

int main(int argc, const char *const *const argv)
{
std::map<std::string, const construct_interface *> factory;
factory["b"] = construct<base>();
factory["d"] = construct<derived>();

if (argc == 3)
{
try
{
//Instantiate based on run-time variables
base *obj1 = VIRTUAL_NEW(factory, argv[1])();
base *obj2 = VIRTUAL_NEW(factory, argv[1])(argv[2]);
base *obj3 = VIRTUAL_NEW(factory, argv[1])(std::string(argv[2]));

//Output
std::cout << (*obj1)() << '\n'
<< (*obj2)() << '\n'
<< (*obj3)() << '\n'
<< std::flush;

//Cleanup
delete obj1;
delete obj2;
delete obj3;
}
catch (const std::exception &e) { std::cout << "Error occured: " << e.what() << std::endl; }
}

return 0;
}


Output:

/tmp> g++-4.4 -Wall -o abstract_factory_test abstract_factory_test.cpp
/tmp> ./abstract_factory_test b 2
3
5
6
/tmp> ./abstract_factory_test b 3
3
6
7
/tmp> ./abstract_factory_test d 2
2
6
8
/tmp> ./abstract_factory_test d 3
2
9
12
/tmp>

Hopefully you should now be able to take this example and plug it in just about anywhere. Easy to add new types. No need to modify existing classes. Fully dynamic. Easy to use!

This method really shines if you dynamically load derived classes while your program is running. Just make sure your DLL uses the the template function internally, so an instance of the function for the new type is created, and dynamically add an id to the map, and presto you're done.

Now go out there and leverage the power of C++!

Tuesday, July 20, 2010

Time to shutdown the factory for code violations

There's a common problem in software development regarding how to create an object's type dynamically in C++ and similar languages. If I have a collection of classes (they all inherit from the same base), I can use a base pointer to any instance, and perform operations as needed. The issue is generating that pointer in the first place.

Say I have an image class, with subclasses of image_png, image_jpeg, image_gif, and so on. I can use an image pointer along with its virtual functions to do an operation like image_pointer->resize(new_dimensions), and have it do whatever needs to be done for the image type in question to resize it and write the data properly.

This idea applies to all kinds of projects over and over again. Another example could be a compression class. Where I have the class compress, along with subclasses compress_zip, compress_gzip, compress_7zip, and so on. I can do compress_pointer->uncompress(in_file, out_file), and have that work appropriately with virtual functions.

However my image pointer and my compress pointer need to be pointing at the right kind of object for this to work in the first place. My image is sent over the network, I have information at runtime of image/jpeg and the binary data, and I need to load that into the correct subclass image_jpeg. My compressed file is on the hard drive, I read its type via its extension or its header, again I need my pointer to point at the appropriate subtype. I don't have this information at compile time, so how do I initialize my pointer?

In this case, the idea of virtual constructors, or factories come into play. The simplest factory would be in the form of an if/else if structure, or a switch.
image *image_factory(const char *mime_type, const uint8_t *binary_data, size_t size)
{
  image *image_pointer;
  if (!strcmp(mime_type, "image/jpeg")) { image_pointer = new image_jpeg(binary_data, size); }
  else if (!strcmp(mime_type, "image/png")) { image_pointer = new image_png(binary_data, size); }
  else if (!strcmp(mime_type, "image/gif")) { image_pointer = new image_gif(binary_data, size); }
  else { throw&nbsp"Unsupported image type"; }
  return image_pointer;
}

compress *compress_factory(COMPRESS_TYPES requested_save_format)
{
  compress *compress_pointer;
  switch (requested_save_format)
  {
    case COMPRESS_ZIP: compress_pointer = new compress_zip; break;
    case COMPRESS_GZIP: compress_pointer = new compress_gzip; break;
    case COMPRESS_7ZIP: compress_pointer = new compress_7zip; break;
    default: throw std::runtime_error("Unknown enum value");
  }
  return compress_pointer;
}


Now the above functions allow us to get the base pointer initialized appropriately, but they have some drawbacks.

  • A single function needs to know about every type.

  • If which types are supported changes during runtime, perhaps based on which DLLs are found, the factory function becomes more complex, and has to be specifically tailored for each application.

  • If there are multiple constructors, and in some cases one is needed, and in some cases another, suddenly we need to recreate the same logic in multiple factory functions, one factory function per constructor.

  • This style of factory function has to be recreated for every single collection of classes present, the logic behind a factory function is not abstracted away.


Based on these drawbacks, obviously such a method is the incorrect solution, even though it is the most popular method used.

In the case where we have variables mapping to other variables, such as 1 = "Apple", 2 = "Orange", 3 = "Tomato", the obvious solution would be to setup an array or use a map. If the mapping needed is between an integer and a variable, going from integer -> variable is simply a matter of indexing into an array. When dealing with non integer types, the standard std::map fits the bill rather nicely, as it can allow any type to become an index.

So the question becomes, can we map a variable to a type itself? With a bit of bootstrapping, the answer is actually yes! What we need are function pointers to replace code logic.

We can't create a function pointer directly to a constructor, so we'll need a bit of bootstrapping. We can create function pointers to global functions, or static member functions.

In our compression case we can do the following:
static compress *compress_zip::construct() { return new compress_zip; }
static compress *compress_gzip::construct() { return new compress_gzip; }
static compress *compress_7zip::construct() { return new compress_7zip; }


We'll place a static construct() within each of our classes we'd like to be able to instantiate dynamically, and then we can use them in a map like so:
std::map<COMPRESS_TYPES, compress *(*)()> compress_factory;
compress_factory[COMPRESS_ZIP] = compress_zip::construct;
compress_factory[COMPRESS_GZIP] = compress_gzip::construct;
compress_factory[COMPRESS_7ZIP] = compress_7zip::construct;


Once we have a map in place that can be constructed at runtime, we no longer have most of our above issues. No canned function needs to be created which knows about all the supported types in advance, which needs modification for each new type. Values are only added to the map that the required DLLs were found for it, or other runtime constraints. The entire idea is also abstracted away, and we don't need to constantly recreate new factory functions for each family of objects. We simply just initialize a factory object and make use of it.

Now how exactly do we use our map? The idea which is rather popular and expounded in book after book, is to create a factory template class something similar to the following:
template <typename identifier, typename product, typename function>
struct factory
{
  void set_mapping(const identifier &id, function f) { mapping[id] = f; }
  product *construct(const identifier &id)
  {
    function f = mapping.at(id);
    return f();
  }

  private:
  std::map<identifier, function> mapping;
};


Now that we have all that encapsulated, we can do the following:
factory<COMPRESS_TYPES, compress, compress *(*)()> myfactory;
myfactory.set_mapping(COMPRESS_ZIP, compress_zip::construct);
...
compress *compress_pointer = myfactory.construct(requested_save_format);


However, such a method has a serious drawback. What if my constructor needs parameters? As it does in the image example above? Do I end up creating a new factory encapsulation class each time? One library out there solves this problem with ~800 lines of code (read Modern C++ Design for more info). Basically the construct function in the factory class is overloaded a bunch of times with template parameters over and over again, for some conceivable amount of parameters a function might take.
AbstractProduct *CreateObject(const IdentifierType &id,
                              Parm1 p1, Parm2 p2, Parm3 p3, Parm4 p4, Parm5 p5, Parm6 p6, Parm7 p7, Parm8 p8)
{
  typename IdToProductMap::iterator i = associations_.find(id);
  if (i != associations_.end()) { return (i->second)(p1, p2, p3, p4, p5, p6, p7, p8); }
  return this->OnUnknownType(id);
}

The above snippet is repeated 16 times in that library with 0-15 parameters. First of all, such code commits the unforgivable sin of code repetition, and is limited to 15 parameters. What if I need more? Sin more?

The entire idea of a factory which produces an object within itself needs to be shutdown. By trying to be overly clever and encapsulate more than needed, object factories turn into nothing but bloat and horrible code.

Simplicity is king. Therefore, the map should NOT be encapsulated, nor should an object be directly produced. Our so called factory should only produce the appropriate mold needed to form our object.

In the case of images:
static image *image_jpeg::construct(const uint8_t *binary_data, size_t size)
{
  return new image_jpeg(binary_data, size);
}

static image *image_png::construct(const uint8_t *binary_data, size_t size)
{
  return new image_png(binary_data, size);
}

static image *image_gif::construct(const uint8_t *binary_data, size_t size)
{
  return new image_gif(binary_data, size);
}

std::map<std::string, image *(*)(const uint8_t *, size_t)> image_factory;
image_factory["image/jpeg"] = image_jpeg::construct;
image_factory["image/png"] = image_png::construct;
image_factory["image/gif"] = image_gif::construct;


Now to create an image, simply do the following when you want to create an image based on run time variables:
image *image_pointer = image_factory.at(mime_type)(binary_data, size);

Notice that code would take the place of:
image *image_pointer = new image_png(binary_data, size);


Now the creation of a particular object at runtime looks pretty much like creating a specific object, and can feel natural. Just like new which throws an error when it fails (which can happen here too), std::map::at() throws an error when the requested index is not found, so you can reuse the same type of exception handling you're already using when creating objects in general.

If you'd like a complete example which you can play with yourself, try this:
#include <iostream>
#include <string>
#include <map>
#include <stdexcept>
#include <cstdlib>

class Base
{
  int x, y;
  public:
  Base(int x, int y) : x(x), y(y) {}
  virtual ~Base() {};
  virtual int mult() { return x*y; }
  virtual int add() { return x+y; }

  static Base *create(int x, int y) { return new Base(x, y); }
};

class Derived : public Base
{
  int z;
  public:
  Derived(int x, int y) : Base(x, y), z(1) {}
  void setZ(int z) { this->z = z; }
  virtual int mult() { return Base::mult()*z; }
  virtual int add() { return Base::add()+z; }

  static Base *create(int x, int y) { return new Derived(x, y); }
};

int main(int argc, const char *const *const argv)
{
  //Get factory ready
  std::map<std::string, Base *(*)(int, int)> factory;

  //Add some rules
  factory["base"] = Base::create;
  factory["derived"] = Derived::create;

  try
  {
    //Instantiate based on run-time variables
    Base *obj = factory.at(argv[1])(4, 5); //Note: instead of new Base(4, 5) or new Derived(4, 5)

    //Set Z if derived and requested
    if (argc > 2)
    {
      if (Derived *d = dynamic_cast<Derived *>(obj)) { d->setZ(std::atoi(argv[2])); }
    }

    //Output
    std::cout << obj->mult() << ' ' << obj->add() << std::endl;

    //Cleanup
    delete obj;
  }
  catch (const std::exception &e) { std::cout << "Error occured: " << e.what() << std::endl; }

  return 0;
}


Here's what the output looks like:
/tmp> g++-4.4 -Wall -o factory_test factory_test.cpp
/tmp> ./factory_test
Error occured: basic_string::_S_construct NULL not valid
/tmp> ./factory_test base
20 9
/tmp> ./factory_test derived
20 10
/tmp> ./factory_test derived 6
120 15
/tmp> ./factory_test cows
Error occured: map::at
/tmp>


Basically the only issue we're left with is that for each class you want to be able to use a factory for, you have to create a static member function for each constructor you want to use with a particular factory. I don't think there is away around that. It's a minor inconvenience, but in the end offers a lot of power in a clean fashion.

This idea of combining a map with a function pointer can be expanded to other areas as well. You can use this method for calling functions in general based on any runtime input. If your class has many different member functions, each with the same parameters, but for different operations, consider creating a map to member function pointers, and use the input as an index into a map to call the appropriate member function.

Suggestions, improvements, and comments welcome.