Thursday, November 25, 2010


C++ Serialization Anyone?



Today I had one of the most amazing programming experiences that I've ever had, from my entire exciting career. I'm still a bit stunned that this happened. I fully thought what happened was completely impossible till now.

At work, we use a lot of different languages to create our software. It's not odd for us to be working on a project which somehow ends up using a dozen languages. Between server code, client code, databases, communication, mark up, styling, pre-processing, dynamic code generation, and other commonality, it's rather easy actually.

Between all these different programming languages, quite often, we need some sort of data interchange format. There's many to choose from, ranging from something custom to something well known like XML. Using these formats, we can pass data from one segment of our application stack to another. Even when they use two different programming languages. Or to save some data, and load it back up later.

When it comes to these things, soft typed functional languages are generally easier to work with than hard typed. Soft typed languages are very good at building objects from data on the fly, thanks to their ability to not care much about what types they're looking at. Is it a number or a string? Doesn't matter to the soft typed language, as they store it all the same way.

When dealing with database access from hard typed languages, the popular method is to create some sort of catch all or convert to anything type. For some, terms like "QVariant" or "boost::any" are always on their lips. The intent of these and similar constructs is to ease things when dealing with data in an unknown type. Although such constructs generally require building a switch block which needs to check some enumeration method to figure out how to handle the data within the rest of the program. Such code is just downright annoying.

At work some time back, thanks to a lot of the new features C++-201x has been adding, we've been able to build a database access library which can handle data without any of these old kludges. Essentially, database access for us in C++ is now just as easy as it is in PHP (or perhaps even easier!).

Now database communication is great, but there's still the issue of data interchange between two programs, which aren't using a database as an intermediary. Many soft typed or functional languages can have a simple encode() or decode() function, pass it any object, and have a nice string representation of it which can be sent off, or saved to a file for later. C++ and related languages always had the nightmare of needing to iterate manually over every data type, or over a hierarchy to work with something like XML, or similar data formats.

There's those that have created workarounds of course. Such as adding a serialize() function for every type you have to work with individually. Or create some serializable objects that one copies data to or from, and which handle all the serialization work internally. Or one of my personal favorites, write a separate parser which can read a description of a format, and generate the C++ objects and code needed to serialize or deserialize it.

Well, today a coworker and I were putting our heads together on how to deal with a certain project. I wrote code some time back which can serialize/deserialize to and from an std::map which contains numbers, strings, or a mix thereof. We were using this data interchange format between two programs. However, now we need to deal with much more complex data, and a series of key pairs just won't cover it. One end of the equation is C++, the other end is a soft typed language which could pretty easily work with whatever we came up with.

We first thought about the option of using a classic method such as XML or JSON, and use some kind of hierarchical writer from C++, and have the soft typed language just read it directly into an object with one of its built in language features. Till my friend had a brilliant realization. The hierarchy of language containers and their children is recursive, as is any serialization that can encode an infinite amount of data stacked in a hierarchy. Then we started discussing if we could make a serialize() function in C++ which could take any C++ type and work, even when not knowing everything about it in advance. It'd be easy for plain old data types, but gets more complicated once we start dealing with containers of those, and containers of containers.

Of course this is where most conversations along these lines end. But then I brought up template meta programming, and some new features C++ is now adding (and already in GCC), and this discussion went on much further than usual, till the point we were talking code. Well, we got into it, and two hours and two hundred lines of code later, we now have a function with the following prototype:

template<typename T>
std::string serialize(const T &t);


It is able to take any type that exists in C++, and well, serialize it. Have some type which contains some types which contains a few more types which contains some other types? It's all serializeable with this function. No pre-processing, dynamic code generation, compiler hacks, or clumsy per program hierarchical parsing required. It just worksTM.

Now next week, we'll have to write the deserializer function to pull that magic in reverse. Using the same idea, it shouldn't be a problem. If the data matches the supplied structure, parse it in, otherwise, throw an error. But currently, our project is done, as we are now able to have our C++ applications send very complex data to our soft typed languages rather easily.

Looking over the code with my coworker, it all seems extremely obvious. Why the heck didn't we think of this 20 years ago? Now am I getting all excited over something that has been done before? Anyone familiar with anything like this?

Question is, what to do now that we know this? File for a patent? (Yes, I'm evil!) Or perhaps ignore this, as no one cares about this topic anyway?