Sunday, November 11, 2007

Directory safety when working with paths

As we discussed in our previous two articles, we run into issues of thread safety when we try changing the path during a running program. Therefor, in addition to bypassing PATH_MAX problems, working around buffer issues thanks to implementing with C++, we would also love to implement our functions without ever calling chdir() and changing the working directory. We'll now look into how we can make changes to our implementation to avoid a chdir() call, along with the pros and cons of techniques UNIX operating systems provide us, as well as what they're lacking.

If one looks back at how we implemented getcwd(), they'll notice that we call chdir("..") for each step in our loop. We need this to sanitize the location for these three calls:

opendir(".")
lstat(entry->d_name, &sb)
stat(".", &sb)


To remove the need of changing directories, we can place the ".." paths directly into our calls to opendir(), lstat(), and stat(). Thanks to C++ Strings being dynamic, it makes it easy for us to manipulate paths for these situations properly. We'll also see that being clever, we can even avoid string manipulation for some of these cases. And if our OSs supplied enough functionality, without any string manipulation at all!

Here's how we can implement getcwd() based off our previous implementation, but without calls to chdir(), with key details highlighted:

bool getcwd(std::string& path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
struct stat sb;
if (!stat(".", &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);
std::string up_path("..");

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
{
bool pushed = false;
DIR *dir = opendir(up_path.c_str());
if (dir)
{
dirent *entry;
up_path += "/";
std::string::size_type after_slash = up_path.size();
while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
{
if (strcmp(entry->d_name, ".") && strcmp(entry->d_name, ".."))
{
up_path.replace(after_slash, std::string::npos, entry->d_name);
if (!lstat(up_path.c_str(), &sb))
{
file_id child_id(sb.st_dev, sb.st_ino);
if (child_id == current_id) //We found where we came from, add its name to the list
{
path_components.push_back(entry->d_name);
pushed = true;
break;
}
}
}
}

if (pushed && !fstat(dirfd(dir), &sb)) //If we have a reason to contiue, we update the current dir id
{
current_id = file_id(sb.st_dev, sb.st_ino);
up_path.replace(after_slash, std::string::npos, ".."); //Keep recursing towards root each iteration
}

closedir(dir);
}
if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
}

if (current_id == root_id) //Unless they're equal, we failed above
{
//Built the path, will always end with a slash
path = "/";
for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
{
path += *i+"/";
}
success = true;
}
}
}

return(success);
}

First of all, we removed all the code related to saving the current directory when we started, since we don't need that information anymore. Our first major change is that we created a new C++ String called up_path, we'll use this variable to keep track of the path, initializing it to "..". then for each step through the loop, make it "../..", "../../.." and so on, till we reach the root. We use this to replace our calls to opendir() to "." as we were doing before.
At this point, we'll add a slash to the path, and keep track of the spot with the variable after_slash. Now in our read directory loop, we can replace whatever is after the slash with the filename in the directory to pass to lstat(), again bypassing the need to be in the same directory as the file when making the function call.
Now for the stat() call on the directory itself, we got a little interesting. Instead of doing a path manipulation trick again, we call fstat() on the file descriptor returned from dirfd() on the already open directly handle. Notice how the call to close directory has been moved to after the block of code, so the directory is still open. And of course it's all wrapped up nicely with appending ".." after the slash.

Noticing how we eliminated path manipulation for the last part, it would be nice to eliminate more of it, especially if anyone wants to port this C++ code to C. The good news is that on Solaris and Linux we can, and as soon as the "at" functions get standardized, we can use them on the other UNIX OSs too. You can read more about them in one of our previous articles, File Descriptors and why we can't use them.

Here's how we can use the at functions to eliminate the path manipulation needed for calling lstat(), with the differences highlighted:

bool getcwd(std::string& path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
struct stat sb;
if (!stat(".", &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);
std::string up_path("..");

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
{
bool pushed = false;
DIR *dir = opendir(up_path.c_str());
if (dir)
{
int dir_fd = dirfd(dir);
dirent *entry;
while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
{
if (strcmp(entry->d_name, ".") && strcmp(entry->d_name, "..") && !fstatat(dir_fd, entry->d_name, &sb, AT_SYMLINK_NOFOLLOW))
{
file_id child_id(sb.st_dev, sb.st_ino);
if (child_id == current_id) //We found where we came from, add its name to the list
{
path_components.push_back(entry->d_name);
pushed = true;
break;
}
}
}

if (pushed && !fstat(dir_fd, &sb)) //If we have a reason to contiue, we update the current dir id
{
current_id = file_id(sb.st_dev, sb.st_ino);
up_path += "/.."; //Keep recursing towards root each iteration
}

closedir(dir);
}
if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
}

if (current_id == root_id) //Unless they're equal, we failed above
{
//Built the path, will always end with a slash
path = "/";
for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
{
path += *i+"/";
}
success = true;
}
}
}

return(success);
}

As should be easily discernible, the string manipulation got easier, and mostly vanished. All the keeping track of a slash, and replaces has been replaced with only needing to append "/.." at the end of each loop.
Instead of manipulating a path string for our call to lstat(), we save the directory's file descriptor above (and subsequently use that instead of recalculating it lower down for fstat()), then use it to get the lstat() for each file in the directory, but now via fstatat().
fstatat() is the same as stat()/lstat(), but takes a directory descriptor as the first parameter to offset where the filename is relative to. The last parameter can be 0 or AT_SYMLINK_NOFOLLOW, which makes it act like stat() or lstat() respectively. fstatat() instead of a file descriptor can also take the special value AT_FDCWD to have it automatically work in the current directory.

This implementation should be much more elegant, but a bit less portable. If one would like to implement fstatat() themselves, it's not hard, here's how you can do it:

int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags)
{
int success = -1;

if ((!flags || (flags == AT_SYMLINK_NOFOLLOW)))
{
int cwdfd = -1;
if ((dirfd == AT_FDCWD) || (pathname && (*pathname == '/')) || (((cwdfd=open(".", O_RDONLY)) != -1) && !fchdir(dirfd)))
{
success = (!flags) ? stat(pathname, buf) : lstat(pathname, buf);
}

if (cwdfd != -1)
{
fchdir(cwdfd);
close(cwdfd);
}
}
else
{
errno = EINVAL;
}

return(success);
}

You'll also need the defines for AT_FDCWD and AT_SYMLINK_NOFOLLOW. You have to make sure that the value you choose for AT_FDCWD can't be a valid file descriptor or the normal failure return value. Therefor I chose -2 in my implementations, but choosing any negative value should be okay. AT_SYMLINK_NOFOLLOW shouldn't matter what it is, and a value of 1 or for all I care, 42, should be fine.
If your OS supports the at functions (currently Linux 2.6.16+ and recent versions of Solaris), it's better not to use a custom implementation, as this isn't thread safe, since it calls fchdir() internally, and for our example, runs counter to the whole point of using fstatat(). It would also be faster to use the original getcwd() implementation from a few days ago than using an emulated fstatat(), since there's less overhead from repeated fchdir() calls.
It's also interesting to note that glibc on Linux now, even for older than 2.6.16 implements fstatat() and the whole slew of at functions even when not supported in the Kernel. It's similar to ours, can be thread unsafe due to changing the working directory, and for some inexplicable reason, segfaulted in certain circumstances when I ran a large battery of tests on it and my implementation against the Kernel's to make sure that mine was working properly.

Anyways, with that out of the way, one can't hope but wonder if it would be possible to also eliminate the need of any path string manipulation for the call to opendir(). Mysteriously, neither Solaris nor Linux have an opendirat() call. If there was such, we could easily keep the previous directory open, till we obtained a new handle for its parent directory.
Baring having an opendirat() function, it'd be nice to implement such a thing. Some UNIX OSs I understand have a function perhaps named fopendir() or fdopendir() to promote a file descriptor to a directory handle. With such a function, we can simply write an opendirat() which calls openat(), then promotes it, but Linux doesn't have a method of promotion, although Solaris does (fdopendir()). Lets hope the people who are standardizing the new at functions for POSIX seriously consider what they're doing, otherwise we won't be able to use them.

Now that we've gotten getcwd() to be safe, we still need realpath() to be. In our realpath() implementation, the only unsafe parts was our function chdir_getcwd() which called chdir() internally as well as getcwd(). getcwd() is now taken care of, but we still need to rewrite the rest of chdir_getcwd() to not actually chdir().
To do such should now be easy, we simply do getcwd(), but with a new start path. We'd make a new function called getcwd_internal() which would take two parameters, one for the current directory, and another to initialize the up one, which we can wrap everything else around. Basically copy your favorite getcwd(), but make the modifications to the beginning like so:

bool getcwd_internal(std::string& path, const std::string& start_dir, std::string& up_path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
struct stat sb;
if (!stat(start_dir.c_str(), &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
--SNIP--


Now we'd turn getcwd() and chdir_getcwd() into wrappers like so:

bool getcwd(std::string& path)
{
std::string up_path("..");
return(getcwd_internal(path, ".", up_path));
}

bool chdir_getcwd(const std::string& dir, std::string& path)
{
std::string up_path(dir+"/..");
return(getcwd_internal(path, dir, up_path));
}


Note that the name of chdir_getcwd() is now misleading that it doesn't actually change directories anymore. For this reason, it's a good idea to not put any implementation details into a function name, as a user shouldn't have to know about them, and over time, it can become inaccurate.

And there you have it folks, getcwd() and realpath() implemented safely. Hopefully in the process all you loyal readers learned a couple other things too.

Other improvements over all we discussed, might be to run a smart pass over concatenated path names at key locations to remove extra slashes together "///", unneeded current directory "/./", and extra parent directories "path1/path2/.." -> "path1/", which is useful if certain functions like opendir() or the stat() family of calls have internal buffer issues. But doing such is a discussion for another time.

This wraps up our long discussion. All comments and suggestions are welcome.

Friday, November 9, 2007

Implementing realpath() in C++

Before reading this post, please read the background in the previous post PATH_MAX simply isn't, as it covers important background, and has some code we'll be building on over here.

Here's what realpath() is supposed to do:

realpath() expands all symbolic links and resolves references to ’/./’, ’/../’ and extra ’/’ characters in the null terminated string named by path and stores the canonicalized absolute pathname in the buffer of size PATH_MAX named by resolved_path. The resulting path will have no symbolic link, ’/./’ or ’/../’ components.


Instead of attacking implementing such a thing head on, we'll take a bottom up approach. One should consider that being able to break down a path between the path components prior to the last slash and after the last slash is essential to figuring out which directory the file actually lies.

The built in functions for this are called dirname() and basename(), but they modify their arguments, and they do some funny handling in special cases which we don't want.

So to implement our own for our purposes, we can write a function as follows:

void relative_dir_base_split(const std::string& path, std::string& dir, std::string& base)
{
std::string::size_type slash_pos = path.rfind("/"); //Find the last slash
if (slash_pos != std::string::npos) //If there is a slash
{
slash_pos++;
dir = path.substr(0, slash_pos); //Directory is before slash
base = path.substr(slash_pos); //And obviously, the file is after
}
else //Otherwise, there is no directory present
{
dir.clear();
base = path;
}
}

The function is simple enough, pass it a path, and two C++ strings for the return values, and have it split them up.

The next function which we'll use is one to get the realpath() of a directory. We can easily get such a thing by chdir()'ing to the directory, running our getcwd(), and then chdir() back.
This simple function is as follows:


bool chdir_getcwd(const std::string& dir, std::string& path)
{
bool success = false;
int start_fd = open(".", O_RDONLY); //Open current directory so we can save a handle to it
if (start_fd != -1)
{
if (!chdir(dir.c_str())) //Change to directory
{
success = getcwd(path); //And get its path
fchdir(start_fd); //And change back of course
}
close(start_fd);
}
return(success);
}

This simple function to resolve where a directory is located uses should be obvious, however, it's also a core function in figuring our where a file is, since a file is located at the realpath() of its parent directory, plus the base name for the file itself.

Now to get the realpath() of a file, we have this simple wrapper over the above functions:

static inline bool realpath_file(const std::string& path, std::string& resolved_path)
{
bool success = false;
std::string dir;
std::string base;
relative_dir_base_split(path, dir, base);

//If there is a directory, get the realpath() for it, otherwise the current directory
if (dir.size() ? chdir_getcwd(dir, resolved_path) : getcwd(resolved_path))
{
resolved_path += base;
success = true;
}
return(success);
}

Now we have chdir_getcwd() which is basically a realpath() for a directory, and realpath_file() for files. Note however, that if you call realpath_file() directly, it won't actually ensure that the file exists, so probably a bad idea to call directly, unless you don't care if it exists or not. It also won't handle symlinks. To handle symlinks we need quite a bit more code.

First, let us write a wrapper in C++ for reading links:

bool readlink_internal(const std::string& path, std::string& buffer, ssize_t length)
{
bool success = false;
if (length > 0)
{
char *buf = new(nothrow) char[length+1]; //Room for Null
if (buf)
{
ssize_t amount = ::readlink(path.c_str(), buf, length+1); //Give room for failure
if ((amount > 0) && (amount <= length)) //If > length, it was modified mid check
{
buf[amount] = 0;
buffer = buf;
success = true;
}
delete[] buf;
}
}
return(success);
}

This function simply wraps up the built in one, and uses C++ strings for dynamic allocation, so we don't have to worry about it directly in our function to resolve symlinks.

Before diving into the actual details in resolving a symlink, let me present one more helper function:

void build_path_base_swap(std::string &path, const std::string& newbase)
{
string dir;
string base;
relative_dir_base_split(path, dir, base);

if (dir.size())
{
path = dir + newbase;
}
else
{
path = newbase;
}
}


This above function will take a path, and replace the base part of it, meaning if you pass it "path/cows.txt" and "feet.txt", path will become "path/feet.txt".

Now onto the symlink discussion.

Resolving symlinks in itself is hard though, since one symlink could lead to another, meaning we'll need a loop. We also need to defend ourselves against an infinite loop if one symlink links to itself, or a symlink earlier on in our resolve loop.

It is not safe enough to call a regular stat() to check for the final link existing (and not a loop), since during our resolve, an attacker can modify a link, so we have to detect that.
Unfortunately, most OSs/libraries have a define called MAXSYMLINKS which I've seen set anywhere from 8 to 30. They stop their loop once they've iterated MAXSYMLINKS times and return an errno of ELOOP. Unfortunately, doing such has two flaws, which I hope should be apparent. First of all, if a link links to itself immediately, it would take MAXSYMLINKS passes until the function finally realizes it (I'm looking at you OpenBSD). The other issue is that it's just wrong.
Hopefully any of us who has taken a data structures course would know that there are algorithms for detecting a loop in a linked list, and not have to resort to any dumb method which doesn't actually detect a loop. Also a number such as 8 is way too low which I've seen some do. I look at one of my Linux boxes, and I see some binaries are actually symlinked a dozen times. Like "upx" is symlinked to "upx-3.00", which is symlinked to "upx-3.00-nonfree", which is symlinked to, etc...

The optimal way to detect a loop in a linked list is explained here, see the last paragraph for the optimal O(N) solution. Unfortunately though, the optimal solution means we need to create a second iterator to check on symlinks, resolving links ahead of the main loop, two at a time. While great in memory, it can be a pain on a hard drive, where we have to deal with slow access, and the links can change during the check. Therefor I settled on a variation of the first solution, where we'll keep track of the file IDs (described in the previous post) of each symlink, and see if we ever hit a match. We'll store this data in an std::set, which is usually implemented as a Red & Black Tree which has O(log(N)) time for insertion and searching. Making our algorithm work at O(log(N))*2 (since we insert and find each time) * N, which reduces to O(N log(N)).

Here's the code:

bool symlink_resolve(const std::string& start, std::string& end)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
if (start.size())
{
std::string path = start; //Need a modifyable copy
struct stat sb;
std::set<file_id> seen_links;

bool resolved_link;
do //The symlink resolve loop
{
resolved_link = false;
if (!lstat(path.c_str(), &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (seen_links.find(current_id) == seen_links.end()) //Not a link we've seen
{
seen_links.insert(current_id); //Add to our set

if (S_ISLNK(sb.st_mode)) //Another link
{
std::string newpath;
if (readlink_internal(path, newpath, sb.st_size))
{
if (newpath[0] == '/') //Absolute
{
path = newpath;
}
else //We need to calculate the relative path in relation to the current
{
build_path_base_swap(path, newpath);
}
resolved_link = true;
} //Else, Link can't be read, time to quit
}
else //Yay, it's not a link! got to the last part finally!
{
success = realpath_file(path, end);
}
} //Else, Nice try, someone linked a link back into a previous link during the scan to try to trick us into an infinite loop
} //Else, Dangling link, can't resolve
} while (resolved_link);
}
return(success);
}


And now with all those helper functions out of the way, we can finally write the actual realpath() function itself, and it's a heck of a lot simpler than other implementations you'll find out there.

Here it is:

bool realpath(const std::string& path, std::string& resolved_path, bool resolve_link = true)
{
bool success = false;
if (path.size())
{
struct stat sb;
if (!stat(path.c_str(), &sb))
{
bool (*rp)(const std::string&, std::string&) = resolve_link ? symlink_resolve : realpath_file;
success = S_ISDIR(sb.st_mode) ? chdir_getcwd(path, resolved_path) : rp(path, resolved_path);
}
}
return(success);
}

It simply stat()s then runs chdir_getcwd() on directories and symlink_resolve() on files. I also offer an added feature. If you don't want the file name itself resolved, which is quite often what people want when a program saves a new file based off a name of another files passed to it, you can pass false as the third parameter, and it'll call realpath_file() instead of symlink_resolve(). With this wrapper, realpath_file() also has a stat() call before it, so the file is checked for existence like the built in realpath().

This implementation protects against the issues of PATH_MAX as described in our previous posts. It's also written in C++ to avoid buffer issues. Unlike built in ones, we generally have faster realpath() on directory, since it uses a getcwd() technique, while most other ones out there use a complicated set of steps on every component in the path until the last one is reached, which is especially slow if there are symlinks in the path. We also beat the built in ones regarding speed in cases of link to self, and have support for a deeper set of links. Some implementations also have some fixed sized buffer which can easily get full, and return failure, which is annoying, but our implementation won't run into. Lastly, we offer not resolving the links at all on the last component.

Drawbacks to our implementation is that it won't work on Windows as described in our last post. But you can #ifdef for that, and easily enough wrap around fullpath(). It also isn't thread safe due to the use of chdir() (which many built in implementations also do), see the previous post on how to work around that. Next time, we'll discuss methods to do all this without a chdir() at all, thus ensuring thread safety without any headaches.

Saturday, November 3, 2007

PATH_MAX simply isn't

Many C/C++ programmers at some point may run into a limit known as PATH_MAX. Basically, if you have to keep track of paths to files/directories, how big does your buffer have to be?
Most Operating Systems/File Systems I've seen, limit a filename or any particular path component to 255 bytes or so. But a full path is a different matter.

Many programmers will immediately tell you that if your buffer is PATH_MAX, or PATH_MAX+1 bytes, it's long enough. A good C++ programmer of course would use C++ strings (std::string or similar with a particular API) to avoid any buffer length issues. But even when having dynamic strings in your program taking care of the nitty gritty issue of how long your buffers need to be, they only solve half the problem.

Even a C++ programmer may at some point want to call the getcwd() or realpath() (fullpath() on Windows) functions, which take a pointer to a writable buffer, and not a C++ string, and according to the standard, they don't do their own allocation. Even ones that do their own allocation very often just allocate PATH_MAX bytes.

getcwd() is a function to return what the current working directory is. realpath() can take a relative or absolute path to any filename, containing .. or levels of /././. or extra slashes, and symlinks and the like, and return a full absolute path without any extra garbage. These functions have a flaw though.

The flaw is that PATH_MAX simply isn't. Each system can define PATH_MAX to whatever size it likes. On my Linux system, I see it's 4096, on my OpenBSD system, I see it's 1024, on Windows, it's 260.

Now performing a test on my Linux system, I noticed that it limits a path component to 255 characters on ext3, but it doesn't stop me from making as many nested ones as I like. I successfully created a path 6000 characters long. Linux does absolutely nothing to stop me from creating such a large path, nor from mounting one large path on another. Running getcwd() in such a large path, even with a huge buffer, fails, since it doesn't work with anything past PATH_MAX.

Even a commercial OS like Mac OS X defines it as 1024, but tests show you can create a path several thousand characters long. Interestingly enough, OSX's getcwd() will properly identify a path which is larger than its PATH_MAX if you pass it a large enough buffer with enough room to hold all the data. This is possible, because the prototype for getcwd() is:
char *getcwd(char *buf, size_t size);


So a smart getcwd() can work if there's enough room. But unfortunately, there is no way to determine how much space you actually need, so you can't allocate it in advance. You'd have to keep allocating larger and larger buffers hoping one of them will finally work, which is quite retarded.

Since a path can be longer than PATH_MAX, the define is useless, writing code based off of it is wrong, and the functions that require it are broken.

An exception to this is Windows. It doesn't allow any paths to be created larger than 260 characters. If the path was created on a partition from a different OS, Windows won't allow anything to access it. It sounds strange that such a small limit was chosen, considering that FAT has no such limit imposed, and NTFS allows paths to be 32768 characters long. I can easily imagine someone with a sizable audio collection having a 300+ character path like so:
"C:\Documents and Settings\Jonathan Ezekiel Cornflour\My Documents\My Music\My Personal Rips\2007\Technological\Operating System Symphony Orchestra\The GNOME Musical Men\I Married Her For Her File System\You Don't Appreciate Marriage Until You've Noticed Tax Pro's Wizard For Married Couples.Track 01.MP5"


Before we forget, here's the prototype for realpath:
char *realpath(const char *file_name, char *resolved_name);


Now looking at that prototype, you should immediately say to yourself, but where's the size value for resolved_name? We don't want a buffer overflow! Which is why OSs will implement it based on the PATH_MAX define.
The resolved_name argument must refer to a buffer capable of storing at least PATH_MAX characters.

Which basically means, it can never work on a large path, and no clever OS can implement around it, unless it actually checks how much RAM is allocated on that pointer using an OS specific method - if available.

For these reasons, I've decided to implement getcwd() and realpath() myself. We'll discuss the exact specifics of realpath() next time, for now however, we will focus on how one can make their own getcwd().

The idea is to walk up the tree from the working directory, till we reach the root, along the way noting which path component we just went across.
Every modern OS has a stat() function which can take a path component and return information about it, such as when it was created, which device it is located on, and the like. All these OSs except for Windows return the fields st_dev and st_ino which together can uniquely identify any file or directory. If those two fields match the data retrieved in some other way on the same system, you can be sure they're the same file/directory.
To start, we'd determine the unique ID for . and /, once we have those, we can construct our loop. At each step, when the current doesn't equal the root, we can change directory to .., then scan the directory (using opendir()+readdir()+closedir()) for a component with the same ID. Once a matching ID is found, we can denote that as the correct name for the current level, and move up one.

Code demonstrating this in C++ is as follows:


bool getcwd(std::string& path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
int start_fd = open(".", O_RDONLY); //Keep track of start directory, so can jump back to it later
if (start_fd != -1)
{
struct stat sb;
if (!fstat(start_fd, &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
{
bool pushed = false;

if (!chdir("..")) //Keep recursing towards root each iteration
{
DIR *dir = opendir(".");
if (dir)
{
dirent *entry;
while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
{
if ((strcmp(entry->d_name, ".") && strcmp(entry->d_name, "..") && !lstat(entry->d_name, &sb)))
{
file_id child_id(sb.st_dev, sb.st_ino);
if (child_id == current_id) //We found where we came from, add its name to the list
{
path_components.push_back(entry->d_name);
pushed = true;
break;
}
}
}
closedir(dir);

if (pushed && !stat(".", &sb)) //If we have a reason to contiue, we update the current dir id
{
current_id = file_id(sb.st_dev, sb.st_ino);
}
}//Else, Uh oh, can't read information at this level
}
if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
}

if (current_id == root_id) //Unless they're equal, we failed above
{
//Built the path, will always end with a slash
path = "/";
for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
{
path += *i+"/";
}
success = true;
}
fchdir(start_fd);
}
}
close(start_fd);
}

return(success);
}


Before we accept that as the defacto method to use in your application, let us discuss the flaws.

As mentioned above, it doesn't work on Windows, but a simple #ifdef for Windows can just make it a wrapper around the built in getcwd() with a local buffer of size PATH_MAX, which is fine for Windows, and pretty much no other OS.

This function uses the name getcwd() which can conflict with the built in C based one which is a problem for certain compilers. The fix is to rename it, or put it in its own namespace.

Next, the built in getcwd() implementations I checked only have a trailing slash on the root directory. I personally like having the slash appended, since I'm usually concatenating a filename onto it, but note that if you're not using it for concatenation, but to pass to functions like access(), stat(), opendir(), chdir(), and the like, an OS may not like doing the call with a trailing slash. I've only noticed that being an issue with DJGPP and a few functions. So if it matters to you, the loop near the end of the function can easily be modified to not have the trailing slash, except in the case that the root directory is the entire path.

This function also changes the directory in the process, so it's not thread safe. But then again, many built in implementations aren't thread safe either. If you use threads, calculate all the paths you need prior to creating the threads. Which is probably a good idea, and keep using path names based off of your absolute directories in your program, instead of changing directories during the main execution elsewhere in the program. Otherwise, you'll have to use a mutex around the call, which is also a valid option.

There could also be the issue that some level of the path isn't readable. Which can happen on UNIX, where to enter a directory, one only needs execute permission, and not read permission. I'm not sure what one can do in that case, except maybe fall back on the built in one hoping it does some magical Kernel call to get around it. If anyone has any advice on this one, please post about it in the comments.

Lastly, this function is written in C++, which is annoying for C users. The std::vector can be replaced with a linked list keeping track of the components, and at the end, allocate the buffer size needed, and return the allocated buffer. This requires the user to free the buffer on the outside, but there really isn't any other safe way of doing this.
Alternatively, instead of a linked list, a buffer which is constantly reallocated can be used while building the path, constantly memmove()'ing the built components over to the higher part of the buffer.

During the course of the rest of the program, all path manipulation should be using safe allocation managing strings such as std::string, or should be based off of the above described auto allocating getcwd() and similar functions, and constantly handling the memory management, growing as needed. Be careful when you need to get any path information from elsewhere, as you can never be sure how large it will be.

I hope developers realize that when not on Windows, using the incorrect define PATH_MAX is just wrong, and fix their applications. Next time, we'll discuss how one can implement their own realpath().

Tuesday, June 19, 2007

Can you name a C++ compiler?

C++ is arguably the best primary language available today for non web based applications that are large, complicated, and advanced. But what should one use to compile their C++ code? Can you even name a single C++ compiler?

C++ was created by Bjarne Stroustrup in 1983 at AT&T. But instead of making a compiler for it, they opted to make a C++ to C translator and dubbed it Cfront. In 1985, Bjarne Stroustrup released a book on his language, moving the language onto center stage. But still there was no compiler to be found, only a translator.

Now translators aren't so bad. After all, they give the programmer many additional features over the base language, such as new types, operators, and other goodies. These extras make it easier to breathe in the art of programming. In fact, there exist many preprocessors and language extensions today using translators to make life easier. Qt for example brings their own C++ extensions to the table such as "slot functions" which works with a preprocessor, translating code to pure C++ and adding any extra functions and data as needed. Now for minor additions, or for translators to C++ which deal with clearly defined objects, generated C++ code is pretty good. But when dealing with entirely new constructs added to a language, generated code can be very bad depending on the circumstance it's in, or not be as optimized as it could be if the compiler understood which high level constructs the code is coming from.

For these reasons, having a real compiler is better than a translator. But what was the first actual compiler for C++? It might come as a surprise to some, but the first C++ compiler was actually GCC, in 1987, 2 years after C++ was exposed to the world. However what C++ itself was - was determined by each new version of Cfront, and GCC did what it could to produce identical results (with extra features on top of course). Over time other C++ compilers came out, each bringing its own slew of extra features to the table, and some of them even acting differently than Cfront did.

In 1992, Microsoft's Visual C++ was released, and it understood code quite a bit differently than Cfront, causing many compatibility issues, and earning a new meaning to MSVC++ - Microsoft Vs. C++. Adding to these compatibility problems was many new features being proposed all the time such as templates, and Cfront development (the de facto standard) stopping in 1993, as it was deemed impossible to keep building upon a translation framework for such a complicated language. Since there was so many different implementations out there, and the "standard" one decided to throw in the towel, the International Organization for Standardization (ISO) decided to step in and standardize C++ as it did with C in 1989.

The ISO finished the standardization of C++ in 1997, and published the new specification in 1998. From this point on, the only real C++ is the one standardized in 1998 (although a second standard of C++ is being worked on to be completed in the next few years), and nothing else can be accurately labeled C++. So now we're about a decade later, guess how many C++ compilers there are?

And the correct answer to the above question is none! If you want to turn to GCC, it still doesn't support the "export" keyword, nor does it handle certain complicated classes entirely right. If you turn to MSVC, until very recently, their implementation of the Standard Template Library was quite shoddy, and was pretty bad with templates in general. More recent MSVCs still have issues though, see here for a short list from Microsoft itself. Aside from Microsoft's list, I'm told by some that they also found some other issues that aren't listed. And for some strange reason, MSVC doesn't even handle scope properly by default for variables created in headers to a block, and instead has an option to enable proper scoping separately. Whose bright idea was it to make improper scoping the default?

So if you want to write a real C++ program today, what exactly should you use? Well, lucky for us, Edison Design Group decided to do what AT&T deemed impossible, and make an up to date fully standards compliant C++ to C translator. Although EDG themselves don't distribute it to the general market. If you want to be able to build your C and C++ code properly, currently, your only choice is to find a complete setup around EDG's translator, such as Comeau C/C++.

Drawbacks with Comeau C/C++ though is that it isn't free. GCC is free, and Microsoft now also offers free versions of their C/C++ compiler, while Comeau C/C++ costs $50 per platform. Comeau C/C++ also isn't as portable as GCC, and doesn't operate properly on a fraction of the OSs that GCC does. Being a translator, it also suffers from the less optimal code generation issues, not to mention longer build times due to more memory usage and more steps involved. Comeau C/C++ also doesn't work by itself, but on install has to configure itself on top of an existing compiler already installed such as GCC or MSVC.

Scary how we don't even have a compiler for the first standardization of C++ and they're already working on a second standardization. C also had a second standardization done in 1999, but is mostly supported by GCC, and will probably be fully supported soon enough. C99 of course is also supported by Comeau C/C++ and works by translating C99 to C89 as needed.

Although the important question we have to ask is not "where are the C++ compilers?", but will we ever see a C++ compiler? GCC for example has their own propaganda written up over how much they support the in progress second C++ standard, without even worrying about the first standard, or even showing a page of how far that is completed yet. Even Microsoft has a page describing some of their caveats. The only little snippet GCC has is this.

Even more disconcerting is that RMS, the idiotic hypocritical redefinitional head of the FSF says he prefers not to have support for "export" in GCC, and prefers it if GCC doesn't follow the standard. His chutzpa is so bad, that he even asked the standards committee if they can change the standard to reflect what he's implemented in GCC, and remove the export keyword altogether, since he doesn't want it to be there.

Of course someone may wonder what is the big fuss about the export keyword anyway. Good C/C++ programmers know that when breaking up functions into different files, it's best to put sizable functions in the C/C++ file, and prototypes for them in the header files for other functions elsewhere to use. The export keyword is supposed to tell the C++ compiler that the current function uses templates, and it needs to have a special uncompiled (but perhaps middle representation) version of the function saved separately to be able to link with other code that calls that function properly.

The problem with export here is that some of the code won't be compiled until link time, meaning a more annoying link process, and the possibility of compile errors at link time. So we have to decide which is the lesser of two evils, either start sticking code in headers files (as people are now forced to do), or have code compile at link time (which people like RMS don't even want to give us the freedom to do). Or do we? Is there perhaps another implementation option?

For MSVC, where Visual Studio itself, annoyingly, won't let you compile a file without having a whole project for it can in fact use this to its advantage. Since it has a full file list, and also keeps class lists and other things such as function prototypes handy for auto completion and tool tip hints, it already knows where it has to grab actual code from. An IDE like Visual Studio just has to pass its compiler the actual file to compile, and a list of other files to pull templates from, and it's all transparent to the programmer. Any compiler could do this, although it would probably get quite annoying if you have to specify all the locations by hand. Of course a sane default might be to try to find '.cpp' files to match any template function prototype in a header file included by the current C++ file that is being compiled.

Post your thoughts on this subject, and any other ideas you might have to sanely implement the export keyword in C++ without running into some of the cons listed above.

Wednesday, June 13, 2007

Hashing out Hashing and Hash Libraries

This topic is one that I enjoy quite a bit, but unfortunately there is a lot of problems out there.

I'm sure everyone has heard of hashing before, but many are not quite sure what it is, how it works, why it's used, or the various uses for it, and why are there so many hashing algorithms.

The hashing idea started when programmers determined a need to various situations to generate a "key" or a "unique identifier" for chunks of data. Say you had 20 strings, each one being 300 characters long, to find out if a new string is in the list (and if so, which one) you'd need to go through 20 entries, and depending on how similar they are, you'd have to do up to 300 comparisons on each one. Needless to say, finding such a needle in a haystack is slow. It also means that to compare, one needs an exact copy stored in the database. So for example, if I was passing you a file, and we wanted to know if it transfered correctly, it would be annoying (to say the least) for me to send you a second copy for comparison. Furthermore, there is no guarantee that any transfers would work right.

For all these reasons above (and lesser reasons as well), hashing was invented. The idea in hashing is that we apply a truncating transform to the data in question. Imagine we have a number 0 to 99999999999999999999, we could have a truncation where we add up all the digits over and over till we get a single digit. So if our number was 123456 we could add that up to 21 and that to 3. Although converting any arbitrary number size to a single digit isn't very unique, as we'll only have 10 possibilities, this follows the basic idea of hashing. A good hashing algorithm will have drastically different results for similar numbers, and wouldn't have a situation where 123456 and 654321 and 123465 all converted to the same hash. What we need to realize though is three key points in hashing.

1) There is truncation. One can't determine the original number from a single hash alone.

2) There is a limited size to the hash itself, and we can't uniquely identify all data with a smaller key. We only have as much uniqueness as the amount of possibilities of the hash itself. Meaning for our hash above, we'd get unique hashes if the input was a single digit, once we're past that point, we have "collisions". If the hash is 4 bytes for example, the uniqueness is only guaranteed on inputs block of up to 4 bytes.

3) Hashing algorithms are completely arbitrary mathematical transforms. Normally, when you read a good programming novel (source code), you like to see some sense in the code, like having objects that model real world objects, or control flow that matches what the user sees. When you look at hashing code, you normally just see a bunch of shifts and rotates, xor'ing data with all sorts of seemingly random values, and producing results that look like A4B31AAD49879FE which doesn't mean anything to anybody. While good hashing algorithms try to focus on transformations that make the hash/key end up looking drastically different for similar inputs, the code to do so really is just gibberish from any standpoint. One shouldn't ask, but why is he adding up all the digits? Why not flip every other bit here? Why not rotate every 3rd bit 4 over?

Now that we have a basic understanding what hashing is, let's look at various hash uses more in depth.

One of the most common uses in programming is the Hash Table. Say you have a programming language with exactly 64 keywords. You want to write a compiler for this language, as you parse the input source code, you'd have to compare every word to one of your keywords. You can store your keywords in a table in lexicographical order, and binary search each word to see if the word is in there, but this generally means you'll bump into on average 5 keywords which you'll have to do string comparisons on before you find for certain if the word you have in question is a keyword or not, and if so, what to do with it. While 5 isn't much, string comparisons are slow as they involve many comparisons internally, and the number is multiplied by the amount of words in the source code. Now while our example has 64, you may find other examples where the amount is 1024 bumping up the amount of failure checks to 10, put together, this ads up to a slowness in your program where there doesn't have to be.

The trick in this case would be to devise a hashing algorithm for your table that produces 64 unique possibilities, with each key mapping to exactly one keyword. Then instead of binary searching with many string comparisons, one can hash the word in question, jump directly into the array and then do the last comparison to see if it was a keyword or not. If the hash algorithm was 3 operations, and the average string comparison was 10 operations, you just cut down keyword check overhead from 50 operations to 3. Of course some of you are thinking that finding a hash algorithm to map so perfectly for this particular situation is probably non existent. However with hashing, we have many tricks up our sleeve to approach this ideal situation. If say our hashing algorithm produced 80 possible key results but otherwise gave each keyword a unique key, we'd still be able to reach our goal, as we'll just have 16 holes in the array that if jumped into, we know immediately that the word in question was not a keyword. There's also techniques available where you could have more than one key occupying a slot in the array, as it's really an array of arrays, and hashing turns into a kind of more optimal binary search, where instead of cutting the array in half, you eliminate 90% or more of the array each time. One could also setup a situation where we instead store both the hash and the strings in an array, and then sort the arrays by the hash, and then one only need to binary search for the hash, thus turning lengthy string comparisons into single integer comparisons (provided that the hash result is an integer of a fixed width).

Using the idea of hashing to save against lengthy string comparisons also works well with all kinds of data structures. You could for example change the key in a binary tree from being a string into a hash of the string, to speed up tree traversal. Hash Tables are also very useful if you need to keep track of data that is processed on the fly, using an array of array method touched on above.

That is your basic data structure use of hashes, now I'd like to focus more on the "signing" aspect of it. Today, there is a multitude of "cryptographic hash functions" which are designed mathematically that it should be very time consuming for a program to just determine how to generate a series of data to match the hash, forcing one to resort to brute force. Once brute force is required, and the possible data set size is well over the trillions, it wouldn't be feasible (using conventional methods at least) to determine a matching set of data for the hash, creating the illusion of uniqueness of the key, making it a way of signing a set of data so one can validate its integrity after transfer. That's why you quite often see websites listing the "MD5" or "SHA-1" of a file you want to download.

This idea can also be expanded to databases, and provide a way to identify data. Say you want to setup a used DVD buy and sell store. When someone comes into your store with a DVD in some case, you want to make sure it's a real DVD and not some copy (or if it is a copy, at least true to the original), that it's not scratched to the extent that some of it can't be read, and you need to know which edition and for which region the movie is in. How does one setup his used DVD store in a way that he can verify a DVD is good before he buys it, and know which edition it is to label it properly?

To have an archive of each DVD somewhere isn't practical, and if it has to be compared over a network, the transfer time would just be too long. If one were to setup a network where some hashes of the data alongside some data on the DVD such as track name, language lists, etc..., a used DVD store could make use of this database to identify and validate DVDs. The people running these stores can all use a program which they pass new DVDs they buy through. The program would collect 2+ cryptographically secure hashes from the full data of the DVD, and rip all other info that can be obtained, using a program such as MPlayer. Then the user can enter in additional info such as what name is on the box, and which edition the box says it is. This data is then placed on a table on a network (website). If one wants to identify a DVD, after hashes are built off of it, these hashes are then scanned for in the table. The primary key of the table should be the hash which has the smallest length, and sorted by it, so the correct section of the table can be binary searched for very quickly. Then the other stats such as extra hashes and track titles can be compared against to make sure this is the right DVD, and not one which happened to match a single hash.

Once such a setup is in place, someone running a DVD store just has to stick a DVD someone is trying to sell them in a machine, and in a few minutes, they can know if it's real, can be read fully, and which edition it is. If it's not in the database, they don't have to make a mistaken purchase, or spend time having to watch the whole thing to make sure it's complete and doesn't stop in the middle anywhere. And with the information pulled from the web, they know if it has to be labeled wide screen edition or deluxe edition or whatever (which isn't always written on the DVD itself, just the box which may not be part of the sale, or may even be the wrong box with the intention to fool the buyer).

Some key points to remember listed above is multiple hashes, sorted by the smallest (even perhaps weakest) hash, and have auxiliary data to verify against, and use cryptographic hashes. If hashes were chosen which could be replicated quickly, people who want to beat the system could make a DVD which just has some matching track lables or whatever data, along with random data to match the hashes in question. Using 2+ hashes also makes it much harder to generate false matching data, as methods to crack hashes are very different from each other, and increase the computational complexity exponentially.

There are various hashes today with varying levels of security, complexity, and length which are good for many of these purposes mentioned above and others. One popular family is the CRC-16, CRC-32, CRC-64, with the numbers referring to how many bits are in hash. CRC is quick and easy to compute (as well as to crack), but since it fits into native integer sizes on CPUs, it's very good for short tables you keep in RAM that you want to deal with quickly and repeatedly. However it's not good for verifying data, as how easy it is to crack, not being cryptographic at all. Focusing on cryptographic hashes, next up which is nice and popular MD-5. MD-5 was an old favorite which has been considered broken for a while now, yet is still in wide use for signing data. MD-5 is 128 bit, and is a decent choice to use along side other keys for extra protection, but shouldn't be used by itself. Another very popular choice today is the SHA series, which offers truncated and non truncated flavors. Personally, I find truncating a hash to be pointless, so we'll just focus on SHA-1 (160 bit), SHA-2 32 bit words (256 bit), and SHA-2 64 bit words (512 bit). SHA-1 was considered a step up on MD-5 and 'the thing to use', but it isn't that much more secure, and while not busted wide open as of the time of this writing, it has been weakened significantly and shouldn't be used by itself either. SHA-2 is significantly more secure, as well as giving long bit sizes to provide more uniqueness of keys. If speed is a consideration, using the 32-bit or 64-bit variants might factor into which CPUs your hashing use is focused on, although of course either can be computed with any kind of CPU, it's just a matter of speed. SHA-512 is regarded as one of the strongest hashes today, but it's hardly the only game out there. There's also RIPEMD (160 bit) as an alternative to SHA-1 being the same size, and built along the same lines as MD-5, which also has its problems, but is a favorite in certain circles. Looking for more unusual but strong hashes, there is Tiger which is 192 bit, and quite popular in certain P2P applications. Tiger being less popular, it hasn't seen the scrutiny some of the other hashes have, so it might not be as good as the SHA series, or it might be better, hard to say, but to my knowledge, doesn't have any significant attacks on it. Tiger2 is also in the works which is supposed to tighten any issues discovered in the first edition, and is supposed to be out in the not so distant future. Lastly, there is Whirlpool which like the 64 bit SHA-2 is 512 bits, and uses heavy transformations making it an excellent choice for hashing, and may be the best popular one currently available, although like Tiger it hasn't seen as much scrutiny as the SHAs. Although Whirlpool is currently at v3, which strengthened any possible issues pointed out since its invention.

Once you decide which hashing algorithm to use, you may wonder where exactly should you get the code from? Since hashing code isn't really supposed to make sense to someone looking at it, clarity isn't what one should be looking for when trying to find a good library. One should focus more on correctness, completeness, portability, usability, and speed.

If you're looking for BSD licensed version of MD-5, RIPEMD, SHA-1 and 2, you can grab good accurate source right out of OpenBSD's CVS. Unfortunately though, the source has some BSDisms in it which will have to be removed if you want to use it elsewhere. It also doesn't make use of the C99 types which can be used to get integers of exact sizes, which is very important when you want portable accurate code with hashing.

There's Botan which has good accurate hashing as well, of every hash we discussed here, and even some assembly optimized versions, all under the BSD license, but it's tangled up with other code making it a bit annoying if you just want to grab some source for a single hash algorithm to use. The portable code is also in C++, which isn't good for C only applications.

Mhash is another library one can use under LGPL which covers all the hashes. I notice though that its output of Tiger is backwards, which also makes me worry about endian issues in general with it. It also suffers of other issues mentioned above.

Crypto++ is a popular C++ library for hashing (and encryption) placed in the public domain. This library is rather excellent, although annoying to use if you only want to grab one source file for a single hash. It also has some assembly optimized versions available, but doesn't use the function pointer methods explained in my previous blog post, and instead checks each time whether your CPU has the instructions needed to compute the hash before computing it. Also in a test I ran, the SSE2 optimized version of SHA-512 crashed under Mac OS X (Intel of course), which may be due to unaligned memory accesses.

If you just want SHA hashing, Brian Gladman has a popular library to keep you covered. It's written in C and dual licensed as BSD and GPL. But it doesn't have any of the type safety of C99 mentioned above.

If you're looking for Whirlpool, be careful, if you look on Wikipedia, it links to source which seems to work at first, but results drastically changed based on what size chunks the data is passed in. I noticed a buffer size of 1-8191 bytes produces different results than hashing the data in 8192 byte chunks.

LibTomCrypt, another implementation under C and for the public domain seems very nice on the surface. But on further testing, I found that if the length of data passed to it for Tiger or Whirlpool hashing is not a multiple of 64, it gives improper results. Problems could be in other situations which I did not test. I really liked this though, as the code was pretty clean, and many implementations were rather fast.

OpenSSL which is very popular and in C with an Apache style license is a good choice if you want the popular hashes, as it doesn't cover Tiger or Whirlpool (but has MD-5, RIPE, and SHA). It also has assembly implementations in certain cases, but the assembly code is really a nightmare, and it's hard to integrate just part of OpenSSL into a project.

If you're looking for another implementation of SHA in C under BSD, and written rather well (but perhaps not as fast as Brian's, look no further than Aaron Gifford's implementation. If I'm not mistaken, it's also his SHA-2 in OpenBSD (but messied up a bit).

A good source which covers our favorite hashes properly is the md5deep project, only thing missing is RIPEMD code. I rather like this one and think it's a good place to start.

Of course, there's also official implementations, for Tiger, Whirlpool, and RIPEMD. Although the official Tiger code seems to be one pass only, no simple hash_init(), hash_update(), hash_finalize() methods. And the official Whirlpool and RIPEMD implementations are rather slow. Some of the other implementations of Whirlpool pointed out above were 25% faster.

It'd also be nice to have some kind of consistency when it comes to which functions to use. Before choosing, make sure you test well on various machines of different endian, bit width, and on various data sizes and chunk sizes. You may be surprised how many libraries just aren't written right and produce wrong data.

I'm personally working on C99 implementations of these, which I'm testing on Win32, Linux x86-32, Linux x86-64, Mac OS X PPC, Mac OS X Intel, in a large variety of situations to certify correctness and portability. Of course I also plan on having a consistent interface for all of these, and each being contained to a single .h/.c pair so it can be easily dropped in anywhere. Right now I'm currently trying to optimize my Tiger implementation more, and adding more assembly variations using the function pointer method documented previously to give the best speed, while providing portability and usability. Let me know if you're interested in using it when I'm done.

While it is ridiculous for yet another person to make yet another hash library, it is rather appalling how many out there are broken or have very bad bugs in them, or act very differently on another system. Don't make the mistake of using a library which hasn't been tested heavily in a variety of situations.

Saturday, May 26, 2007

Secrets to Optimization - Function Pointers

Let me present you with a bit of code which you might find quite commonly in a program:

void function_1(type a, type b)
{
if (complex_formula())
{
method_a(a, b);
}
else
{
method_b(a, b);
}
}
...
void function_2(type *a, type *b, size_t ab_len)
{
while (ab_len--)
{
function_1(*a++, *b++);
}
}
I see this quite often in all kinds of situations. Where some repetitive action keeps calling a function which uses a static calculation which in turn does one of two things. Since this action is making many calls to a function which is doing the exact same sub action on this set of data, it really is a waste to redetermine which sub action we're dealing with for every element of data.

Imagine if a program has an option to approximate fractions from floating point using algorithm A or algorithm B, if the program repeatedly has to check which algorithm it's supposed to be using on each floating point number in a set, it is wasting time that could be spent on actually working on these numbers.

The most obvious way to speed this process up is to setup a variable which contains the result of complex_formula(), so complex_formula() itself doesn't have to be recalculated again and again, and instead we just do something like "method = complex_formula()" once, and just check "method" each time. However if we have a case where we have multiple methods, we'd have to turn this into a switch(). Once we have several methods, we'll be wasting a bit of minor but cumulative CPU cycles for every time we need to perform this action.

In order to bypass any overhead, or make the overhead insignificant, we make use of a sometimes neglected feature - function pointers. Let's rewrite the prior example:

void (*function_1)(type a, type b) = method_a;
...
void function_2(type *a, type *b, size_t ab_len)
{
while (ab_len--)
{
function_1(*a++, *b++);
}
}
With this new bit of code, method_a will be used each time. If we want to recalculate which method to use because an option was changed. then we can do something like this when need be:

function_1 = complex_formula() ? method_a : method_b;
But this complex_formula() won't have to be rechecked for each element of an array, or even overhead for comparison and jumping if we did ifs/switch on a variable.

I've looked into the difference of calling a function directly or via a pointer, and on the CPUs I looked at, the overhead was 1 cycle or even 0, meaning the overhead is negligible if any. Either way, it's less than that of a comparison and jump, and surely several of them.

If you remember some posts back where I was talking about quick sorting, I setup a sorting testing frame work like so:

struct qsorters
{
char *name;
void (*func)(void *, size_t, size_t, int(*)(const void *, const void *));
};

static struct qsorters qsorters[] = {{"AT&T", qsort_att},
{"BSD 1983", qsort_bsd2},
{"BSD 1990", qsort_bsd3},
{"BSD 1993", qsort_bsd},
{"diet libc", qsort_dietc},
{"DJGPP", qsort_djgpp},
{"glibc", qsort_gnu},
{"IBM", qsort_ibm},
{"Linux", qsort_linux},
{"Microsoft", qsort_microsoft},
{"Sun", qsort_solaris},
{"Syslinux", qsort_syslinux},
{"uClibc", qsort_uclibc},
{"insane", qsort_insane},
{"Simple", qsort_simple},
};
Now my sorting framework simply looped through the array and did the sorting tests on qsorters[i].func() instead of it needing to setup some kind of switch system to translate names into the function to call. If you want table driven methods to apply to functions, this is the way to do it. Imagine how much slower my testing framework would've been if I had to use a switch on this. This also makes the code simpler to read and write, just like arrays make more sense than x1. x2. x3. etc...

Function pointers are a wonderful thing to use whenever you have multiple functions which are the same black box, which when on the same input offer the same output, just use different methods internally.

This also extends well to when mixing with other kinds of optimizations. Quite often people try to optimize critical sections of their code using hand tuned assembly. Now on the x86 platform, there exists several instruction sets which can be used for optimization that not every CPU has, for example, MMX, 3DNow!, SSE, SSE2, etc...
If one wanted to offer good assembly optimizations for something, I all to often see scattered across programs code like this:

void func()
{
if (detect_sse2())
{
func_sse2();
}
else if (detect_3dnow())
{
func_3dnow();
}
else if (detect_mmx())
{
func_mmx();
}
else
{
func_i386();
}
}
If func() is called say 10,000 times a frame in a video processing program, it really lowers the overall FPS that could be processed if instead the following method was used:

void (*func) = func_i386;
void (*func2) = func2_i386;
etc...

void instructions_detect()
{
if (detect_sse2())
{
func = func_sse2;
func2 = func2_sse2;
}
else if (detect_3dnow())
{
func = func_3dnow;
func2 = func2_3dnow;
}
else if (detect_mmx())
{
func = func_mmx;
func2 = func_mmx;
}
}
With this, instruction_detect() can be called when the program is started, or the library is initialized, and since the instruction set is not going to change in the middle, you just call this once and forget about it. With this, you can achieve optimal speed for any given processor in a program, without wasting any speed for overhead, aside from an initial call.

Thursday, May 17, 2007

The Sorry State of Sound in Linux

Lets start with some background.
Back in the old days, if you had a PC, there was only one card called "the sound card", which of course was the Sound Blaster 16. If you were playing any of the popular games back then, many of them only supported the SB16 for sound. Other companies who wanted to start releasing sound cards had to include Sound Blaster emulation if they wanted to get any kind of real sales. Sometimes this emulation was buggy, but you wouldn't know that until after you bought it. For this reason, people who wanted sound took no short cuts and only bought SB16s.

Thus back when Linux was first starting out, which was on the PC for the 3x86, there was only one sound card which demanded immediate support. Understandably, Sound Blaster support was completed and done very well rather early on for Linux. Since the API was good, as well as that most Linux programs which provided sound targeted this API, people who wanted to provide drivers for their sound card made their sound drivers use the same API. At this point, the API for the Sound Blaster became known as the Linux sound API, and all the various sound card drivers got merged into one neat little package, this later became the Open Sound System or OSS for short.

Since OSS was UNIX oriented and rather good, the other UNIX OSs starting up at that time such as FreeBSD and NetBSD wanted sound support too, and OSS was ported to them as well. Today OSS runs on virtually every UNIX OS except Mac OS X, but Linux, *BSD, Solaris, are covered, as well as even AIX and HP-UX and more.

Now from a developer standpoint, if you wanted to create a simple application with sound output, and you wanted it to work on the various UNIXs available today, the choice was simple, you write the code for OSS, and it's all nice and portable.

However, there was one major drawback perceived with OSS, and that was mixing. Say you want to listen to music while listening to the news. Not exactly that hard to setup if the music is soft and the news is loud. However when OSS was originally designed, it let the sound card handle 'mixing' which would allow multiple sound outputs be mixed together. But more modern sound cards decided to follow the path of modems and became mostly software. Once this happened, all features of a sound card had to be written up in software to work properly, and mixing wasn't always a focus, nor is it always a simple matter. Therefore many newer sound cards would lack mixing under OSS.

Therefore two new sound systems came into existence, aRts and ESD which were used mostly by KDE and Enlightenment/GNOME respectively. They used new APIs which did mixing before sending sound to OSS. They intended that new programs would use these APIs. Now I looked at both aRts and ESD, aRts seems pretty easy to use, even easier than OSS, and I wrote up a sound player in less than 5 minutes with aRts. ESD on the other hand looks like it might have more features than aRts to use, but also looks much more complex than aRts. With programs written to use one of these two, you could run multiple applications which use one of these sound systems at the same time and get sound out of both of them.

The problem with aRts or ESD though is that there are two of them, so while many KDE or GTK apps will use one or the other, you can't run both of them at the same time, as you can only have one of them work with OSS at one time because of the 'mixing' problem. It's even more problematic that you still have native OSS applications that don't go through another system. To solve many of these problems, library wrappers were invented.

First we have the Simple DirectMedia Layer (SDL) which wraps around OSS, aRts, ESD, and even sound systems on Windows and Mac OS X. Sounds good for portability, since it works everywhere, and you can have it use whichever sound engine on UNIXs so everything can play nice together. Unfortunately though, SDL uses sound only via a callback interface. While this is fine for many applications, it can get annoying sometimes, and in some applications which generate audio on the fly, not only can it be complicated to do properly, it might even be impossible to maintain proper sound sync.

Another nice competitor which I like is libao which wraps around OSS, aRts, ESD, and several other UNIX specific sound engines. The API for libao is also really easy to use, quite similar to that of aRts, and I managed to whip up sound applications with libao in no time at all. Unfortunately libao hasn't been updated in some time, and certain wrappers it has seem to be a bit buggy. libao also only supports blocking audio, which makes it complicated to write certain applications where the audio is generated while it's playing, forcing the program to use threads, hope one doesn't fall asleep while the other is working, and use semaphores and mutexes.

Because the general idea of libao was good, the MPlayer team took libao, fixed several limitations, updated it, fixed bugs, added many more sound wrappers, even two for Windows and one for Mac OS X, and dubbed their creation libao2. With MPlayer, you can change which sound wrapper to use with the param -ao and specify which system to use, or set permanently via the config file. I even took some libao2 DirectSound code, used it in a Windows application, and a Microsoft developer who I know looked at it and told me the code looked really good, which goes to show the quality of the underlying wrappers. Unfortunately though libao2 is bound to MPlayer, it has all kinds of MPlayer library calls within it meaning a major dissecting operation is needed to use it in another application. Perhaps some MPlayer or library developers can get together and separate the two, add on any needed features, and then we application developers can finally have a portable sound library to use, while potentially also allowing MPlayer to get better sound output on more obscure drivers because a developer who doesn't use MPlayer but another sound program might fix it.

Now while application developers are all worrying about their precious libraries, the developers of the portable UNIX sound system - OSS, decided to go closed source and offer extra features (which regular users don't need) to paying customers. This of course created an uproar, and some Linux developers decided to create a new solution.

Now instead of just rewriting OSS at the core, or using a wrapper (which understandably doesn't fix the underlying mixer problem), or updating the existing open source OSS at its base, some developers for some absurd reason decided they needed a whole new system and API to go with it. This is known as the Advanced Linux Sound Architecture or ALSA for short. This new API is huge, mostly undocumented, incredibly complicated and completely different than OSS. Which means various sound card drivers have to be rewritten for ALSA, and all applications have to be rewritten (or have more sound system support like SDL and libao) in order to use ALSA.

Other advantages of ALSA is that they included a software mixer (which doesn't always work). But as should be apparent, applications aren't magically overnight going to start switching over to ALSA. And as ALSA isn't portable, people who want to support BSD and Solaris will of course be using OSS. Meaning to support ALSA would mean needing to support OSS and ALSA. Since OSS exists on Linux, why even bother? Realizing this, the ALSA developers programmed OSS emulation into ALSA, so sane developers can just program for OSS. But the big embarrassment here for ALSA is that using ALSA via its OSS emulation is usually better than using ALSA directly. I've heard from many users of SDL or libao powered programs that telling those wrappers to use OSS (which ends up being used via ALSA's OSS emulation) works better with less gaps (or other problems) in the audio than using ALSA directly by those wrappers.

But for some really stupid reason, ALSA's OSS emulation doesn't support mixing. Which in the end really defeats the purpose of ALSA in the first place. I also have two sound cards which work both under OSS and ALSA and find OSS to just work better. Even more shocking is that I found in cards that have hardware mixers installed, they don't seem to be used by ALSA's OSS emulation, leaving such users without mixing in OSS apps. However, for some reason, I'm seeing a lot of propaganda lately that I have to make all my new applications use ALSA and not OSS because OSS is deprecated. I really doubt that because Advanced Linux Sound Architecture now exists that suddenly FreeBSD has deprecated OSS. How can something portable be deprecated just because one really near sighted OS can't figure out how to deal with life? Using non portable APIs is what is deprecated. I've smartly been forwarding all such ALSA propaganda to /dev/null, but it does bother me that my laptop's internal sound card only has support by ALSA in Linux. And with all the propaganda, I am worried what if ALSA makes more headway? It's already annoying that some Linux distros for example ship SDL without the OSS bindings installed, and it could get worse.

Now a friend of mind recently decided to switch to Linux because he noticed how much better it is than Windows in ways that are important to him. Since he has a newish laptop, the only drivers he could find for it are ALSA. Now for some reason, sound is really scratchy for him, and he has very limited volume controls, making him want to go back to Windows where sound works properly.

Earlier this week, another friend of mine informed me that the closed OSS has been constantly worked on since the last open sourced OSS was put into Linux years ago, can be downloaded and installed from from their website for free, and is superior to ALSA as well. It even offers ALSA emulation. This sounded interesting to me, and I headed over to the OSS website to install OSS on my friend's laptop to see if it would fix his sound problems. And lo and behold, sound is now crystal clear for him, he has much more finely tuned controls of volume, and can raise it higher than ALSA was able to. I put some research into the closed sourced OSS, and I see it has software mixing, and even software resampling which works very well. I proceeded to install it on my laptop, and now got much better sound support there as well. Only thing is that the ALSA emulation isn't working for me too well, but as I only currently have one app which is ALSA exclusive, I don't mind much, but it is annoying knowing that the previous version of this app supported OSS.

All this shows me is that ALSA is truly garbage, and a very bad idea from the ground up. If you want good sound support under Linux, the best, and sometime the only feasible option is to install the closed source OSS. With this, you always get mixing (even using the hardware mixer which ALSA doesn't always do), support for a dozen UNIX OSs, and finely tuned controls. They also made some API improvements to make it easier for application developers to do more advanced things with their audio. It's also nice to see spdif support, and even be able to send AC3 and other audio formats directly to OSS without needing to decode it first.

The real problem here being that it's closed source which makes many people for some reason step away from it. But being that the best video drivers for Linux are the closed source nVidia ones, excellent closed source sound drivers being the best doesn't surprise me. Being that I want the best possible use of my computer, and don't care for any dumb ideals, I'll take closed source drivers if they work well and don't do anything they shouldn't be doing. But this makes many distro packagers shy away from bundling or supporting the close source drivers in any way. Now while in the case of nVidia or ATI drivers, developers write apps for OpenGL or X11 which the closed source drivers support well, and it's the same API being developed for either way, we don't have to worry much if the distros don't support them. But in the case of OSS, if the distros package a completely different incompatible interface, and provide SDL and friends with everything but OSS support, we can't use the closed source drivers even if we wanted to, taking away our freedom, forcing us to put up with garbage, and defeating good portability.

We need to make a stance and stop this ALSA propaganda in its tracks, and produce many good applications with no ALSA support whatsoever and telling the propaganda spewers to go wake up. We must get the distros to keep supporting OSS. If they want a new sound system, it should be using the same API as the old one and be what every other UNIX OS uses for native sound.

Now I wish OSS would be opened sourced again, and perhaps we should be talking to 4Front about that, or recreating a full open source up to date equivalent of the closed source one, but there is no reason that we should be diverting all our efforts into a broken, incompatible, unusable, knock off.

Read here for more information about the latest closed source version of OSS and why we should be using it .

To recap, if you have sound issues, try installing the official OSS, start getting recognition going that OSS is what we're supposed to be using and that ALSA is garbage. Make sure to tell your distros that ALSA is garbage, and they should not be removing OSS support, and we should have the freedom to use a closed source driver if we want to. Tell application developers that you demand OSS support. Tell users of your application that want ALSA that ALSA is garbage, and point them to the closed source OSS if they have troubles with the open sourced one. We should also see about talking to 4Front about reopening source or updating the old open source implementation (or rewrite if need be). We should also see about making libao2 more of a library. If we took these steps, I think the state of sound in Linux, and other UNIX OSs by extension would be better off. If we took these necessary steps, UNIX sound apps wouldn't look like the laughing stock of the sound application community.

Sunday, April 8, 2007

Can you trust your security application?

Let me begin with a nice mathematical problem.

Say you have a divorced couple that are currently discussing over the phone how to divide up their possessions. They pretty much split everything equally, till they reach their car. They know they won't get nearly as much as it's worth by selling it, so they'd prefer if one of them would just keep it, but they can't agree to which one. They don't want to meet and they don't want to get another party involved. How can they determine which one should get the car?

They can flip a coin on it, but since this is over the phone, both parties won't be able to see it and agree. Ideally one side would decide what heads and tails means, and the other would say if it is heads or tails, but they need a secure method to transfer these secrets to each other, without either end fearing of some sort of compromise.

This problem is what today's public/private key secure communication technology is based upon, and we have come up with a solution.

One party would take two very large prime numbers, and multiply them and transfer this number to the other party. The first party keeps the two factors as a secret, the other party has no reasonable way to determine the two factors without months of computation. The first party then informs the second party that once the coin is flipped, they will take the larger factor, and look at it's fifth digit from a particular end. If that digit is odd it means the first party chose heads and the other tails, and if it is even, the winning conditions will reverse. Now the second party can flip the coin and report the status back to the first party.

At this point, everything needed to determine the winner has been transferred securely, now the secret can be transferred to decode who the winner was. The first party will tell the second what the two factors were, and they will have reached an agreement on who will retain the car.

Now provided that the duration of this transaction (phone call) was shorter than the time it would take for the second party to factor, and everything was done in the proper order, this entire exchange should be 100% secure. However, what if the second party happened to know that number already, and had the factors on hand? That party could cheat, and tamper with the exchange of information, rendering the transaction insecure.

Now when we look at secure communication technology today, they generally have each side come up with their own variation of the prime number examples above, and each side sends the other a "public key" to match with their own secret "private key". The private key can't be easily derived from the public key, but one will decode data encrypted with the other. Then when transferring, each side encodes and decodes data with their private and the other's public keys. An attacker can't jump in the middle, because it would need to get the private key which is never distributed in order to decode or impersonate one of them.

This security falls apart when the keys in use are older than their cryptographically secure time frame, or when the application doesn't follow the proper procedures. If keys are long and strong enough, and always replaced before their safe time frame limit approaches, and connections aren't opened indefinitely, one should be safe in relying upon the keys. However, a chain is only as strong as it's weakest link. If the application doesn't follow the proper procedures for key exchange, or has errors in its authentication and validation routines, it could be leaving it's users and their data compromisable.

If your security application is closed source, there is the chance that there is a backdoor programmed into it that will go unnoticed. When a company has few employees in comparison to the sheer amount of code it has in its products, there is little to stop one rouge employee to stick a tiny bit of code into a complex bit of security related code. When the amount of code greatly outnumbers the coders, very few people will bother to look and try to envision all the circumstances a bit of tricky code will have to handle and how it will end up handling it.

Consider how someone hacked into the Linux server and tried to embed a backdoor into the source that would later make up Linux releases around the world, read about it here. They inserted a bit of code that if two rare options were used together, an unprivileged user would gain administrator capabilities on that machine. Such a thing would go unnoticed if the person had the ability to modify the source at will without raising any eyebrows.

Let me offer an example. The SSH program allows a user to connect to a computer running an SSH server. When one wishes to connect to the other, they have to supply a user name and password, as well as tell SSH which modes to use.

Here are two existing options:

-1 Forces ssh to try protocol version 1 only.
-2 Forces ssh to try protocol version 2 only.


No person would normally consider to force SSH to try everything it was going to try anyway. Imagine if the developer of an SSH server stuck in a bit of code that if the client only wanted protocol version 1 and only wanted protocol version 2, to grant access even if the password was incorrect. This rouge developer could then gain access to every machine running his SSH server, and no one the wiser. Once this developer knows few at his company will see his work, he has nothing to lose to add such code. If he happens to be caught, and if the code was in a confusing section, he can say he was trying to handle an invalid case, and apparently didn't do it right, it was only an accident.

When your security program is closed source, do you really want to lay the security of your data in the hands of a disgruntled employee somewhere? Can you really trust the protection of walls that you can't see and has no outside review process? Many internal reviews miss things in tricky sections, especially when the group in question takes pride in their work. I'm reviewing code, and thinking to myself, hey we wrote this, we're world class programmers, we're good at what we do, this code is too tricky to really sit down with a fine tooth comb, I'm sure it's right.

Keep in mind though, just because something is open source, doesn't mean something can't be snuck in either. It's just less likely for that malicious code to remain there for too long if the application/library in question is popular and has enough people reviewing it.

But despite something being open source, malicious code can be snuck in without anyone ever noticing, read this paper for background. Now while the attack described where one edits the compiler to recognize certain security code and handle it in a malicious manner is a bit far fetched, similar attacks a bit closer to home are quite possible. If you're using a Linux distro which offers binary packages, what really stops a package maintainer from compiling a modified application and putting that in the distro's repositories? Those running a secure environment may want to consider compiling certain packages themselves and not trusting binaries that we really have no clue what is in them.

But based on this paper, do we have to worry that the compiler or the OS or other libraries would produce the proper binary when we compile this security application ourselves?

Lucky for us, despite what newbie programmers want, our programming languages aren't made up of a series of very high level functions such as compile(), handle_security(), and the like. Such would make it much easier for someone to make the compiler or the library do something malicious when it encounters such a request in the source. In order for such an attack to be really successful, it would have to understand every bit of code it's compiling to make sure the resulting program won't be able to detect the trojan, which is extremely tricky if not near impossible for a compiler to do. Not using a high level compiler or virtual machine gives us a layer of security in that it would be harder for one to pass out an "evil compiler" that would understand what the developer was trying to do and instead have it do something malicious.

But if such an attack were to take place, we'd have to pull out hex editors and disassemblers to see that such code has been snuck in (something which we must do with closed source applications). Take this a step further, it is theoretically possible if the OS were affected, or if the compiler was so smart that it intimately understood that it was compiling a hex editor or disassembler and the like to stick in code that would subvert file reading on executables and libraries to mask such malicious code even in the binary.

Now while some clever guy out there is thinking to himself: "Oh I'm going to do that, it'll be completely undetectable", such a course of action is much much easier said than done. I would be amazed if even a whole team of programmers would be able to pull such a monumental task off. I wouldn't worry too much that such magical wool has been invented to pull over our eyes when we try to decode a binary in a safe environment.

But I would worry if the application or the libraries it depends on are closed source. And even if we have the source, I would question where the binary we happen to be using comes from. If you're using even an open sourced application in something critical, I would advise to have your binary for your application and related libraries examined in a safe environment just to be sure. I just hope no one subverted an OS out there to alter non executable reading and writing on executable files, and have the OS strip/readd code when executable files are transfered.