Sunday, November 11, 2007


Directory safety when working with paths



As we discussed in our previous two articles, we run into issues of thread safety when we try changing the path during a running program. Therefor, in addition to bypassing PATH_MAX problems, working around buffer issues thanks to implementing with C++, we would also love to implement our functions without ever calling chdir() and changing the working directory. We'll now look into how we can make changes to our implementation to avoid a chdir() call, along with the pros and cons of techniques UNIX operating systems provide us, as well as what they're lacking.

If one looks back at how we implemented getcwd(), they'll notice that we call chdir("..") for each step in our loop. We need this to sanitize the location for these three calls:

opendir(".")
lstat(entry->d_name, &sb)
stat(".", &sb)


To remove the need of changing directories, we can place the ".." paths directly into our calls to opendir(), lstat(), and stat(). Thanks to C++ Strings being dynamic, it makes it easy for us to manipulate paths for these situations properly. We'll also see that being clever, we can even avoid string manipulation for some of these cases. And if our OSs supplied enough functionality, without any string manipulation at all!

Here's how we can implement getcwd() based off our previous implementation, but without calls to chdir(), with key details highlighted:

bool getcwd(std::string& path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
struct stat sb;
if (!stat(".", &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);
std::string up_path("..");

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
{
bool pushed = false;
DIR *dir = opendir(up_path.c_str());
if (dir)
{
dirent *entry;
up_path += "/";
std::string::size_type after_slash = up_path.size();
while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
{
if (strcmp(entry->d_name, ".") && strcmp(entry->d_name, ".."))
{
up_path.replace(after_slash, std::string::npos, entry->d_name);
if (!lstat(up_path.c_str(), &sb))
{
file_id child_id(sb.st_dev, sb.st_ino);
if (child_id == current_id) //We found where we came from, add its name to the list
{
path_components.push_back(entry->d_name);
pushed = true;
break;
}
}
}
}

if (pushed && !fstat(dirfd(dir), &sb)) //If we have a reason to contiue, we update the current dir id
{
current_id = file_id(sb.st_dev, sb.st_ino);
up_path.replace(after_slash, std::string::npos, ".."); //Keep recursing towards root each iteration
}

closedir(dir);
}
if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
}

if (current_id == root_id) //Unless they're equal, we failed above
{
//Built the path, will always end with a slash
path = "/";
for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
{
path += *i+"/";
}
success = true;
}
}
}

return(success);
}

First of all, we removed all the code related to saving the current directory when we started, since we don't need that information anymore. Our first major change is that we created a new C++ String called up_path, we'll use this variable to keep track of the path, initializing it to "..". then for each step through the loop, make it "../..", "../../.." and so on, till we reach the root. We use this to replace our calls to opendir() to "." as we were doing before.
At this point, we'll add a slash to the path, and keep track of the spot with the variable after_slash. Now in our read directory loop, we can replace whatever is after the slash with the filename in the directory to pass to lstat(), again bypassing the need to be in the same directory as the file when making the function call.
Now for the stat() call on the directory itself, we got a little interesting. Instead of doing a path manipulation trick again, we call fstat() on the file descriptor returned from dirfd() on the already open directly handle. Notice how the call to close directory has been moved to after the block of code, so the directory is still open. And of course it's all wrapped up nicely with appending ".." after the slash.

Noticing how we eliminated path manipulation for the last part, it would be nice to eliminate more of it, especially if anyone wants to port this C++ code to C. The good news is that on Solaris and Linux we can, and as soon as the "at" functions get standardized, we can use them on the other UNIX OSs too. You can read more about them in one of our previous articles, File Descriptors and why we can't use them.

Here's how we can use the at functions to eliminate the path manipulation needed for calling lstat(), with the differences highlighted:

bool getcwd(std::string& path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
struct stat sb;
if (!stat(".", &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);
std::string up_path("..");

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
{
bool pushed = false;
DIR *dir = opendir(up_path.c_str());
if (dir)
{
int dir_fd = dirfd(dir);
dirent *entry;
while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
{
if (strcmp(entry->d_name, ".") && strcmp(entry->d_name, "..") && !fstatat(dir_fd, entry->d_name, &sb, AT_SYMLINK_NOFOLLOW))
{
file_id child_id(sb.st_dev, sb.st_ino);
if (child_id == current_id) //We found where we came from, add its name to the list
{
path_components.push_back(entry->d_name);
pushed = true;
break;
}
}
}

if (pushed && !fstat(dir_fd, &sb)) //If we have a reason to contiue, we update the current dir id
{
current_id = file_id(sb.st_dev, sb.st_ino);
up_path += "/.."; //Keep recursing towards root each iteration
}

closedir(dir);
}
if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
}

if (current_id == root_id) //Unless they're equal, we failed above
{
//Built the path, will always end with a slash
path = "/";
for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
{
path += *i+"/";
}
success = true;
}
}
}

return(success);
}

As should be easily discernible, the string manipulation got easier, and mostly vanished. All the keeping track of a slash, and replaces has been replaced with only needing to append "/.." at the end of each loop.
Instead of manipulating a path string for our call to lstat(), we save the directory's file descriptor above (and subsequently use that instead of recalculating it lower down for fstat()), then use it to get the lstat() for each file in the directory, but now via fstatat().
fstatat() is the same as stat()/lstat(), but takes a directory descriptor as the first parameter to offset where the filename is relative to. The last parameter can be 0 or AT_SYMLINK_NOFOLLOW, which makes it act like stat() or lstat() respectively. fstatat() instead of a file descriptor can also take the special value AT_FDCWD to have it automatically work in the current directory.

This implementation should be much more elegant, but a bit less portable. If one would like to implement fstatat() themselves, it's not hard, here's how you can do it:

int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags)
{
int success = -1;

if ((!flags || (flags == AT_SYMLINK_NOFOLLOW)))
{
int cwdfd = -1;
if ((dirfd == AT_FDCWD) || (pathname && (*pathname == '/')) || (((cwdfd=open(".", O_RDONLY)) != -1) && !fchdir(dirfd)))
{
success = (!flags) ? stat(pathname, buf) : lstat(pathname, buf);
}

if (cwdfd != -1)
{
fchdir(cwdfd);
close(cwdfd);
}
}
else
{
errno = EINVAL;
}

return(success);
}

You'll also need the defines for AT_FDCWD and AT_SYMLINK_NOFOLLOW. You have to make sure that the value you choose for AT_FDCWD can't be a valid file descriptor or the normal failure return value. Therefor I chose -2 in my implementations, but choosing any negative value should be okay. AT_SYMLINK_NOFOLLOW shouldn't matter what it is, and a value of 1 or for all I care, 42, should be fine.
If your OS supports the at functions (currently Linux 2.6.16+ and recent versions of Solaris), it's better not to use a custom implementation, as this isn't thread safe, since it calls fchdir() internally, and for our example, runs counter to the whole point of using fstatat(). It would also be faster to use the original getcwd() implementation from a few days ago than using an emulated fstatat(), since there's less overhead from repeated fchdir() calls.
It's also interesting to note that glibc on Linux now, even for older than 2.6.16 implements fstatat() and the whole slew of at functions even when not supported in the Kernel. It's similar to ours, can be thread unsafe due to changing the working directory, and for some inexplicable reason, segfaulted in certain circumstances when I ran a large battery of tests on it and my implementation against the Kernel's to make sure that mine was working properly.

Anyways, with that out of the way, one can't hope but wonder if it would be possible to also eliminate the need of any path string manipulation for the call to opendir(). Mysteriously, neither Solaris nor Linux have an opendirat() call. If there was such, we could easily keep the previous directory open, till we obtained a new handle for its parent directory.
Baring having an opendirat() function, it'd be nice to implement such a thing. Some UNIX OSs I understand have a function perhaps named fopendir() or fdopendir() to promote a file descriptor to a directory handle. With such a function, we can simply write an opendirat() which calls openat(), then promotes it, but Linux doesn't have a method of promotion, although Solaris does (fdopendir()). Lets hope the people who are standardizing the new at functions for POSIX seriously consider what they're doing, otherwise we won't be able to use them.

Now that we've gotten getcwd() to be safe, we still need realpath() to be. In our realpath() implementation, the only unsafe parts was our function chdir_getcwd() which called chdir() internally as well as getcwd(). getcwd() is now taken care of, but we still need to rewrite the rest of chdir_getcwd() to not actually chdir().
To do such should now be easy, we simply do getcwd(), but with a new start path. We'd make a new function called getcwd_internal() which would take two parameters, one for the current directory, and another to initialize the up one, which we can wrap everything else around. Basically copy your favorite getcwd(), but make the modifications to the beginning like so:

bool getcwd_internal(std::string& path, const std::string& start_dir, std::string& up_path)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
struct stat sb;
if (!stat(start_dir.c_str(), &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
{
std::vector<std::string> path_components;
file_id root_id(sb.st_dev, sb.st_ino);

while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
--SNIP--


Now we'd turn getcwd() and chdir_getcwd() into wrappers like so:

bool getcwd(std::string& path)
{
std::string up_path("..");
return(getcwd_internal(path, ".", up_path));
}

bool chdir_getcwd(const std::string& dir, std::string& path)
{
std::string up_path(dir+"/..");
return(getcwd_internal(path, dir, up_path));
}


Note that the name of chdir_getcwd() is now misleading that it doesn't actually change directories anymore. For this reason, it's a good idea to not put any implementation details into a function name, as a user shouldn't have to know about them, and over time, it can become inaccurate.

And there you have it folks, getcwd() and realpath() implemented safely. Hopefully in the process all you loyal readers learned a couple other things too.

Other improvements over all we discussed, might be to run a smart pass over concatenated path names at key locations to remove extra slashes together "///", unneeded current directory "/./", and extra parent directories "path1/path2/.." -> "path1/", which is useful if certain functions like opendir() or the stat() family of calls have internal buffer issues. But doing such is a discussion for another time.

This wraps up our long discussion. All comments and suggestions are welcome.

1 comment:

insane coder said...

Just wanted to update to say that it seems Linux does now have fdopendir(), it just doesn't seem to be listed in the manpages yet.