Friday, November 9, 2007


Implementing realpath() in C++



Before reading this post, please read the background in the previous post PATH_MAX simply isn't, as it covers important background, and has some code we'll be building on over here.

Here's what realpath() is supposed to do:

realpath() expands all symbolic links and resolves references to ’/./’, ’/../’ and extra ’/’ characters in the null terminated string named by path and stores the canonicalized absolute pathname in the buffer of size PATH_MAX named by resolved_path. The resulting path will have no symbolic link, ’/./’ or ’/../’ components.


Instead of attacking implementing such a thing head on, we'll take a bottom up approach. One should consider that being able to break down a path between the path components prior to the last slash and after the last slash is essential to figuring out which directory the file actually lies.

The built in functions for this are called dirname() and basename(), but they modify their arguments, and they do some funny handling in special cases which we don't want.

So to implement our own for our purposes, we can write a function as follows:

void relative_dir_base_split(const std::string& path, std::string& dir, std::string& base)
{
std::string::size_type slash_pos = path.rfind("/"); //Find the last slash
if (slash_pos != std::string::npos) //If there is a slash
{
slash_pos++;
dir = path.substr(0, slash_pos); //Directory is before slash
base = path.substr(slash_pos); //And obviously, the file is after
}
else //Otherwise, there is no directory present
{
dir.clear();
base = path;
}
}

The function is simple enough, pass it a path, and two C++ strings for the return values, and have it split them up.

The next function which we'll use is one to get the realpath() of a directory. We can easily get such a thing by chdir()'ing to the directory, running our getcwd(), and then chdir() back.
This simple function is as follows:


bool chdir_getcwd(const std::string& dir, std::string& path)
{
bool success = false;
int start_fd = open(".", O_RDONLY); //Open current directory so we can save a handle to it
if (start_fd != -1)
{
if (!chdir(dir.c_str())) //Change to directory
{
success = getcwd(path); //And get its path
fchdir(start_fd); //And change back of course
}
close(start_fd);
}
return(success);
}

This simple function to resolve where a directory is located uses should be obvious, however, it's also a core function in figuring our where a file is, since a file is located at the realpath() of its parent directory, plus the base name for the file itself.

Now to get the realpath() of a file, we have this simple wrapper over the above functions:

static inline bool realpath_file(const std::string& path, std::string& resolved_path)
{
bool success = false;
std::string dir;
std::string base;
relative_dir_base_split(path, dir, base);

//If there is a directory, get the realpath() for it, otherwise the current directory
if (dir.size() ? chdir_getcwd(dir, resolved_path) : getcwd(resolved_path))
{
resolved_path += base;
success = true;
}
return(success);
}

Now we have chdir_getcwd() which is basically a realpath() for a directory, and realpath_file() for files. Note however, that if you call realpath_file() directly, it won't actually ensure that the file exists, so probably a bad idea to call directly, unless you don't care if it exists or not. It also won't handle symlinks. To handle symlinks we need quite a bit more code.

First, let us write a wrapper in C++ for reading links:

bool readlink_internal(const std::string& path, std::string& buffer, ssize_t length)
{
bool success = false;
if (length > 0)
{
char *buf = new(nothrow) char[length+1]; //Room for Null
if (buf)
{
ssize_t amount = ::readlink(path.c_str(), buf, length+1); //Give room for failure
if ((amount > 0) && (amount <= length)) //If > length, it was modified mid check
{
buf[amount] = 0;
buffer = buf;
success = true;
}
delete[] buf;
}
}
return(success);
}

This function simply wraps up the built in one, and uses C++ strings for dynamic allocation, so we don't have to worry about it directly in our function to resolve symlinks.

Before diving into the actual details in resolving a symlink, let me present one more helper function:

void build_path_base_swap(std::string &path, const std::string& newbase)
{
string dir;
string base;
relative_dir_base_split(path, dir, base);

if (dir.size())
{
path = dir + newbase;
}
else
{
path = newbase;
}
}


This above function will take a path, and replace the base part of it, meaning if you pass it "path/cows.txt" and "feet.txt", path will become "path/feet.txt".

Now onto the symlink discussion.

Resolving symlinks in itself is hard though, since one symlink could lead to another, meaning we'll need a loop. We also need to defend ourselves against an infinite loop if one symlink links to itself, or a symlink earlier on in our resolve loop.

It is not safe enough to call a regular stat() to check for the final link existing (and not a loop), since during our resolve, an attacker can modify a link, so we have to detect that.
Unfortunately, most OSs/libraries have a define called MAXSYMLINKS which I've seen set anywhere from 8 to 30. They stop their loop once they've iterated MAXSYMLINKS times and return an errno of ELOOP. Unfortunately, doing such has two flaws, which I hope should be apparent. First of all, if a link links to itself immediately, it would take MAXSYMLINKS passes until the function finally realizes it (I'm looking at you OpenBSD). The other issue is that it's just wrong.
Hopefully any of us who has taken a data structures course would know that there are algorithms for detecting a loop in a linked list, and not have to resort to any dumb method which doesn't actually detect a loop. Also a number such as 8 is way too low which I've seen some do. I look at one of my Linux boxes, and I see some binaries are actually symlinked a dozen times. Like "upx" is symlinked to "upx-3.00", which is symlinked to "upx-3.00-nonfree", which is symlinked to, etc...

The optimal way to detect a loop in a linked list is explained here, see the last paragraph for the optimal O(N) solution. Unfortunately though, the optimal solution means we need to create a second iterator to check on symlinks, resolving links ahead of the main loop, two at a time. While great in memory, it can be a pain on a hard drive, where we have to deal with slow access, and the links can change during the check. Therefor I settled on a variation of the first solution, where we'll keep track of the file IDs (described in the previous post) of each symlink, and see if we ever hit a match. We'll store this data in an std::set, which is usually implemented as a Red & Black Tree which has O(log(N)) time for insertion and searching. Making our algorithm work at O(log(N))*2 (since we insert and find each time) * N, which reduces to O(N log(N)).

Here's the code:

bool symlink_resolve(const std::string& start, std::string& end)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
if (start.size())
{
std::string path = start; //Need a modifyable copy
struct stat sb;
std::set<file_id> seen_links;

bool resolved_link;
do //The symlink resolve loop
{
resolved_link = false;
if (!lstat(path.c_str(), &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (seen_links.find(current_id) == seen_links.end()) //Not a link we've seen
{
seen_links.insert(current_id); //Add to our set

if (S_ISLNK(sb.st_mode)) //Another link
{
std::string newpath;
if (readlink_internal(path, newpath, sb.st_size))
{
if (newpath[0] == '/') //Absolute
{
path = newpath;
}
else //We need to calculate the relative path in relation to the current
{
build_path_base_swap(path, newpath);
}
resolved_link = true;
} //Else, Link can't be read, time to quit
}
else //Yay, it's not a link! got to the last part finally!
{
success = realpath_file(path, end);
}
} //Else, Nice try, someone linked a link back into a previous link during the scan to try to trick us into an infinite loop
} //Else, Dangling link, can't resolve
} while (resolved_link);
}
return(success);
}


And now with all those helper functions out of the way, we can finally write the actual realpath() function itself, and it's a heck of a lot simpler than other implementations you'll find out there.

Here it is:

bool realpath(const std::string& path, std::string& resolved_path, bool resolve_link = true)
{
bool success = false;
if (path.size())
{
struct stat sb;
if (!stat(path.c_str(), &sb))
{
bool (*rp)(const std::string&, std::string&) = resolve_link ? symlink_resolve : realpath_file;
success = S_ISDIR(sb.st_mode) ? chdir_getcwd(path, resolved_path) : rp(path, resolved_path);
}
}
return(success);
}

It simply stat()s then runs chdir_getcwd() on directories and symlink_resolve() on files. I also offer an added feature. If you don't want the file name itself resolved, which is quite often what people want when a program saves a new file based off a name of another files passed to it, you can pass false as the third parameter, and it'll call realpath_file() instead of symlink_resolve(). With this wrapper, realpath_file() also has a stat() call before it, so the file is checked for existence like the built in realpath().

This implementation protects against the issues of PATH_MAX as described in our previous posts. It's also written in C++ to avoid buffer issues. Unlike built in ones, we generally have faster realpath() on directory, since it uses a getcwd() technique, while most other ones out there use a complicated set of steps on every component in the path until the last one is reached, which is especially slow if there are symlinks in the path. We also beat the built in ones regarding speed in cases of link to self, and have support for a deeper set of links. Some implementations also have some fixed sized buffer which can easily get full, and return failure, which is annoying, but our implementation won't run into. Lastly, we offer not resolving the links at all on the last component.

Drawbacks to our implementation is that it won't work on Windows as described in our last post. But you can #ifdef for that, and easily enough wrap around fullpath(). It also isn't thread safe due to the use of chdir() (which many built in implementations also do), see the previous post on how to work around that. Next time, we'll discuss methods to do all this without a chdir() at all, thus ensuring thread safety without any headaches.

2 comments:

Danny Daemonic said...

I just ran into the PATH_MAX problem - actually, I ran into a buggy work around - and wanted a quick realpath replacement. Google brought me to you. What's the license on your code? Were the AT flags ever standardized?

insane coder said...

Just place a link back to this article in whatever file contains the code, and use it however you wish.

The *at() functions are now part of POSIX 2008.

Also, I recommend understanding the code here, and tweaking as needed. I myself have overhauled this code since I wrote this article, and use that in my projects.