Friday, November 9, 2007

Implementing realpath() in C++

Before reading this post, please read the background in the previous post PATH_MAX simply isn't, as it covers important background, and has some code we'll be building on over here.

Here's what realpath() is supposed to do:

realpath() expands all symbolic links and resolves references to ’/./’, ’/../’ and extra ’/’ characters in the null terminated string named by path and stores the canonicalized absolute pathname in the buffer of size PATH_MAX named by resolved_path. The resulting path will have no symbolic link, ’/./’ or ’/../’ components.

Instead of attacking implementing such a thing head on, we'll take a bottom up approach. One should consider that being able to break down a path between the path components prior to the last slash and after the last slash is essential to figuring out which directory the file actually lies.

The built in functions for this are called dirname() and basename(), but they modify their arguments, and they do some funny handling in special cases which we don't want.

So to implement our own for our purposes, we can write a function as follows:

void relative_dir_base_split(const std::string& path, std::string& dir, std::string& base)
std::string::size_type slash_pos = path.rfind("/"); //Find the last slash
if (slash_pos != std::string::npos) //If there is a slash
dir = path.substr(0, slash_pos); //Directory is before slash
base = path.substr(slash_pos); //And obviously, the file is after
else //Otherwise, there is no directory present
base = path;

The function is simple enough, pass it a path, and two C++ strings for the return values, and have it split them up.

The next function which we'll use is one to get the realpath() of a directory. We can easily get such a thing by chdir()'ing to the directory, running our getcwd(), and then chdir() back.
This simple function is as follows:

bool chdir_getcwd(const std::string& dir, std::string& path)
bool success = false;
int start_fd = open(".", O_RDONLY); //Open current directory so we can save a handle to it
if (start_fd != -1)
if (!chdir(dir.c_str())) //Change to directory
success = getcwd(path); //And get its path
fchdir(start_fd); //And change back of course

This simple function to resolve where a directory is located uses should be obvious, however, it's also a core function in figuring our where a file is, since a file is located at the realpath() of its parent directory, plus the base name for the file itself.

Now to get the realpath() of a file, we have this simple wrapper over the above functions:

static inline bool realpath_file(const std::string& path, std::string& resolved_path)
bool success = false;
std::string dir;
std::string base;
relative_dir_base_split(path, dir, base);

//If there is a directory, get the realpath() for it, otherwise the current directory
if (dir.size() ? chdir_getcwd(dir, resolved_path) : getcwd(resolved_path))
resolved_path += base;
success = true;

Now we have chdir_getcwd() which is basically a realpath() for a directory, and realpath_file() for files. Note however, that if you call realpath_file() directly, it won't actually ensure that the file exists, so probably a bad idea to call directly, unless you don't care if it exists or not. It also won't handle symlinks. To handle symlinks we need quite a bit more code.

First, let us write a wrapper in C++ for reading links:

bool readlink_internal(const std::string& path, std::string& buffer, ssize_t length)
bool success = false;
if (length > 0)
char *buf = new(nothrow) char[length+1]; //Room for Null
if (buf)
ssize_t amount = ::readlink(path.c_str(), buf, length+1); //Give room for failure
if ((amount > 0) && (amount <= length)) //If > length, it was modified mid check
buf[amount] = 0;
buffer = buf;
success = true;
delete[] buf;

This function simply wraps up the built in one, and uses C++ strings for dynamic allocation, so we don't have to worry about it directly in our function to resolve symlinks.

Before diving into the actual details in resolving a symlink, let me present one more helper function:

void build_path_base_swap(std::string &path, const std::string& newbase)
string dir;
string base;
relative_dir_base_split(path, dir, base);

if (dir.size())
path = dir + newbase;
path = newbase;

This above function will take a path, and replace the base part of it, meaning if you pass it "path/cows.txt" and "feet.txt", path will become "path/feet.txt".

Now onto the symlink discussion.

Resolving symlinks in itself is hard though, since one symlink could lead to another, meaning we'll need a loop. We also need to defend ourselves against an infinite loop if one symlink links to itself, or a symlink earlier on in our resolve loop.

It is not safe enough to call a regular stat() to check for the final link existing (and not a loop), since during our resolve, an attacker can modify a link, so we have to detect that.
Unfortunately, most OSs/libraries have a define called MAXSYMLINKS which I've seen set anywhere from 8 to 30. They stop their loop once they've iterated MAXSYMLINKS times and return an errno of ELOOP. Unfortunately, doing such has two flaws, which I hope should be apparent. First of all, if a link links to itself immediately, it would take MAXSYMLINKS passes until the function finally realizes it (I'm looking at you OpenBSD). The other issue is that it's just wrong.
Hopefully any of us who has taken a data structures course would know that there are algorithms for detecting a loop in a linked list, and not have to resort to any dumb method which doesn't actually detect a loop. Also a number such as 8 is way too low which I've seen some do. I look at one of my Linux boxes, and I see some binaries are actually symlinked a dozen times. Like "upx" is symlinked to "upx-3.00", which is symlinked to "upx-3.00-nonfree", which is symlinked to, etc...

The optimal way to detect a loop in a linked list is explained here, see the last paragraph for the optimal O(N) solution. Unfortunately though, the optimal solution means we need to create a second iterator to check on symlinks, resolving links ahead of the main loop, two at a time. While great in memory, it can be a pain on a hard drive, where we have to deal with slow access, and the links can change during the check. Therefor I settled on a variation of the first solution, where we'll keep track of the file IDs (described in the previous post) of each symlink, and see if we ever hit a match. We'll store this data in an std::set, which is usually implemented as a Red & Black Tree which has O(log(N)) time for insertion and searching. Making our algorithm work at O(log(N))*2 (since we insert and find each time) * N, which reduces to O(N log(N)).

Here's the code:

bool symlink_resolve(const std::string& start, std::string& end)
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
if (start.size())
std::string path = start; //Need a modifyable copy
struct stat sb;
std::set<file_id> seen_links;

bool resolved_link;
do //The symlink resolve loop
resolved_link = false;
if (!lstat(path.c_str(), &sb))
file_id current_id(sb.st_dev, sb.st_ino);
if (seen_links.find(current_id) == seen_links.end()) //Not a link we've seen
seen_links.insert(current_id); //Add to our set

if (S_ISLNK(sb.st_mode)) //Another link
std::string newpath;
if (readlink_internal(path, newpath, sb.st_size))
if (newpath[0] == '/') //Absolute
path = newpath;
else //We need to calculate the relative path in relation to the current
build_path_base_swap(path, newpath);
resolved_link = true;
} //Else, Link can't be read, time to quit
else //Yay, it's not a link! got to the last part finally!
success = realpath_file(path, end);
} //Else, Nice try, someone linked a link back into a previous link during the scan to try to trick us into an infinite loop
} //Else, Dangling link, can't resolve
} while (resolved_link);

And now with all those helper functions out of the way, we can finally write the actual realpath() function itself, and it's a heck of a lot simpler than other implementations you'll find out there.

Here it is:

bool realpath(const std::string& path, std::string& resolved_path, bool resolve_link = true)
bool success = false;
if (path.size())
struct stat sb;
if (!stat(path.c_str(), &sb))
bool (*rp)(const std::string&, std::string&) = resolve_link ? symlink_resolve : realpath_file;
success = S_ISDIR(sb.st_mode) ? chdir_getcwd(path, resolved_path) : rp(path, resolved_path);

It simply stat()s then runs chdir_getcwd() on directories and symlink_resolve() on files. I also offer an added feature. If you don't want the file name itself resolved, which is quite often what people want when a program saves a new file based off a name of another files passed to it, you can pass false as the third parameter, and it'll call realpath_file() instead of symlink_resolve(). With this wrapper, realpath_file() also has a stat() call before it, so the file is checked for existence like the built in realpath().

This implementation protects against the issues of PATH_MAX as described in our previous posts. It's also written in C++ to avoid buffer issues. Unlike built in ones, we generally have faster realpath() on directory, since it uses a getcwd() technique, while most other ones out there use a complicated set of steps on every component in the path until the last one is reached, which is especially slow if there are symlinks in the path. We also beat the built in ones regarding speed in cases of link to self, and have support for a deeper set of links. Some implementations also have some fixed sized buffer which can easily get full, and return failure, which is annoying, but our implementation won't run into. Lastly, we offer not resolving the links at all on the last component.

Drawbacks to our implementation is that it won't work on Windows as described in our last post. But you can #ifdef for that, and easily enough wrap around fullpath(). It also isn't thread safe due to the use of chdir() (which many built in implementations also do), see the previous post on how to work around that. Next time, we'll discuss methods to do all this without a chdir() at all, thus ensuring thread safety without any headaches.


Danny Daemonic said...

I just ran into the PATH_MAX problem - actually, I ran into a buggy work around - and wanted a quick realpath replacement. Google brought me to you. What's the license on your code? Were the AT flags ever standardized?

insane coder said...

Just place a link back to this article in whatever file contains the code, and use it however you wish.

The *at() functions are now part of POSIX 2008.

Also, I recommend understanding the code here, and tweaking as needed. I myself have overhauled this code since I wrote this article, and use that in my projects.

Aaron C. said...

It looks like the stat() in realpath can err for reasons you specified were unreasonable.

You need to test for ELOOP and ENAMETOOLONG because those may not be valid errors.

Soujanya Bargavi said...

This looks absolutely perfect, thanks for sharing. VMware training

MyTraining said...

Nice Information

Click here Best Training Institute in Bangalore to get More Details with Exiting offers

HKR Supports said...

Great post.
Thanks for sharing...
technical support services

sakshta rasal said...

Thanks for sharing information I must say very informative blog post. Keep it up!!
digital marketing classes in baramati | java course in baramati | white label trucking app solution | php classes in baramati

vyshu kits2019 said...

Thank you so much for such an wonderful blog. I’m really thankful to you for this informative article.
Best Online Training Institute From Hyderabad.

Microsoft Azure Online Training

Devops Puppet Training Videos

DevOps Chef Training Videos

Sivanandhana Girish said...

Spoken English Classes in Chennai
Best Spoken English Classes in Chennai
Spoken English Class in Chennai
Spoken English in Chennai
English Classes in Chennai

divya said...

Nice blog!! I hope you will share more info like this. I will use this for my studies and research.
Angularjs Training in Chennai
Angularjs Course in Chennai
CCNA Training in Chennai
Salesforce Training in Chennai
Angularjs Training
ui ux design course in chennai
Angularjs Training Institute in Chennai
Angular2 Training in Chennai
Angularjs Training in Chennai
Angularjs Course in Chennai

nikitha josh said...

Excellent post gained so much of information, Keep posting like this.
Aviation Courses in Chennai
Air hostess training in Chennai
Airline Courses in Chennai
airport ground staff training courses in Chennai
Aviation Academy in Chennai
air hostess training in Chennai
airport management courses in Chennai
ground staff training in Chennai

Bloggerboy said...

Alone Status

janitha said...

Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
pmp certificaton malaysia

Amirtha Gowri said...

Excellent blog I visit this blog it's really informative. By reading your blog, I get inspired and this provides useful information.

Check out:
reactjs training in chennai
it courses in chennai
react js interview questions

Digital Marketing said...

Excellent information, I like your blog.

Best Software Development Company in Kolkata

Best Android Application Development Company in Kolkata

Best App Development Company in kolkata

Best Web Development Company in kolkata

htop said...

thanks for sharing this information
aws training center in chennai
aws training in chennai
aws training institute in chennai
best angularjs training in chennai
angular js training in sholinganallur
angularjs training in chennai
azure training in chennai
best devops training in chennai

maddison said...

Thanks for sharing. This must be super helpful for so many. Learning implementation tough especially when you have a lack of resources. I literally needed this because too many times I am racking my brain like what am I doing, is not showing value. I also found some other interesting blogs from JanBask Training on "Implementing", Experienced and they are also very interesting. if you want to more informative blogs can visit: