Friday, November 9, 2007


Implementing realpath() in C++



Before reading this post, please read the background in the previous post PATH_MAX simply isn't, as it covers important background, and has some code we'll be building on over here.

Here's what realpath() is supposed to do:

realpath() expands all symbolic links and resolves references to ’/./’, ’/../’ and extra ’/’ characters in the null terminated string named by path and stores the canonicalized absolute pathname in the buffer of size PATH_MAX named by resolved_path. The resulting path will have no symbolic link, ’/./’ or ’/../’ components.


Instead of attacking implementing such a thing head on, we'll take a bottom up approach. One should consider that being able to break down a path between the path components prior to the last slash and after the last slash is essential to figuring out which directory the file actually lies.

The built in functions for this are called dirname() and basename(), but they modify their arguments, and they do some funny handling in special cases which we don't want.

So to implement our own for our purposes, we can write a function as follows:

void relative_dir_base_split(const std::string& path, std::string& dir, std::string& base)
{
std::string::size_type slash_pos = path.rfind("/"); //Find the last slash
if (slash_pos != std::string::npos) //If there is a slash
{
slash_pos++;
dir = path.substr(0, slash_pos); //Directory is before slash
base = path.substr(slash_pos); //And obviously, the file is after
}
else //Otherwise, there is no directory present
{
dir.clear();
base = path;
}
}

The function is simple enough, pass it a path, and two C++ strings for the return values, and have it split them up.

The next function which we'll use is one to get the realpath() of a directory. We can easily get such a thing by chdir()'ing to the directory, running our getcwd(), and then chdir() back.
This simple function is as follows:


bool chdir_getcwd(const std::string& dir, std::string& path)
{
bool success = false;
int start_fd = open(".", O_RDONLY); //Open current directory so we can save a handle to it
if (start_fd != -1)
{
if (!chdir(dir.c_str())) //Change to directory
{
success = getcwd(path); //And get its path
fchdir(start_fd); //And change back of course
}
close(start_fd);
}
return(success);
}

This simple function to resolve where a directory is located uses should be obvious, however, it's also a core function in figuring our where a file is, since a file is located at the realpath() of its parent directory, plus the base name for the file itself.

Now to get the realpath() of a file, we have this simple wrapper over the above functions:

static inline bool realpath_file(const std::string& path, std::string& resolved_path)
{
bool success = false;
std::string dir;
std::string base;
relative_dir_base_split(path, dir, base);

//If there is a directory, get the realpath() for it, otherwise the current directory
if (dir.size() ? chdir_getcwd(dir, resolved_path) : getcwd(resolved_path))
{
resolved_path += base;
success = true;
}
return(success);
}

Now we have chdir_getcwd() which is basically a realpath() for a directory, and realpath_file() for files. Note however, that if you call realpath_file() directly, it won't actually ensure that the file exists, so probably a bad idea to call directly, unless you don't care if it exists or not. It also won't handle symlinks. To handle symlinks we need quite a bit more code.

First, let us write a wrapper in C++ for reading links:

bool readlink_internal(const std::string& path, std::string& buffer, ssize_t length)
{
bool success = false;
if (length > 0)
{
char *buf = new(nothrow) char[length+1]; //Room for Null
if (buf)
{
ssize_t amount = ::readlink(path.c_str(), buf, length+1); //Give room for failure
if ((amount > 0) && (amount <= length)) //If > length, it was modified mid check
{
buf[amount] = 0;
buffer = buf;
success = true;
}
delete[] buf;
}
}
return(success);
}

This function simply wraps up the built in one, and uses C++ strings for dynamic allocation, so we don't have to worry about it directly in our function to resolve symlinks.

Before diving into the actual details in resolving a symlink, let me present one more helper function:

void build_path_base_swap(std::string &path, const std::string& newbase)
{
string dir;
string base;
relative_dir_base_split(path, dir, base);

if (dir.size())
{
path = dir + newbase;
}
else
{
path = newbase;
}
}


This above function will take a path, and replace the base part of it, meaning if you pass it "path/cows.txt" and "feet.txt", path will become "path/feet.txt".

Now onto the symlink discussion.

Resolving symlinks in itself is hard though, since one symlink could lead to another, meaning we'll need a loop. We also need to defend ourselves against an infinite loop if one symlink links to itself, or a symlink earlier on in our resolve loop.

It is not safe enough to call a regular stat() to check for the final link existing (and not a loop), since during our resolve, an attacker can modify a link, so we have to detect that.
Unfortunately, most OSs/libraries have a define called MAXSYMLINKS which I've seen set anywhere from 8 to 30. They stop their loop once they've iterated MAXSYMLINKS times and return an errno of ELOOP. Unfortunately, doing such has two flaws, which I hope should be apparent. First of all, if a link links to itself immediately, it would take MAXSYMLINKS passes until the function finally realizes it (I'm looking at you OpenBSD). The other issue is that it's just wrong.
Hopefully any of us who has taken a data structures course would know that there are algorithms for detecting a loop in a linked list, and not have to resort to any dumb method which doesn't actually detect a loop. Also a number such as 8 is way too low which I've seen some do. I look at one of my Linux boxes, and I see some binaries are actually symlinked a dozen times. Like "upx" is symlinked to "upx-3.00", which is symlinked to "upx-3.00-nonfree", which is symlinked to, etc...

The optimal way to detect a loop in a linked list is explained here, see the last paragraph for the optimal O(N) solution. Unfortunately though, the optimal solution means we need to create a second iterator to check on symlinks, resolving links ahead of the main loop, two at a time. While great in memory, it can be a pain on a hard drive, where we have to deal with slow access, and the links can change during the check. Therefor I settled on a variation of the first solution, where we'll keep track of the file IDs (described in the previous post) of each symlink, and see if we ever hit a match. We'll store this data in an std::set, which is usually implemented as a Red & Black Tree which has O(log(N)) time for insertion and searching. Making our algorithm work at O(log(N))*2 (since we insert and find each time) * N, which reduces to O(N log(N)).

Here's the code:

bool symlink_resolve(const std::string& start, std::string& end)
{
typedef std::pair<dev_t, ino_t> file_id;

bool success = false;
if (start.size())
{
std::string path = start; //Need a modifyable copy
struct stat sb;
std::set<file_id> seen_links;

bool resolved_link;
do //The symlink resolve loop
{
resolved_link = false;
if (!lstat(path.c_str(), &sb))
{
file_id current_id(sb.st_dev, sb.st_ino);
if (seen_links.find(current_id) == seen_links.end()) //Not a link we've seen
{
seen_links.insert(current_id); //Add to our set

if (S_ISLNK(sb.st_mode)) //Another link
{
std::string newpath;
if (readlink_internal(path, newpath, sb.st_size))
{
if (newpath[0] == '/') //Absolute
{
path = newpath;
}
else //We need to calculate the relative path in relation to the current
{
build_path_base_swap(path, newpath);
}
resolved_link = true;
} //Else, Link can't be read, time to quit
}
else //Yay, it's not a link! got to the last part finally!
{
success = realpath_file(path, end);
}
} //Else, Nice try, someone linked a link back into a previous link during the scan to try to trick us into an infinite loop
} //Else, Dangling link, can't resolve
} while (resolved_link);
}
return(success);
}


And now with all those helper functions out of the way, we can finally write the actual realpath() function itself, and it's a heck of a lot simpler than other implementations you'll find out there.

Here it is:

bool realpath(const std::string& path, std::string& resolved_path, bool resolve_link = true)
{
bool success = false;
if (path.size())
{
struct stat sb;
if (!stat(path.c_str(), &sb))
{
bool (*rp)(const std::string&, std::string&) = resolve_link ? symlink_resolve : realpath_file;
success = S_ISDIR(sb.st_mode) ? chdir_getcwd(path, resolved_path) : rp(path, resolved_path);
}
}
return(success);
}

It simply stat()s then runs chdir_getcwd() on directories and symlink_resolve() on files. I also offer an added feature. If you don't want the file name itself resolved, which is quite often what people want when a program saves a new file based off a name of another files passed to it, you can pass false as the third parameter, and it'll call realpath_file() instead of symlink_resolve(). With this wrapper, realpath_file() also has a stat() call before it, so the file is checked for existence like the built in realpath().

This implementation protects against the issues of PATH_MAX as described in our previous posts. It's also written in C++ to avoid buffer issues. Unlike built in ones, we generally have faster realpath() on directory, since it uses a getcwd() technique, while most other ones out there use a complicated set of steps on every component in the path until the last one is reached, which is especially slow if there are symlinks in the path. We also beat the built in ones regarding speed in cases of link to self, and have support for a deeper set of links. Some implementations also have some fixed sized buffer which can easily get full, and return failure, which is annoying, but our implementation won't run into. Lastly, we offer not resolving the links at all on the last component.

Drawbacks to our implementation is that it won't work on Windows as described in our last post. But you can #ifdef for that, and easily enough wrap around fullpath(). It also isn't thread safe due to the use of chdir() (which many built in implementations also do), see the previous post on how to work around that. Next time, we'll discuss methods to do all this without a chdir() at all, thus ensuring thread safety without any headaches.

12 comments:

Danny said...

I just ran into the PATH_MAX problem - actually, I ran into a buggy work around - and wanted a quick realpath replacement. Google brought me to you. What's the license on your code? Were the AT flags ever standardized?

insane coder said...

Just place a link back to this article in whatever file contains the code, and use it however you wish.

The *at() functions are now part of POSIX 2008.

Also, I recommend understanding the code here, and tweaking as needed. I myself have overhauled this code since I wrote this article, and use that in my projects.

Unknown said...

It looks like the stat() in realpath can err for reasons you specified were unreasonable.

You need to test for ELOOP and ENAMETOOLONG because those may not be valid errors.

KARTHIK said...

Great blog !It is best institute.Top Training institute In chennai
http://chennaitraining.in/oracle-dba-training-in-chennai/
http://chennaitraining.in/sql-server-dba-training-in-chennai/
http://chennaitraining.in/teradata-training-in-chennai/
http://chennaitraining.in/sap-hr-training-in-chennai/
http://chennaitraining.in/sap-fico-training-in-chennai/
http://chennaitraining.in/sap-abap-training-in-chennai/

bavisthra said...

Study Data Analytics Course in Bangalore with ExcelR where you get a great experience and better knowledge.
Data Analytics Course in Bangalore

Augustine Christian said...

Awesome article, it was exceptionally helpful! I simply began in this and I'm becoming more acquainted with it better. The post is written in very a good manner and it contains many useful information for me. Thank you very much and will look for more postings from you.

digital marketing blog
skartec's digital marketing blog
skartec digital marketing academy
skartec digital marketing
best seo service in chennai
best seo services in chennai
high pr dofollow directory submission sites
high da dofollow directory submission sites

globalonlinepills said...


When you feel any kind of body pain, it is best if you go to the doctor for treating it.
Sometimes body pain can be the symptom of some serious disease. Sometimes body pain attacks
us suddenly because of which you may not able to get the help of the doctor. In those situations,
to get quick and effective pain relief, you can take the help of painkillers though they cannot cure your pain.
As your painkiller, choose Tramadol 50 mg which is very effective. This painkiller is available in
the market with the name of Ultram. To use this painkiller, you can get it easily.
Buy Tramadol online and get this painkiller at an affordable price.

tramadol 50 mg
tramadol 50 mg hcl
tramadol 50 mg hydrochloride
tramadol 50 mg tablet

Ethan jurk said...

On the off chance that you have stuck some place with the installation, or unfit to set up the printer then you are at the perfect spot,Wireless Printer Setup Canon

augustwalker said...

HP Print And Scan Doctor Utility is outstanding amongst other apparatus offered by HP to oversee printing and checking task. HP print and sweep specialist encourages you to investigate your printer gadget. This device can without much of a stretch work on the two Windows and Mac framework. In the event that you have any issue while downloading and introducing this device, contact specialists right away. You will get total answer for fix the mistakes identified with print and output specialist.

globalonlinepills said...

Being a member of the Association of Indian Universities (AIU), the programs are recognized by WES. Students may verify the same from the WES platform for Canadian Immigration. NMIMS Distance Education program serves its students with highly innovative and revolutionary technology and offers every digital solution to enable faster and most tactful learning process to its students.
nmims distance fee payment
nmims distance university
nmims college distance education
nmims distance result
nmims online distance learning

DY Patil Distance Learning program is aimed towards a slick and rapid improvement in the process of education while maintaining the contemporary standards of the educational industry in the genre of Hospitality and Management. Being renowned as a sophisticated university, DY Patil has several tie-ups with some of the best international industries.
dr dy patil vidyapeeth pune distance mba
ajeenkya dy patil university distance education
dy patil distance education mumbai
dy patil university distance education
dy patil institute of distance learning admission 2020

The School of Distance Education and Learning has executed the same schooling pattern from its mother university and continues to spread its remarkable influence across Rajasthan. The students and alumni members are provided with innovative self-learning study materials. JNU Distance Learning also offers consistent and well-organized counselling programs to its JNU Distance Learning students by renowned and experienced counsellors right at their doorsteps.
jaipur national university distance education bca question paper

jaipur national university distance education contact details

jaipur national university distance education login

jaipur national university distance education ba

\jaipur national university distance education results june 2020

CloudLearn ERP said...

I have honestly never read such overwhelmingly good content like this. I agree with your points and your ideas. This info is really great. Thanks.
Best Data Science training in Mumbai

Data Science training in Mumbai

ajay said...

thanks for sharing us

Hindi Shayari

Hindi Shayari

hindi love shayari 2020

Rap Name Generator

Mod Apk

Picsart Mod Apk

Kinemaster pro Mod Apk

mp4moviez

Latest News

Best laptop under 50000

hindi News

Thanks For Sharing Useful Content