Insane Coding: November 2007

Sunday, November 11, 2007

Directory safety when working with paths

Posted by insane coder at Sunday, November 11, 2007

As we discussed in our previous two articles, we run into issues of thread safety when we try changing the path during a running program. Therefor, in addition to bypassing PATH_MAX problems, working around buffer issues thanks to implementing with C++, we would also love to implement our functions without ever calling chdir() and changing the working directory. We'll now look into how we can make changes to our implementation to avoid a chdir() call, along with the pros and cons of techniques UNIX operating systems provide us, as well as what they're lacking.

If one looks back at how we implemented getcwd(), they'll notice that we call chdir("..") for each step in our loop. We need this to sanitize the location for these three calls:


opendir(".")
lstat(entry->d_name, &sb)
stat(".", &sb)

To remove the need of changing directories, we can place the ".." paths directly into our calls to opendir(), lstat(), and stat(). Thanks to C++ Strings being dynamic, it makes it easy for us to manipulate paths for these situations properly. We'll also see that being clever, we can even avoid string manipulation for some of these cases. And if our OSs supplied enough functionality, without any string manipulation at all!

Here's how we can implement getcwd() based off our previous implementation, but without calls to chdir(), with key details highlighted:


bool getcwd(std::string& path)
{
  typedef std::pair<dev_t, ino_t> file_id;

  bool success = false;
  struct stat sb;
  if (!stat(".", &sb))
  {
    file_id current_id(sb.st_dev, sb.st_ino);
    if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
    {
      std::vector<std::string> path_components;
      file_id root_id(sb.st_dev, sb.st_ino);
      std::string up_path("..");

      while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
      {
        bool pushed = false;
        DIR *dir = opendir(up_path.c_str());
        if (dir)
        {
          dirent *entry;
          up_path += "/";
          std::string::size_type after_slash = up_path.size();
          while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
          {
            if (strcmp(entry->d_name, ".") && strcmp(entry->d_name, ".."))
            {
              up_path.replace(after_slash, std::string::npos, entry->d_name);
              if (!lstat(up_path.c_str(), &sb))
              {
                file_id child_id(sb.st_dev, sb.st_ino);
                if (child_id == current_id) //We found where we came from, add its name to the list
                {
                  path_components.push_back(entry->d_name);
                  pushed = true;
                  break;
                }
              }
            }
          }

          if (pushed && !fstat(dirfd(dir), &sb)) //If we have a reason to contiue, we update the current dir id
          {
            current_id = file_id(sb.st_dev, sb.st_ino);
            up_path.replace(after_slash, std::string::npos, ".."); //Keep recursing towards root each iteration
          }

          closedir(dir);
        }
        if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
      }

      if (current_id == root_id) //Unless they're equal, we failed above
      {
        //Built the path, will always end with a slash
        path = "/";
        for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
        {
          path += *i+"/";
        }
        success = true;
      }
    }
  }

  return(success);
}

First of all, we removed all the code related to saving the current directory when we started, since we don't need that information anymore. Our first major change is that we created a new C++ String called up_path, we'll use this variable to keep track of the path, initializing it to "..". then for each step through the loop, make it "../..", "../../.." and so on, till we reach the root. We use this to replace our calls to opendir() to "." as we were doing before.
At this point, we'll add a slash to the path, and keep track of the spot with the variable after_slash. Now in our read directory loop, we can replace whatever is after the slash with the filename in the directory to pass to lstat(), again bypassing the need to be in the same directory as the file when making the function call.
Now for the stat() call on the directory itself, we got a little interesting. Instead of doing a path manipulation trick again, we call fstat() on the file descriptor returned from dirfd() on the already open directly handle. Notice how the call to close directory has been moved to after the block of code, so the directory is still open. And of course it's all wrapped up nicely with appending ".." after the slash.

Noticing how we eliminated path manipulation for the last part, it would be nice to eliminate more of it, especially if anyone wants to port this C++ code to C. The good news is that on Solaris and Linux we can, and as soon as the "at" functions get standardized, we can use them on the other UNIX OSs too. You can read more about them in one of our previous articles, File Descriptors and why we can't use them.

Here's how we can use the at functions to eliminate the path manipulation needed for calling lstat(), with the differences highlighted:


bool getcwd(std::string& path)
{
  typedef std::pair<dev_t, ino_t> file_id;

  bool success = false;
  struct stat sb;
  if (!stat(".", &sb))
  {
    file_id current_id(sb.st_dev, sb.st_ino);
    if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
    {
      std::vector<std::string> path_components;
      file_id root_id(sb.st_dev, sb.st_ino);
      std::string up_path("..");

      while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
      {
        bool pushed = false;
        DIR *dir = opendir(up_path.c_str());
        if (dir)
        {
          int dir_fd = dirfd(dir);
          dirent *entry;
          while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
          {
            if (strcmp(entry->d_name, ".") && strcmp(entry->d_name, "..") && !fstatat(dir_fd, entry->d_name, &sb, AT_SYMLINK_NOFOLLOW))
            {
              file_id child_id(sb.st_dev, sb.st_ino);
              if (child_id == current_id) //We found where we came from, add its name to the list
              {
                path_components.push_back(entry->d_name);
                pushed = true;
                break;
              }
            }
          }

          if (pushed && !fstat(dir_fd, &sb)) //If we have a reason to contiue, we update the current dir id
          {
            current_id = file_id(sb.st_dev, sb.st_ino);
            up_path += "/.."; //Keep recursing towards root each iteration
          }

          closedir(dir);
        }
        if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
      }

      if (current_id == root_id) //Unless they're equal, we failed above
      {
        //Built the path, will always end with a slash
        path = "/";
        for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
        {
          path += *i+"/";
        }
        success = true;
      }
    }
  }

  return(success);
}

As should be easily discernible, the string manipulation got easier, and mostly vanished. All the keeping track of a slash, and replaces has been replaced with only needing to append "/.." at the end of each loop.
Instead of manipulating a path string for our call to lstat(), we save the directory's file descriptor above (and subsequently use that instead of recalculating it lower down for fstat()), then use it to get the lstat() for each file in the directory, but now via fstatat().
fstatat() is the same as stat()/lstat(), but takes a directory descriptor as the first parameter to offset where the filename is relative to. The last parameter can be 0 or AT_SYMLINK_NOFOLLOW, which makes it act like stat() or lstat() respectively. fstatat() instead of a file descriptor can also take the special value AT_FDCWD to have it automatically work in the current directory.

This implementation should be much more elegant, but a bit less portable. If one would like to implement fstatat() themselves, it's not hard, here's how you can do it:


int fstatat(int dirfd, const char *pathname, struct stat *buf, int flags)
{
  int success = -1;

  if ((!flags || (flags == AT_SYMLINK_NOFOLLOW)))
  {
    int cwdfd = -1;
    if ((dirfd == AT_FDCWD) || (pathname && (*pathname == '/')) || (((cwdfd=open(".", O_RDONLY)) != -1) && !fchdir(dirfd)))
    {
      success = (!flags) ? stat(pathname, buf) : lstat(pathname, buf);
    }

    if (cwdfd != -1)
    {
      fchdir(cwdfd);
      close(cwdfd);
    }
  }
  else
  {
    errno = EINVAL;
  }

  return(success);
}

You'll also need the defines for AT_FDCWD and AT_SYMLINK_NOFOLLOW. You have to make sure that the value you choose for AT_FDCWD can't be a valid file descriptor or the normal failure return value. Therefor I chose -2 in my implementations, but choosing any negative value should be okay. AT_SYMLINK_NOFOLLOW shouldn't matter what it is, and a value of 1 or for all I care, 42, should be fine.
If your OS supports the at functions (currently Linux 2.6.16+ and recent versions of Solaris), it's better not to use a custom implementation, as this isn't thread safe, since it calls fchdir() internally, and for our example, runs counter to the whole point of using fstatat(). It would also be faster to use the original getcwd() implementation from a few days ago than using an emulated fstatat(), since there's less overhead from repeated fchdir() calls.
It's also interesting to note that glibc on Linux now, even for older than 2.6.16 implements fstatat() and the whole slew of at functions even when not supported in the Kernel. It's similar to ours, can be thread unsafe due to changing the working directory, and for some inexplicable reason, segfaulted in certain circumstances when I ran a large battery of tests on it and my implementation against the Kernel's to make sure that mine was working properly.

Anyways, with that out of the way, one can't hope but wonder if it would be possible to also eliminate the need of any path string manipulation for the call to opendir(). Mysteriously, neither Solaris nor Linux have an opendirat() call. If there was such, we could easily keep the previous directory open, till we obtained a new handle for its parent directory.
Baring having an opendirat() function, it'd be nice to implement such a thing. Some UNIX OSs I understand have a function perhaps named fopendir() or fdopendir() to promote a file descriptor to a directory handle. With such a function, we can simply write an opendirat() which calls openat(), then promotes it, but Linux doesn't have a method of promotion, although Solaris does (fdopendir()). Lets hope the people who are standardizing the new at functions for POSIX seriously consider what they're doing, otherwise we won't be able to use them.

Now that we've gotten getcwd() to be safe, we still need realpath() to be. In our realpath() implementation, the only unsafe parts was our function chdir_getcwd() which called chdir() internally as well as getcwd(). getcwd() is now taken care of, but we still need to rewrite the rest of chdir_getcwd() to not actually chdir().
To do such should now be easy, we simply do getcwd(), but with a new start path. We'd make a new function called getcwd_internal() which would take two parameters, one for the current directory, and another to initialize the up one, which we can wrap everything else around. Basically copy your favorite getcwd(), but make the modifications to the beginning like so:


bool getcwd_internal(std::string& path, const std::string& start_dir, std::string& up_path)
{
  typedef std::pair<dev_t, ino_t> file_id;

  bool success = false;
  struct stat sb;
  if (!stat(start_dir.c_str(), &sb))
  {
    file_id current_id(sb.st_dev, sb.st_ino);
    if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
    {
      std::vector<std::string> path_components;
      file_id root_id(sb.st_dev, sb.st_ino);

      while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
--SNIP--

Now we'd turn getcwd() and chdir_getcwd() into wrappers like so:


bool getcwd(std::string& path)
{
  std::string up_path("..");
  return(getcwd_internal(path, ".", up_path));
}

bool chdir_getcwd(const std::string& dir, std::string& path)
{
  std::string up_path(dir+"/..");
  return(getcwd_internal(path, dir, up_path));
}

Note that the name of chdir_getcwd() is now misleading that it doesn't actually change directories anymore. For this reason, it's a good idea to not put any implementation details into a function name, as a user shouldn't have to know about them, and over time, it can become inaccurate.

And there you have it folks, getcwd() and realpath() implemented safely. Hopefully in the process all you loyal readers learned a couple other things too.

Other improvements over all we discussed, might be to run a smart pass over concatenated path names at key locations to remove extra slashes together "///", unneeded current directory "/./", and extra parent directories "path1/path2/.." -> "path1/", which is useful if certain functions like opendir() or the stat() family of calls have internal buffer issues. But doing such is a discussion for another time.

This wraps up our long discussion. All comments and suggestions are welcome.

Friday, November 9, 2007

Implementing realpath() in C++

Posted by insane coder at Friday, November 09, 2007

Before reading this post, please read the background in the previous post PATH_MAX simply isn't, as it covers important background, and has some code we'll be building on over here.

Here's what realpath() is supposed to do:

realpath() expands all symbolic links and resolves references to ’/./’, ’/../’ and extra ’/’ characters in the null terminated string named by path and stores the canonicalized absolute pathname in the buffer of size PATH_MAX named by resolved_path. The resulting path will have no symbolic link, ’/./’ or ’/../’ components.

Instead of attacking implementing such a thing head on, we'll take a bottom up approach. One should consider that being able to break down a path between the path components prior to the last slash and after the last slash is essential to figuring out which directory the file actually lies.

The built in functions for this are called dirname() and basename(), but they modify their arguments, and they do some funny handling in special cases which we don't want.

So to implement our own for our purposes, we can write a function as follows:


void relative_dir_base_split(const std::string& path, std::string& dir, std::string& base)
{
  std::string::size_type slash_pos = path.rfind("/"); //Find the last slash
  if (slash_pos != std::string::npos) //If there is a slash
  {
    slash_pos++;
    dir = path.substr(0, slash_pos); //Directory is before slash
    base = path.substr(slash_pos); //And obviously, the file is after
  }
  else //Otherwise, there is no directory present
  {
    dir.clear(); 
    base = path;
  }
}

The function is simple enough, pass it a path, and two C++ strings for the return values, and have it split them up.

The next function which we'll use is one to get the realpath() of a directory. We can easily get such a thing by chdir()'ing to the directory, running our getcwd(), and then chdir() back.
This simple function is as follows:


bool chdir_getcwd(const std::string& dir, std::string& path)
{
  bool success = false;
  int start_fd = open(".", O_RDONLY); //Open current directory so we can save a handle to it
  if (start_fd != -1)
  {
    if (!chdir(dir.c_str())) //Change to directory
    {
      success = getcwd(path); //And get its path
      fchdir(start_fd); //And change back of course
    }
    close(start_fd);
  }
  return(success);
}

This simple function to resolve where a directory is located uses should be obvious, however, it's also a core function in figuring our where a file is, since a file is located at the realpath() of its parent directory, plus the base name for the file itself.

Now to get the realpath() of a file, we have this simple wrapper over the above functions:


static inline bool realpath_file(const std::string& path, std::string& resolved_path)
{
  bool success = false;
  std::string dir;
  std::string base;
  relative_dir_base_split(path, dir, base);

  //If there is a directory, get the realpath() for it, otherwise the current directory
  if (dir.size() ? chdir_getcwd(dir, resolved_path) : getcwd(resolved_path))
  {
    resolved_path += base;
    success = true;
  }
  return(success);
}

Now we have chdir_getcwd() which is basically a realpath() for a directory, and realpath_file() for files. Note however, that if you call realpath_file() directly, it won't actually ensure that the file exists, so probably a bad idea to call directly, unless you don't care if it exists or not. It also won't handle symlinks. To handle symlinks we need quite a bit more code.

First, let us write a wrapper in C++ for reading links:


bool readlink_internal(const std::string& path, std::string& buffer, ssize_t length)
{
  bool success = false;
  if (length > 0)
  {
    char *buf = new(nothrow) char[length+1]; //Room for Null
    if (buf)
    {
      ssize_t amount = ::readlink(path.c_str(), buf, length+1); //Give room for failure
      if ((amount > 0) && (amount <= length)) //If > length, it was modified mid check
      {
        buf[amount] = 0;
        buffer = buf;
        success = true;
      }
      delete[] buf;
    }
  }
  return(success);
}

This function simply wraps up the built in one, and uses C++ strings for dynamic allocation, so we don't have to worry about it directly in our function to resolve symlinks.

Before diving into the actual details in resolving a symlink, let me present one more helper function:


void build_path_base_swap(std::string &path, const std::string& newbase)
{
  string dir;
  string base;
  relative_dir_base_split(path, dir, base);

  if (dir.size())
  {
    path = dir + newbase;
  }
  else
  {
    path = newbase;
  }
}

This above function will take a path, and replace the base part of it, meaning if you pass it "path/cows.txt" and "feet.txt", path will become "path/feet.txt".

Now onto the symlink discussion.

Resolving symlinks in itself is hard though, since one symlink could lead to another, meaning we'll need a loop. We also need to defend ourselves against an infinite loop if one symlink links to itself, or a symlink earlier on in our resolve loop.

It is not safe enough to call a regular stat() to check for the final link existing (and not a loop), since during our resolve, an attacker can modify a link, so we have to detect that.
Unfortunately, most OSs/libraries have a define called MAXSYMLINKS which I've seen set anywhere from 8 to 30. They stop their loop once they've iterated MAXSYMLINKS times and return an errno of ELOOP. Unfortunately, doing such has two flaws, which I hope should be apparent. First of all, if a link links to itself immediately, it would take MAXSYMLINKS passes until the function finally realizes it (I'm looking at you OpenBSD). The other issue is that it's just wrong.
Hopefully any of us who has taken a data structures course would know that there are algorithms for detecting a loop in a linked list, and not have to resort to any dumb method which doesn't actually detect a loop. Also a number such as 8 is way too low which I've seen some do. I look at one of my Linux boxes, and I see some binaries are actually symlinked a dozen times. Like "upx" is symlinked to "upx-3.00", which is symlinked to "upx-3.00-nonfree", which is symlinked to, etc...

The optimal way to detect a loop in a linked list is explained here, see the last paragraph for the optimal O(N) solution. Unfortunately though, the optimal solution means we need to create a second iterator to check on symlinks, resolving links ahead of the main loop, two at a time. While great in memory, it can be a pain on a hard drive, where we have to deal with slow access, and the links can change during the check. Therefor I settled on a variation of the first solution, where we'll keep track of the file IDs (described in the previous post) of each symlink, and see if we ever hit a match. We'll store this data in an std::set, which is usually implemented as a Red & Black Tree which has O(log(N)) time for insertion and searching. Making our algorithm work at O(log(N))*2 (since we insert and find each time) * N, which reduces to O(N log(N)).

Here's the code:


bool symlink_resolve(const std::string& start, std::string& end)
{
  typedef std::pair<dev_t, ino_t> file_id;

  bool success = false;
  if (start.size())
  {
    std::string path = start; //Need a modifyable copy
    struct stat sb;
    std::set<file_id> seen_links;

    bool resolved_link;
    do //The symlink resolve loop
    {
      resolved_link = false;
      if (!lstat(path.c_str(), &sb))
      {
        file_id current_id(sb.st_dev, sb.st_ino);
        if (seen_links.find(current_id) == seen_links.end()) //Not a link we've seen
        {
          seen_links.insert(current_id); //Add to our set

          if (S_ISLNK(sb.st_mode)) //Another link
          {
            std::string newpath;
            if (readlink_internal(path, newpath, sb.st_size))
            {
              if (newpath[0] == '/') //Absolute
              {
                path = newpath;
              }
              else //We need to calculate the relative path in relation to the current
              {
                build_path_base_swap(path, newpath);
              }
              resolved_link = true;
            } //Else, Link can't be read, time to quit
          }
          else //Yay, it's not a link! got to the last part finally!
          {
            success = realpath_file(path, end);
          }
        } //Else, Nice try, someone linked a link back into a previous link during the scan to try to trick us into an infinite loop
      } //Else, Dangling link, can't resolve
    } while (resolved_link);
  }
  return(success);
}

And now with all those helper functions out of the way, we can finally write the actual realpath() function itself, and it's a heck of a lot simpler than other implementations you'll find out there.

Here it is:


bool realpath(const std::string& path, std::string& resolved_path, bool resolve_link = true)
{
  bool success = false;
  if (path.size())
  {
    struct stat sb;
    if (!stat(path.c_str(), &sb))
    {
      bool (*rp)(const std::string&, std::string&) = resolve_link ? symlink_resolve : realpath_file;
      success = S_ISDIR(sb.st_mode) ? chdir_getcwd(path, resolved_path) : rp(path, resolved_path);
    }
  }
  return(success);
}

It simply stat()s then runs chdir_getcwd() on directories and symlink_resolve() on files. I also offer an added feature. If you don't want the file name itself resolved, which is quite often what people want when a program saves a new file based off a name of another files passed to it, you can pass false as the third parameter, and it'll call realpath_file() instead of symlink_resolve(). With this wrapper, realpath_file() also has a stat() call before it, so the file is checked for existence like the built in realpath().

This implementation protects against the issues of PATH_MAX as described in our previous posts. It's also written in C++ to avoid buffer issues. Unlike built in ones, we generally have faster realpath() on directory, since it uses a getcwd() technique, while most other ones out there use a complicated set of steps on every component in the path until the last one is reached, which is especially slow if there are symlinks in the path. We also beat the built in ones regarding speed in cases of link to self, and have support for a deeper set of links. Some implementations also have some fixed sized buffer which can easily get full, and return failure, which is annoying, but our implementation won't run into. Lastly, we offer not resolving the links at all on the last component.

Drawbacks to our implementation is that it won't work on Windows as described in our last post. But you can #ifdef for that, and easily enough wrap around fullpath(). It also isn't thread safe due to the use of chdir() (which many built in implementations also do), see the previous post on how to work around that. Next time, we'll discuss methods to do all this without a chdir() at all, thus ensuring thread safety without any headaches.

Saturday, November 3, 2007

PATH_MAX simply isn't

Posted by insane coder at Saturday, November 03, 2007

Many C/C++ programmers at some point may run into a limit known as PATH_MAX. Basically, if you have to keep track of paths to files/directories, how big does your buffer have to be?
Most Operating Systems/File Systems I've seen, limit a filename or any particular path component to 255 bytes or so. But a full path is a different matter.

Many programmers will immediately tell you that if your buffer is PATH_MAX, or PATH_MAX+1 bytes, it's long enough. A good C++ programmer of course would use C++ strings (std::string or similar with a particular API) to avoid any buffer length issues. But even when having dynamic strings in your program taking care of the nitty gritty issue of how long your buffers need to be, they only solve half the problem.

Even a C++ programmer may at some point want to call the getcwd() or realpath() (fullpath() on Windows) functions, which take a pointer to a writable buffer, and not a C++ string, and according to the standard, they don't do their own allocation. Even ones that do their own allocation very often just allocate PATH_MAX bytes.

getcwd() is a function to return what the current working directory is. realpath() can take a relative or absolute path to any filename, containing .. or levels of /././. or extra slashes, and symlinks and the like, and return a full absolute path without any extra garbage. These functions have a flaw though.

The flaw is that PATH_MAX simply isn't. Each system can define PATH_MAX to whatever size it likes. On my Linux system, I see it's 4096, on my OpenBSD system, I see it's 1024, on Windows, it's 260.

Now performing a test on my Linux system, I noticed that it limits a path component to 255 characters on ext3, but it doesn't stop me from making as many nested ones as I like. I successfully created a path 6000 characters long. Linux does absolutely nothing to stop me from creating such a large path, nor from mounting one large path on another. Running getcwd() in such a large path, even with a huge buffer, fails, since it doesn't work with anything past PATH_MAX.

Even a commercial OS like Mac OS X defines it as 1024, but tests show you can create a path several thousand characters long. Interestingly enough, OSX's getcwd() will properly identify a path which is larger than its PATH_MAX if you pass it a large enough buffer with enough room to hold all the data. This is possible, because the prototype for getcwd() is:

char *getcwd(char *buf, size_t size);

So a smart getcwd() can work if there's enough room. But unfortunately, there is no way to determine how much space you actually need, so you can't allocate it in advance. You'd have to keep allocating larger and larger buffers hoping one of them will finally work, which is quite retarded.

Since a path can be longer than PATH_MAX, the define is useless, writing code based off of it is wrong, and the functions that require it are broken.

An exception to this is Windows. It doesn't allow any paths to be created larger than 260 characters. If the path was created on a partition from a different OS, Windows won't allow anything to access it. It sounds strange that such a small limit was chosen, considering that FAT has no such limit imposed, and NTFS allows paths to be 32768 characters long. I can easily imagine someone with a sizable audio collection having a 300+ character path like so:

"C:\Documents and Settings\Jonathan Ezekiel Cornflour\My Documents\My Music\My Personal Rips\2007\Technological\Operating System Symphony Orchestra\The GNOME Musical Men\I Married Her For Her File System\You Don't Appreciate Marriage Until You've Noticed Tax Pro's Wizard For Married Couples.Track 01.MP5"

Before we forget, here's the prototype for realpath:

char *realpath(const char *file_name, char *resolved_name);

Now looking at that prototype, you should immediately say to yourself, but where's the size value for resolved_name? We don't want a buffer overflow! Which is why OSs will implement it based on the PATH_MAX define.

The resolved_name argument must refer to a buffer capable of storing at least PATH_MAX characters.

Which basically means, it can never work on a large path, and no clever OS can implement around it, unless it actually checks how much RAM is allocated on that pointer using an OS specific method - if available.

For these reasons, I've decided to implement getcwd() and realpath() myself. We'll discuss the exact specifics of realpath() next time, for now however, we will focus on how one can make their own getcwd().

The idea is to walk up the tree from the working directory, till we reach the root, along the way noting which path component we just went across.
Every modern OS has a stat() function which can take a path component and return information about it, such as when it was created, which device it is located on, and the like. All these OSs except for Windows return the fields st_dev and st_ino which together can uniquely identify any file or directory. If those two fields match the data retrieved in some other way on the same system, you can be sure they're the same file/directory.
To start, we'd determine the unique ID for . and /, once we have those, we can construct our loop. At each step, when the current doesn't equal the root, we can change directory to .., then scan the directory (using opendir()+readdir()+closedir()) for a component with the same ID. Once a matching ID is found, we can denote that as the correct name for the current level, and move up one.

Code demonstrating this in C++ is as follows:


  bool getcwd(std::string& path)
  {
    typedef std::pair<dev_t, ino_t> file_id;
    
    bool success = false;
    int start_fd = open(".", O_RDONLY); //Keep track of start directory, so can jump back to it later
    if (start_fd != -1)
    {
      struct stat sb;
      if (!fstat(start_fd, &sb))
      {
        file_id current_id(sb.st_dev, sb.st_ino);
        if (!stat("/", &sb)) //Get info for root directory, so we can determine when we hit it
        {
          std::vector<std::string> path_components;
          file_id root_id(sb.st_dev, sb.st_ino);

          while (current_id != root_id) //If they're equal, we've obtained enough info to build the path
          {
            bool pushed = false;

            if (!chdir("..")) //Keep recursing towards root each iteration
            {
              DIR *dir = opendir(".");
              if (dir)
              {
                dirent *entry;
                while ((entry = readdir(dir))) //We loop through each entry trying to find where we came from
                {
                  if ((strcmp(entry->d_name, ".") && strcmp(entry->d_name, "..") && !lstat(entry->d_name, &sb)))
                  {
                    file_id child_id(sb.st_dev, sb.st_ino);
                    if (child_id == current_id) //We found where we came from, add its name to the list
                    {
                      path_components.push_back(entry->d_name);
                      pushed = true;
                      break;
                    }
                  }
                }
                closedir(dir);

                if (pushed && !stat(".", &sb)) //If we have a reason to contiue, we update the current dir id
                {
                  current_id = file_id(sb.st_dev, sb.st_ino);
                }
              }//Else, Uh oh, can't read information at this level
            }
            if (!pushed) { break; } //If we didn't obtain any info this pass, no reason to continue
          }

          if (current_id == root_id) //Unless they're equal, we failed above
          {
            //Built the path, will always end with a slash
            path = "/";
            for (std::vector<std::string>::reverse_iterator i = path_components.rbegin(); i != path_components.rend(); ++i)
            {
              path += *i+"/";
            }
            success = true;
          }
          fchdir(start_fd);
        }
      }
      close(start_fd);
    }

    return(success);
  }

Before we accept that as the defacto method to use in your application, let us discuss the flaws.

As mentioned above, it doesn't work on Windows, but a simple #ifdef for Windows can just make it a wrapper around the built in getcwd() with a local buffer of size PATH_MAX, which is fine for Windows, and pretty much no other OS.

This function uses the name getcwd() which can conflict with the built in C based one which is a problem for certain compilers. The fix is to rename it, or put it in its own namespace.

Next, the built in getcwd() implementations I checked only have a trailing slash on the root directory. I personally like having the slash appended, since I'm usually concatenating a filename onto it, but note that if you're not using it for concatenation, but to pass to functions like access(), stat(), opendir(), chdir(), and the like, an OS may not like doing the call with a trailing slash. I've only noticed that being an issue with DJGPP and a few functions. So if it matters to you, the loop near the end of the function can easily be modified to not have the trailing slash, except in the case that the root directory is the entire path.

This function also changes the directory in the process, so it's not thread safe. But then again, many built in implementations aren't thread safe either. If you use threads, calculate all the paths you need prior to creating the threads. Which is probably a good idea, and keep using path names based off of your absolute directories in your program, instead of changing directories during the main execution elsewhere in the program. Otherwise, you'll have to use a mutex around the call, which is also a valid option.

There could also be the issue that some level of the path isn't readable. Which can happen on UNIX, where to enter a directory, one only needs execute permission, and not read permission. I'm not sure what one can do in that case, except maybe fall back on the built in one hoping it does some magical Kernel call to get around it. If anyone has any advice on this one, please post about it in the comments.

Lastly, this function is written in C++, which is annoying for C users. The std::vector can be replaced with a linked list keeping track of the components, and at the end, allocate the buffer size needed, and return the allocated buffer. This requires the user to free the buffer on the outside, but there really isn't any other safe way of doing this.
Alternatively, instead of a linked list, a buffer which is constantly reallocated can be used while building the path, constantly memmove()'ing the built components over to the higher part of the buffer.

During the course of the rest of the program, all path manipulation should be using safe allocation managing strings such as std::string, or should be based off of the above described auto allocating getcwd() and similar functions, and constantly handling the memory management, growing as needed. Be careful when you need to get any path information from elsewhere, as you can never be sure how large it will be.

I hope developers realize that when not on Windows, using the incorrect define PATH_MAX is just wrong, and fix their applications. Next time, we'll discuss how one can implement their own realpath().

Insane Coding

Sunday, November 11, 2007

Directory safety when working with paths

Friday, November 9, 2007

Implementing realpath() in C++

Saturday, November 3, 2007

PATH_MAX simply isn't

Insane Coding Sites

Blog Archive

Search This Blog

Please Add This

Insane Coding

Sunday, November 11, 2007

Directory safety when working with paths

Friday, November 9, 2007

Implementing realpath() in C++

Saturday, November 3, 2007

PATH_MAX simply isn't

Insane Coding Sites

Blog Archive

Search This Blog

Subscribe To

Please Add This