The madness of paths
While working on super-user-spark, path and validity-path, I discovered what madness lies in working with paths. The following is an overview of some of the reasons why paths are difficult to deal with.
This post was originally going to be called ‘the annoyance of paths’, but soon after writing the first draft, the name was changed to the current version.
Multiple path separators
/directory/file points to a file called
in a directory called
directory, where does
/directory//file point? Is it a file called
a directory called
directory/? Is it a file called
/file in a directory called
turns out this is implementation-specific. Luckily, most of the time
multiple slashes are to be treated as a single slash.
/ is both the root directory and the path separator, AND it is
allowed as a character.
/ is a path that points to the root directory, whatever that means.
/ is also used to separate directories. Luckily,
/ only means ‘the root’ when it’s at the start of a path. This
would not be as big of a problem if
/ wasn’t allowed as a
character in a path. Sadly, that’s not the case.
/ is a valid character in a path, so it is necessary to
/ to differentiate between using it as a path character
and using it as a path separator. This brings me to the next problem.
The escape character: backslash is a valid path character
/ (and other characters) is usually done with a
\) character. This means that the
character also has to be escaped to differentiate between using it as a path
character and using it as an escape character.
Absolute or relative?
Paths can be relative or absolute. The distinction is made by looking at
the first character of a path. If the first character is a
then it’s an absolute path, otherwise it is a relative path.
The main problem with relative paths is that they have no meaning by themselves. Relative paths must be viewed relative to a given directory. Often this directory is assumed to be known, and that’s not necessarily a valid assumption.
One could argue that this already invalidates the ‘specifies a unique location in a file system’ part of the definition of a path.
‘.’ is both the current directory and the extension separator
When dealing with relative paths, ‘.’ means ‘the working directory’. This means that the path ‘.’ has no meaning by itself.
The ‘.’ character is also the extension separator. What is the extension
of a path? What is the extension of
file.txt. Is it
What is the extension of
file.tar.gz? Is it
What does the concept of an extension even mean?!
What does ‘/.’ mean? Is it the root directory or is it a file called ‘.’ in the root directory? … or is it file in the root directory with an empty name and an empty extension?
file.txt/., does this path assume that
file.txt is a directory or is it equivalent to
The current directory character ‘.’ is a valid path character.
‘.’ is valid as a path character. However, ‘.’ does not need to be escaped, because if it is part of a file path, then it separates the filename and extension, but if it is part of a directory then it’s just a part of the path. This brings me to the next problem:
The ‘.’ character is also used to hide files.
When a filename is prefixed with a ‘.’, it is a hidden file. This brings
quite a few problems. What is the filename of
.file.txt? Is it
.file or empty? What is its extension? is it
File or directory?
file a path that points to a directory or a file? No one
can know without looking there. Even worse: it can change!
This means that the interpretation of
file.txt is dependent on the current state of the
A double dot means ‘the directory that can be found one level upward in
the file system hierarchy’. For example,
dir1/dir2/.. has the
same meaning as
dir1. (I can hear you thinking “no it doesn’t!”.
Read on.) This means that there are multiple ways to represent the same place
in the file system, even in absolute file paths.
It also means that what I wrote about ‘no need to escape a dot’ is not
true! How would you write a path that points to a file called ‘..’ in a
dir? Yes, then you need to escape dots:
dir/\.\.. Sigh. And there even are multiple ways to do that.
dir/\.., for example. Bigger sigh.
Because of links in file systems, the interpretation of
.. depends on the current state of the filesystem. Indeed,
suppose there exists a link from
dir3/dir4. Where then does the path
point? Naively, one would think you can cancel out
.., but no, because of the link,
An empty path?
Where does the empty path point? Sometimes it means the current directory, sometimes it is considered invalid.
So what’s the directory above
file? You can’t just strip away
everything after the last
- There need not be a slash (in the case of a relative path, for example)
- Even if you remove everything, the empty string might not be a valid path!
Some shells give special semantics to certain characters. Never mind that these characters are already valid in paths, and thus need to be escaped in those shells. Even worse: these added semantics are so common that users now expect programs to be able to handle them even if the program is not shell-related.
For example, in the case of bash, ‘~’ means ‘the current user’s home
directory’. This means that ‘~’ could point to any directory, depending
on who is interpreting it. Never mind that not every user need
necessarily have a home directory or that the concept of a ‘user’ may not
even exist! In the case of zsh,
~user means the home
user, as opposed to a file called
Spaces are valid path characters. This means that whenever paths are passed as arguments in a list of arguments that is separated by spaces, spaces need to be escaped.
However, it gets worse. Not only are spaces valid path characters, but tabs are too, and newlines and zero-width spaces.
How are paths represented as bytes?
In Unix, paths are
NUL-terminated byte arrays, so it
technically doesn’t necessarily make sense to talk about ‘valid’ characters.
However, paths are mostly described via strings, which means they need to be
encoded somehow. This means that the meaning of a path is application
It gets worse. NTFS stores filenames as sequences of 16-bit values. No
encoding is enforced but UTF-16 is usually assumed. However, when using C,
fopen still takes in a portable way, not as a sequence of 16-bit
Some systems use the contents of the shell’s
PATH variable should contain a list of paths, but because it
is just a string, and not a list, in reality it is a sequence of paths
separated by columns: ‘:’. This would be would be fine if it was not the case
that ‘:’ is a valid path character. ‘:’ needs to be escaped in
paths, but only if the paths are in the
All of the above is at least partly untrue on at least one platform.
- On windows, absolute paths cannot be recognised by the first letter because of the concept of a ‘drive’.
- On some systems
/is not the path separator. The path separator can vary wildly.
- On windows, there is no single ‘root’. Instead there could be many
so-called drives and they are denoted by, for example,
C:\. Moreover, in
C:\, the backslash does NOT denote the root directory. Instead, to point to the root directory of the
C:\drive, you need another slash:
Up until now I have only been ranting about paths that intend to point to a single file or directory. However, often paths are used for much more than that, sadly. Applications assign their own semantics to path-like strings. For example:
- gitignore uses path-like patterns.
rmwon’t let you remove a symlink to a directory if there’s a slash at the end
diras a destination
rsyncwill distinguish between
diras a central part of its functionality
All these special semantics make paths even harder to deal with in shell-programming.
I’ll save you the blog post called ‘The hell of URI’s’.