The madness of paths

While working on super-user-spark, path and validity-path, I discovered what madness lies in working with paths. The following is an overview of some of the reasons why paths are difficult to deal with.

This post was originally going to be called 'the annoyance of paths', but soon after writing the first draft, the name was changed to the current version.

Paths

A path, the general form of the name of a file or directory, specifies a unique location in a file system.

Madness ! `<rant>`

Multiple path separators

If /directory/file points to a file called file in a directory called directory, where does /directory//file point? Is it a file called file in a directory called directory/? Is it a file called /file in a directory called directory? It turns out this is implementation-specific. Luckily, most of the time multiple slashes are to be treated as a single slash.

`/` is both the root directory and the path separator, AND it is allowed as a character.

/ is a path that points to the root directory, whatever that means. However, / is also used to separate directories. Luckily, / only means 'the root' when it's at the start of a path. This would not be as big of a problem if / wasn't allowed as a character in a path. Sadly, that's not the case.

/ is a valid character in a path, so it is necessary to escape / to differentiate between using it as a path character and using it as a path separator. This brings me to the next problem.

The escape character: backslash is a valid path character

Escaping / (and other characters) is usually done with a backslash (\) character. This means that the \ character also has to be escaped to differentiate between using it as a path character and using it as an escape character.

And then I haven't even spoken about systems where a backslash is also the path separator.

Absolute or relative?

Paths can be relative or absolute. The distinction is made by looking at the first character of a path. If the first character is a /, then it's an absolute path, otherwise it is a relative path.

The main problem with relative paths is that they have no meaning by themselves. Relative paths must be viewed relative to a given directory. Often this directory is assumed to be known, and that's not necessarily a valid assumption.

One could argue that this already invalidates the 'specifies a unique location in a file system' part of the definition of a path.

'.' is both the current directory and the extension separator

When dealing with relative paths, '.' means 'the working directory'. This means that the path '.' has no meaning by itself.

The '.' character is also the extension separator. What is the extension of a path? What is the extension of file.txt. Is it txt or .txt?

What is the extension of file.tar.gz? Is it .tar.gz, tar.gz, .gz or gz?

What does the concept of an extension even mean?!

What does '/.' mean? Is it the root directory or is it a file called '.' in the root directory? ... or is it file in the root directory with an empty name and an empty extension?

What about file.txt/., does this path assume that file.txt is a directory or is it equivalent to file.txt?

The current directory character '.' is a valid path character.

'.' is valid as a path character. However, '.' does not need to be escaped, because if it is part of a file path, then it separates the filename and extension, but if it is part of a directory then it's just a part of the path. This brings me to the next problem:

The '.' character is also used to hide files.

When a filename is prefixed with a '.', it is a hidden file. This brings quite a few problems. What is the filename of .file.txt? Is it file, .file or empty? What is its extension? is it txt or file.txt?

File or directory?

Is file a path that points to a directory or a file? No one can know without looking there. Even worse: it can change!

This means that the interpretation of txt in file.txt is dependent on the current state of the filesystem!

'..'

A double dot means 'the directory that can be found one level upward in the file system hierarchy'. For example, dir1/dir2/.. has the same meaning as dir1. (I can hear you thinking "no it doesn't!". Read on.) This means that there are multiple ways to represent the same place in the file system, even in absolute file paths.

It also means that what I wrote about 'no need to escape a dot' is not true! How would you write a path that points to a file called '..' in a directory called dir? Yes, then you need to escape dots: dir/\.\.. Sigh. And there even are multiple ways to do that. dir/\.., for example. Bigger sigh.

Links

Because of links in file systems, the interpretation of .. depends on the current state of the filesystem. Indeed, suppose there exists a link from dir1/dir2 to dir3/dir4. Where then does the path dir1/dir2/.. point? Naively, one would think you can cancel out dir2 and .., but no, because of the link, dir1/dir2/.. now points to dir3.

An empty path?

Where does the empty path point? Sometimes it means the current directory, sometimes it is considered invalid.

So what's the directory above file? You can't just strip away everything after the last / because:

There need not be a slash (in the case of a relative path, for example)
Even if you remove everything, the empty string might not be a valid path!

Shell-specific parts

Some shells give special semantics to certain characters. Never mind that these characters are already valid in paths, and thus need to be escaped in those shells. Even worse: these added semantics are so common that users now expect programs to be able to handle them even if the program is not shell-related.

For example, in the case of bash, '~' means 'the current user's home directory'. This means that '~' could point to any directory, depending on who is interpreting it. Never mind that not every user need necessarily have a home directory or that the concept of a 'user' may not even exist! In the case of zsh, ~user means the home directory of user, as opposed to a file called ~user.

White space.

Spaces are valid path characters. This means that whenever paths are passed as arguments in a list of arguments that is separated by spaces, spaces need to be escaped.

However, it gets worse. Not only are spaces valid path characters, but tabs are too, and newlines and zero-width spaces.

How are paths represented as bytes?

In Unix, paths are NUL-terminated byte arrays, so it technically doesn't necessarily make sense to talk about 'valid' characters. However, paths are mostly described via strings, which means they need to be encoded somehow. This means that the meaning of a path is application specific.

It gets worse. NTFS stores filenames as sequences of 16-bit values. No encoding is enforced but UTF-16 is usually assumed. However, when using C, fopen still takes in a portable way, not as a sequence of 16-bit values.

PATH

Some systems use the contents of the shell's PATH variable. The PATH variable should contain a list of paths, but because it is just a string, and not a list, in reality it is a sequence of paths separated by columns: ':'. This would be would be fine if it was not the case that ':' is a valid path character. ':' needs to be escaped in paths, but only if the paths are in the PATH variable.

Platform specific

All of the above is at least partly untrue on at least one platform.

On windows, absolute paths cannot be recognised by the first letter because of the concept of a 'drive'.
On some systems / is not the path separator. The path separator can vary wildly.
On windows, there is no single 'root'. Instead there could be many so-called drives and they are denoted by, for example, C:\. Moreover, in C:\, the backslash does NOT denote the root directory. Instead, to point to the root directory of the C:\ drive, you need another slash: C:\\.

Program-specific semantics.

Up until now I have only been ranting about paths that intend to point to a single file or directory. However, often paths are used for much more than that, sadly. Applications assign their own semantics to path-like strings. For example:

All these special semantics make paths even harder to deal with in shell-programming.

`</rant>`

I'll save you the blog post called 'The hell of URI's'.