Introduction

dir_tree_walk is a PHP function for processing directory tree hierarchies. It was inspired by Perl's File::Find and Python's os.walk.

It was written with Unix in mind, and while it will probably work on other operating systems, the security guarantees that it makes will not apply. (More on security below.)

It doesn't work in PHP 4.

The current release is 1.1.0. The software and documentation, which may be used without any restrictions, are available here.

Quick start

Download the archive, extract the file dir_tree_walk.php, and load it into your application. It defines the following function:

void dir_tree_walk(string $top, callback $process_func [, array $option_val])

This traverses the entire directory tree, entering each directory before processing it.

$top is the directory at which the traversal starts.

$process_func is applied to $start_dir, and to every filesystem entity under it (i.e. to every file, directory, etc.).

$option_val is an array of keys and values which modifies the behaviour of dir_tree_walk. The most interesting keys are preprocess and postprocess, both of which have callback functions as their values.

For every directory in the tree, a directory listing is produced and then handed to the preprocess function. This function is expected to return a list of directory entries. Most often, it examines the entries to see which should be removed from the list.

When processing of a directory has completed, the postprocess function is called, with the directory's name as its first argument.

The API in full

(If you prefer to see examples first, you can find some here; you've already learned enough to understand them.)

dir_tree_walk

We begin by explaining the function dir_tree_walk, with reference to the following prototype:

void dir_tree_walk(string $top, callback $process_func [, array $option_val])
$top

Must be a directory.

$process_func

A callback function. It is called for $top itself and for every entity under $top (i.e. every file, every directory, and every other filesystem object). It is guaranteed that we are in the same directory as the entity, unless we are processing $top, in which case no guarantee is made about the current directory.

$process_func is passed 3 arguments, all of which are names for the entity. The first argument is the basename, the second argument is the convenient name, and the third argument is the absolute name.

Convenient name
Absolute name

These terms have a special meaning in this documentation. The convenient name is built from $top. It is guaranteed to be a valid name for the entity if we are in $initial_dir, where $initial_dir represents the current directory at the time that dir_tree_walk was called.

The absolute name is (unsurprisingly) an absolute pathname; if $top begins with a slash, then the absolute name is the same as the convenient name.

As an example, suppose you write the following code:

$document_root = "/var/www";
chdir($document_root);
dir_tree_walk("images/icons", "process_entity");

Suppose there's a file called /var/www/images/icons/ordering/hand.png. When process_entity is called in order to process this file, its second argument (the convenient name) will be images/icons/ordering/hand.png, and its third argument (the absolute name) will be /var/www/images/icons/ordering/hand.png.

If the examples were real-world code, some of them (for instance, the second), would use the second or third argument of the preprocess function, rather than the first.
$option_val

An optional argument, which defaults to an empty array if omitted. It maps zero or more string keys to values. The possible keys are as follows:

prune

This defaults to FALSE. If set to TRUE, then the return value of $process_func, which is normally ignored, will be checked. If that value, when cast to a boolean, is TRUE, then dir_tree_walk will not enter the directory (i.e. the directory for which $process_func was run).

As an example, assume prune was set to TRUE, and that $process_func returns TRUE when run on the directory images; then dir_tree_walk will not descend into images.

preprocess

A callback function, called for every directory which is processed. When it is called for a directory D, the current directory is guaranteed to be D.

The function is called just after a directory listing has been obtained, and its return value is used to replace that directory listing. If an entry is eliminated by this function from the directory listing, that entry will not be processed any further: it will not be passed to $process_func and will not be descended into.

It is guaranteed that directory entries are processed in the order in which this function returns them.

The function receives 3 arguments. The first argument is a list of entities in the current directory, with each entity being represented by its basename. The list does not contain “.” or “..”.

The second argument is the convenient name of the current directory, and the third argument is the absolute name of the current directory.

The default value for this key is NULL, which means that no function will be called.

postprocess

A callback function, called when processing of a directory has completed. When it is called for a directory D, the current directory is guaranteed to be D.

It is called with 2 arguments: the first is the convenient name of the current directory, and the second is the absolute name of the current directory.

The default value for this key is NULL, which means that no function will be called.

top_down

This defaults to TRUE. If set to FALSE, it causes the contents of a directory to be processed before the directory itself. In other words, if top_down is set to FALSE, then by the time $process_func is called for some directory D, it will already have been called for each entry in D, and each directory in D will already have been descended into.

follow_links

This defaults to FALSE. If set to TRUE, it causes dir_tree_walk to descend into symbolic links (if they're directories).

You should be cautious about setting this to TRUE, particularly if you're operating on a directory tree which may be modified by untrusted users.

A symlink is always processed with $process_func, regardless of the value of follow_links.

chdir_failure_action

A callback function, called any time that dir_tree_walk fails to enter a directory (using chdir). It is called with 3 arguments. The first argument is the convenient name of the directory in question, and the second argument is its absolute name. The third argument is NULL, unless track_errors has been enabled, in which case it is the value of $php_errormsg immediately after the call to chdir.

The default value for this key is NULL, which means that no function will be called.

Note that, aside from calling this function, dir_tree_walk takes no special actions upon failure to enter a directory. This means that if failure to enter a directory should be considered an error, you will have to set an appropriate value for chdir_failure_action, i.e. you will have to set it to a function which exits or throws an exception.

Exceptions

The API also includes a number of exception classes, shown in the following inheritance diagram. All are defined with empty bodies (except for RuntimeException, which of course is built into PHP).

OSException and SecurityException are never thrown; they exist only to be inherited from.

InvalidTopException is thrown when dir_tree_walk is passed an invalid directory (for example, one that doesn't exist).

The remaining exceptions are very unlikely to be thrown. ReturnToDirException is thrown if a call to chdir fails, when attempting to move up the directory hierarchy to a directory we were in previously. OpendirException is thrown if a call to opendir fails. DirSwitchException is thrown in certain cases when a “directory switch” appears to have occurred. (The notion of a directory switch is explained in the section on security.)

These classes, together with the function dir_tree_walk, constitute the API; everything else defined in the file dir_tree_walk.php is for internal use only.

While dir_tree_walk is running, the current directory may be changed many times; so if you are handling an exception thrown while dir_tree_walk was executing, you can't make any assumptions about where you are.

If no exception is thrown, then the current directory immediately after a call to dir_tree_walk will be the same as it was immediately before the call.

Callback functions

Callback functions need not be strings; they can be of any variety understood by PHP. For example, a value of the form array($object, "method_name") is a valid callback function for dir_tree_walk.

Security

Software which recursively processes a directory tree can be vulnerable to certain attacks. Consider a process P, running as root, which is manipulating a directory tree. Suppose that there's a malicious process, M, running on the same computer. The following sequence of events may occur.

  1. P decides to enter a directory D.
  2. The operating system decides to stop executing P and schedule M instead.
  3. M renames D. It then creates a symlink having the same name as D and pointing to some system directory. (This is what we referred to above as a “directory switch”.)
  4. At some later time, execution of P resumes. It enters D using chdir and deletes every file within.

dir_tree_walk has a defense against this danger. Simplifying, it calls stat before and after each call to chdir and compares the 2 results. If they differ, it concludes that an attempt has been made to breach security, and, refusing to process the current directory, returns to the previous directory and continues its work.

A very similar scenario is the following. As before, suppose that P is a process running as root and that M is a malicious process running on the same computer. The following sequence of events may occur.

  1. P decides to modify a file F.
  2. The operating system decides to stop executing P and schedule M instead.
  3. M renames F. It then creates a symlink having the same name as F and pointing to some system file (the password file, say).
  4. At some later time, execution of P resumes. It modifies F (that is, it modifies the password file), corrupting it so that no-one can log into the system.

dir_tree_walk is not directly affected by this problem, but the callback functions are. If there is a possibility that it might affect you, you will have to write your callback functions carefully.

Renaming or recreating directories

Renaming or recreating a directory can cause problems for this software. Consider the following sequence of shell commands, where findphotos is an imaginary command that uses dir_tree_walk.
cd holidays
findphotos . &
mv Spain Spain2010

Suppose that, when the mv command is executed, findphoto is 3 levels below Spain. When it tries to return to Spain, it will use the full pathname (rather than “..”), and this will fail, because there will no longer be any directory with that name.

Or consider the following sequence of commands.

cd holidays
findphotos . &
mv Spain Spain-thumbnails
mv ../Spain Spain

Suppose that, when the two mv commands are executed, findphoto is 3 levels below Spain. When it tries to return to Spain, it will throw a DirSwitchException, because Spain is now different, in the sense of having a different inode number.

Similar software

It appears to be possible to use RecursiveDirectoryIterator to traverse directory trees; see the example in the documentation for its constructor.

PEAR has a package called File_Find. I have no comment to make on this, except to note that it is currently unmaintained.