Shlomif's Technical Posts Community - What you can do with File-Find-Object (that you can't with File::Find) [entries|archive|friends|userinfo]
Shlomif's Technical Posts Community

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Links
[Links:| Shlomi Fish's Homepage Main Journal Homesite Blog Planet Linux-IL Amir Aharoni in Unicode open dot dot dot ]

What you can do with File-Find-Object (that you can't with File::Find) [Jul. 17th, 2009|03:58 pm]
Previous Entry Add to Memories Share Next Entry

shlomif_tech

[shlomif]
[Tags|, , , , , , , , , , , , , , ]
[Current Location |Home]
[Current Mood |calmcalm]
[Current Music |Ronald Jenkees - Super-Fun]

I've written about File-Find-Object before, but I've intended to write an entry demonstrating its philosophical advantages over the core File::Find module. Today, I'd like to get to it.

As opposed to File::Find, File-Find-Object:

  1. Has an iterative interface, and is capable of being interrupted in the middle.
  2. Can be instantiated and be used to traverse an arbitrary number of directory trees in one process.
  3. Can return result objects instead of just plain paths.

I'd like to demonstrate some of these advantages now.

Case Study #1: Looking for a Needle in a Haystack

Let's suppose you have a huge directory tree containing many directories and files, and you're looking for only one result (or a few ones). Once you found that result you wish to stop. This question was raised in this Stack Overflow post.

So how can you do it with File::Find? Not very easily. Either you can throw an exception:


sub processFile() {
   if ($_ =~ /target/) {
      die { type => "file-was-found", path => $File::Find::name };
   }
}

eval {
    find (\&processFile, $mydir);
};

if ( $@ ) {
    my $result = $@;
    if ( (ref($result) eq "HASH") && 
         ($result->{type} eq "file-found")
       )
    {
        my $path = $result->{path};
        # Do something with $path.
    }
    elsif ( $result ) {
        die $result;
    }
}
else {
   # be sad
}

This is incredibly inelegant, and abuses the Perl exception system for propagating values instead of errors. But there's even a worse way, using $File::Find::prune:

#! /usr/bin/perl -w

use strict;
use File::Find;

my @hits = ();
my $hit_lim = shift || 20;

find(
    sub {
        if( scalar @hits >= $hit_lim ) {
            $File::Find::prune = 1;
            return;
        }
        elsif( -d $_ ) {
            return;
        }
        push @hits, $File::Find::name;
    },
    shift || '.'
);

$, = "\n";
print @hits, "\n";

Here, we prune all the levels from the results up to the root to get out of the loop.

So how can you do it with File-Find-Object? In a very straightforward manner:

#!/usr/bin/perl

use strict;
use warnings;

use File::Find::Object;

sub find_needle
{
    my $base = shift;

    my $finder = File::Find::Object->new({}, $base);

    while (defined(my $r = $finder->next()))
    {
        if ($r =~ /target/)
        {
            return $r;
        }
    }

    return;
}

my $found = find_needle(shift(@ARGV));

if (defined($found))
{
    print "$found\n";
}
else
{
    die "Could not find target.";
}

The find_needle() function is the important thing here, and one can see it doesn't use any exceptions, excessive prunes or anything like that. It just harnesses the iterative interface of File-Find-Object. And it works too:

shlomi:~$ perl f-f-o-find-needle.pl ~/progs/
/home/shlomi/progs/Rpms/BUILD/ExtUtils-MakeMaker-6.52/t/dir_target.t
shlomi:~$

Case Study #2: Recursive Diff

Evil Djinni from Disney's Aladdin

Let's suppose an evil djinni has removed the -r flag from your diff program, making you unable to recursively find the differences between files in two directory tree. As a result, you now need to write a recursive-diff program in Perl that will run diff -u on the two copies of each equivalent path in the two directorie.

Since File::Find cannot be instantiated two times at once, then when using it, we will need to collect all the results from both directories, and then traverse them in memory. But with File-Find-Object there is a better way:

#!/usr/bin/perl

use strict;
use warnings;

use File::Find::Object;
use List::MoreUtils qw(all);

my @indexes = (0,1);
my @paths;
for my $idx (@indexes)
{
    push @paths, shift(@ARGV);
}

my @finders = map { File::Find::Object->new({}, $_ ) } @paths;

my @results;

my @fns;

sub fetch
{
    my $idx = shift;

    if ($results[$idx] = $finders[$idx]->next_obj())
    {
        $fns[$idx] = join("/", @{$results[$idx]->full_components()});
    }

    return;
}

sub only_in
{
    my $idx = shift;

    printf("Only in %s: %s\n", $paths[$idx], $fns[$idx]);
    fetch($idx);

    return;
}

for my $idx (@indexes)
{
    fetch($idx);
}

COMPARE:
while (all { $_ } @results)
{
    my $skip = 0;
    foreach my $idx (@indexes)
    {
        if (!$results[$idx]->is_file())
        {
            fetch($idx);
            $skip = 1;
        }
    }
    if ($skip)
    {
        next COMPARE;
    }

    if ($fns[0] lt $fns[1])
    {
        only_in(0);
    }
    elsif ($fns[1] lt $fns[0])
    {
        only_in(1);
    }
    else
    {
        system("diff", "-u", map {$_->path() } @results);
        foreach my $idx (@indexes)
        {
            fetch($idx);
        }
    }
}

foreach my $idx (@indexes)
{
    while($results[$idx])
    {
        only_in($idx);
    }
}

( As a bonus, we do not need to sort the results explicitly at any stage, because File-Find-Object sorts them for us. )

This program did not take me a long time to write, it works pretty well, and does populate a long list of results of one or both directories.

Conclusion

If you use File-Find-Object instead of File::Find, your code may be cleaner, your logic less convulted, and you may actually be able to achieve things that are not possible with the latter. I hope I whet your appetite here and convinced you to give File-Find-Object a try.

So what does the future holds? I recently ported File-Find-Rule to File-Find-Object and called the result File-Find-Object-Rule . As a result, "->start" and "->match" are now truly iterative, and I believe you can iterate with them on several objects at once. As I discovered by porting File-Find-Object-Rule-MMagic, I unfortunately cannot maintain full backwards compatibility with the plugin API of File-Find-Rule, because the latter exposes some of behaviour of File::Find (in a leaky abstraction fashion).

I'm planning on porting more File-Find-Rule plugins to File-Find-Object-Rule, and would appreciate any help. I also would like to look at the directory tree traversal APIs of other languages to see if they contain any interesting techniques.

LinkReply

Comments:
From: (Anonymous)
2009-07-17 11:02 pm (UTC)

Confused about returns in File::Find

(Link)

I have been tempted to write an iterative version of File::Find on numerous occasions, and I'm glad to see someone has. While I usually only need to traverse a single directory path at a time, I'll not complain about being able to traverse several at once.

That having been said, it's always been my impression that if you wanted to do a return in File::Find, you needed to put it in a variable defined outside the scope of your wanted sub, because your wanted sub's return value is ignored. As such, while you have to jump through a hoop to return anything at all, you can return anything at all. So your advantage 3 doesn't hold - File::Find either isn't limited like that, or it is more limited than that. Or has File::Find been updated more recently than I am aware?

(Btw, you have a typo in your first advantage - 'and a is'.)
[User Picture]From: shlomif
2009-07-18 06:49 am (UTC)

Re: Confused about returns in File::Find

(Link)

Hi Anonymous!

Thanks for your comment. I corrected your typo. What I meant by the third advantage was that File-Find-Object gives you the result object for free as part of its interface, and it already contains such data as whether it is a plain file or a directory, and the result of the call to stat()/lstat(), and its components. With File::Find, you only have the path name, and need to work out the rest of the information yourself. You are free to put it inside an object, but it's still sub-optimal.