There was a myth going on that
"you
cannot write large-scale programs in Perl". It's been a while since
I've heard it but it's possible that Perl just went out of the hackers' radar.
In any case, today I wanted to address it by estimating the size of
MojoMojo, a wiki written in Perl. MojoMojo
stands at the top of the CPAN Top "Heavy"
100 with dependencies on 330 direct and indirect CPAN distributions.
Here's the executive summary: MojoMojo and its dependencies contain
approximately 631,222 source lines of production code (what is under
the distributions' lib/ directory) - all of it Perl - and
approximately 272,027 lines of testing code (what is under the t/'s)
, about 265,833 out of which are Perl. The total is 897,055 SLOCs of Perl.
Here is the lib/ report by
SLOCCount:
Totals grouped by language (dominant language first):
perl: 631222 (100.00%)
Total Physical Source Lines of Code (SLOC) = 631,222
Development Effort Estimate, Person-Years (Person-Months) = 174.27 (2,091.23)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months) = 3.81 (45.68)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule) = 45.78
Total Estimated Cost to Develop = $ 23,541,431
(average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."
And here is the t/ report:
Totals grouped by language (dominant language first):
perl: 265833 (97.72%)
pascal: 5888 (2.16%)
java: 168 (0.06%)
ansic: 96 (0.04%)
cpp: 30 (0.01%)
sh: 12 (0.00%)
Total Physical Source Lines of Code (SLOC) = 272,027
Development Effort Estimate, Person-Years (Person-Months) = 72.01 (864.08)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Now for a longer story of how I collected these statistics. At first I wanted
to download all the dependencies. For that I needed to list them. I looked at
CPANDB, but could not figure
out how to do it properly using it, because its API still has many
SQLisms, and really needs a convenient high-level API. My SQL skills were
also too low to hack on the SQL directly. However, someone on
#toolchain referred me to
CPAN-FindDependencies, whose synopsis did exactly what I needed. I ran it and it generated
a file with all the dependent modules and their CPAN distributions. This
was filtered using this shell one-liner:
cat ~/mojomojo-deps.txt | ruby -ne 'puts $1 if / \(([^\)]+)\)/' |
sort | uniq > dep-dists.txt
(Note - I've used ruby instead of perl here, not because I don't like Perl,
but because I've been trying to learn Ruby lately and could use the practice.
I can do the same with Perl, too, by adding -l and replacing "puts" with
"print".)
Then I got a list of paths like
A/AB/ABIGAIL/Regexp-Common-2010010201.tar.gz ,
A/AB/ABW/AppConfig-1.66.tar.gz etc. After trying to get the CPAN
client to download them to the current directory, I've given up and instead
used the following script:
#!/bin/bash
cat ../dep-dists.txt |
(while read T ; do
echo "Doing $T" 1>&2
L="$(basename "$T")"
if [ ! -e "$L" ] ; then
lwp-mirror http://biocourse.weizmann.ac.il/CPAN/authors/id/"$T" "$L"
fi
done)
If you're using it, please replace the biocourse.weizmann mirror with a nearer
mirror - I used one suitable for Israel. Then I had a list of tarballs. I
cded to a different directory and ran a loop to unpack them all:
for I in ../dists/*.tar.* ; do tar -xvf "$I" ; done
Then I used the following loop for extracting only their lib directories:
ls -d ../unpacked/*/lib |
perl -lne '$x = $_; s{\A../unpacked/}{}; s{/}{-}; print "cp -a $x $_"' |
bash
(Note: it's not safe for paths containing special characters, but I didn't
have to worry about it here.)
I used a similar loop for t and then ran SLOCCount (which is written in Perl)
on the directory structures yielding the final statistics.
Some conclusions are:
-
Perl code can scale to large code-bases (over 600K lines of code). On the
original link I've pointed, the one who made that statement said after I
pointed him to some data disputing his claim that it was possible but it
was not easy. However, writing all this code was not extremely difficult
and I didn't hear that the maintainers of these CPAN distributions have
quit in despair because maintaining the distributions was too hard. So it's
possible and not too difficult.
-
It's probably a bit worrying that there is a relatively small percentage
of testing code out of the production code. I expected it to be higher.
Finally, I should note that having discovered that MojoMojo can also be
used as a blog engine, I'm considering to abandon my
Catable project, which aimed
to create a Perl and Catalyst-based blog engine (and
which still only has basic functionality) and simply use MojoMojo. Thanks
to EdwardIII from
#perl for
pointing me to that.
Cheers!
I should note that
my
previous post (which was a joke) took an ugly turn when many of the
commenters there thought I was serious and flamed me. I don't usually
receive so many comments, but their tone depressed me a little. This post
is serious, I promise.
|