Shlomif's Technical Posts Community - Estimating the Size of the Perl 5 MojoMojo Wiki [entries|archive|friends|userinfo]
Shlomif's Technical Posts Community

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Links:| Shlomi Fish's Homepage Main Journal Homesite Blog Planet Linux-IL Amir Aharoni in Unicode open dot dot dot ]

Estimating the Size of the Perl 5 MojoMojo Wiki [Oct. 2nd, 2010|11:05 pm]
Previous Entry Add to Memories Share Next Entry


[Tags|, , , , , , ]
[Current Location |Home]
[Current Music |Rolling Stones - The Singer Not the Song]

There was a myth going on that "you cannot write large-scale programs in Perl". It's been a while since I've heard it but it's possible that Perl just went out of the hackers' radar. In any case, today I wanted to address it by estimating the size of MojoMojo, a wiki written in Perl. MojoMojo stands at the top of the CPAN Top "Heavy" 100 with dependencies on 330 direct and indirect CPAN distributions.

Here's the executive summary: MojoMojo and its dependencies contain approximately 631,222 source lines of production code (what is under the distributions' lib/ directory) - all of it Perl - and approximately 272,027 lines of testing code (what is under the t/'s) , about 265,833 out of which are Perl. The total is 897,055 SLOCs of Perl.

Here is the lib/ report by SLOCCount:

Totals grouped by language (dominant language first):
perl:        631222 (100.00%)

Total Physical Source Lines of Code (SLOC)                = 631,222
Development Effort Estimate, Person-Years (Person-Months) = 174.27 (2,091.23)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 3.81 (45.68)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 45.78
Total Estimated Cost to Develop                           = $ 23,541,431
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

And here is the t/ report:

Totals grouped by language (dominant language first):
perl:        265833 (97.72%)
pascal:        5888 (2.16%)
java:           168 (0.06%)
ansic:           96 (0.04%)
cpp:             30 (0.01%)
sh:              12 (0.00%)

Total Physical Source Lines of Code (SLOC)                = 272,027
Development Effort Estimate, Person-Years (Person-Months) = 72.01 (864.08)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))

Now for a longer story of how I collected these statistics. At first I wanted to download all the dependencies. For that I needed to list them. I looked at CPANDB, but could not figure out how to do it properly using it, because its API still has many SQLisms, and really needs a convenient high-level API. My SQL skills were also too low to hack on the SQL directly. However, someone on #toolchain referred me to CPAN-FindDependencies, whose synopsis did exactly what I needed. I ran it and it generated a file with all the dependent modules and their CPAN distributions. This was filtered using this shell one-liner:

cat ~/mojomojo-deps.txt | ruby -ne 'puts $1 if / \(([^\)]+)\)/' | 
    sort | uniq > dep-dists.txt

(Note - I've used ruby instead of perl here, not because I don't like Perl, but because I've been trying to learn Ruby lately and could use the practice. I can do the same with Perl, too, by adding -l and replacing "puts" with "print".)

Then I got a list of paths like A/AB/ABIGAIL/Regexp-Common-2010010201.tar.gz , A/AB/ABW/AppConfig-1.66.tar.gz etc. After trying to get the CPAN client to download them to the current directory, I've given up and instead used the following script:

cat ../dep-dists.txt | 
    (while read T ; do 
        echo "Doing $T" 1>&2
        L="$(basename "$T")"
        if [ ! -e "$L" ] ; then
            lwp-mirror"$T" "$L"

If you're using it, please replace the biocourse.weizmann mirror with a nearer mirror - I used one suitable for Israel. Then I had a list of tarballs. I cded to a different directory and ran a loop to unpack them all:

for I in ../dists/*.tar.* ; do tar -xvf "$I" ; done

Then I used the following loop for extracting only their lib directories:

ls -d ../unpacked/*/lib | 
    perl -lne '$x = $_; s{\A../unpacked/}{}; s{/}{-}; print "cp -a $x $_"' |

(Note: it's not safe for paths containing special characters, but I didn't have to worry about it here.)

I used a similar loop for t and then ran SLOCCount (which is written in Perl) on the directory structures yielding the final statistics.

Some conclusions are:

  1. Perl code can scale to large code-bases (over 600K lines of code). On the original link I've pointed, the one who made that statement said after I pointed him to some data disputing his claim that it was possible but it was not easy. However, writing all this code was not extremely difficult and I didn't hear that the maintainers of these CPAN distributions have quit in despair because maintaining the distributions was too hard. So it's possible and not too difficult.
  2. It's probably a bit worrying that there is a relatively small percentage of testing code out of the production code. I expected it to be higher.

Finally, I should note that having discovered that MojoMojo can also be used as a blog engine, I'm considering to abandon my Catable project, which aimed to create a Perl and Catalyst-based blog engine (and which still only has basic functionality) and simply use MojoMojo. Thanks to EdwardIII from #perl for pointing me to that.


I should note that my previous post (which was a joke) took an ugly turn when many of the commenters there thought I was serious and flamed me. I don't usually receive so many comments, but their tone depressed me a little. This post is serious, I promise.


From: (Anonymous)
2010-10-03 12:11 am (UTC)



I think that CPAN shows that Perl can be assembled into bigger scale systems with even more ease than other languages.