[personal profile] csjewell
Well, on http://use.perl.org/~Alias/journal/40184, Alias described how we shrunk the Strawberry Perl .msi files by ordering which files went into the databases, and suggested an Archive::Tar::Optimize. Let's see if that's actually worth doing.


use 5.012;
use warnings;
use Archive::Tar qw();
use File::Find::Rule qw();
use File::Spec::Functions qw(updir rel2abs abs2rel catdir catfile splitpath);
use File::pushd qw(pushd);
use IO::Compress::Bzip2 qw(bzip2);
use IO::Compress::Xz qw(xz);
use IO::Compress::Gzip qw(gzip);

# This script assumes an unpacked perl distribution here.
my $PERLDIST  = 'L:\\perl\\perl-5.12.1';
my $ROOTDIR   = rel2abs(catdir($PERLDIST, updir()));
my $DISTDIR   = abs2rel($PERLDIST, $ROOTDIR);
my $TESTDIR   = 'C:\\Users\\Curtis\\Desktop\\tartest';
my $DUMB_TAR  = catfile($TESTDIR, $DISTDIR . '-dumb.tar');
my $SMART_TAR = catfile($TESTDIR, $DISTDIR . '-smart.tar');

sub compare {
	my ($a, $b) = @_;

	my (undef, $a_dir, $a_file) = splitpath($a);
	my (undef, $b_dir, $b_file) = splitpath($b);
	my ($a_ext) = split /[.]/, scalar reverse $a_file;
	my ($b_ext) = split /[.]/, scalar reverse $b_file;
	$a_ext //= q{};
	$b_ext //= q{};
	if ($a_ext ne $b_ext) {
		return $a_ext cmp $b_ext;
	} elsif ($a_dir ne $b_dir) { 
		return $a_dir cmp $b_dir;
	} elsif ($a_file ne $b_file) { 
		return $a_file cmp $b_file;
	} else { 
		return $a cmp $b;

say "Starting test.";

my @filelist = File::Find::Rule->file()->relative()->in($PERLDIST);

my @filelist_dumb = map { catfile($DISTDIR, $_) } @filelist;
my @filelist_smart = sort { compare($a, $b) } @filelist_dumb;	

{ # Give File::pushd a scope.
	my $pushd = File::pushd::pushd($ROOTDIR);

	say "Making $DUMB_TAR.";
	my $dumb_tar = Archive::Tar->new();

	say "Making $SMART_TAR.";
	my $smart_tar = Archive::Tar->new();
	foreach my $file (@filelist_smart) {
#		say $file;

print "\n";

say "Compressing $DUMB_TAR with gzip.";
gzip $DUMB_TAR  => $DUMB_TAR  . '.gz', -Level => 9, BinModeIn => 1;

say "Compressing $SMART_TAR with gzip.";
gzip $SMART_TAR => $SMART_TAR . '.gz', -Level => 9, BinModeIn => 1;

say "Compressing $DUMB_TAR with bzip2.";
bzip2 $DUMB_TAR  => $DUMB_TAR  . '.bz2', BlockSize100K => 9, BinModeIn => 1, WorkFactor => 250;

say "Compressing $SMART_TAR with bzip2.";
bzip2 $SMART_TAR => $SMART_TAR . '.bz2', BlockSize100K => 9, BinModeIn => 1, WorkFactor => 250;

say "Compressing $DUMB_TAR with xz.";
xz $DUMB_TAR  => $DUMB_TAR  . '.xz', Preset => 9, BinModeIn => 1;

say "Compressing $SMART_TAR with xz.";
xz $SMART_TAR => $SMART_TAR . '.xz', Preset => 9, BinModeIn => 1;
print "\n";

my $dumb_tar_gz_size  = -s $DUMB_TAR  . '.gz';
my $smart_tar_gz_size = -s $SMART_TAR . '.gz';
my $difference_gz = $dumb_tar_gz_size - $smart_tar_gz_size;
my $percent_gz = ($difference_gz / $dumb_tar_gz_size) * 100;
my $diff_gz_k = $difference_gz / 1024;

say "$DUMB_TAR.gz  size: $dumb_tar_gz_size ";
say "$SMART_TAR.gz size: $smart_tar_gz_size";
say "difference: $difference_gz ($percent_gz %) ($diff_gz_k KiB)";
print "\n";

my $dumb_tar_bz2_size  = -s $DUMB_TAR  . '.bz2';
my $smart_tar_bz2_size = -s $SMART_TAR . '.bz2';
my $difference_bz2 = $dumb_tar_bz2_size - $smart_tar_bz2_size;
my $percent_bz2 = ($difference_bz2 / $dumb_tar_bz2_size) * 100;
my $diff_bz2_k = $difference_bz2 / 1024;

say "$DUMB_TAR.bz2  size: $dumb_tar_bz2_size ";
say "$SMART_TAR.bz2 size: $smart_tar_bz2_size";
say "difference: $difference_bz2 ($percent_bz2 %) ($diff_bz2_k KiB)";
print "\n";

my $dumb_tar_xz_size  = -s $DUMB_TAR  . '.xz';
my $smart_tar_xz_size = -s $SMART_TAR . '.xz';
my $difference_xz = $dumb_tar_xz_size - $smart_tar_xz_size;
my $percent_xz = ($difference_xz / $dumb_tar_xz_size) * 100;
my $diff_xz_k = $difference_xz / 1024;

say "$DUMB_TAR.xz  size: $dumb_tar_xz_size ";
say "$SMART_TAR.xz size: $smart_tar_xz_size";
say "difference: $difference_xz ($percent_xz %) ($diff_xz_k KiB)";


Starting test.
Making C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.
Making C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.

Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar with gzip.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar with gzip.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar with bzip2.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar with bzip2.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar with xz.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar with xz.

C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.gz  size: 14885143
C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.gz size: 14892370
difference: -7227 (-0.0485517673562155 %) (-7.0576171875 KiB)

C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.bz2  size: 12011773
C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.bz2 size: 12073770
difference: -61997 (-0.516135294931065 %) (-60.5439453125 KiB)

C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.xz  size: 9280004
C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.xz size: 9260888
difference: 19116 (0.205991290520995 %) (18.66796875 KiB)

So in the case of the Perl source code, trying to optimize the .tar file by trying to group files with similar extensions together (as an attempt to optimize for similar content) doesn't do enough to worry about in the .xz case, and makes a file that compresses WORSE for .gz and .bz2.

And just as an aside, the .tar files are exactly the same size.

(speaking of which, .xz is obviously optimized for decompression. It took about 90 seconds to compress each file, while .bz2 and .gz take 15 seconds each.)

Of course, your mileage may vary, based on your own test data.

Should actually try with an extracted Strawberry Perl (in order to have more binary data) and see what happens.

Shouldn't one expect this?

Date: 2010-05-18 11:34 am (UTC)
From: (Anonymous)
The difference with gzip and xz is negligible. The negative thing with bzip2 is IMHO caused by the fact that bzip2's main feature is block sorting, and it seems to be better at that than something else (particularly if the "something" ist just an alphabetical sort). I'm not overly optimistic about beating bzip2 with a preprocessed tar archive, since bzip2 can swap blocks across file boundaries and tar cannot.

Ralf Muschall



June 2011

192021 22232425

Style Credit

Page generated Apr. 16th, 2014 10:11 am
Powered by Dreamwidth Studios

Expand Cut Tags

No cut tags

Most Popular Tags

Page Summary