[personal profile] csjewell
Well, on http://use.perl.org/~Alias/journal/40184, Alias described how we shrunk the Strawberry Perl .msi files by ordering which files went into the databases, and suggested an Archive::Tar::Optimize. Let's see if that's actually worth doing.

Script:
#!perl

use 5.012;
use warnings;
use Archive::Tar qw();
use File::Find::Rule qw();
use File::Spec::Functions qw(updir rel2abs abs2rel catdir catfile splitpath);
use File::pushd qw(pushd);
use IO::Compress::Bzip2 qw(bzip2);
use IO::Compress::Xz qw(xz);
use IO::Compress::Gzip qw(gzip);

# This script assumes an unpacked perl distribution here.
my $PERLDIST  = 'L:\\perl\\perl-5.12.1';
my $ROOTDIR   = rel2abs(catdir($PERLDIST, updir()));
my $DISTDIR   = abs2rel($PERLDIST, $ROOTDIR);
my $TESTDIR   = 'C:\\Users\\Curtis\\Desktop\\tartest';
my $DUMB_TAR  = catfile($TESTDIR, $DISTDIR . '-dumb.tar');
my $SMART_TAR = catfile($TESTDIR, $DISTDIR . '-smart.tar');

sub compare {
	my ($a, $b) = @_;

	my (undef, $a_dir, $a_file) = splitpath($a);
	my (undef, $b_dir, $b_file) = splitpath($b);
	
	my ($a_ext) = split /[.]/, scalar reverse $a_file;
	my ($b_ext) = split /[.]/, scalar reverse $b_file;
	
	$a_ext //= q{};
	$b_ext //= q{};
	
	if ($a_ext ne $b_ext) {
		return $a_ext cmp $b_ext;
	} elsif ($a_dir ne $b_dir) { 
		return $a_dir cmp $b_dir;
	} elsif ($a_file ne $b_file) { 
		return $a_file cmp $b_file;
	} else { 
		return $a cmp $b;
	}
}

say "Starting test.";

my @filelist = File::Find::Rule->file()->relative()->in($PERLDIST);

my @filelist_dumb = map { catfile($DISTDIR, $_) } @filelist;
my @filelist_smart = sort { compare($a, $b) } @filelist_dumb;	

{ # Give File::pushd a scope.
	my $pushd = File::pushd::pushd($ROOTDIR);

	say "Making $DUMB_TAR.";
	my $dumb_tar = Archive::Tar->new();
	$dumb_tar->add_files(@filelist_dumb);
	$dumb_tar->write($DUMB_TAR);

	say "Making $SMART_TAR.";
	my $smart_tar = Archive::Tar->new();
	foreach my $file (@filelist_smart) {
#		say $file;
		$smart_tar->add_files($file);
	}
	$smart_tar->write($SMART_TAR);
}

print "\n";

say "Compressing $DUMB_TAR with gzip.";
gzip $DUMB_TAR  => $DUMB_TAR  . '.gz', -Level => 9, BinModeIn => 1;

say "Compressing $SMART_TAR with gzip.";
gzip $SMART_TAR => $SMART_TAR . '.gz', -Level => 9, BinModeIn => 1;

say "Compressing $DUMB_TAR with bzip2.";
bzip2 $DUMB_TAR  => $DUMB_TAR  . '.bz2', BlockSize100K => 9, BinModeIn => 1, WorkFactor => 250;

say "Compressing $SMART_TAR with bzip2.";
bzip2 $SMART_TAR => $SMART_TAR . '.bz2', BlockSize100K => 9, BinModeIn => 1, WorkFactor => 250;

say "Compressing $DUMB_TAR with xz.";
xz $DUMB_TAR  => $DUMB_TAR  . '.xz', Preset => 9, BinModeIn => 1;

say "Compressing $SMART_TAR with xz.";
xz $SMART_TAR => $SMART_TAR . '.xz', Preset => 9, BinModeIn => 1;
print "\n";

my $dumb_tar_gz_size  = -s $DUMB_TAR  . '.gz';
my $smart_tar_gz_size = -s $SMART_TAR . '.gz';
my $difference_gz = $dumb_tar_gz_size - $smart_tar_gz_size;
my $percent_gz = ($difference_gz / $dumb_tar_gz_size) * 100;
my $diff_gz_k = $difference_gz / 1024;

say "$DUMB_TAR.gz  size: $dumb_tar_gz_size ";
say "$SMART_TAR.gz size: $smart_tar_gz_size";
say "difference: $difference_gz ($percent_gz %) ($diff_gz_k KiB)";
print "\n";

my $dumb_tar_bz2_size  = -s $DUMB_TAR  . '.bz2';
my $smart_tar_bz2_size = -s $SMART_TAR . '.bz2';
my $difference_bz2 = $dumb_tar_bz2_size - $smart_tar_bz2_size;
my $percent_bz2 = ($difference_bz2 / $dumb_tar_bz2_size) * 100;
my $diff_bz2_k = $difference_bz2 / 1024;

say "$DUMB_TAR.bz2  size: $dumb_tar_bz2_size ";
say "$SMART_TAR.bz2 size: $smart_tar_bz2_size";
say "difference: $difference_bz2 ($percent_bz2 %) ($diff_bz2_k KiB)";
print "\n";

my $dumb_tar_xz_size  = -s $DUMB_TAR  . '.xz';
my $smart_tar_xz_size = -s $SMART_TAR . '.xz';
my $difference_xz = $dumb_tar_xz_size - $smart_tar_xz_size;
my $percent_xz = ($difference_xz / $dumb_tar_xz_size) * 100;
my $diff_xz_k = $difference_xz / 1024;

say "$DUMB_TAR.xz  size: $dumb_tar_xz_size ";
say "$SMART_TAR.xz size: $smart_tar_xz_size";
say "difference: $difference_xz ($percent_xz %) ($diff_xz_k KiB)";

exit;


Output:
Starting test.
Making C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.
Making C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.

Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar with gzip.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar with gzip.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar with bzip2.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar with bzip2.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar with xz.
Compressing C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar with xz.

C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.gz  size: 14885143
C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.gz size: 14892370
difference: -7227 (-0.0485517673562155 %) (-7.0576171875 KiB)

C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.bz2  size: 12011773
C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.bz2 size: 12073770
difference: -61997 (-0.516135294931065 %) (-60.5439453125 KiB)

C:\Users\Curtis\Desktop\tartest\perl-5.12.1-dumb.tar.xz  size: 9280004
C:\Users\Curtis\Desktop\tartest\perl-5.12.1-smart.tar.xz size: 9260888
difference: 19116 (0.205991290520995 %) (18.66796875 KiB)


So in the case of the Perl source code, trying to optimize the .tar file by trying to group files with similar extensions together (as an attempt to optimize for similar content) doesn't do enough to worry about in the .xz case, and makes a file that compresses WORSE for .gz and .bz2.

And just as an aside, the .tar files are exactly the same size.

(speaking of which, .xz is obviously optimized for decompression. It took about 90 seconds to compress each file, while .bz2 and .gz take 15 seconds each.)

Of course, your mileage may vary, based on your own test data.

Should actually try with an extracted Strawberry Perl (in order to have more binary data) and see what happens.
From:
Anonymous
OpenID
Identity URL: 
User
Account name:
Password:
If you don't have an account you can create one now.
Subject:
HTML doesn't work in the subject.

Message:

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org


 
Notice: This account is set to log the IP addresses of everyone who comments.
Links will be displayed as unclickable URLs to help prevent spam.

Profile

csjewell

June 2011

S M T W T F S
   1234
567891011
12131415161718
192021 22232425
2627282930  

Style Credit

Page generated Apr. 19th, 2014 02:30 pm
Powered by Dreamwidth Studios

Expand Cut Tags

No cut tags

Most Popular Tags