Randomly sample lines from large files (in Perl!)

Published December 20th 2012

As I work more and more with large text files I often need to generate samples of these files to provide me with test data. Here's a simple Perl program that will sample files that I cobbled together from this StackOverflow answer with a touch of GetOpt and POD documentation.

#!/usr/bin/env perl  
use Getopt::Long;  
use Pod::Usage;

my $sample = 0.01;  
my $help = 0;  
my $args = GetOptions("sample=f" => \$sample,  
                      "help"     => \$help);

pod2usage(1) if $help;

while (<>) {  
  print if (rand() < $sample)
}

__END__  
=head1 NAME

random-sample - randomly sample lines from file(s) to STDOUT

=head1 SYNOPSIS

random-sample [options] [file...]

=head1 OPTIONS

=over 8

=item B<-s (--sample)>

Sets the sample rate; when unspecified defaults to 0.01 (1%).

=item B<-h (--help)>

Prints this help message and exits.

=back

=head1 DESCRIPTION
B reads each filename specifed line-by-line and will randomly output a line based on the sample size (default of 1%).

When provided with no command line arguments reads from STDIN.

This program uses random sample so its not guaranteed to output the precise percentage specified, though statistically the larger the file, the closer it will get.  
=cut

I'm a bit rusty with Perl but it is the language I grew up on and with its recent 25th anniversary it just felt good to "write once, read never" yet again.