Home
Note: Not intended for professional programmers, but as example for non-programmers wishing to see some programming action. Also a disclaimer: No warranties whatsoever.

In the little perl programming example for newbies I documented before, I did not make any tests for how quickly the program ran. Here I offer a modified example keeping tr/// to show that sometimes it is not that convenient to do loops when repeating almost identical commands. It might look smart, but if you need to run lots of these kinds of programs, speed might be something to consider. So, here the last version of the program whose development I followed last time, plus a version deriving from the first point before I introduced the @nucl array (differences are highlighted):

From prior post (which I call the s/// version):

#!/usr/bin/perl
# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# we make an array for the nucleotides we want to count
my @nucls = qw( A T G C ); # note that I eliminated AT and GC from the list

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    # here loop to go through the list:
    for my $nucl ( @nucls ) {
        $count_nucl{"$nucl"} += s/$nucl/$nucl/g;
    }
}
close($FNA);

# Now we use a loop to print keys and values
for my $nucl ( @nucls ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}

# calculate and print GC and AT contents:
my $GC = $count_nucl{"G"} + $count_nucl{"C"};
my $AT = $count_nucl{"A"} + $count_nucl{"T"};
my $GC_cont = $GC/($GC + $AT);
my $AT_cont = $AT/($GC + $AT);
print "GC = ",$GC,"\n";
print "AT = ",$AT,"\n";
print "GC content = ",sprintf("%.2f",$GC_cont),"\n";
print "AT content = ",sprintf("%.2f",$AT_cont),"\n";

New program (tr/// version):

#!/usr/bin/perl
# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    # count each nucleotide using specific tr///:
    $count_nucl{"A"} += tr/A/A/;
    $count_nucl{"T"} += tr/T/T/;
    $count_nucl{"G"} += tr/G/G/;
    $count_nucl{"C"} += tr/C/C/;
    $count_nucl{"GC"} += tr/GC/GC/;
    $count_nucl{"AT"} += tr/AT/AT/;
}
close($FNA);

# Now we use a loop to print keys and values
for my $nucl ( sort keys %count_nucl ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}

# calculate and print GC and AT contents:
my $GC = $count_nucl{"GC"};
my $AT = $count_nucl{"AT"};
my $GC_cont = $GC/($GC + $AT);
my $AT_cont = $AT/($GC + $AT);
print "GC = ",$GC,"\n";
print "AT = ",$AT,"\n";
print "GC content = ",sprintf("%.2f",$GC_cont),"\n";
print "AT content = ",sprintf("%.2f",$AT_cont),"\n";

As I said, differences are highlighted (they look greyish in my browser). Under UNIX, one way to figure out the differences would be using the diff command:

% diff count_letters-s.pl count_letters-tr.pl 
11,13d10
< # we make an array for the nucleotides we want to count
< my @nucls = qw( A T G C ); # note that I eliminated AT and GC from the list
< 
18,21c15,21
<     # here loop to go through the list:
<     for my $nucl ( @nucls ) {
<         $count_nucl{"$nucl"} += s/$nucl/$nucl/g;
<     }
---
>     # count each nucleotide using specific tr///:
>     $count_nucl{"A"} += tr/A/A/;
>     $count_nucl{"T"} += tr/T/T/;
>     $count_nucl{"G"} += tr/G/G/;
>     $count_nucl{"C"} += tr/C/C/;
>     $count_nucl{"GC"} += tr/GC/GC/;
>     $count_nucl{"AT"} += tr/AT/AT/;
26c26
< for my $nucl ( @nucls ) {
---
> for my $nucl ( sort keys %count_nucl ) {
31,32c31,32
< my $GC = $count_nucl{"G"} + $count_nucl{"C"};
< my $AT = $count_nucl{"A"} + $count_nucl{"T"};
---
> my $GC = $count_nucl{"GC"};
> my $AT = $count_nucl{"AT"};
%

(I don’t know if the color is fine in the bash above, there’s no coloring of the diff results in the terminal, but looks so nice I can’t decide to turn highlighting off.)

Well, we want to test how fast these programs run compared to each other. So here it goes:

% time ./count_letters-s.pl
A = 1142228
T = 1140970
G = 1176923
C = 1179554
GC = 2356477
AT = 2283198
GC content = 0.51
AT content = 0.49
2.237u 0.007s 0:02.24 99.5%	0+0k 0+9io 0pf+0w
%
% time ./count_letters-tr.pl
A = 1142228
AT = 2283198
C = 1179554
G = 1176923
GC = 2356477
T = 1140970
GC = 2356477
AT = 2283198
GC content = 0.51
AT content = 0.49
0.149u 0.005s 0:00.15 93.3%	0+0k 0+0io 0pf+0w
%

(The time reports are in default format under the tcsh.) Yes. The tr/// program runs much faster than the s/// program. Why? Well, tr/// compiles only once, while s/// compiles each time. There’s ways to make s/// compile only once, but that only works if the values to substitute won’t change. But these change because the s/// is part of a loop that changes the value each time.

The last lines in the tr/// program put the hash values for GC and AT into variables just for the sake of visual cleanliness, with a second printing of these values to make sure that they were correctly associated (or maybe I wasn’t in the mood to change all the last lines from the previous version :)). GC content could just as well have been calculated as:

my $GC_cont = $count_nucl{"GC"}/($count_nucl{"GC"} + $count_nucl{"AT"});

Up to you. After all, “there’s more than one way to do it.”

As before, I left a few things unexplained for you to figure out. I think most of this can be deduced from the context, but no harm in checking things out.

Enjoy!

P.S. In case you have no idea what the “E_coli_K12_MG1655.fna” file looks like, it comes in fasta format, and here you can see a few lines:

>gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT
GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC
TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT
TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT
GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG
CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s