Entropic Evolution

Thoughts of a Computational Biologist

Little PERL programming example for biology newbies

Note: Not intended for professional programmers, but as example for non-programmers wishing to see some programming action. Also a disclaimer: No warranties whatsoever.

During an exercise in my intro course in bioinformatics I asked my students to use the following command to count the number of As in E. coli K12’s DNA sequence (yeah, the whole genome):

grep -v '>' E_coli_K12_MG1655.fna | perl -ne '$count = tr/A/A/; $total += $count; print "$total\n" if eof;'

It works, of course:

% grep -v '>' E_coli_K12_MG1655.fna | perl -ne '$count = tr/A/A/; $total += $count; print "$total\n" if eof;'
1142228
%

(Ain’t UNIX pipes just beautiful?) As part of the exercise they had to adapt the above to count G, T, C. Then I asked them to try tr/AT/AT/ and see what they would get instead (tr/GC/GC/ follows naturally). Here I am documenting how I would have written a little program in perl to do all the exercise. I expect that showing how this happens, even though not intended to show the best practices in programming, might be enough inspiration to get students with no experience programming into trying it. This is not exhaustive. I will not explain everything there is to explain about everything. I think there’s learning value in leaving a bit of stuff unexplained so that the reader finds illumination by the exercise of searching for the answers and figuring it out. My laptop is a Mac. Macs have UNIX at the bottom (or top, of back, or whatever), and have perl already installed, just as most-if-not-all linuxes. I have no idea about windows, so don’t ask me about that. I know there’s perl for windows and that’s it. :)

Now for the making of this command into a program in perl.
The first thing I do is use touch to create a file for my program, and chmod to make it into an executable:

% touch count_letters.pl
% chmod +x count_letters.pl
% ls
E_coli_K12_MG1655.faa  E_coli_K12_MG1655.fna  count_letters.pl
% emacs count_letters.pl

Because this file has the right “last name” (.pl), emacs should be able to highlight the syntax as soon as I open the file, and it does. I will not show the emacs interface, but rather the perl code as I develop it. The first step is a direct translation of the one-liner program into the program in a file:

#!/usr/bin/perl -n
$count = tr/A/A/;
$total += $count;
print $total,"\n" if eof;

That works pretty well:

% grep -v '>' E_coli_K12_MG1655.fna | ./count_letters.pl
1142228
% 

But of course, that’s not really a nice program. We can do better. For starters, we can gain a lot of control if we eliminate the “-n”, which assumes that a file is being open and its lines read, but applies each line of code to each line in the file. We want control, so we substitute the “-n” and add the necessary code to read the file. Pay attention to the comments (lines starting with a #, which are not part of the code, just my notes):

#!/usr/bin/perl
#(note above that the "-n" is no longer there)

# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set the total counts to zero
my $total = 0;

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    my $count = tr/A/A/;
    $total += $count;
}
close($FNA);

# since the file was read, As counted, and total calculated, we don't
# need the "eof" to print only the final count. We have the final
# count in memory, so just print it!
print $total,"\n";

If that looks like a lot of code notice how much is there that’s not comments, and you will see that it’s nothing. Just a few changes to the original above. And of course it works:

% ./count_letters.pl
1142228
%

See?

Now, the total and the count variables are kind of useless for the purpose of the program. We could do just as well with just one variable. So I eliminated the $total variable, declared the $count to be “0” instead, and the tr/A/A/ adds directly the number of As into the code variable ($count += tr/A/A/;), and we print “$count” instead of “$total”:

#!/usr/bin/perl
#(note above that the "-n" is no longer there)

# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set the total counts to zero
my $count = 0;

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    $count += tr/A/A/;
}
close($FNA);

# since the file was read, As counted, and total calculated, we don't
# need the "eof" to print only the final count. We have the final
# count in memory, so just print it!
print $count,"\n";

It works of course (try it yourself).

Now, we are counting only As. So I was first about to do something like this: add a variable per letter to count, and add lines as necessary. Example, the line that declared the count as zero would change to:

# we set the total counts to zero
my $count_A = 0;
my $count_G = 0;

And the lines doing the counting would look like this:

    $count_A += tr/A/A/;
    $count_G += tr/G/G/;

But then again, we have hashes in perl, which are variables that link a key to a value. Using such things the addition of other letters becomes much easier:

#!/usr/bin/perl
#(note above that the "-n" is no longer there)

# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    $count_nucl{"A"} += tr/A/A/;
    $count_nucl{"G"} += tr/G/G/;
}
close($FNA);

# Now we are printing keys and values, so we can use a loop:
for my $nucl ( sort keys %count_nucl ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}

Let’s see what it does:

% ./count_letters.pl
A = 1142228
G = 1176923
%

Nice isn’t it? Now it should be a matter of adding a tr/// line for each thing left to count, right? Like “C”, “T”, “AC”, and “GC”. But each line is kind of the same. I might want to use all these nucleotides later again, so, what about we used a list of things to count? Let’s be ambitious and make this into a more complicated program! Therefore I declared a list, and array in perl’s parlance: @nucls, and modify the counting into a loop going through the list:

#!/usr/bin/perl
#(note above that the "-n" is no longer there)

# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# we make an array for the nucleotides we want to count
my @nucls = qw( A T G C AT GC );

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    # here loop to go through the list:
    for my $nucl ( @nucls ) {
        $count_nucl{"$nucl"} += tr/$nucl/$nucl/;
    }
}
close($FNA);

# Now we are printing keys and values, so we can use a loop:
# changed from before to go in the order of the list
for my $nucl ( @nucls ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}

And this, of course, doesn’t work! :(

% ./count_letters.pl
A = 0
T = 0
G = 0
C = 0
AT = 0
GC = 0
%

Oh shit!? Well, what happens is that interpolation, the art of using a variable where we normally put letters in the tr/// thingie, does not work with tr///. It works with s///. However, s/// substitutes one string for another, thus s/AT/AT/ would not count all As and all Ts, but would instead count all the ATs. Going through the web I found that tr/// would work if put into an eval (I also eliminated the “-n” comment):

#!/usr/bin/perl
# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# we make an array for the nucleotides we want to count
my @nucls = qw( A T G C AT GC );

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    # here loop to go through the list:
    for my $nucl ( @nucls ) {
        $count_nucl{"$nucl"} += eval "tr/$nucl/$nucl/";
    }
}
close($FNA);

# Now we are printing keys and values, so we can use a loop:
# changed from before to go in the order of the list
for my $nucl ( @nucls ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}
% ./count_letters.pl
A = 1142228
T = 1140970
G = 1176923
C = 1179554
AT = 2283198
GC = 2356477
%

It works, but it’s horribly slow. OK then. Maybe we can instead use s///, and just add As to Ts, and Gs to Cs. or better, use eval only for the ATs and GCs! We can also add a line to calculate GC and AT content:

#!/usr/bin/perl
# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# we make an array for the nucleotides we want to count
my @nucls = qw( A T G C AT GC );

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    # here loop to go through the list:
    for my $nucl ( @nucls ) {
        if( length("$nucl") > 1 ) {
            $count_nucl{"$nucl"} += eval "tr/$nucl/$nucl/";
        }
        else {
            $count_nucl{"$nucl"} += s/$nucl/$nucl/g;
        }
    }
}
close($FNA);

# Now we are printing keys and values, so we can use a loop:
# changed from before to go in the order of the list
for my $nucl ( @nucls ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}

# calculate and print GC and AT content:
my $GC_cont = $count_nucl{"GC"}/($count_nucl{"GC"}+$count_nucl{"AT"});
my $AT_cont = $count_nucl{"AT"}/($count_nucl{"GC"}+$count_nucl{"AT"});
print "GC content = ",sprintf("%.2f",$GC_cont),"\n";
print "AT content = ",sprintf("%.2f",$AT_cont),"\n";

It works!

% ./count_letters.pl
A = 1142228
T = 1140970
G = 1176923
C = 1179554
AT = 2283198
GC = 2356477
GC content = 0.51
AT content = 0.49
%

It is not much faster though. Maybe it’s a mistake to use that eval. Perhaps it’s better to count only the four nucleotides and solve everything else with simple arithmetic. OK, let’s try that for the sake of sanity:

#!/usr/bin/perl
# since we will be using a file, and just one file, we can put it's
# name into a variable for ease of handling later on. Not needed, but
# makes it easier to modify later if we want to be able to open other
# files
my $fna_file = "E_coli_K12_MG1655.fna";

# we set a hash for the counts
my %count_nucl = ();

# we make an array for the nucleotides we want to count
my @nucls = qw( A T G C ); # note that I eliminated AT and GC from the list

# now we open the file and read it with a while, and count
open( my $FNA,"<","$fna_file" );
while(<$FNA>) {
    next if( m{^>} ); # this one eliminates identifier lines!
    # here loop to go through the list:
    for my $nucl ( @nucls ) {
        $count_nucl{"$nucl"} += s/$nucl/$nucl/g;
    }
}
close($FNA);

# Now we are printing keys and values, so we can use a loop:
# changed from before to go in the order of the list
for my $nucl ( @nucls ) {
    print $nucl," = ",$count_nucl{"$nucl"},"\n";
}

# calculate and print GC and AT contents:
my $GC = $count_nucl{"G"} + $count_nucl{"C"};
my $AT = $count_nucl{"A"} + $count_nucl{"T"};
my $GC_cont = $GC/($GC + $AT);
my $AT_cont = $AT/($GC + $AT);
print "GC = ",$GC,"\n";
print "AT = ",$AT,"\n";
print "GC content = ",sprintf("%.2f",$GC_cont),"\n";
print "AT content = ",sprintf("%.2f",$AT_cont),"\n";

It works again:

% ./count_letters.pl
A = 1142228
T = 1140970
G = 1176923
C = 1179554
AT = 2283198
GC = 2356477
GC content = 0.51
AT content = 0.49
%

This is indistinguishable from above. Slightly faster. It eliminates the “eval”, which seems to be far from recommended, and it works. With that this example is finished. I hope you find the exercise illuminating!

About these ads

2 comments on “Little PERL programming example for biology newbies

  1. camel
    March 23, 2013

    open( my $FNA,”<","$fna_file" );

    why "$fna_file" ?

    open( my $FNA, '<', $fna_file);
    is fine. :)

    • gabo
      March 23, 2013

      Well, three reasons: I did not know that variables containing file names work without quotes (they have not worked without quotes for me, for example in rename, and other such commands). Not using the quotes has caused me trouble in the past, and it’s explicit in the sense that it makes it clear that the variable is interpolated. Thanks! :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on March 22, 2013 by in Bioinformatics, Education, General Science, Work Habits and tagged , , , .

Twitter Updates

  • Tiny bookstore at Mexican little mall. Oparin's origin of life for 32 pesos! (brand new edition, in Spanish) 3 days ago
  • @GaboTuitero ¿Se traduce aquí desarrollo como evolución? 4 days ago
  • Waking up to a rooster 1 week ago
  • RT @mongodbinc: Want to develop an app? Learn how with MongoDB. Free, online professional training at MongoDB University http://t.co/c29N7K… 2 weeks ago
  • A lot of the work in computational biology is figuring out where to find the data and how to work with it. Just do it! 2 weeks ago
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: