Home
Note: No guaranties whatsoever. If you want to try the exercise, I have the original table of genomes with their sizes and GC content at figshare. Click here to download (at your own risk, etc etc).

Last time I left you after I chose a set of ten genomes, kinda separated by uniform increments in GC content, genomes that would then be used by students in a lab that is part of a course in bioinformatics. I used a series of commands in UNIX finishing with a choice of genomes that I proceeded to do by hand. The work was rather pedestrian, but it helped demonstrate the power of the UNIX terminal. The power of a command shell. By the way, I use the tcsh, but bash, the default in macs and linuxes today, works too.

Now I would like to show you how to write a program in PERL for the very same task. It helps to do this because then, next year, when I teach this course again, I can run this program against a new table of genomes, and choose a different subset of genomes quickly.

So here we go. I start, as I have said before, with “touch” to create a file for my program, and chmod to make the file into an “executable,” meaning that it runs like a program by just writing the file’s name, let me show you:

% touch choose-ten-genomes.pl
% ./choose-ten-genomes.pl
./choose-ten-genomes.pl: Permission denied.
% chmod +x choose-ten-genomes.pl
% ./choose-ten-genomes.pl
%

Did you see that? Before the chmod, trying to run the file as if it’s a program gave me a “Permission denied”, after the chmod, it just returned nothing (because the file is empty). We have now a program. An empty one, but a program nonetheless.

I use emacs for programming, but I will not show you the emacs window itself, but the code. Here’s the file, and the first few lines I wrote:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

All right. Now we will be opening the file with the tab-separated table, and, as always, we have to first take a peek at it to remember the format:

% ls DATA/
GC_table.sorted.bz2  GC_table.tbl.bz2     SUBSET
% bzcat DATA/GC_table.tbl.bz2 | head -5
Replicon	GC	AT	fGC	Length
A_acidocaldarius_DSM446	1881729	1137026	0.6233	3018755
A_acidocaldarius_Tc_4_1	1922014	1202028	0.6152	3124048
A_actinomycetemcomitans_ANH9381	949230	1187575	0.4442	2136808
A_actinomycetemcomitans_D11S_1-l1	938183	1167561	0.4455	2105764
% 

Now that we remember we can add the code for opening this file to that perl program:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

If that looks like a lot of added code, pay attention and see that most of it is comments (the lines starting with “#”), and the added code is rather just a little. OK, let’s run that to see if we have made no mistakes yet:

% ./choose-ten-genomes.pl
  Chosen by size: 933 out of 2261
  Associated to GC: 933
% 

Apparently no mistakes. The number of associations is the same as the number of chosen genomes by size, as it should, the number of chosen genomes is smaller than the total number of genomes, and the number of chosen genomes is the same as the number we had in the UNIX exercise. Nice, let us continue. Let us add code to figure out the smallest GC content and the largest:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

Let’s run this:

% ./choose-ten-genomes.pl
  Chosen by size: 933 out of 2261
  Associated to GC: 933
  Smallest GC cont:  0.2656
  Largest GC cont:  0.7389
% 

Nice isn’t it? You might now be thinking, shit, this guy verifies too much, then too little! Yes. I do. The message is, find ways to verify that your program is going well! Anyway, these results, again, are exactly what we had previously.

I hope I haven’t lost you so far. Now we want to make those intervals. We need nine, one less than the total genomes to be chosen by the end, because we already have the first value (the minimum), and we have to reach the last. Anyway, we therefore calculate the size of the interval, and then use that to calculate the GC contents that we want our chosen genomes to have. Kind of:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

And checking yet again:

% ./choose-ten-genomes.pl
  Chosen by size: 933 out of 2261
  Associated to GC: 933
  Smallest GC cont:  0.2656
  Largest GC cont:  0.7389
  Interval size = 0.0525888888888889
  Intervals: 0.2656, 0.318188888888889, 0.370777777777778, 0.423366666666667, 0.475955555555556, 0.528544444444444, 0.581133333333333, 0.633722222222222, 0.686311111111111, 0.7389
% 

Nice! It works! ten GC values separated by uniform GC contents.

I do not expect to find genomes whose GC content will match exactly these values (except for the smallest and largest values for obvious reasons). Therefore, to choose genomes I could use a range around those values, associate the subset of genomes within such a range to each of these intervals, and then choose one of these genomes somewhat randomly. This could be a bit of a hard task to program. So let me think … hum … I suspect that this is why I chose the ten genomes by hand in the first place … how to do this? I can’t think of anything elegant, so here I go with an unelegant code. First, I will assign a range to each interval:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

## find values of GC content within a 0.01 range from each interval
my $int_range = 0.01;
for my $interval ( @intervals ) {
    my $low  = $interval - $int_range;
    my $high = $interval + $int_range;
    my @within_range = grep { $_ >= $low && $_ <= $high } @small2large_gc;
    print "  Values within range ($low/$high): ",join(", ",@within_range),"\n";
    exit;
}

Notice the exit within the loop? I stop there just to make sure that the grep command works as I expect it to work. It does:

% ./choose-ten-genomes.pl
  Chosen by size: 933 out of 2261
  Associated to GC: 933
  Smallest GC cont:  0.2656
  Largest GC cont:  0.7389
  Interval size = 0.0525888888888889
  Intervals: 0.2656, 0.318188888888889, 0.370777777777778, 0.423366666666667, 0.475955555555556, 0.528544444444444, 0.581133333333333, 0.633722222222222, 0.686311111111111, 0.7389
  Values within range (0.2556/0.2756): 0.2656, 0.2705, 0.2706, 0.2707, 0.2715, 0.2722, 0.2736, 0.2751
% 

Because those GC content values correspond to genomes, I should be able to get the genomes. That could be accomplished by reversing the GC content hash and thus using the values within range to call for the genome. A few changes in the code as follows do the trick (all the changes occur in the last part):

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

## find values of GC content within a 0.01 range from each interval
my %gc2genome = reverse %gc_content;
my $int_range = 0.01;
for my $interval ( @intervals ) {
    my $low  = $interval - $int_range;
    my $high = $interval + $int_range;
    my @within_range = grep { $_ >= $low && $_ <= $high } @small2large_gc;
    print "  Values within range ($low/$high): ",join(", ",@within_range),"\n";
    for my $within ( @within_range ) {
        my $genome = $gc2genome{"$within"};
        print $genome," = ", $gc_content{"$genome"}," = ",$within,"\n";
    }
    exit;
}

Our old friend the terminal help us see if this worked:

% ./choose-ten-genomes.pl
  Chosen by size: 933 out of 2261
  Associated to GC: 933
  Smallest GC cont:  0.2656
  Largest GC cont:  0.7389
  Interval size = 0.0525888888888889
  Intervals: 0.2656, 0.318188888888889, 0.370777777777778, 0.423366666666667, 0.475955555555556, 0.528544444444444, 0.581133333333333, 0.633722222222222, 0.686311111111111, 0.7389
  Values within range (0.2556/0.2756): 0.2656, 0.2705, 0.2706, 0.2707, 0.2715, 0.2722, 0.2736, 0.2751
Arcobacter_L = 0.2656 = 0.2656
A_butzleri_RM4018 = 0.2705 = 0.2705
B_hyodysenteriae_WA1 = 0.2706 = 0.2706
A_butzleri_ED1 = 0.2707 = 0.2707
F_nucleatum_ATCC25586 = 0.2715 = 0.2715
B_intermedia_PWS_A = 0.2722 = 0.2722
C_botulinum_E3_Alaska_E43 = 0.2736 = 0.2736
C_botulinum_B_Eklund_17B = 0.2751 = 0.2751
%

All right! Super verified. Now, for the final touch, what about we chose a genome from those within. I wanted the choosing to be somewhat random. Since hashes call keys in a somewhat random order, a hash might do the trick without much pain:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

## find values of GC content within a 0.01 range from each interval
my %gc2genome = reverse %gc_content;
my $int_range = 0.01;
for my $interval ( @intervals ) {
    my $low  = $interval - $int_range;
    my $high = $interval + $int_range;
    my @within_range = grep { $_ >= $low && $_ <= $high } @small2large_gc;
    print "  Values within range ($low/$high): ",join(", ",@within_range),"\n";
    my %genomes_in_range = ();
    for my $within ( @within_range ) {
        my $genome = $gc2genome{"$within"};
        print $genome," = ", $gc_content{"$genome"}," = ",$within,"\n";
        $genomes_in_range{"$genome"}++;
    }
    my @genomes2choose = keys %genomes_in_range;
    print "  Genome chosen = ",$genomes2choose[0],"\n";
    exit;
}

We verify:

% ./choose-ten-genomes.pl 
  Chosen by size: 933 out of 2261
  Associated to GC: 933
  Smallest GC cont:  0.2656
  Largest GC cont:  0.7389
  Interval size = 0.0525888888888889
  Intervals: 0.2656, 0.318188888888889, 0.370777777777778, 0.423366666666667, 0.475955555555556, 0.528544444444444, 0.581133333333333, 0.633722222222222, 0.686311111111111, 0.7389
  Values within range (0.2556/0.2756): 0.2656, 0.2705, 0.2706, 0.2707, 0.2715, 0.2722, 0.2736, 0.2751
Arcobacter_L = 0.2656 = 0.2656
A_butzleri_RM4018 = 0.2705 = 0.2705
B_hyodysenteriae_WA1 = 0.2706 = 0.2706
A_butzleri_ED1 = 0.2707 = 0.2707
F_nucleatum_ATCC25586 = 0.2715 = 0.2715
B_intermedia_PWS_A = 0.2722 = 0.2722
C_botulinum_E3_Alaska_E43 = 0.2736 = 0.2736
C_botulinum_B_Eklund_17B = 0.2751 = 0.2751
  Genome chosen = A_butzleri_RM4018
%

Nice! Notice that “A_butzleri_RM4018” is not the first in the list, therefore the keys are not returned in the same order as the hash was built.

OK, now let us get rid of the “exit” to see the whole process being carried out. Only, to avoid having too much stuff in the terminal, I will “comment out” some of those prints that helped us verify how things are going:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

## find values of GC content within a 0.01 range from each interval
my %gc2genome = reverse %gc_content;
my $int_range = 0.01;
for my $interval ( @intervals ) {
    my $low  = $interval - $int_range;
    my $high = $interval + $int_range;
    my @within_range = grep { $_ >= $low && $_ <= $high } @small2large_gc;
    #print "  Values within range ($low/$high): ",join(", ",@within_range),"\n";
    my %genomes_in_range = ();
    for my $within ( @within_range ) {
        my $genome = $gc2genome{"$within"};
        print $genome," = ", $gc_content{"$genome"}," = ",$within,"\n";
        $genomes_in_range{"$genome"}++;
    }
    my @genomes2choose = keys %genomes_in_range;
    print "  Genome chosen = ",$genomes2choose[0],"\n";
    #exit;
}

Verifying for the whole enchilada give us (only showing last 24 lines to show how each time the chosen genome is not predictable):

% ./choose-ten-genomes.pl | tail -24
S_thermophilus_DSM20745_1 = 0.6810 = 0.6810
T_radiovictrix_DSM17093 = 0.6814 = 0.6814
T_mobilis_KA081020_065 = 0.6815 = 0.6815
T_mobilis_KA081020_065 = 0.6815 = 0.6815
A_ferrooxidans_DSM10331 = 0.6829 = 0.6829
B_subvibrioides_ATCC15264 = 0.6842 = 0.6842
R_sphaeroides_ATCC17025 = 0.6848 = 0.6848
B_gladioli_BSR3_2 = 0.6858 = 0.6858
Thermus_CCB_US3_UF1 = 0.6861 = 0.6861
Thermus_CCB_US3_UF1 = 0.6861 = 0.6861
S_thermophilum_IAM14863 = 0.6867 = 0.6867
C_gracile_PCC6307 = 0.6871 = 0.6871
B_glumae_BGR1_2 = 0.6877 = 0.6877
B_glumae_BGR1_2 = 0.6877 = 0.6877
B_mallei_NCTC10229_II = 0.6894 = 0.6894
B_mallei_NCTC10247_II = 0.6896 = 0.6896
R_sphaeroides_241_1 = 0.6901 = 0.6901
R_sphaeroides_ATCC17029_1 = 0.6909 = 0.6909
  Genome chosen (0.686311111111111) = T_radiovictrix_DSM17093
M_luteus_NCTC2665 = 0.7300 = 0.7300
P_mikurensis_NBRC102666 = 0.7329 = 0.7329
C_gilvus_ATCC13127 = 0.7381 = 0.7381
I_variabilis_225 = 0.7389 = 0.7389
  Genome chosen (0.7389) = I_variabilis_225
% 
% ./choose-ten-genomes.pl | grep 'chosen'
  Genome chosen (0.2656) = A_butzleri_RM4018
  Genome chosen (0.318188888888889) = C_lytica_DSM7489
  Genome chosen (0.370777777777778) = L_welshimeri_6b_SLCC5334
  Genome chosen (0.423366666666667) = C_hydrogenoformans_Z2901
  Genome chosen (0.475955555555556) = H_walsbyi_DSM16790
  Genome chosen (0.528544444444444) = Synechococcus_CC9311
  Genome chosen (0.581133333333333) = C_symbiosum_A
  Genome chosen (0.633722222222222) = Novosphingobium_PP1Y
  Genome chosen (0.686311111111111) = T_radiovictrix_DSM17093
  Genome chosen (0.7389) = I_variabilis_225
% ./choose-ten-genomes.pl | grep 'chosen' | wc -l
      10
% 

Ha! Ten genomes chosen, at kind of constant intervals of GC content, now I could either redirect the output to make my list of ten genomes, or modify the program so that it will print the list. For now I will redirect. I can’t resist the power of UNIX commands, pipes, and redirection:

% ./choose-ten-genomes.pl | grep 'chosen'
  Genome chosen (0.2656) = A_butzleri_RM4018
  Genome chosen (0.318188888888889) = C_lytica_DSM7489
  Genome chosen (0.370777777777778) = L_welshimeri_6b_SLCC5334
  Genome chosen (0.423366666666667) = C_hydrogenoformans_Z2901
  Genome chosen (0.475955555555556) = H_walsbyi_DSM16790
  Genome chosen (0.528544444444444) = Synechococcus_CC9311
  Genome chosen (0.581133333333333) = C_symbiosum_A
  Genome chosen (0.633722222222222) = Novosphingobium_PP1Y
  Genome chosen (0.686311111111111) = T_radiovictrix_DSM17093
  Genome chosen (0.7389) = I_variabilis_225
% ./choose-ten-genomes.pl | grep 'chosen' | awk '{print $NF}'
A_butzleri_RM4018
C_lytica_DSM7489
L_welshimeri_6b_SLCC5334
C_hydrogenoformans_Z2901
H_walsbyi_DSM16790
Synechococcus_CC9311
C_symbiosum_A
Novosphingobium_PP1Y
T_radiovictrix_DSM17093
I_variabilis_225
% ./choose-ten-genomes.pl | grep 'chosen' | awk '{print $NF}' > DATA/subset.by_program
% wc DATA/subset.by_program 
      10      10     201 DATA/subset.by_program
% cat DATA/subset.by_program
A_butzleri_RM4018
C_lytica_DSM7489
L_welshimeri_6b_SLCC5334
C_hydrogenoformans_Z2901
H_walsbyi_DSM16790
Synechococcus_CC9311
C_symbiosum_A
Novosphingobium_PP1Y
T_radiovictrix_DSM17093
I_variabilis_225
%

Done! A list of ten genomes at somewhat constant intervals of GC content, ready for a lab exercise on GC content and the proportions of arginine and lysine. Next we count arginines, lysines, and figure out if there’s any changes in these two amino acids with GC content.

Enjoy,
-SuperGabo

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s