Home
Note: No guaranties whatsoever. If you want to try the exercise, I have the original table of genomes with their sizes and GC content at figshare. Click here to download (at your own risk, etc etc). I just hope that these soliloquies might inspire bioinformatics newbies

This is the fourth in the series. In the previous in the series I showed, kinda real life, how to write a program in perl to count the arginines and lysines in each replicon available in my database. With those counts, the program also calculated the relative proportions of each of these amino acids, and added the counts and proportions to the genome table that already had GC content precalculated. We would be ready now to answer the original question: does the proportions of these amino acids change with GC content? Despite this is an exercise I added to the lab in my bioinformatics course, the question originated out of my skepticism that this would be true after reading something to the effect in some article whose title I don’t remember (de cuyo nombre no puedo acordarme).

So, for starters, I will be showing you how I calculated these relationships. I could proceed directly with the whole list of replicons, but I wanted to have results before the students to be able to mark their work quickly. Therefore, I need three calculations at least:

  1. calculations with a subset of ten genomes as presented to the students
  2. calculations with the 2Mbp to 4Mbp in case some student decided to take the challenge of doing the exercise with all of these
  3. calculations with the whole list just to satisfy my curiosity completely, and perhaps see if smallish genomes do indeed introduce some noise to the exercise.

Well, since I already had a program that would select first a subset of genomes between 2Mbp and 4Mbp of size, from the second iteration in this series, I modified a similar program to that one to work with the table with added values (counts and proportions) as follows:

#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;
## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$TABLE>) {
    next if( m{^Replicon} ); # jump the heading
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$gc,$at,$gc_content,$genome_length) = split;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        $gc_content{"$genome"} = $gc_content;
        $chosen_by_size++;
    }
}
close($TABLE);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

## find values of GC content within a 0.01 range from each interval
my $int_range = 0.01;
for my $interval ( @intervals ) {
    my $low  = $interval - $int_range;
    my $high = $interval + $int_range;
    my @within_range
        = grep { $gc_content{"$_"} >= $low && $gc_content{"$_"} <= $high }
            keys %gc_content;
    my %genomes_in_range = ();
    for my $genome ( @within_range ) {
        print $genome," = ", $gc_content{"$genome"},"\n";
        $genomes_in_range{"$genome"}++;
    }
    my @genomes2choose = keys %genomes_in_range;
    my $chosen = $genomes2choose[0];
    print "  Genome chosen ($interval) = ",$chosen," (",$gc_content{"$chosen"},")\n";
    #exit;
}
#!/usr/bin/env perl
## the line above tells the shell that the program that follows
## should be interpreted with perl

## set a few things that we want in order to choose genomes.
## I want ten genomes:
my $subset_number = 10;
## I want them to be at least 2 Mbp in length
my $min_length = 2e6;
## I want them to be a max of 4Mbp in length
my $max_length = 4e6;
## yes, perl understand that kind of scientific notation, isn't it cool?

## now because I know that I will I open the file, figure out the
## range of GC content, and filter out genomes out of the sizes I
## decided above, yet come back to the data to choose only ten genomes
## within sme range of GC content, I should, at the very least, learn
## the GC content as associated to a genome's name. Therefore, I need
## a hash variable (well, no, perl would create the variable "on the
## fly," but it's good practice to declare before using. Makes your
## programs easier to maintain. Anyway, here he hash for GC content,
## we will call it:
my %gc_content = ();
## Ain't I smart?
## we also want to remember each line to print after selecting
## the ten genomes for the exercise
my %genome_line = ();

## because we will be choosing a subset, as a measure of caution,
## we will count the number of genomes read, and the chosen ones:
my $total_genomes = 0;
my $chosen_by_size = 0;

## we open files to write the tables
open( my $RFULL,"|-","bzip2 -9 >DATA/GC_n_RK.R.bz2" );
open( my $R2N4,"|-","bzip2 -9 >DATA/GC_n_RK_2n4.R.bz2" );
open( my $TEN,">","DATA/GC_n_RK_ten.R" );
## (no excuse to compress a file with ten genomes)


## now we open the table, which is, by the way, compressed with bzip2:
open( my $TABLE,"-|","bzip2 -qdc DATA/GC_n_RK.tbl.bz2" );
READ_TABLE:
while(<$TABLE>) {
    if( m{^Replicon} ){
        ## erase the first word for R's sake
        s{^Replicon\s+}{};
        print {$RFULL} $_;
        print {$R2N4}  $_;
        print {$TEN}   $_;
        # now jump the heading
        next READ_TABLE;
    }
    ## by deafult, "split" splits a line at empty spaces, such as tabs
    ## we assign each splitted word to a variable
    my($genome,$genome_length,$gc,$at,$gc_content) = split;
    print {$RFULL} $_;
    $total_genomes++;
    if( $genome_length >= $min_length
            && $genome_length <= $max_length ) {
        print {$R2N4}  $_;
        $gc_content{"$genome"} = $gc_content;
        $genome_line{"$genome"} = $_;
        $chosen_by_size++;
    }
}
close($TABLE);
close($RFULL);
close($R2N4);

## now we should have a subset of genomes associated to their GC content
## let's make sure by printing the number of chosen genomes, the total,
## and the number of associations in the gc_content hash
print "  Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n";
my $count_gc_assocs = keys %gc_content;
print "  Associated to GC: ",$count_gc_assocs,"\n";

## find smallest and largest GC content:
my @small2large_gc = sort { $a <=> $b } values %gc_content;
print "  Smallest GC cont:  ",$small2large_gc[0],"\n";
print "  Largest GC cont:  ",$small2large_gc[-1],"\n";

## calculate GC intervals:
my $interval_size
    = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 );
print "  Interval size = ",$interval_size,"\n";

my $current_value = $small2large_gc[0];
my @intervals = ();
push(@intervals,$current_value);
while( $current_value <= $small2large_gc[-1] ) {
    $current_value += $interval_size;
    push(@intervals,$current_value);
}

## let's make sure that we have them right:
print "  Intervals: ",join(", ",@intervals),"\n";

## find values of GC content within a 0.01 range from each interval
my $int_range = 0.01;
for my $interval ( @intervals ) {
    my $low  = $interval - $int_range;
    my $high = $interval + $int_range;
    my @within_range
        = grep { $gc_content{"$_"} >= $low && $gc_content{"$_"} <= $high }
            keys %gc_content;
    my %genomes_in_range = ();
    for my $genome ( @within_range ) {
        print $genome," = ", $gc_content{"$genome"},"\n";
        $genomes_in_range{"$genome"}++;
    }
    my @genomes2choose = keys %genomes_in_range;
    my $chosen = $genomes2choose[0];
    print {$TEN} $genome_line{"$chosen"};
    print "  Genome chosen ($interval) = ",$chosen," (",$gc_content{"$chosen"},")\n";
    #exit;
}
close($TEN);

Another way to look at the differences is using the classic UNIX command “diff”:

% diff choose-ten-genomes-v2.pl build-tables4R.pl
24a25,27
> ## we also want to remember each line to print after selecting
> ## the ten genomes for the exercise
> my %genome_line = ();
29a33,40
> 
> ## we open files to write the tables
> open( my $RFULL,"|-","bzip2 -9 >DATA/GC_n_RK.R.bz2" );
> open( my $R2N4,"|-","bzip2 -9 >DATA/GC_n_RK_2n4.R.bz2" );
> open( my $TEN,">","DATA/GC_n_RK_ten.R" );
> ## (no excuse to compress a file with ten genomes)
> 
> 
31c42,43
< open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
---
> open( my $TABLE,"-|","bzip2 -qdc DATA/GC_n_RK.tbl.bz2" );
> READ_TABLE:
33c45,53
<     next if( m{^Replicon} ); # jump the heading
---
>     if( m{^Replicon} ){
>         ## erase the first word for R's sake
>         s{^Replicon\s+}{};
>         print {$RFULL} $_;
>         print {$R2N4}  $_;
>         print {$TEN}   $_;
>         # now jump the heading
>         next READ_TABLE;
>     }
36c56,57
<     my($genome,$gc,$at,$gc_content,$genome_length) = split;
---
>     my($genome,$genome_length,$gc,$at,$gc_content) = split;
>     print {$RFULL} $_;
39a61
>         print {$R2N4}  $_;
40a63
>         $genome_line{"$genome"} = $_;
44a68,69
> close($RFULL);
> close($R2N4);
88a114
>     print {$TEN} $genome_line{"$chosen"};
91a118
> close($TEN);
%

Actually, I used the output of this “diff” command to know the lines to highlight to present the differences between the programs above. The differences make the “build-tables4R.pl” program read a different file as first input so that we have the arginine and lysine calculations. They also make the program write the tables (three tables, the original, the subset of genomes between 2Mbp and 4Mbp in length, and the ten genomes) into files so that they are easily imported into R. For example, you might notice that I eliminated the word “Replicon” from the first field in the heading. This makes R understand that the first line is to be used as column names, while the first column should be used for row names, thus saving us some complications when reading the tables into R that I will not explain this time.

I thus run the program, and check the resulting tables:

% ls DATA/
GC_n_RK.tbl.bz2      GC_table.tbl.bz2     subset.by_program
GC_table.sorted.bz2  SUBSET
%
% ./build-tables4R.pl |tail -5
M_luteus_NCTC2665 = 0.7300
I_variabilis_225 = 0.7389
C_gilvus_ATCC13127 = 0.7381
P_mikurensis_NBRC102666 = 0.7329
  Genome chosen (0.7389) = I_variabilis_225 (0.7389)
%
% ls DATA/
GC_n_RK.R.bz2        GC_n_RK_ten.R        SUBSET
GC_n_RK.tbl.bz2      GC_table.sorted.bz2  subset.by_program
GC_n_RK_2n4.R.bz2    GC_table.tbl.bz2
%
% ls DATA/*.R*
DATA/GC_n_RK.R.bz2      DATA/GC_n_RK_2n4.R.bz2  DATA/GC_n_RK_ten.R
%
% cat DATA/GC_n_RK_ten.R
Length	GC	AT	fGC	Arg	Lys	fArg	fLys
C_botulinum_B_Eklund_17B	3800327	1045522	2754805	0.2751	31440	95985	0.246733372572101	0.753266627427899
C_lytica_DSM7489	3765936	1209276	2556660	0.3211	34394	94988	0.265832959762564	0.734167040237436
L_monocytogenes_serotype_4b_LL195	2904662	1104863	1799798	0.3804	31945	61915	0.34034732580439	0.65965267419561
Cycloclasticus_P1	2363215	992524	1370691	0.4200	31789	41370	0.434519334599981	0.565480665400019
H_walsbyi_DSM16790	3132494	1499334	1633160	0.4786	45113	18143	0.713181358290123	0.286818641709877
C_pseudotuberculosis_1002	2335113	1218741	1116372	0.5219	38467	25810	0.598456679683246	0.401543320316754
C_symbiosum_A	2045086	1173335	871520	0.5738	39575	25774	0.605594576810663	0.394405423189337
D_vulgaris_RCH1	3532052	2229947	1302105	0.6313	75794	31721	0.704962098311863	0.295037901688137
T_radiovictrix_DSM17093	3260398	2221603	1038795	0.6814	76161	20616	0.786974177748845	0.213025822251155
I_variabilis_225	3307740	2444014	863726	0.7389	78896	14963	0.840580018964617	0.159419981035383
%
% bzcat DATA/GC_n_RK.R.bz2 | wc
    2262   20357  223446
% bzcat DATA/GC_n_RK_2n4.R.bz2 | wc
     934    8405   92604
%

Good! So we have three new files, and at least the one with ten genomes has ten genomes, while the two compressed files have the number of rows expected. I could further verify, but that’ll do for now. If there’s problems, we will see them in R. Now let us run an interactive R session and check what we wanted to check in the first place. Let’s start with those ten genomes (R is started just by typing “R” in the command-line. After that the command-line is R’s):

% R

R version 3.0.0 (2013-04-03) -- "Masked Marvel"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin12.3.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> ten_bichos = read.table("DATA/GC_n_RK_ten.R")
> summary(ten_bichos)
     Length              GC                AT               fGC        
 Min.   :2045086   Min.   : 992524   Min.   : 863726   Min.   :0.2751  
 1st Qu.:2498577   1st Qu.:1121981   1st Qu.:1058189   1st Qu.:0.3903  
 Median :3196446   Median :1214008   Median :1336398   Median :0.5002  
 Mean   :3044702   Mean   :1513916   Mean   :1530763   Mean   :0.5022  
 3rd Qu.:3475974   3rd Qu.:2041036   3rd Qu.:1758138   3rd Qu.:0.6169  
 Max.   :3800327   Max.   :2444014   Max.   :2754805   Max.   :0.7389  
      Arg             Lys             fArg             fLys       
 Min.   :31440   Min.   :14963   Min.   :0.2467   Min.   :0.1594  
 1st Qu.:32557   1st Qu.:21906   1st Qu.:0.3639   1st Qu.:0.2889  
 Median :39021   Median :28766   Median :0.6020   Median :0.3980  
 Mean   :48357   Mean   :43128   Mean   :0.5537   Mean   :0.4463  
 3rd Qu.:68124   3rd Qu.:56779   3rd Qu.:0.7111   3rd Qu.:0.6361  
 Max.   :78896   Max.   :95985   Max.   :0.8406   Max.   :0.7533  
> cor.test(ten_bichos$fGC,ten_bichos$fArg, method="pearson")

	Pearson's product-moment correlation

data:  ten_bichos$fGC and ten_bichos$fArg
t = 8.252, df = 8, p-value = 3.491e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7822898 0.9874598
sample estimates:
      cor 
0.9459756 

> 

Do you remember how I was inspired to make this exercise? I doubted that there would be a relationship between GC content and the relative abundances of arginine and lysine! Crap, not only was I wrong, the correlation is almost perfect! Look at that! I expected some correlation I could explain away. Maybe a Pearson’s correlation coefficient (R, not to be mistaken for the program R) of 0.5 tops. But 0.94? You’ve got to be kidding me! OK, I thought, maybe this happened because I chose genomes separated evenly in their GC content, so let’s see what happens with the subset between 2Mpb and 4Mbp in length:

> two2four = read.table(bzfile("DATA/GC_n_RK_2n4.R.bz2"))
> summary(two2four)
     Length              GC                AT               fGC        
 Min.   :2000962   Min.   : 590411   Min.   : 640011   Min.   :0.2656  
 1st Qu.:2337730   1st Qu.: 958087   1st Qu.:1161427   1st Qu.:0.3815  
 Median :2821452   Median :1313948   Median :1370810   Median :0.4698  
 Mean   :2870742   Mean   :1402605   Mean   :1468074   Mean   :0.4831  
 3rd Qu.:3354505   3rd Qu.:1706601   3rd Qu.:1762401   3rd Qu.:0.5865  
 Max.   :3998906   Max.   :2851183   Max.   :2864327   Max.   :0.7389  
      Arg             Lys              fArg             fLys       
 Min.   :11562   Min.   :  8229   Min.   :0.2217   Min.   :0.1287  
 1st Qu.:29975   1st Qu.: 30345   1st Qu.:0.3622   1st Qu.:0.3589  
 Median :40780   Median : 42032   Median :0.4694   Median :0.5306  
 Mean   :44521   Mean   : 44322   Mean   :0.5018   Mean   :0.4982  
 3rd Qu.:54852   3rd Qu.: 57342   3rd Qu.:0.6411   3rd Qu.:0.6378  
 Max.   :98146   Max.   :107045   Max.   :0.8713   Max.   :0.7783  
> cor.test(two2four$fGC,two2four$fArg, method="pearson")

	Pearson's product-moment correlation

data:  two2four$fGC and two2four$fArg
t = 104.4614, df = 931, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9545137 0.9646429
sample estimates:
      cor 
0.9598903 

> 

What? Pearson’s R close to 0.96? OK, then I was wrong. Even if the whole table had a lower correlation, the most it would be evidence for would be that smallish genomes introduce noise, which is expected. Let’s see:

> wholething = read.table(bzfile("DATA/GC_n_RK.R.bz2"))
> summary(wholething)
     Length               GC                AT               fGC        
 Min.   :    1133   Min.   :    727   Min.   :    406   Min.   :0.1354  
 1st Qu.: 1938709   1st Qu.: 772846   1st Qu.:1056489   1st Qu.:0.3754  
 Median : 2922917   Median :1401641   Median :1407570   Median :0.4631  
 Mean   : 3275883   Mean   :1685033   Mean   :1590690   Mean   :0.4768  
 3rd Qu.: 4519823   3rd Qu.:2373396   3rd Qu.:2072131   3rd Qu.:0.5885  
 Max.   :13033779   Max.   :9302956   Max.   :5433425   Max.   :0.7491  
      Arg              Lys              fArg              fLys       
 Min.   :     8   Min.   :    13   Min.   :0.09687   Min.   :0.1206  
 1st Qu.: 24218   1st Qu.: 28308   1st Qu.:0.34035   1st Qu.:0.3631  
 Median : 43511   Median : 42228   Median :0.47161   Median :0.5284  
 Mean   : 53511   Mean   : 46107   Mean   :0.49584   Mean   :0.5042  
 3rd Qu.: 75429   3rd Qu.: 58987   3rd Qu.:0.63691   3rd Qu.:0.6597  
 Max.   :325052   Max.   :171173   Max.   :0.87944   Max.   :0.9031  
> cor.test(wholething$fGC,wholething$fArg, method="pearson")

	Pearson's product-moment correlation

data:  wholething$fGC and wholething$fArg
t = 164.6014, df = 2259, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9574459 0.9638010
sample estimates:
      cor 
0.9607493 

> 

OK, almost the same result. Thus, even if smallish genomes introduce noise, it is not a lot of noise. Well, all we are left with would be to graph this shit:

(I first plotted the whole thing set to figure out the proper coordinates to display each graph)

> plot(wholething$fGC, wholething$fArg)
> par(mfrow=c(1,3))
> plot(ten_bichos$fGC, ten_bichos$fArg, xlim=c(0.1,0.8), ylim=c(0.1,0.9))
> plot(two2four$fGC, two2four$fArg, xlim=c(0.1,0.8), ylim=c(0.1,0.9))
> plot(wholething$fGC, wholething$fArg, xlim=c(0.1,0.8), ylim=c(0.1,0.9))
>

Wow. I am happy to inform you that this command worked in my mac, it opened another window (a “quartz” window) and displayed the graphs in that window, despite I was working with R in the terminal. 🙂

But you can’t see the plots this way, so I saved what was being displayed (it saved automatically as a PDF) and here they are for you to contemplate:

GC, arginine and lysine

GC, arginine and lysine

(For a clearer, larger, version click on the image)

Upa! OK, so I had the misfortune of selecting an outlier genome (see that dot jumping out in the ten genome plot), and the graphs show that all the effort to select genomes within some range of sizes did nothing to reduce noise in the graphs. In any event, we have thus discovered that there is a very clear correlation between GC content and the relative abundances of arginine and lysine. I doubted it, yet here we are. Facts are facts. Now why do we care? For one, because some people wrote that in an article and I doubted it. Then, because this indicates that these two amino acids are pretty much interchangeable. Wait a minute Gabo! You might start to think: correlation does not indicate causation. Right. But I am inclined to think that GC comes first, but I did not expect it to have such a strong influence on amino acid composition. Anyway, if it is true that arginines and lysines are mostly interchangeable, then we have much more to test, don’t we? For example, whether homologous proteins show arginines to align with lysines more frequently than with some other amino acids. Or would this correlation still show if we took into account only homologous, or orthologous proteins? Do other amino acids show such clear exchangeability? Do other amino acids change with GC content? All these questions should prompt us to go for a deep literature review before proceeding. The more we think of questions, the better we can figure our what kinds of articles to look for, and what kinds of analyses would be due to better understand what might be behind these results. Hey, if nobody has published this relationship, maybe we can make more tests and publish them.

I will leave you for now with those thoughts. 🙂

Have fun,
-SuperGabo

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s