Note: No guaranties whatsoever. If you want to try the exercise, I have the original table of genomes with their sizes and GC content at figshare. Click here to download (at your own risk, etc etc). I just hope that these soliloquies might inspire bioinformatics newbies
This is the fourth in the series. In the previous in the series I showed, kinda real life, how to write a program in perl to count the arginines and lysines in each replicon available in my database. With those counts, the program also calculated the relative proportions of each of these amino acids, and added the counts and proportions to the genome table that already had GC content precalculated. We would be ready now to answer the original question: does the proportions of these amino acids change with GC content? Despite this is an exercise I added to the lab in my bioinformatics course, the question originated out of my skepticism that this would be true after reading something to the effect in some article whose title I don’t remember (de cuyo nombre no puedo acordarme).
So, for starters, I will be showing you how I calculated these relationships. I could proceed directly with the whole list of replicons, but I wanted to have results before the students to be able to mark their work quickly. Therefore, I need three calculations at least:
- calculations with a subset of ten genomes as presented to the students
- calculations with the 2Mbp to 4Mbp in case some student decided to take the challenge of doing the exercise with all of these
- calculations with the whole list just to satisfy my curiosity completely, and perhaps see if smallish genomes do indeed introduce some noise to the exercise.
Well, since I already had a program that would select first a subset of genomes between 2Mbp and 4Mbp of size, from the second iteration in this series, I modified a similar program to that one to work with the table with added values (counts and proportions) as follows:
#!/usr/bin/env perl ## the line above tells the shell that the program that follows ## should be interpreted with perl ## set a few things that we want in order to choose genomes. ## I want ten genomes: my $subset_number = 10; ## I want them to be at least 2 Mbp in length my $min_length = 2e6; ## I want them to be a max of 4Mbp in length my $max_length = 4e6; ## yes, perl understand that kind of scientific notation, isn't it cool? ## now because I know that I will I open the file, figure out the ## range of GC content, and filter out genomes out of the sizes I ## decided above, yet come back to the data to choose only ten genomes ## within sme range of GC content, I should, at the very least, learn ## the GC content as associated to a genome's name. Therefore, I need ## a hash variable (well, no, perl would create the variable "on the ## fly," but it's good practice to declare before using. Makes your ## programs easier to maintain. Anyway, here he hash for GC content, ## we will call it: my %gc_content = (); ## Ain't I smart? ## because we will be choosing a subset, as a measure of caution, ## we will count the number of genomes read, and the chosen ones: my $total_genomes = 0; my $chosen_by_size = 0; ## now we open the table, which is, by the way, compressed with bzip2: open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" ); while(<$TABLE>) { next if( m{^Replicon} ); # jump the heading ## by deafult, "split" splits a line at empty spaces, such as tabs ## we assign each splitted word to a variable my($genome,$gc,$at,$gc_content,$genome_length) = split; $total_genomes++; if( $genome_length >= $min_length && $genome_length <= $max_length ) { $gc_content{"$genome"} = $gc_content; $chosen_by_size++; } } close($TABLE); ## now we should have a subset of genomes associated to their GC content ## let's make sure by printing the number of chosen genomes, the total, ## and the number of associations in the gc_content hash print " Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n"; my $count_gc_assocs = keys %gc_content; print " Associated to GC: ",$count_gc_assocs,"\n"; ## find smallest and largest GC content: my @small2large_gc = sort { $a <=> $b } values %gc_content; print " Smallest GC cont: ",$small2large_gc[0],"\n"; print " Largest GC cont: ",$small2large_gc[-1],"\n"; ## calculate GC intervals: my $interval_size = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 ); print " Interval size = ",$interval_size,"\n"; my $current_value = $small2large_gc[0]; my @intervals = (); push(@intervals,$current_value); while( $current_value <= $small2large_gc[-1] ) { $current_value += $interval_size; push(@intervals,$current_value); } ## let's make sure that we have them right: print " Intervals: ",join(", ",@intervals),"\n"; ## find values of GC content within a 0.01 range from each interval my $int_range = 0.01; for my $interval ( @intervals ) { my $low = $interval - $int_range; my $high = $interval + $int_range; my @within_range = grep { $gc_content{"$_"} >= $low && $gc_content{"$_"} <= $high } keys %gc_content; my %genomes_in_range = (); for my $genome ( @within_range ) { print $genome," = ", $gc_content{"$genome"},"\n"; $genomes_in_range{"$genome"}++; } my @genomes2choose = keys %genomes_in_range; my $chosen = $genomes2choose[0]; print " Genome chosen ($interval) = ",$chosen," (",$gc_content{"$chosen"},")\n"; #exit; }
#!/usr/bin/env perl ## the line above tells the shell that the program that follows ## should be interpreted with perl ## set a few things that we want in order to choose genomes. ## I want ten genomes: my $subset_number = 10; ## I want them to be at least 2 Mbp in length my $min_length = 2e6; ## I want them to be a max of 4Mbp in length my $max_length = 4e6; ## yes, perl understand that kind of scientific notation, isn't it cool? ## now because I know that I will I open the file, figure out the ## range of GC content, and filter out genomes out of the sizes I ## decided above, yet come back to the data to choose only ten genomes ## within sme range of GC content, I should, at the very least, learn ## the GC content as associated to a genome's name. Therefore, I need ## a hash variable (well, no, perl would create the variable "on the ## fly," but it's good practice to declare before using. Makes your ## programs easier to maintain. Anyway, here he hash for GC content, ## we will call it: my %gc_content = (); ## Ain't I smart? ## we also want to remember each line to print after selecting ## the ten genomes for the exercise my %genome_line = (); ## because we will be choosing a subset, as a measure of caution, ## we will count the number of genomes read, and the chosen ones: my $total_genomes = 0; my $chosen_by_size = 0; ## we open files to write the tables open( my $RFULL,"|-","bzip2 -9 >DATA/GC_n_RK.R.bz2" ); open( my $R2N4,"|-","bzip2 -9 >DATA/GC_n_RK_2n4.R.bz2" ); open( my $TEN,">","DATA/GC_n_RK_ten.R" ); ## (no excuse to compress a file with ten genomes) ## now we open the table, which is, by the way, compressed with bzip2: open( my $TABLE,"-|","bzip2 -qdc DATA/GC_n_RK.tbl.bz2" ); READ_TABLE: while(<$TABLE>) { if( m{^Replicon} ){ ## erase the first word for R's sake s{^Replicon\s+}{}; print {$RFULL} $_; print {$R2N4} $_; print {$TEN} $_; # now jump the heading next READ_TABLE; } ## by deafult, "split" splits a line at empty spaces, such as tabs ## we assign each splitted word to a variable my($genome,$genome_length,$gc,$at,$gc_content) = split; print {$RFULL} $_; $total_genomes++; if( $genome_length >= $min_length && $genome_length <= $max_length ) { print {$R2N4} $_; $gc_content{"$genome"} = $gc_content; $genome_line{"$genome"} = $_; $chosen_by_size++; } } close($TABLE); close($RFULL); close($R2N4); ## now we should have a subset of genomes associated to their GC content ## let's make sure by printing the number of chosen genomes, the total, ## and the number of associations in the gc_content hash print " Chosen by size: ",$chosen_by_size," out of ",$total_genomes,"\n"; my $count_gc_assocs = keys %gc_content; print " Associated to GC: ",$count_gc_assocs,"\n"; ## find smallest and largest GC content: my @small2large_gc = sort { $a <=> $b } values %gc_content; print " Smallest GC cont: ",$small2large_gc[0],"\n"; print " Largest GC cont: ",$small2large_gc[-1],"\n"; ## calculate GC intervals: my $interval_size = ( $small2large_gc[-1] - $small2large_gc[0] ) / ( $subset_number - 1 ); print " Interval size = ",$interval_size,"\n"; my $current_value = $small2large_gc[0]; my @intervals = (); push(@intervals,$current_value); while( $current_value <= $small2large_gc[-1] ) { $current_value += $interval_size; push(@intervals,$current_value); } ## let's make sure that we have them right: print " Intervals: ",join(", ",@intervals),"\n"; ## find values of GC content within a 0.01 range from each interval my $int_range = 0.01; for my $interval ( @intervals ) { my $low = $interval - $int_range; my $high = $interval + $int_range; my @within_range = grep { $gc_content{"$_"} >= $low && $gc_content{"$_"} <= $high } keys %gc_content; my %genomes_in_range = (); for my $genome ( @within_range ) { print $genome," = ", $gc_content{"$genome"},"\n"; $genomes_in_range{"$genome"}++; } my @genomes2choose = keys %genomes_in_range; my $chosen = $genomes2choose[0]; print {$TEN} $genome_line{"$chosen"}; print " Genome chosen ($interval) = ",$chosen," (",$gc_content{"$chosen"},")\n"; #exit; } close($TEN);
Another way to look at the differences is using the classic UNIX command “diff”:
% diff choose-ten-genomes-v2.pl build-tables4R.pl 24a25,27 > ## we also want to remember each line to print after selecting > ## the ten genomes for the exercise > my %genome_line = (); 29a33,40 > > ## we open files to write the tables > open( my $RFULL,"|-","bzip2 -9 >DATA/GC_n_RK.R.bz2" ); > open( my $R2N4,"|-","bzip2 -9 >DATA/GC_n_RK_2n4.R.bz2" ); > open( my $TEN,">","DATA/GC_n_RK_ten.R" ); > ## (no excuse to compress a file with ten genomes) > > 31c42,43 < open( my $TABLE,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" ); --- > open( my $TABLE,"-|","bzip2 -qdc DATA/GC_n_RK.tbl.bz2" ); > READ_TABLE: 33c45,53 < next if( m{^Replicon} ); # jump the heading --- > if( m{^Replicon} ){ > ## erase the first word for R's sake > s{^Replicon\s+}{}; > print {$RFULL} $_; > print {$R2N4} $_; > print {$TEN} $_; > # now jump the heading > next READ_TABLE; > } 36c56,57 < my($genome,$gc,$at,$gc_content,$genome_length) = split; --- > my($genome,$genome_length,$gc,$at,$gc_content) = split; > print {$RFULL} $_; 39a61 > print {$R2N4} $_; 40a63 > $genome_line{"$genome"} = $_; 44a68,69 > close($RFULL); > close($R2N4); 88a114 > print {$TEN} $genome_line{"$chosen"}; 91a118 > close($TEN); %
Actually, I used the output of this “diff” command to know the lines to highlight to present the differences between the programs above. The differences make the “build-tables4R.pl” program read a different file as first input so that we have the arginine and lysine calculations. They also make the program write the tables (three tables, the original, the subset of genomes between 2Mbp and 4Mbp in length, and the ten genomes) into files so that they are easily imported into R. For example, you might notice that I eliminated the word “Replicon” from the first field in the heading. This makes R understand that the first line is to be used as column names, while the first column should be used for row names, thus saving us some complications when reading the tables into R that I will not explain this time.
I thus run the program, and check the resulting tables:
% ls DATA/ GC_n_RK.tbl.bz2 GC_table.tbl.bz2 subset.by_program GC_table.sorted.bz2 SUBSET % % ./build-tables4R.pl |tail -5 M_luteus_NCTC2665 = 0.7300 I_variabilis_225 = 0.7389 C_gilvus_ATCC13127 = 0.7381 P_mikurensis_NBRC102666 = 0.7329 Genome chosen (0.7389) = I_variabilis_225 (0.7389) % % ls DATA/ GC_n_RK.R.bz2 GC_n_RK_ten.R SUBSET GC_n_RK.tbl.bz2 GC_table.sorted.bz2 subset.by_program GC_n_RK_2n4.R.bz2 GC_table.tbl.bz2 % % ls DATA/*.R* DATA/GC_n_RK.R.bz2 DATA/GC_n_RK_2n4.R.bz2 DATA/GC_n_RK_ten.R % % cat DATA/GC_n_RK_ten.R Length GC AT fGC Arg Lys fArg fLys C_botulinum_B_Eklund_17B 3800327 1045522 2754805 0.2751 31440 95985 0.246733372572101 0.753266627427899 C_lytica_DSM7489 3765936 1209276 2556660 0.3211 34394 94988 0.265832959762564 0.734167040237436 L_monocytogenes_serotype_4b_LL195 2904662 1104863 1799798 0.3804 31945 61915 0.34034732580439 0.65965267419561 Cycloclasticus_P1 2363215 992524 1370691 0.4200 31789 41370 0.434519334599981 0.565480665400019 H_walsbyi_DSM16790 3132494 1499334 1633160 0.4786 45113 18143 0.713181358290123 0.286818641709877 C_pseudotuberculosis_1002 2335113 1218741 1116372 0.5219 38467 25810 0.598456679683246 0.401543320316754 C_symbiosum_A 2045086 1173335 871520 0.5738 39575 25774 0.605594576810663 0.394405423189337 D_vulgaris_RCH1 3532052 2229947 1302105 0.6313 75794 31721 0.704962098311863 0.295037901688137 T_radiovictrix_DSM17093 3260398 2221603 1038795 0.6814 76161 20616 0.786974177748845 0.213025822251155 I_variabilis_225 3307740 2444014 863726 0.7389 78896 14963 0.840580018964617 0.159419981035383 % % bzcat DATA/GC_n_RK.R.bz2 | wc 2262 20357 223446 % bzcat DATA/GC_n_RK_2n4.R.bz2 | wc 934 8405 92604 %
Good! So we have three new files, and at least the one with ten genomes has ten genomes, while the two compressed files have the number of rows expected. I could further verify, but that’ll do for now. If there’s problems, we will see them in R. Now let us run an interactive R session and check what we wanted to check in the first place. Let’s start with those ten genomes (R is started just by typing “R” in the command-line. After that the command-line is R’s):
% R R version 3.0.0 (2013-04-03) -- "Masked Marvel" Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin12.3.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > ten_bichos = read.table("DATA/GC_n_RK_ten.R") > summary(ten_bichos) Length GC AT fGC Min. :2045086 Min. : 992524 Min. : 863726 Min. :0.2751 1st Qu.:2498577 1st Qu.:1121981 1st Qu.:1058189 1st Qu.:0.3903 Median :3196446 Median :1214008 Median :1336398 Median :0.5002 Mean :3044702 Mean :1513916 Mean :1530763 Mean :0.5022 3rd Qu.:3475974 3rd Qu.:2041036 3rd Qu.:1758138 3rd Qu.:0.6169 Max. :3800327 Max. :2444014 Max. :2754805 Max. :0.7389 Arg Lys fArg fLys Min. :31440 Min. :14963 Min. :0.2467 Min. :0.1594 1st Qu.:32557 1st Qu.:21906 1st Qu.:0.3639 1st Qu.:0.2889 Median :39021 Median :28766 Median :0.6020 Median :0.3980 Mean :48357 Mean :43128 Mean :0.5537 Mean :0.4463 3rd Qu.:68124 3rd Qu.:56779 3rd Qu.:0.7111 3rd Qu.:0.6361 Max. :78896 Max. :95985 Max. :0.8406 Max. :0.7533 > cor.test(ten_bichos$fGC,ten_bichos$fArg, method="pearson") Pearson's product-moment correlation data: ten_bichos$fGC and ten_bichos$fArg t = 8.252, df = 8, p-value = 3.491e-05 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7822898 0.9874598 sample estimates: cor 0.9459756 >
Do you remember how I was inspired to make this exercise? I doubted that there would be a relationship between GC content and the relative abundances of arginine and lysine! Crap, not only was I wrong, the correlation is almost perfect! Look at that! I expected some correlation I could explain away. Maybe a Pearson’s correlation coefficient (R, not to be mistaken for the program R) of 0.5 tops. But 0.94? You’ve got to be kidding me! OK, I thought, maybe this happened because I chose genomes separated evenly in their GC content, so let’s see what happens with the subset between 2Mpb and 4Mbp in length:
> two2four = read.table(bzfile("DATA/GC_n_RK_2n4.R.bz2")) > summary(two2four) Length GC AT fGC Min. :2000962 Min. : 590411 Min. : 640011 Min. :0.2656 1st Qu.:2337730 1st Qu.: 958087 1st Qu.:1161427 1st Qu.:0.3815 Median :2821452 Median :1313948 Median :1370810 Median :0.4698 Mean :2870742 Mean :1402605 Mean :1468074 Mean :0.4831 3rd Qu.:3354505 3rd Qu.:1706601 3rd Qu.:1762401 3rd Qu.:0.5865 Max. :3998906 Max. :2851183 Max. :2864327 Max. :0.7389 Arg Lys fArg fLys Min. :11562 Min. : 8229 Min. :0.2217 Min. :0.1287 1st Qu.:29975 1st Qu.: 30345 1st Qu.:0.3622 1st Qu.:0.3589 Median :40780 Median : 42032 Median :0.4694 Median :0.5306 Mean :44521 Mean : 44322 Mean :0.5018 Mean :0.4982 3rd Qu.:54852 3rd Qu.: 57342 3rd Qu.:0.6411 3rd Qu.:0.6378 Max. :98146 Max. :107045 Max. :0.8713 Max. :0.7783 > cor.test(two2four$fGC,two2four$fArg, method="pearson") Pearson's product-moment correlation data: two2four$fGC and two2four$fArg t = 104.4614, df = 931, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9545137 0.9646429 sample estimates: cor 0.9598903 >
What? Pearson’s R close to 0.96? OK, then I was wrong. Even if the whole table had a lower correlation, the most it would be evidence for would be that smallish genomes introduce noise, which is expected. Let’s see:
> wholething = read.table(bzfile("DATA/GC_n_RK.R.bz2")) > summary(wholething) Length GC AT fGC Min. : 1133 Min. : 727 Min. : 406 Min. :0.1354 1st Qu.: 1938709 1st Qu.: 772846 1st Qu.:1056489 1st Qu.:0.3754 Median : 2922917 Median :1401641 Median :1407570 Median :0.4631 Mean : 3275883 Mean :1685033 Mean :1590690 Mean :0.4768 3rd Qu.: 4519823 3rd Qu.:2373396 3rd Qu.:2072131 3rd Qu.:0.5885 Max. :13033779 Max. :9302956 Max. :5433425 Max. :0.7491 Arg Lys fArg fLys Min. : 8 Min. : 13 Min. :0.09687 Min. :0.1206 1st Qu.: 24218 1st Qu.: 28308 1st Qu.:0.34035 1st Qu.:0.3631 Median : 43511 Median : 42228 Median :0.47161 Median :0.5284 Mean : 53511 Mean : 46107 Mean :0.49584 Mean :0.5042 3rd Qu.: 75429 3rd Qu.: 58987 3rd Qu.:0.63691 3rd Qu.:0.6597 Max. :325052 Max. :171173 Max. :0.87944 Max. :0.9031 > cor.test(wholething$fGC,wholething$fArg, method="pearson") Pearson's product-moment correlation data: wholething$fGC and wholething$fArg t = 164.6014, df = 2259, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9574459 0.9638010 sample estimates: cor 0.9607493 >
OK, almost the same result. Thus, even if smallish genomes introduce noise, it is not a lot of noise. Well, all we are left with would be to graph this shit:
(I first plotted the whole thing set to figure out the proper coordinates to display each graph)
> plot(wholething$fGC, wholething$fArg) > par(mfrow=c(1,3)) > plot(ten_bichos$fGC, ten_bichos$fArg, xlim=c(0.1,0.8), ylim=c(0.1,0.9)) > plot(two2four$fGC, two2four$fArg, xlim=c(0.1,0.8), ylim=c(0.1,0.9)) > plot(wholething$fGC, wholething$fArg, xlim=c(0.1,0.8), ylim=c(0.1,0.9)) >
Wow. I am happy to inform you that this command worked in my mac, it opened another window (a “quartz” window) and displayed the graphs in that window, despite I was working with R in the terminal. 🙂
But you can’t see the plots this way, so I saved what was being displayed (it saved automatically as a PDF) and here they are for you to contemplate:
(For a clearer, larger, version click on the image)
Upa! OK, so I had the misfortune of selecting an outlier genome (see that dot jumping out in the ten genome plot), and the graphs show that all the effort to select genomes within some range of sizes did nothing to reduce noise in the graphs. In any event, we have thus discovered that there is a very clear correlation between GC content and the relative abundances of arginine and lysine. I doubted it, yet here we are. Facts are facts. Now why do we care? For one, because some people wrote that in an article and I doubted it. Then, because this indicates that these two amino acids are pretty much interchangeable. Wait a minute Gabo! You might start to think: correlation does not indicate causation. Right. But I am inclined to think that GC comes first, but I did not expect it to have such a strong influence on amino acid composition. Anyway, if it is true that arginines and lysines are mostly interchangeable, then we have much more to test, don’t we? For example, whether homologous proteins show arginines to align with lysines more frequently than with some other amino acids. Or would this correlation still show if we took into account only homologous, or orthologous proteins? Do other amino acids show such clear exchangeability? Do other amino acids change with GC content? All these questions should prompt us to go for a deep literature review before proceeding. The more we think of questions, the better we can figure our what kinds of articles to look for, and what kinds of analyses would be due to better understand what might be behind these results. Hey, if nobody has published this relationship, maybe we can make more tests and publish them.
I will leave you for now with those thoughts. 🙂
Have fun,
-SuperGabo