Home
Note: No guaranties whatsoever. If you want to try the exercise, I have the original table of genomes with their sizes and GC content at figshare. Click here to download (at your own risk, etc etc).

A while ago, I was reading some article, maybe ten years ago. I have not been able to remember which one, but I think it might have been one about the genome of Pseudomonas syringae, which has a GC-rich genome. The thing is that somewhere in the article I read something of a note in pass about how, in a GC-rich genome, there should be arginines where other organisms would have lysine because the codons for arginine are GC-rich. I remember thinking back then “forgive me for not trusting you, I will check on this …” However, I forgot to check, but the idea was left floating at the background of my mind.

Many years later (in front of the firing squad), I find myself thinking of questions for tests in my undergrad course on bioinformatics. Since we had been talking about statistics, I thought of putting the degenerate codons for different amino-acids and ask the students to tell me, for each amino-acid, which codons should be used more often by an organisms with a GC-rich genome, compared to one with a GC-poor genome. Anyway, that brought about the memory of that arginine versus lysine as related to GC content. Ah! I thought, that’s yet another nice question. I had the arginine codons in the previous question, so I added a question about whether students would expect more arginine instead of lysine in a GC-poor genome. Of course, I presented the codons for lysine with the question.

This year, I made it a point to have an lab exercise for the students to determine if the GC-content had any relationship with arginine versus lysine content in proteins. So here’s how I prepared the data.

I had a table ready with the GC-content of genomes in my databases. You can check example programs that calculate the GC content of a DNA sequence as found in a fasta file, and how I wrote them, here and here.

Here’s what my original table looks like (It’s a tab-separated simple text table, compressed with bzip2):

% bzcat DATA/GC_table.tbl.bz2 | head -20
Replicon    GC    AT    fGC    Length
A_acidocaldarius_DSM446    1881729    1137026    0.6233    3018755
A_acidocaldarius_Tc_4_1    1922014    1202028    0.6152    3124048
A_actinomycetemcomitans_ANH9381    949230    1187575    0.4442    2136808
A_actinomycetemcomitans_D11S_1-l1    938183    1167561    0.4455    2105764
A_actinomycetemcomitans_D7S_1    1023231    1285836    0.4431    2309073
A_aeolicus_VF5    674461    876874    0.4348    1551335
A_aphrophilus_NJ8700    976817    1336218    0.4223    2313035
A_arabaticum_DSM5501    904645    1564951    0.3663    2469596
A_arilaitensis_Re117    2288018    1571239    0.5929    3859257
A_aromaticum_EbN1    2797768    1498462    0.6512    4296230
A_asiaticus_5a2    660376    1223988    0.3505    1884364
A_aurescens_TC1    2866335    1731351    0.6234    4597686
A_avenae_ATCC19860    3772885    1709285    0.6882    5482170
A_baumannii_1656_2    1543472    2397142    0.3917    3940614
A_baumannii_AB0057    1588015    2462498    0.3921    4050513
A_baumannii_AB307_0294    1468426    2292555    0.3904    3760981
A_baumannii_ACICU    1523827    2380219    0.3903    3904116
A_baumannii_ATCC17978    1548578    2428168    0.3894    3976747
A_baumannii_AYE    1550111    2386180    0.3938    3936291
%

This table is a secondary product of a previous project. In any event, I thought of limiting the lab exercise to a few genomes, perhaps 10, and to use genomes at least 2Mbp in length (check the heading above, “length” is the last column) to avoid the possibility of biases in smallish genomes. You know, when we deal with random crap, the probability of a mess increases against the size of your samples. Still, the cutoff at 2Mbp is arbitrary. I just thought that was big enough. Here’s a bit of exploration of the table to show how UNIX allows us to do a lot of work by piping commands (I insist, UNIX piping is a thing of beauty):

% bzcat DATA/GC_table.tbl.bz2 | wc
    2262   11310  115321
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000' | wc
    1665    8325   85489
%

In the above, 2000000 is the 2Mbp minimal size I decided arbitrarily above. Now, I wanted to challenge the students to try and work with the whole tale instead of the ten selected genomes for a bonus. For some reason I don’t remember I thought that 1665 genomes might be too many, and that big genomes often have several replicons (chromosomes and plasmids), and I did not want to go to those details in this exercise. So I limited the size to no more than 4Mbp:

% bzcat DATA/GC_table.tbl.bz2 | wc
    2262   11310  115321
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000' | wc
    1665    8325   85489
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000 && $NF <= 4000000' | wc
     933    4665   48006
%

I felt more comfortable with that (I know, not very convincing, quite arbitrary, but heck, that’s how it happened!). Next step, well, sort these genomes by their GC content, which we know, from above, it’s the data in the fourth column:

% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000 && $NF <= 4000000' | wc
     933    4665   48006
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000 && $NF <= 4000000' | head -5
A_acidocaldarius_DSM446	1881729	1137026	0.6233	3018755
A_acidocaldarius_Tc_4_1	1922014	1202028	0.6152	3124048
A_actinomycetemcomitans_ANH9381	949230	1187575	0.4442	2136808
A_actinomycetemcomitans_D11S_1-l1	938183	1167561	0.4455	2105764
A_actinomycetemcomitans_D7S_1	1023231	1285836	0.4431	2309073
%
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000 && $NF <= 4000000' | sort -g -k 4 | head -5
Arcobacter_L	782481	2163192	0.2656	2945673
A_butzleri_RM4018	633195	1708056	0.2705	2341251
B_hyodysenteriae_WA1	812070	2188624	0.2706	3000694
A_butzleri_ED1	610831	1645844	0.2707	2256675
F_nucleatum_ATCC25586	590411	1584088	0.2715	2174500
%
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000 && $NF <= 4000000' | sort -g -k 4 | tail -5
C_michiganensis_NCPPB382	2396387	901504	0.7266	3297891
M_luteus_NCTC2665	1825682	675415	0.7300	2501097
P_mikurensis_NBRC102666	2787300	1015925	0.7329	3803225
C_gilvus_ATCC13127	2602968	923473	0.7381	3526441
I_variabilis_225	2444014	863726	0.7389	3307740
%

As you can see, sorting puts a genome with a GC-content of ~0.26 at the top, and one with a GC-content of ~0.74 at the bottom. The extra lines let us see that we truly sorted these genomes by their GC-content (always, always, always always, always, check the outputs). I could not just output the result of keeping genomes between the lengths chosen, and sorted, into a file for challenging the students for those extra points. Yet, if I do so directly, I will lose the heading (did you notice?). Thus, I first save the heading into the file, and then appended the results of the commands above to such file, thus producing the file I want with the necessary information:

% bzcat DATA/GC_table.tbl.bz2 | head -1
Replicon	GC	AT	fGC	Length
%
% bzcat DATA/GC_table.tbl.bz2 | head -1 > DATA/GC_table.sorted
%
% cat DATA/GC_table.sorted
Replicon	GC	AT	fGC	Length
%
% bzcat DATA/GC_table.tbl.bz2 | awk '$NF >= 2000000 && $NF <= 4000000' | sort -g -k 4 >> DATA/GC_table.sorted
% head -5 DATA/GC_table.sorted
Replicon	GC	AT	fGC	Length
Arcobacter_L	782481	2163192	0.2656	2945673
A_butzleri_RM4018	633195	1708056	0.2705	2341251
B_hyodysenteriae_WA1	812070	2188624	0.2706	3000694
A_butzleri_ED1	610831	1645844	0.2707	2256675
%
% wc DATA/GC_table.sorted
     934    4670   48032 DATA/GC_table.sorted
%
% bzip2 -9 DATA/GC_table.sorted
%
% bzcat DATA/GC_table.sorted.bz2 | wc
     934    4670   48032
%

Pay close attention to those “>” and “>>”, and to my explicit way of checking that the file has what it should have (cat, head, wc). Also, Because I’m a compulsive compressor, I bzipped the file.

Now, since I wanted ten genomes, I used the range learned above to find the size of the increments needed to go evenly from 0.26 to 0.74 inclusive:

% perl -e 'print ((0.74 - 0.26)/9)."\n";'
0.0533333333333333
% perl -e '$i = 0.26; $top = 0.74; while( $i <= $top ){ print sprintf("%.2f",$i)."\n"; $i += 0.0533 }'
0.26
0.31
0.37
0.42
0.47
0.53
0.58
0.63
0.69
0.74
%
% perl -e '$i = 0.26; $top = 0.74; while( $i <= $top ){ print sprintf("%.2f",$i)."\n"; $i += 0.0533 }' > intervals.txt
%
% cat intervals.txt
0.26
0.31
0.37
0.42
0.47
0.53
0.58
0.63
0.69
0.74
% wc intervals.txt
      10      10      50 intervals.txt
%

You will be disappointed that after I had these intervals, I choose the genomes kind of by hand. I used “less” to go through the file, and each value at a time to jump to genomes with a GC-content close to the value being chosen, then copy and paste into a text file. Don’t worry. Next time I will show you how I could have chosen these genomes using a little program. For now, here my chosen list (the file name for my subset was called “SUBSET”), and a verification that they go close to those intervals I was talking about:

% cat SUBSET
A_pasteurianus_IFO3283_01
Arcobacter_L
B_breve_ACS071_V_Sch8b
B_mallei_ATCC23344_1
Methylocystis_SC2
P_mikurensis_NBRC102666
S_aureus_MSHR1132
Tepidanaerobacter_Re1
V_cholerae_IEC224_I
V_moutnovskia_768_28
%
% bzcat DATA/GC_table.sorted.bz2 | grep -wf SUBSET
Arcobacter_L	782481	2163192	0.2656	2945673
S_aureus_MSHR1132	894822	1867961	0.3239	2762785
Tepidanaerobacter_Re1	1035897	1723970	0.3753	2759867
V_moutnovskia_768_28	974470	1324513	0.4239	2298983
V_cholerae_IEC224_I	1433250	1574200	0.4766	3007450
A_pasteurianus_IFO3283_01	1542043	1365452	0.5304	2907495
B_breve_ACS071_V_Sch8b	1366866	960626	0.5873	2327492
Methylocystis_SC2	2390658	1382786	0.6335	3773444
B_mallei_ATCC23344_1	2392315	1117833	0.6815	3510148
P_mikurensis_NBRC102666	2787300	1015925	0.7329	3803225
%
% bzcat DATA/GC_table.sorted.bz2 | grep -wf SUBSET | awk '{print $1"\t"$4}'
Arcobacter_L	0.2656
S_aureus_MSHR1132	0.3239
Tepidanaerobacter_Re1	0.3753
V_moutnovskia_768_28	0.4239
V_cholerae_IEC224_I	0.4766
A_pasteurianus_IFO3283_01	0.5304
B_breve_ACS071_V_Sch8b	0.5873
Methylocystis_SC2	0.6335
B_mallei_ATCC23344_1	0.6815
P_mikurensis_NBRC102666	0.7329
%
% bzcat DATA/GC_table.sorted.bz2 | grep -wf SUBSET | awk '{print $1"\t"$4}' > grepped.list
%
% paste grepped.list intervals.txt 
Arcobacter_L	0.2656	0.26
S_aureus_MSHR1132	0.3239	0.31
Tepidanaerobacter_Re1	0.3753	0.37
V_moutnovskia_768_28	0.4239	0.42
V_cholerae_IEC224_I	0.4766	0.47
A_pasteurianus_IFO3283_01	0.5304	0.53
B_breve_ACS071_V_Sch8b	0.5873	0.58
Methylocystis_SC2	0.6335	0.63
B_mallei_ATCC23344_1	0.6815	0.69
P_mikurensis_NBRC102666	0.7329	0.74
%

I’ll leave you to guess what those commands do. It’s worth giving it a try.

Remember that the plan is to calculate and compare the GC-content to the arginine and lysine contents of these organisms. Next time I will write a program to automatically choose organisms. After that I will show how to write a program to check the arginine and lysine contents, since we already have GC-content covered from the previous programs I have shown before. My students only had the protein and DNA sequences for these ten organisms, plus the sorted list with 933 genomes, for the extra points challenge.

Enjoy!
-SuperGabo

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s