Home
Note: No guaranties whatsoever. If you want to try the exercise, I have the original table of genomes with their sizes and GC content at figshare. Click here to download (at your own risk, etc etc). I just hope that these soliloquies might inspire bioinformatics newbies

This is the third installment on the arginine/lysine and GC content series (first post here, second here). This post is about writing a program for counting the arginines and lysines, calculating their relative proportions, and saving the results into a file for later use. Next time I will use the resulting file to figure out if arginines and lysines vary with the GC content of the chromosome (here sometimes called replicons), of the Prokaryotes whose genomes were in my database.

First things first, the beginning of a program to count arginines and lysines, and then some, as before, I start by creating a file with touch, and changing it into an executable:

% touch calc-RK.pl
% chmod +x calc-RK.pl
%

You might remember that this exercise was to be performed with ten genomes as chosen previously. However, you might also remember that I had also produced a file with prokaryotic chromosomes with sizes between 2 Mbp and 4 Mbp as a challenge to the students for extra credit. This because ten genomes might not require programming, while more than 900 would. I thought, OK, if I do this program to work with all the replicons in the original file, I will have all the results ready, and I can “grep out” whatever subset I want in order to compare to whatever results the students get, be it the ten genomes, be it the more than 900. Therefore I decided to count the arginines and lysines for all the replicons, and put all the data together for later use. First things first, let us remember what the table looks like:

% bzcat DATA/GC_table.tbl.bz2 |wc
    2262   11310  115321
% bzcat DATA/GC_table.tbl.bz2 | head -5
Replicon	GC	AT	fGC	Length
A_acidocaldarius_DSM446	1881729	1137026	0.6233	3018755
A_acidocaldarius_Tc_4_1	1922014	1202028	0.6152	3124048
A_actinomycetemcomitans_ANH9381	949230	1187575	0.4442	2136808
A_actinomycetemcomitans_D11S_1-l1	938183	1167561	0.4455	2105764
%

We will be using files in fasta format to count those arginines and lysines. What does a file like that look like?

% zcat /research/gmh/GENOME_DB/faa-Replicon/A_acidocaldarius_DSM446.faa.gz | head -10
>gi|258510021|ref|YP_003183455.1| chromosomal replication initiator protein DnaA [Alicyclobacillus acidocaldarius subsp. acidocaldarius DSM 446]
MMSTKVDPTYDALWNQVLQMVQDKVSGPTYSTWFEDTHIVRIDESEQTVYISAPSTFVRNWLSEHYTSLV
ETLMMTLMDREYRVRFITEDDVQAPAVIVRRAPEPVTAEEDTETNLNPRYTFDSFVIGAGNRFAHAACLA
VAERPANAYNPLFIYGGVGLGKTHLMHAIGHYVRQHYPNFKVSYISSERFTNEFIAAIRDKNPDSFRARY
RTVDVLLIDDIQFIANKEQTQEEFFHTFNSLHEVGKQIVISSDRPPREIPTLEDRLRSRFEWGLITDIQP
PDLETRIAILQRKAKADGLDVPVDVLAFIANQIDSNIRELEGALTRVIAYSSLVKEDLSVALAEEALKDL
IHPNRPRAVTVNHVQKVVADHYGLKVDDLKGKKRTRNIAFPRQIAMYLTRELTDLSLPRIGEAFGGRDHT
TVMHACERVHEEMMRDDELRATIERLSQAIRTLT
>gi|258510022|ref|YP_003183456.1| DNA polymerase III subunit beta [Alicyclobacillus acidocaldarius subsp. acidocaldarius DSM 446]
MEFSLQKQALSNALQIVSKAVAVRTPKQVLMGILIEVHEDEIVATAYDLELAIQTRVPVGEETGLRVHQT
%

With that information we can start programming:

#!/usr/bin/env perl
# could be /usr/bin/perl, or /usr/local/bin/perl, for example.
# But "env" is able to choose the proper perl whatever the path might be.

## put directory where I store the protein sequences encoded by each
## genome into a variable for ease of handling later in the program:
my $faa_dir = "/research/gmh/GENOME_DB/faa-Replicon";

## now open the file with the whole list of genomes, their sizes,
## and GC content (the file is compressed with bzip2)
open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    my($arg,$lys) = count_RnK("$replicon");
    ## exit before going on to check that the work is going well
    exit;
}
close($FULLTBL);

OK, so all that’s happening for now is that I opened the file. Because I have put an “exit” at the end of the “while” loop, the program will not proceed beyond the first line with a genome (one after the heading). Using the name of the replicon, I will find the appropriate fasta file and count the arginines and lysines in it, only I decided to use a sub routine (that is not written yet), called “count_RnK” to do the counting. Naturally, now I should write the subroutine:

#!/usr/bin/env perl
# could be /usr/bin/perl, or /usr/local/bin/perl, for example.
# But "env" is able to choose the proper perl whatever the path might be.

## put directory where I store the protein sequences encoded by each
## genome into a variable for ease of handling later in the program:
my $faa_dir = "/research/gmh/GENOME_DB/faa-Replicon";

## now open the file with the whole list of genomes, their sizes,
## and GC content (the file is compressed with bzip2)
open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    my($arg,$lys) = count_RnK("$replicon");
    ## exit before going on to check that the work is going well
    exit;
}
close($FULLTBL);

sub count_RnK {
    my($replicon) = @_;
    ## I already know that my replicon files are compressed
    ## with gzip
    my $infile = "$faa_dir/$replicon.faa.gz";
    ## I already know that the files exist. Still, let us verify existence
    ## before proceeding:
    if( -f "$infile" ) {

    }
    else {
        return(); #return empty, meaning failure
    }
}

This subroutine cannot do anything yet. I am pausing for you to notice first that I added an “if/else” to make sure that the file exists. Since I know what the answer from the sub routine should be if the file is not there (return()), I add that answer. Now let’s us add the code for counting. We can use tr/// to count letters as before. The letter representing arginine in protein sequences is R, the one representing lysine is K:

sub count_RnK {
    my($replicon) = @_;
    ## I already know that my replicon files are compressed
    ## with gzip
    my $infile = "$faa_dir/$replicon.faa.gz";
    ## I know that the files exist. Still, let us verify existence
    ## before proceeding:
    if( -f "$infile" ) {
        ## declare variables for arginine and lysine counts:
        my $arg = 0;
        my $lys = 0;
        open( my $FAA,"-|","gzip -qdc $infile" );
        while(<$FAA>) {
            next if( m{^>} );
            $arg += tr/R/R/;
            $lys += tr/K/K/;
        }
        close($FAA);
        return($arg,$lys);
    }
    else {
        return(); #return empty, meaning failure
    }
}

That should work. That should count arginines and lysines. Yet, we’d rather verify (remember, it’s always better to check before proceeding). So, let us put a print in the while loop, after the call to the subroutine, and see if we have something other than zeroes for arginine and lysine:

open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    my($arg,$lys) = count_RnK("$replicon");
    print $replicon,": arginine: ",$arg,"; lysine: ",$lys,"\n";
    ## exit before going on to check that the work is going well
    exit;
}
close($FULLTBL);

With that we can try and see what happens:

% ./calc-RK.pl
A_acidocaldarius_DSM446: arginine: 68156; lysine: 26933
%

It works! It wooooorks! Well, of course it does! Hum, but we don’t want just counts, we want proportions of one versus the other. A few lines will be enough for that (and we change the print command to check again):

open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    my($arg,$lys) = count_RnK("$replicon");
    my $total = $arg + $lys;
    my $f_arg = $arg/$total;
    my $f_lys = $lys/$total;
    print $replicon,": arginine: ",$arg,"; lysine: ",$lys
        ,"; fArg: ",$f_arg,"; fLys: ",$f_lys,"\n";
    ## exit before going on to check that the work is going well
    exit;
}
close($FULLTBL);

And we check:

% ./calc-RK.pl
A_acidocaldarius_DSM446: arginine: 68156; lysine: 26933
fArg: 0.716760087917635; fLys: 0.283239912082365
%

Looks all right. The proportions of arginine and lysine seem to add to one, as it should be, the arithmetic is simple enough that we can trust this so far. Maybe it’s a good idea to now delete the “exit” and the “print.” However, in the subroutine we are verifying that the fasta file exists. We did not tell the program what to do if the return does not have values for arginine and lysine. Let’s fix that, and see that we still have the very same results as before adding such verifyication:

open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
TBL_LINE:
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    if( my($arg,$lys) = count_RnK("$replicon") ) {
        my $total = $arg + $lys;
        my $f_arg = $arg/$total;
        my $f_lys = $lys/$total;
        print $replicon,": arginine: ",$arg,"; lysine: ",$lys
            ,"\nfArg: ",$f_arg,"; fLys: ",$f_lys,"\n";
    }
    else {
        next TBL_LINE;
    }
    ## exit before going on to check that the work is going well
    exit;
}
close($FULLTBL);
% ./calc-RK.pl
A_acidocaldarius_DSM446: arginine: 68156; lysine: 26933
fArg: 0.716760087917635; fLys: 0.283239912082365
%

Yup. No difference, and if any fasta file is not there the program will jump over. Cool. But we have not saved the results. Aaaaaaaah! Little tiny detail, we are doing this so that we get some results, and we are not saving the results? What the hell?. OK, let’s fix that, let us open a file, that we will compress with bzip2 because I am an obsessive compressor, and let’s print results into that file after an appropriate heading:

#!/usr/bin/env perl
# could be /usr/bin/perl, or /usr/local/bin/perl, for example.
# But "env" is able to choose the proper perl whatever the path might be.

## put directory where I store the protein sequences encoded by each
## genome into a variable for ease of handling later in the program:
my $faa_dir = "/research/gmh/GENOME_DB/faa-Replicon";

## open a file for results and print in it a tab separated heading
## reflecting what I will put in the file:
open( my $RKTBL,"|-","bzip2 -9 > DATA/GC_n_RK.tbl.bz2" );
print {$RKTBL} join("\t","Replicon","Length","GC","AT","fGC"
                        ,"Arg","Lys","fArg","fLys"),"\n";

## now open the file with the whole list of genomes, their sizes,
## and GC content (the file is compressed with bzip2)
open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
TBL_LINE:
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    if( my($arg,$lys) = count_RnK("$replicon") ) {
        my $total = $arg + $lys;
        my $f_arg = $arg/$total;
        my $f_lys = $lys/$total;
        print $replicon,": arginine: ",$arg,"; lysine: ",$lys
            ,"\nfArg: ",$f_arg,"; fLys: ",$f_lys,"\n";
        print {$RKTBL} join("\t",$replicon,$length,$gc,$at,$fGC
                                ,$arg,$lys,$f_arg,$f_lys),"\n";
    }
    else {
        next TBL_LINE;
    }
    ## exit before going on to check that the work is going well
    exit;
}
close($FULLTBL);
close($RKTBL);

sub count_RnK {
    my($replicon) = @_;
    ## I already know that my replicon files are compressed
    ## with gzip
    my $infile = "$faa_dir/$replicon.faa.gz";
    ## I know that the files exist. Still, let us verify existence
    ## before proceeding:
    if( -f "$infile" ) {
        ## declare variables for arginine and lysine counts:
        my $arg = 0;
        my $lys = 0;
        open( my $FAA,"-|","gzip -qdc $infile" );
        while(<$FAA>) {
            next if( m{^>} );
            $arg += tr/R/R/;
            $lys += tr/K/K/;
        }
        close($FAA);
        return($arg,$lys);
    }
    else {
        return(); #return empty, meaning failure
    }
}

The output should be the same, except:

% ./calc-RK.pl
A_acidocaldarius_DSM446: arginine: 68156; lysine: 26933
fArg: 0.716760087917635; fLys: 0.283239912082365
% bzcat DATA/GC_
GC_n_RK.tbl.bz2      GC_table.sorted.bz2  GC_table.tbl.bz2
% bzcat DATA/GC_n_RK.tbl.bz2
Replicon	Length	GC	AT	fGC	Arg	Lys	fArg	fLys
A_acidocaldarius_DSM446	3018755	1881729	1137026	0.6233	68156	26933	0.716760087917635	0.283239912082365
%

OK! Now we should be able to get the brakes off. I thus “comment out” both the “exit” and the verification “print.” Without this verification “print,” the program will run silently for a while, there’s 2261 replicons to work with after all. If you rather see some output in the terminal while the program does it’s thing, you can leave the verification “print” alone.

open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
TBL_LINE:
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    if( my($arg,$lys) = count_RnK("$replicon") ) {
        my $total = $arg + $lys;
        my $f_arg = $arg/$total;
        my $f_lys = $lys/$total;
        #print $replicon,": arginine: ",$arg,"; lysine: ",$lys
        #    ,"\nfArg: ",$f_arg,"; fLys: ",$f_lys,"\n";
        print {$RKTBL} join("\t",$replicon,$length,$gc,$at,$fGC
                                ,$arg,$lys,$f_arg,$f_lys),"\n";
    }
    else {
        next TBL_LINE;
    }
    ## exit before going on to check that the work is going well
    #exit;
}
close($FULLTBL);

We run this program, wait until it is done, and then check the file (used “time” just to show you how long it took to run all the counts in my machine):

% time ./calc-RK.pl
67.173u 6.150s 1:43.29 70.9%	0+0k 1614+1118io 0pf+0w
%
% bzcat DATA/GC_n_RK.tbl.bz2 |wc
    2262   20358  223455
% bzcat DATA/GC_n_RK.tbl.bz2 | head -5
Replicon	Length	GC	AT	fGC	Arg	Lys	fArg	fLys
A_acidocaldarius_DSM446	3018755	1881729	1137026	0.6233	68156	26933	0.716760087917635	0.283239912082365
A_acidocaldarius_Tc_4_1	3124048	1922014	1202028	0.6152	71342	28402	0.715251042669233	0.284748957330767
A_actinomycetemcomitans_ANH9381	2136808	949230	1187575	0.4442	28210	36956	0.432894454163214	0.567105545836786
A_actinomycetemcomitans_D11S_1-l1	2105764	938183	1167561	0.4455	28254	37116	0.432216613125287	0.567783386874713
% bzcat DATA/GC_n_RK.tbl.bz2 | tail -5
cyanobacterium_UCYN_A	1443806	449266	994540	0.3112	16300	27189	0.374807422566626	0.625192577433374
gamma_proteobacterium_HdN1	4587455	2443213	2144242	0.5326	85536	56484	0.602281368821293	0.397718631178707
secondary_endosymbiont_of_Ctenarytaina_eucalypti	1441139	623872	817267	0.4329	15619	11620	0.573405778479386	0.426594221520614
secondary_endosymbiont_of_Heteropsylla_cubana_Thao2000	1121596	324171	797425	0.2890	10017	12111	0.452684381778742	0.547315618221258
uncultured_Termite_group_1_bacterium_phylotype_Rs_D17	1125857	396385	729472	0.3521	10273	24399	0.296290955237656	0.703709044762344
%

So yes, the file with results is there. Nothing seems amiss. We are ready. In my next post I will show you the results we actually care about: whether arginines and lysines change with the GC content. For now I leave you with the whole program.

Have tons of fun,
-SuperGabo

#!/usr/bin/env perl
# could be /usr/bin/perl, or /usr/local/bin/perl, for example.
# But "env" is able to choose the proper perl whatever the path might be.

## put directory where I store the protein sequences encoded by each
## genome into a variable for ease of handling later in the program:
my $faa_dir = "/research/gmh/GENOME_DB/faa-Replicon";

## open a file for results and print in it a tab separated heading
## reflecting what I will put in the file:
open( my $RKTBL,"|-","bzip2 -9 > DATA/GC_n_RK.tbl.bz2" );
print {$RKTBL} join("\t","Replicon","Length","GC","AT","fGC"
                        ,"Arg","Lys","fArg","fLys"),"\n";

## now open the file with the whole list of genomes, their sizes,
## and GC content (the file is compressed with bzip2)
open( my $FULLTBL,"-|","bzip2 -qdc DATA/GC_table.tbl.bz2" );
TBL_LINE:
while(<$FULLTBL>) {
    next if( m{^Replicon} ); # jump heading
    ## we cut the line into fields and give them proper variable names:
    my($replicon,$gc,$at,$fGC,$length) = split;
    ## I decided to use a subroutine to count the arginines (R)
    ## and lysines (L):
    if( my($arg,$lys) = count_RnK("$replicon") ) {
        my $total = $arg + $lys;
        my $f_arg = $arg/$total;
        my $f_lys = $lys/$total;
        #print $replicon,": arginine: ",$arg,"; lysine: ",$lys
        #    ,"\nfArg: ",$f_arg,"; fLys: ",$f_lys,"\n";
        print {$RKTBL} join("\t",$replicon,$length,$gc,$at,$fGC
                                ,$arg,$lys,$f_arg,$f_lys),"\n";
    }
    else {
        next TBL_LINE;
    }
    ## exit before going on to check that the work is going well
    #exit;
}
close($FULLTBL);
close($RKTBL);

sub count_RnK {
    my($replicon) = @_;
    ## I already know that my replicon files are compressed
    ## with gzip
    my $infile = "$faa_dir/$replicon.faa.gz";
    ## I know that the files exist. Still, let us verify existence
    ## before proceeding:
    if( -f "$infile" ) {
        ## declare variables for arginine and lysine counts:
        my $arg = 0;
        my $lys = 0;
        open( my $FAA,"-|","gzip -qdc $infile" );
        while(<$FAA>) {
            next if( m{^>} );
            $arg += tr/R/R/;
            $lys += tr/K/K/;
        }
        close($FAA);
        return($arg,$lys);
    }
    else {
        return(); #return empty, meaning failure
    }
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s