I read about the possibility of adding masking information to an NCBI blast database. This allows for running blast sequence comparisons using either soft or hard masking. Since there’s already soft and hard masking of the queries, I hadn’t bothered to try. Anyway, a few weeks ago, I started playing with it. I thus discovered segmasker (part of NCBI’s blast package).
Database masking suggests using a particular output format, “maskinfo_asn1_bin”, but, as usual, I checked the instructions (segmasker -help | more). There’s a fasta format! What does it do? It produces a fasta file with the protein sequences in capital letters, and the low-information, maskerable, parts of the sequences in lowercase. Well, guess what? Several sequence comparison programs can recognize lowercase segments as masked and make your sequence comparison life quite enjoyable. Examples of software? Pearson’s fasta suite, and last. Oh, and guess what? last is a super-fast sequence comparison software suite that can compare proteins. Now you know last can also do soft masking. You’re welcome.