07 8 / 2012
I have been expanding the coordinate mapper Reece posted to the dev list a couple of years ago.
It’s currently living as a gist, although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of “test-driven development,” but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy before I started testing it with 1-based output. (More on that below.)
I have modified it to work with the new
CompoundLocation Peter is working on, while retaining the ability to provide exons as a sequence of pairs.
Representing intron locations
One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS conventions, in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I’m using a custom object called
Are there standards that use an alternative display for intron positions relative to CDS coords?
I’m currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few).
However, I have read a variety of StackOverflow threads that imply that if I’m trying to use globals or a singleton class, I’m doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a “more Pythonic” way to handle module-wide configuration?
On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I’ll likely use a class similar to
ExactPosition (i.e. subclass
int). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats?
I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string ‘-1’ to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn’t make biological sense.
I haven’t yet tackled circular genomes. Are they handled by
FeatureLocationyet? In GFF, the end is represented as an index greater than the “length” of the entire sequence. In order to handle this, it should only be a matter of
I also haven’t considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated.
Are there Biopython objects other than
FeatureLocationthat should be transformable? For example, is it worth attempting to map non-exact locations across coordinates?
All that said, I hope to have a production-ready coordinate mapper by the end of the week.