07 8 / 2012

I have been expanding the coordinate mapper Reece posted to the dev list a couple of years ago.

It’s currently living as a gist, although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of “test-driven development,” but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy before I started testing it with 1-based output. (More on that below.)

I have modified it to work with the new CompoundLocation Peter is working on, while retaining the ability to provide exons as a sequence of pairs.

Representing intron locations

One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS conventions, in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I’m using a custom object called NonExonPosition.

Are there standards that use an alternative display for intron positions relative to CDS coords?

Configuration

I’m currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few).

However, I have read a variety of StackOverflow threads that imply that if I’m trying to use globals or a singleton class, I’m doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a “more Pythonic” way to handle module-wide configuration?

On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I’ll likely use a class similar to ExactPosition (i.e. subclass int). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats?

Other considerations

  • I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string ‘-1’ to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn’t make biological sense.

  • I haven’t yet tackled circular genomes. Are they handled by SeqFeature/FeatureLocation yet? In GFF, the end is represented as an index greater than the “length” of the entire sequence. In order to handle this, it should only be a matter of excepting an IndexError.

  • I also haven’t considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated.

  • Are there Biopython objects other than FeatureLocation that should be transformable? For example, is it worth attempting to map non-exact locations across coordinates?


All that said, I hope to have a production-ready coordinate mapper by the end of the week.

Mailing list: