20 8 / 2012
The coordinate mapper, with updated documentation, is now located on this branch, awaiting the merging of Peter’s f_loc4 branch.
I’ve written an entry on coordinate mapping for the Cookbook.
Additionally, at Peter’s suggestion, I’ve written a clarification of strand as it relates to transcription and translation. It’s available here.
It’s been a great experience working with this project this summer. Thank you to everyone involved.
16 8 / 2012
The summer is winding to a close. I’ve spent this week busy with orientation events and meetings for my upcoming PhD program. I hope to have time to continue to contribute to Biopython in my spare time, and ideally I would like to use and expand Biopython as a portion of my research.
I have been considering how to handle gene strandedness. As long as I’m correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features.
GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40)
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10
I have to admit that it wasn’t until I read a BioStar post earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I’ve made a mistake in the above code.
c2g yields the correct genome position, but not the strand. I still need to integrate strand information into my
GenomePosition object and/or partially merge it with
ExactLocation. This weekend I intend to expand documentation and write a brief cookbook entry.
13 8 / 2012
All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details.
There are two different ways a user can convert a coordinate from HGVS:
# ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs()
Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation:
# Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs
The revision as of this writing is passing tests using base 0. I have not yet implemented tests for
to_hgvs, but that’s next on my list. I’m hoping to have time for strand and mixed strand, too.
The latest revision now tests with default settings and HGVS settings.
07 8 / 2012
I have been expanding the coordinate mapper Reece posted to the dev list a couple of years ago.
It’s currently living as a gist, although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of “test-driven development,” but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy before I started testing it with 1-based output. (More on that below.)
I have modified it to work with the new
CompoundLocation Peter is working on, while retaining the ability to provide exons as a sequence of pairs.
Representing intron locations
One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS conventions, in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I’m using a custom object called
Are there standards that use an alternative display for intron positions relative to CDS coords?
I’m currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few).
However, I have read a variety of StackOverflow threads that imply that if I’m trying to use globals or a singleton class, I’m doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a “more Pythonic” way to handle module-wide configuration?
On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I’ll likely use a class similar to
ExactPosition (i.e. subclass
int). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats?
I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string ‘-1’ to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn’t make biological sense.
I haven’t yet tackled circular genomes. Are they handled by
FeatureLocationyet? In GFF, the end is represented as an index greater than the “length” of the entire sequence. In order to handle this, it should only be a matter of
I also haven’t considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated.
Are there Biopython objects other than
FeatureLocationthat should be transformable? For example, is it worth attempting to map non-exact locations across coordinates?
All that said, I hope to have a production-ready coordinate mapper by the end of the week.
18 6 / 2012
James raised some concerns about the difficulty of representing the VCF “metaformat” in SQL. I’ve taken these into consideration and am forging ahead. So far, some of the types of data fit more neatly into SQL than into a VCF row.
I have redesigned my SQL schema with a two-pronged approach to tackle the flexibility of VCF:
- For the site, alt, and genotype tables, there are columns for the reserved info/format keywords in the VCF spec (so far only for non-SV).
- For new info and format keywords (both in the header and in the body), I am storing the values in a “narrow table.” This table stores a foreign key to the key’s row and the key-value pair. The narrow table is also good for storing reserved keys that are lists (but not per-allele or per-genotype).
Note: this diagram only has the FKs listed for simplicity.
Interestingly, despite the increase in the number of tables and thus insert statements, the current script is considerably faster than the previous version. Evidently JSON serialization is slow.
There are a few things I haven’t figured out:
- Can an info field be per-genotype? The spec implies that wouldn’t make sense, but doesn’t forbid it.
- Is there a safe way to find out if a VCF 4.0 field is per-allele or per-genotype?
- Will my SQL representation be able to handle SV?
I’ll be out of town for the next week but I will have plenty of time for Python.
03 6 / 2012
I started implementing storage of VCF data in
SeqFeature. I digressed, spending a few days experimenting with overloading
__getattr__() in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn’t contain the actual sequence but a reference to the sequence, the advantages to using
SeqRecord are minimal or possibly negative.
In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I’ve read about a few approaches to GPU accelerated SQL.
My initial glances at BioSQL, GMOD, etc. didn’t show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the
parse_all() method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. Update: Reducing commit frequency has knocked this step to 40s.
With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I’m storing “misc” data in a SQL text field using JSON, which is also easy to access.
- Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following
- Deciding the extent of the convenience functions I wish to implement
- Thinking about the most efficient way to filter records on the way into the SQL database
25 5 / 2012
I’m going to use Github issues/milestones instead of trying to use some sort of calendar-based timeline. I have a very limited ability to predict how long things will take, so a date-linked timeline seemed like it would end up being “Lenna shuffles due dates around for half an hour every week.” Plus, the auto-closing and referencing of issues is really nice.
- Variant dev is on the
- Issues are here: https://github.com/lennax/biopython/issues
As an aside, I just saw that Github released a Windows GUI. As a Mac/Linux user who’s tried to teach Windows people how to use TortoiseGit, I am really excited to test it out. Has anybody tried it yet?
25 5 / 2012
A couple quick things:
Met a grad student today who has written his own Python PDB parser - he only found out about Biopython two months ago. He has a collection of variations in alphabets used by proprietary software and I am trying to convince him to contribute it to Biopython. His lab does primarily NMR and molecular dynamics.
Started a github branch for the variant project. I did subclass Bio.SeqIO.SequenceIterator but I ended up overloading all of the methods. Got somewhat distracted again reading about lambdas and closures; still not sold. Right now I’m using dictionary function dispatching and I think it’s pretty nice. If it needs to get messier, it’s possible closures would help. Although I still don’t understand closures well enough to explain them to another programmer, much less a six year old.
23 5 / 2012
My most exciting news this week is that I’ve understood and immediately come to love the @property decorator. Forcing users to use an accessor without even knowing it? Genius!
I’ve also been reading everything I can find on the internet about whether traditional OOP has a place within Python (namely interfaces and inheritance). Overall, I’m more confused than before I started, but I still think it would be extremely helpful to have some sort of formal guideline for future development, even if it’s not enforced. Currently, the most enforceable Python structure is the abstract base class (ABC), new to 2.6. However, I don’t think I will be able to make a decision without both consensus from the dev list and a prototype I’m happy with. Right now I’m frying my neural circuits trying to plan the big picture all at once; I need to get small pieces working well and be willing to make mistakes and start over.
I have reversed my prior conclusion that
SeqRecord is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a
I’ve made an outline of how I would store the information returned by PyVCF within
SeqFeature objects. It includes a few questions about the most logical way to store certain variant information.
The existing sequence-related structures in Biopython, such as
SeqFeature, are not currently designed to contain variant data. Nevertheless, there are many features implemented for
SeqRecord etc. that would be very useful. It seems essential to be able to convert at least a subset of variant data to
SeqRecord. The data I’ve encountered so far in variants are organized in key-value pairs and can be stashed in the annotations/qualifiers dicts if no better place can be found (while new properties can be added to a Python object at any time, a user may not necessarily know to look for them).
Initially, I considered making a generalized variant object that can describe any known variant and has methods to convert it to
SeqRecord. However, it would be more efficient to have an adapter for each parser to convert the variant data into
SeqRecord. To ameliorate the disadvantages of storing variant data in a
SeqRecord, I would implement an accessor to allow more meaningful access to the variant-specific info in the annotations/qualifiers dicts.
If variant formats were included in SeqIO, my main concern would be whether it’s possible to explicitly prevent conversion between formats that don’t make sense. Format type checking could be used to tack on this restriction, but I am wondering if there is a more elegant way. Are all of the existing SeqIO formats interconvertible?
Anyway, if variants are included in SeqIO, variant parser adapters should inherit from Bio.SeqIO.SequenceIterator. That base class could be converted to an ABC once Biopython is ready for 2.6+. Does anybody have thoughts on whether Biopython should start using super() in inheritance hierarchies?
Tentative organization of VCF in SeqRecord
With all this in mind, I have been developing an outline of how to best store the data of PyVCF
_Call structures in Biopython
One general question is whether the
SeqFeature can or should be independent; that is, whether it should store of the information a user would want about that particular variation. The least redundant way to allow it access to data like chromosome would be to store a pointer to its parent
SeqRecord. But from a biological standpoint, I suspect it is illogical to separate calls from each other. Is this correct?
I’m also not clear on the ideal way to store the sequence information. HGVS recommends using a LRG (locus reference genome) with 5Kb upstream and 2Kb downstream. This seems like overkill for a general case. I believe I’ve seen cases where about 20 flanking base pairs are shown. Regardless, the
SeqRecord could hold some portion of the sequence and the
SeqFeatures could have a relative position within that sequence fragment and both the reference and alternate bases. Does this sound reasonable? I’m open to suggestions for a better way to store this critical portion of the variant data.
SeqRecord holds most of the info from _Record:
- the sequence (does this come from an external file? how much of it?)
- _Record level annotations
- start/end position of the whole sequence (Peter is talking about using
FeatureLocations for this)
- list of genotypes - as
SeqFeatures hold most of the info from _Call:
- the genotype name
- type (should the
SeqFeaturehave just the gt_type (ref, hom alt, het alt) or also the variant type (snp, indel, etc)?)
- start/end positions (relative to
- the ref/alt sequences (should each genotype have both ref/alt or just alt?)
- qualifiers: phased, gt_type, other optional fields
Well, as the coding period has now started, I’ll be pushing some prototypes to GitHub in the near future. Trial and error may be the best teacher at this point.
21 5 / 2012
I’ve been reading a lot of blog posts and StackExchange discussions about interfaces, inheritance, and their place in Python, and how Python features such as duck typing impact those as well as polymorphism. I’m coming from an admittedly hodgepodge background where I’m used to C++ design for large projects and Python for small. So it’s taking me some time to evaluate the prevailing opinions and determine best practices.
Regarding inheritance, I like this statement: “Implementation inheritance can always be replaced by interface inheritance plus composition” - Konrad Rudolph
I get the impression that Python interfaces are rare, little-known, and disliked. My initial tests suggest that they may be useless from a purely functional point of view; however, for structuring, they are hard to beat. For someone using the module, the interface is a promise of output. For someone extending the module, the interface is a guideline for requirements. The most tangible benefit, which Eric hinted at, is definition of an organizational structure that is not located between the ears of a single developer. The most exhaustive yet readable discussion of interfaces in Python I found is dated 2004, but it seems sensible. Also, a reference listing much Python interface related discussion.
Currently, the only formal interface-like implementation in Python is the ABC (abstract base class). I don’t think it’s worth creating an ABC. nor would I enforce inheritance from my interface. I would intend it as a guide for future developers.