03 6 / 2012
Weeks 0 and 1: SQL
I started implementing storage of VCF data in SeqRecord and SeqFeature. I digressed, spending a few days experimenting with overloading __getattr__() in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn’t contain the actual sequence but a reference to the sequence, the advantages to using SeqRecord are minimal or possibly negative.
In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I’ve read about a few approaches to GPU accelerated SQL.
My initial glances at BioSQL, GMOD, etc. didn’t show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the parse_all() method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. Update: Reducing commit frequency has knocked this step to 40s.
With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I’m storing “misc” data in a SQL text field using JSON, which is also easy to access.
Next:
- Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following
- Deciding the extent of the convenience functions I wish to implement
- Thinking about the most efficient way to filter records on the way into the SQL database
Mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009682.html
25 5 / 2012
Github issues
I’m going to use Github issues/milestones instead of trying to use some sort of calendar-based timeline. I have a very limited ability to predict how long things will take, so a date-linked timeline seemed like it would end up being “Lenna shuffles due dates around for half an hour every week.” Plus, the auto-closing and referencing of issues is really nice.
Links:
- Variant dev is on the
variantbranch: https://github.com/lennax/biopython/tree/variant - Issues are here: https://github.com/lennax/biopython/issues
As an aside, I just saw that Github released a Windows GUI. As a Mac/Linux user who’s tried to teach Windows people how to use TortoiseGit, I am really excited to test it out. Has anybody tried it yet?
23 5 / 2012
Week -1
My most exciting news this week is that I’ve understood and immediately come to love the @property decorator. Forcing users to use an accessor without even knowing it? Genius!
I’ve also been reading everything I can find on the internet about whether traditional OOP has a place within Python (namely interfaces and inheritance). Overall, I’m more confused than before I started, but I still think it would be extremely helpful to have some sort of formal guideline for future development, even if it’s not enforced. Currently, the most enforceable Python structure is the abstract base class (ABC), new to 2.6. However, I don’t think I will be able to make a decision without both consensus from the dev list and a prototype I’m happy with. Right now I’m frying my neural circuits trying to plan the big picture all at once; I need to get small pieces working well and be willing to make mistakes and start over.
Summary
I have reversed my prior conclusion that SeqRecord is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a SeqRecord.
I’ve made an outline of how I would store the information returned by PyVCF within SeqRecord and SeqFeature objects. It includes a few questions about the most logical way to store certain variant information.
There has been a fair amount of discussion on the mailing list about how best to represent genomic variants in Biopython.
The existing sequence-related structures in Biopython, such as SeqRecord and SeqFeature, are not currently designed to contain variant data. Nevertheless, there are many features implemented for SeqRecord etc. that would be very useful. It seems essential to be able to convert at least a subset of variant data to SeqRecord. The data I’ve encountered so far in variants are organized in key-value pairs and can be stashed in the annotations/qualifiers dicts if no better place can be found (while new properties can be added to a Python object at any time, a user may not necessarily know to look for them).
Initially, I considered making a generalized variant object that can describe any known variant and has methods to convert it to SeqRecord. However, it would be more efficient to have an adapter for each parser to convert the variant data into SeqRecord. To ameliorate the disadvantages of storing variant data in a SeqRecord, I would implement an accessor to allow more meaningful access to the variant-specific info in the annotations/qualifiers dicts.
If variant formats were included in SeqIO, my main concern would be whether it’s possible to explicitly prevent conversion between formats that don’t make sense. Format type checking could be used to tack on this restriction, but I am wondering if there is a more elegant way. Are all of the existing SeqIO formats interconvertible?
Anyway, if variants are included in SeqIO, variant parser adapters should inherit from Bio.SeqIO.SequenceIterator. That base class could be converted to an ABC once Biopython is ready for 2.6+. Does anybody have thoughts on whether Biopython should start using super() in inheritance hierarchies?
Tentative organization of VCF in SeqRecord
With all this in mind, I have been developing an outline of how to best store the data of PyVCF _Record and _Call structures in Biopython SeqRecord and SeqFeature objects.
One general question is whether the SeqFeature can or should be independent; that is, whether it should store of the information a user would want about that particular variation. The least redundant way to allow it access to data like chromosome would be to store a pointer to its parent SeqRecord. But from a biological standpoint, I suspect it is illogical to separate calls from each other. Is this correct?
I’m also not clear on the ideal way to store the sequence information. HGVS recommends using a LRG (locus reference genome) with 5Kb upstream and 2Kb downstream. This seems like overkill for a general case. I believe I’ve seen cases where about 20 flanking base pairs are shown. Regardless, the SeqRecord could hold some portion of the sequence and the SeqFeatures could have a relative position within that sequence fragment and both the reference and alternate bases. Does this sound reasonable? I’m open to suggestions for a better way to store this critical portion of the variant data.
SeqRecord holds most of the info from _Record:
- the sequence (does this come from an external file? how much of it?)
- _Record level annotations
- start/end position of the whole sequence (Peter is talking about using
FeatureLocations for this) - list of genotypes - as
SeqFeatures
The child SeqFeatures hold most of the info from _Call:
- the genotype name
- type (should the
SeqFeaturehave just the gt_type (ref, hom alt, het alt) or also the variant type (snp, indel, etc)?) - start/end positions (relative to
SeqRecordor absolute?) - the ref/alt sequences (should each genotype have both ref/alt or just alt?)
- qualifiers: phased, gt_type, other optional fields
Well, as the coding period has now started, I’ll be pushing some prototypes to GitHub in the near future. Trial and error may be the best teacher at this point.
21 5 / 2012
Theory break
I’ve been reading a lot of blog posts and StackExchange discussions about interfaces, inheritance, and their place in Python, and how Python features such as duck typing impact those as well as polymorphism. I’m coming from an admittedly hodgepodge background where I’m used to C++ design for large projects and Python for small. So it’s taking me some time to evaluate the prevailing opinions and determine best practices.
Regarding inheritance, I like this statement: “Implementation inheritance can always be replaced by interface inheritance plus composition” - Konrad Rudolph
I get the impression that Python interfaces are rare, little-known, and disliked. My initial tests suggest that they may be useless from a purely functional point of view; however, for structuring, they are hard to beat. For someone using the module, the interface is a promise of output. For someone extending the module, the interface is a guideline for requirements. The most tangible benefit, which Eric hinted at, is definition of an organizational structure that is not located between the ears of a single developer. The most exhaustive yet readable discussion of interfaces in Python I found is dated 2004, but it seems sensible. Also, a reference listing much Python interface related discussion.
Currently, the only formal interface-like implementation in Python is the ABC (abstract base class). I don’t think it’s worth creating an ABC. nor would I enforce inheritance from my interface. I would intend it as a guide for future developers.
16 5 / 2012
Week -2
In the past week, I’ve continued to plan the overall structure of the Biopython Variant module (name pending, naturally). See my skeleton code and discussion on the mailing list.
Brief summary of this post:
I don’t think SeqFeature or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on _Record and _Call in PyVCF. I’m not sure if this structure should be associated with Seq, i.e. by naming it SeqVariant, and would like feedback on this question.
It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome.
Representing variants within Biopython
Now I need to consider how the information stored in a variant file would best be represented in Biopython.
- Can variants be adequately described by an existing Biopython structure, and would this be logical?
My conclusion is no. The relevant sequence info could be stored as a Seq or SeqRecord, but given the quantity of information for each site in a variant, I think a new structure is necessary. SeqFeature is the closest option and Genbank does have feature keys such as “modified_base” and “variation”; however, Genbank doesn’t store the majority of fields provided by a variant file.
It would be possible to subclass and extend SeqFeature, but I see a number of potential drawbacks. As discussed before, there are potential problems with inheritance of non-abstract classes. SeqFeature and a hypothetical SeqVariant structure might have some methods and attributes in common, but I’m doubtful that the overlap is sufficient to justify subclassing.
My biggest concern is whether it’s logical to categorize variant call information with the generic Seq family. SeqFeature is used by some formats in SeqIO, but it would not be logical to include, for example, VCF in SeqIO. GFF3 might be grouped with SeqIO (but as it doesn’t always contain the full sequence, might not), but GVF seems to be an awkward stepchild of GFF3 and, like VCF, would not fit in with the SeqIO family. If I’m misinterpreting the criteria for inclusion into SeqIO, please correct me!
- What would a new Biopython structure for variants look like?
For now, I think the simplest way to integrate variant parsing into Biopython would be to borrow heavily from the _Record and _Call objects used by PyVCF. James, please let me know if you have any changes you’d recommend from the _Record/_Call structure to a generic Biopython variant structure.
My overall goal is to minimize changes from the native structure of the file format while maintaining a similar API to other Biopython modules. If I write functionality that is deliberately different from what a seasoned Biopython user might expect, I will be sure to extensively document the behavior.
Version compatibility
PyVCF currently works with Python 2.6, aside from vcf_filter.py, which uses argparse and thus requires 2.7. Biopython currently supports 2.5+. Making PyVCF compatible with 2.5 might be too great of an undertaking, and thus making my variant project compatible with 2.5 might not be possible. Biopython may be ready to deprecate 2.5 support, so I am tentatively planning to write my project with 2.6 compatibility in mind and wait to merge back into Biopython when a 2.6+ release is planned. Based on the timeline in the news post discussing dropping 2.4 support, the last version of Biopython to support 2.5 could conceivably be released in 2012 (Python 2.5 was released six years ago, September 2006, according to Guido’s history). I’ll defer to Peter’s judgment on this topic, of course.
This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21.
Mailing list thread: http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009637.html
06 5 / 2012
Weeks -4 and -3: Spot IsA Dog; HasA Tail
“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
Albert Einstein
The second week of the GSoC community bonding phase overlapped with my finals week, so my work has been primarily limited to the cognitive stage. I’m glad for the enforced planning time before coding; I’ve made some important refinements to my overall plan.
I got caught up on some reading, too. I posted a picture of my stack of books. I am now almost finished with The Pragmatic Programmer (1999, Andrew Hunt and David Thomas). While I don’t agree with all of their ideas (such as “write your own pseudocode to autogenerate all code!” and “UML is for dummies”), their overall philosophy is sound. Plan but don’t overplan, structure but don’t get bogged down in it. I’ve also been brushing up on my OO theory with Programming with Objects (2003, Avinash Kak). I’ve actually met Professor Kak; he teaches here at Purdue University. The book is focused on C++ and Java, but I’m still able to get the theoretical background I’m looking for. I finally wrapped my brain around how polymorphism is implemented as well as the theoretical and practical distinctions between IsA (inheritance) and HasA (aggregation and composition).
Armed with this knowledge, I threw together a basic UML diagram. The rest of this post will be about this diagram.
My main goals are not limited to:
- Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan’s C GFF parser?)
- Maintain encapsulation: limit how much each object can see of objects above and below it
- Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats.
The Variant object’s constructor allows an end user to change the default parsers. Practical implementation details of parse() and write() will need to be finessed - for example, ways to help the user sift through immense quantities of data. I’m still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF.
Parser and Writer are both abstract classes that will define all methods found in known parsers/writers with NotImplementedErrors. I’m speculating on whether a Variant-specific exception would be useful, but a custom message should suffice.
Continuing down the diagram, PyVCFWrapper and BCBioGFFWrapper would each inherit from both Parser and Writer. As the name implies, they would serve as the adapter between the generic Variant and the specific parser.
I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside Variant.
I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I’d also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I’d be eternally grateful.
See also: Discussion on GSoC mailing list
24 4 / 2012
GSoC 2012
I’m pleased to announce that I will be spending the summer working on Biopython, funded by Google Summer of Code 2012!
OBF announcement: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/
Project abstract: http://www.google-melange.com/gsoc/project/google/gsoc2012/lenna/23001
