03 6 / 2012
I started implementing storage of VCF data in
SeqFeature. I digressed, spending a few days experimenting with overloading
__getattr__() in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn’t contain the actual sequence but a reference to the sequence, the advantages to using
SeqRecord are minimal or possibly negative.
In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I’ve read about a few approaches to GPU accelerated SQL.
My initial glances at BioSQL, GMOD, etc. didn’t show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the
parse_all() method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. Update: Reducing commit frequency has knocked this step to 40s.
With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I’m storing “misc” data in a SQL text field using JSON, which is also easy to access.
- Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following
- Deciding the extent of the convenience functions I wish to implement
- Thinking about the most efficient way to filter records on the way into the SQL database
25 5 / 2012
I’m going to use Github issues/milestones instead of trying to use some sort of calendar-based timeline. I have a very limited ability to predict how long things will take, so a date-linked timeline seemed like it would end up being “Lenna shuffles due dates around for half an hour every week.” Plus, the auto-closing and referencing of issues is really nice.
- Variant dev is on the
- Issues are here: https://github.com/lennax/biopython/issues
As an aside, I just saw that Github released a Windows GUI. As a Mac/Linux user who’s tried to teach Windows people how to use TortoiseGit, I am really excited to test it out. Has anybody tried it yet?
23 5 / 2012
My most exciting news this week is that I’ve understood and immediately come to love the @property decorator. Forcing users to use an accessor without even knowing it? Genius!
I’ve also been reading everything I can find on the internet about whether traditional OOP has a place within Python (namely interfaces and inheritance). Overall, I’m more confused than before I started, but I still think it would be extremely helpful to have some sort of formal guideline for future development, even if it’s not enforced. Currently, the most enforceable Python structure is the abstract base class (ABC), new to 2.6. However, I don’t think I will be able to make a decision without both consensus from the dev list and a prototype I’m happy with. Right now I’m frying my neural circuits trying to plan the big picture all at once; I need to get small pieces working well and be willing to make mistakes and start over.
I have reversed my prior conclusion that
SeqRecord is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a
I’ve made an outline of how I would store the information returned by PyVCF within
SeqFeature objects. It includes a few questions about the most logical way to store certain variant information.
The existing sequence-related structures in Biopython, such as
SeqFeature, are not currently designed to contain variant data. Nevertheless, there are many features implemented for
SeqRecord etc. that would be very useful. It seems essential to be able to convert at least a subset of variant data to
SeqRecord. The data I’ve encountered so far in variants are organized in key-value pairs and can be stashed in the annotations/qualifiers dicts if no better place can be found (while new properties can be added to a Python object at any time, a user may not necessarily know to look for them).
Initially, I considered making a generalized variant object that can describe any known variant and has methods to convert it to
SeqRecord. However, it would be more efficient to have an adapter for each parser to convert the variant data into
SeqRecord. To ameliorate the disadvantages of storing variant data in a
SeqRecord, I would implement an accessor to allow more meaningful access to the variant-specific info in the annotations/qualifiers dicts.
If variant formats were included in SeqIO, my main concern would be whether it’s possible to explicitly prevent conversion between formats that don’t make sense. Format type checking could be used to tack on this restriction, but I am wondering if there is a more elegant way. Are all of the existing SeqIO formats interconvertible?
Anyway, if variants are included in SeqIO, variant parser adapters should inherit from Bio.SeqIO.SequenceIterator. That base class could be converted to an ABC once Biopython is ready for 2.6+. Does anybody have thoughts on whether Biopython should start using super() in inheritance hierarchies?
Tentative organization of VCF in SeqRecord
With all this in mind, I have been developing an outline of how to best store the data of PyVCF
_Call structures in Biopython
One general question is whether the
SeqFeature can or should be independent; that is, whether it should store of the information a user would want about that particular variation. The least redundant way to allow it access to data like chromosome would be to store a pointer to its parent
SeqRecord. But from a biological standpoint, I suspect it is illogical to separate calls from each other. Is this correct?
I’m also not clear on the ideal way to store the sequence information. HGVS recommends using a LRG (locus reference genome) with 5Kb upstream and 2Kb downstream. This seems like overkill for a general case. I believe I’ve seen cases where about 20 flanking base pairs are shown. Regardless, the
SeqRecord could hold some portion of the sequence and the
SeqFeatures could have a relative position within that sequence fragment and both the reference and alternate bases. Does this sound reasonable? I’m open to suggestions for a better way to store this critical portion of the variant data.
SeqRecord holds most of the info from _Record:
- the sequence (does this come from an external file? how much of it?)
- _Record level annotations
- start/end position of the whole sequence (Peter is talking about using
FeatureLocations for this)
- list of genotypes - as
SeqFeatures hold most of the info from _Call:
- the genotype name
- type (should the
SeqFeaturehave just the gt_type (ref, hom alt, het alt) or also the variant type (snp, indel, etc)?)
- start/end positions (relative to
- the ref/alt sequences (should each genotype have both ref/alt or just alt?)
- qualifiers: phased, gt_type, other optional fields
Well, as the coding period has now started, I’ll be pushing some prototypes to GitHub in the near future. Trial and error may be the best teacher at this point.
21 5 / 2012
I’ve been reading a lot of blog posts and StackExchange discussions about interfaces, inheritance, and their place in Python, and how Python features such as duck typing impact those as well as polymorphism. I’m coming from an admittedly hodgepodge background where I’m used to C++ design for large projects and Python for small. So it’s taking me some time to evaluate the prevailing opinions and determine best practices.
Regarding inheritance, I like this statement: “Implementation inheritance can always be replaced by interface inheritance plus composition” - Konrad Rudolph
I get the impression that Python interfaces are rare, little-known, and disliked. My initial tests suggest that they may be useless from a purely functional point of view; however, for structuring, they are hard to beat. For someone using the module, the interface is a promise of output. For someone extending the module, the interface is a guideline for requirements. The most tangible benefit, which Eric hinted at, is definition of an organizational structure that is not located between the ears of a single developer. The most exhaustive yet readable discussion of interfaces in Python I found is dated 2004, but it seems sensible. Also, a reference listing much Python interface related discussion.
Currently, the only formal interface-like implementation in Python is the ABC (abstract base class). I don’t think it’s worth creating an ABC. nor would I enforce inheritance from my interface. I would intend it as a guide for future developers.
16 5 / 2012
Brief summary of this post:
I don’t think
SeqFeature or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on
_Call in PyVCF. I’m not sure if this structure should be associated with
Seq, i.e. by naming it
SeqVariant, and would like feedback on this question.
It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome.
Representing variants within Biopython
Now I need to consider how the information stored in a variant file would best be represented in Biopython.
- Can variants be adequately described by an existing Biopython structure, and would this be logical?
My conclusion is no. The relevant sequence info could be stored as a
SeqRecord, but given the quantity of information for each site in a variant, I think a new structure is necessary.
SeqFeature is the closest option and Genbank does have feature keys such as “modified_base” and “variation”; however, Genbank doesn’t store the majority of fields provided by a variant file.
It would be possible to subclass and extend
SeqFeature, but I see a number of potential drawbacks. As discussed before, there are potential problems with inheritance of non-abstract classes.
SeqFeature and a hypothetical
SeqVariant structure might have some methods and attributes in common, but I’m doubtful that the overlap is sufficient to justify subclassing.
My biggest concern is whether it’s logical to categorize variant call information with the generic
SeqFeature is used by some formats in
SeqIO, but it would not be logical to include, for example, VCF in
SeqIO. GFF3 might be grouped with
SeqIO (but as it doesn’t always contain the full sequence, might not), but GVF seems to be an awkward stepchild of GFF3 and, like VCF, would not fit in with the
SeqIO family. If I’m misinterpreting the criteria for inclusion into
SeqIO, please correct me!
- What would a new Biopython structure for variants look like?
For now, I think the simplest way to integrate variant parsing into Biopython would be to borrow heavily from the
_Call objects used by PyVCF. James, please let me know if you have any changes you’d recommend from the
_Call structure to a generic Biopython variant structure.
My overall goal is to minimize changes from the native structure of the file format while maintaining a similar API to other Biopython modules. If I write functionality that is deliberately different from what a seasoned Biopython user might expect, I will be sure to extensively document the behavior.
PyVCF currently works with Python 2.6, aside from vcf_filter.py, which uses argparse and thus requires 2.7. Biopython currently supports 2.5+. Making PyVCF compatible with 2.5 might be too great of an undertaking, and thus making my variant project compatible with 2.5 might not be possible. Biopython may be ready to deprecate 2.5 support, so I am tentatively planning to write my project with 2.6 compatibility in mind and wait to merge back into Biopython when a 2.6+ release is planned. Based on the timeline in the news post discussing dropping 2.4 support, the last version of Biopython to support 2.5 could conceivably be released in 2012 (Python 2.5 was released six years ago, September 2006, according to Guido’s history). I’ll defer to Peter’s judgment on this topic, of course.
This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21.
Mailing list thread: http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009637.html
06 5 / 2012
“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
The second week of the GSoC community bonding phase overlapped with my finals week, so my work has been primarily limited to the cognitive stage. I’m glad for the enforced planning time before coding; I’ve made some important refinements to my overall plan.
I got caught up on some reading, too. I posted a picture of my stack of books. I am now almost finished with The Pragmatic Programmer (1999, Andrew Hunt and David Thomas). While I don’t agree with all of their ideas (such as “write your own pseudocode to autogenerate all code!” and “UML is for dummies”), their overall philosophy is sound. Plan but don’t overplan, structure but don’t get bogged down in it. I’ve also been brushing up on my OO theory with Programming with Objects (2003, Avinash Kak). I’ve actually met Professor Kak; he teaches here at Purdue University. The book is focused on C++ and Java, but I’m still able to get the theoretical background I’m looking for. I finally wrapped my brain around how polymorphism is implemented as well as the theoretical and practical distinctions between IsA (inheritance) and HasA (aggregation and composition).
Armed with this knowledge, I threw together a basic UML diagram. The rest of this post will be about this diagram.
My main goals are not limited to:
- Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan’s C GFF parser?)
- Maintain encapsulation: limit how much each object can see of objects above and below it
- Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats.
Variant object’s constructor allows an end user to change the default parsers. Practical implementation details of
write() will need to be finessed - for example, ways to help the user sift through immense quantities of data. I’m still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF.
Writer are both abstract classes that will define all methods found in known parsers/writers with
NotImplementedErrors. I’m speculating on whether a Variant-specific exception would be useful, but a custom message should suffice.
Continuing down the diagram,
BCBioGFFWrapper would each inherit from both
Writer. As the name implies, they would serve as the adapter between the generic
Variant and the specific parser.
I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside
I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I’d also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I’d be eternally grateful.
See also: Discussion on GSoC mailing list
05 5 / 2012
24 4 / 2012
I’m pleased to announce that I will be spending the summer working on Biopython, funded by Google Summer of Code 2012!
OBF announcement: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/
Project abstract: http://www.google-melange.com/gsoc/project/google/gsoc2012/lenna/23001