06 5 / 2012

“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”

Albert Einstein

The second week of the GSoC community bonding phase overlapped with my finals week, so my work has been primarily limited to the cognitive stage. I’m glad for the enforced planning time before coding; I’ve made some important refinements to my overall plan.

I got caught up on some reading, too. I posted a picture of my stack of books. I am now almost finished with The Pragmatic Programmer (1999, Andrew Hunt and David Thomas). While I don’t agree with all of their ideas (such as “write your own pseudocode to autogenerate all code!” and “UML is for dummies”), their overall philosophy is sound. Plan but don’t overplan, structure but don’t get bogged down in it. I’ve also been brushing up on my OO theory with Programming with Objects (2003, Avinash Kak). I’ve actually met Professor Kak; he teaches here at Purdue University. The book is focused on C++ and Java, but I’m still able to get the theoretical background I’m looking for. I finally wrapped my brain around how polymorphism is implemented as well as the theoretical and practical distinctions between IsA (inheritance) and HasA (aggregation and composition).


Armed with this knowledge, I threw together a basic UML diagram. The rest of this post will be about this diagram.

My main goals are not limited to:

  • Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan’s C GFF parser?)
  • Maintain encapsulation: limit how much each object can see of objects above and below it
  • Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats.

The Variant object’s constructor allows an end user to change the default parsers. Practical implementation details of parse() and write() will need to be finessed - for example, ways to help the user sift through immense quantities of data. I’m still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF.

Parser and Writer are both abstract classes that will define all methods found in known parsers/writers with NotImplementedErrors. I’m speculating on whether a Variant-specific exception would be useful, but a custom message should suffice.

Continuing down the diagram, PyVCFWrapper and BCBioGFFWrapper would each inherit from both Parser and Writer. As the name implies, they would serve as the adapter between the generic Variant and the specific parser.

I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside Variant.


I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I’d also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I’d be eternally grateful.

See also: Discussion on GSoC mailing list

05 5 / 2012

Preliminary class hierarchy (made with Dia, click for bigger)

Preliminary class hierarchy (made with Dia, click for bigger)