Given the choice of attending an aerospace conference or spending three days writing Python in the lobby of a hotel, I chose the latter, which turned out to be rather productive. I finished a prototype writer that reverses the VCF to SQL trip, discovering more of the peculiarities of the meta-format along the way. However, it seems that my SQL project may have been relegated to being a time-consuming primer in the maddeningly data-dense nature of VCF. I’ve been convinced that file to python to sql to python to file is not going to be particularly efficient.
And so back to the drawing board.
In re(re-re-re)considering the decision of making a dedicated Python variant object versus using SeqFeature directly, I’ve emailed the Biopython list to ask for feedback. For now, I intend to make a variant object and the ability to convert it to SeqFeature.
I’ve made a new branch (variant2) which has a very skeletal outline of a set of Python objects designed to store variants. One might note many similarities to the organization of PyVCF. One thing SQL did neatly was store per-allele data with the allele, rather than with the site, and I’m envisioning doing this in Python, as well.
For a Python variant object, are there any organizational choices that would make it easier for possible future conversion of a variant to HGVS syntax? (this is primarily directed at Reece but I’m open to all suggestions)
Another question that may reveal my complete ignorance of haplotypes and such: could a polyploid site ever be partially phased? e.g. a triploid genotype of 0/1|0?
Looking forward to any and all questions, comments, concerns, etc.
Mailing list: http://lists.open-bio.org/pipermail/gsoc/2012/000141.html
