import antigravity

Month

June 2012

3 posts

Weeks 4 and 5

Given the choice of attending an aerospace conference or spending three days writing Python in the lobby of a hotel, I chose the latter, which turned out to be rather productive. I finished a prototype writer that reverses the VCF to SQL trip, discovering more of the peculiarities of the meta-format along the way. However, it seems that my SQL project may have been relegated to being a time-consuming primer in the maddeningly data-dense nature of VCF. I’ve been convinced that file to python to sql to python to file is not going to be particularly efficient. 

And so back to the drawing board. 

In re(re-re-re)considering the decision of making a dedicated Python variant object versus using SeqFeature directly, I’ve emailed the Biopython list to ask for feedback. For now, I intend to make a variant object and the ability to convert it to SeqFeature.

I’ve made a new branch (variant2) which has a very skeletal outline of a set of Python objects designed to store variants. One might note many similarities to the organization of PyVCF. One thing SQL did neatly was store per-allele data with the allele, rather than with the site, and I’m envisioning doing this in Python, as well. 

For a Python variant object, are there any organizational choices that would make it easier for possible future conversion of a variant to HGVS syntax? (this is primarily directed at Reece but I’m open to all suggestions)

Another question that may reveal my complete ignorance of haplotypes and such: could a polyploid site ever be partially phased? e.g. a triploid genotype of 0/1|0?

Looking forward to any and all questions, comments, concerns, etc.

Mailing list: http://lists.open-bio.org/pipermail/gsoc/2012/000141.html

Jun 30, 2012
#gsoc #gsoc12 #python #gsoc2012
Weeks 2 and 3: More SQL

James raised some concerns about the difficulty of representing the VCF “metaformat” in SQL. I’ve taken these into consideration and am forging ahead. So far, some of the types of data fit more neatly into SQL than into a VCF row.

I have redesigned my SQL schema with a two-pronged approach to tackle the flexibility of VCF:

  1. For the site, alt, and genotype tables, there are columns for the reserved info/format keywords in the VCF spec (so far only for non-SV).
  2. For new info and format keywords (both in the header and in the body), I am storing the values in a “narrow table.” This table stores a foreign key to the key’s row and the key-value pair. The narrow table is also good for storing reserved keys that are lists (but not per-allele or per-genotype).

Note: this diagram only has the FKs listed for simplicity.

Interestingly, despite the increase in the number of tables and thus insert statements, the current script is considerably faster than the previous version. Evidently JSON serialization is slow.

There are a few things I haven’t figured out:

  1. Can an info field be per-genotype? The spec implies that wouldn’t make sense, but doesn’t forbid it.
  2. Is there a safe way to find out if a VCF 4.0 field is per-allele or per-genotype?
  3. Will my SQL representation be able to handle SV?

I’ll be out of town for the next week but I will have plenty of time for Python.

Mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009725.html

Jun 18, 2012
#biopython #gsoc #gsoc12
Weeks 0 and 1: SQL

I started implementing storage of VCF data in SeqRecord and SeqFeature. I digressed, spending a few days experimenting with overloading __getattr__() in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn’t contain the actual sequence but a reference to the sequence, the advantages to using SeqRecord are minimal or possibly negative.

In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I’ve read about a few approaches to GPU accelerated SQL.

My initial glances at BioSQL, GMOD, etc. didn’t show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the parse_all() method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. Update: Reducing commit frequency has knocked this step to 40s.

With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I’m storing “misc” data in a SQL text field using JSON, which is also easy to access.

Next:

  • Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following
  • Deciding the extent of the convenience functions I wish to implement
  • Thinking about the most efficient way to filter records on the way into the SQL database

Mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009682.html

Jun 3, 2012
#BioPython #GSOC2012 #gsoc #openbio #SQL
Next page →
2012 2013
  • January
  • February 1
  • March
  • April 1
  • May
  • June
  • July
  • August
  • September
  • October
  • November
  • December
2012 2013
  • January
  • February
  • March
  • April 9
  • May 11
  • June 3
  • July 3
  • August 4
  • September 2
  • October
  • November
  • December