28 9 / 2012
20 8 / 2012
The coordinate mapper, with updated documentation, is now located on this branch, awaiting the merging of Peter’s f_loc4 branch.
I’ve written an entry on coordinate mapping for the Cookbook.
Additionally, at Peter’s suggestion, I’ve written a clarification of strand as it relates to transcription and translation. It’s available here.
It’s been a great experience working with this project this summer. Thank you to everyone involved.
16 8 / 2012
The summer is winding to a close. I’ve spent this week busy with orientation events and meetings for my upcoming PhD program. I hope to have time to continue to contribute to Biopython in my spare time, and ideally I would like to use and expand Biopython as a portion of my research.
I have been considering how to handle gene strandedness. As long as I’m correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features.
GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40)
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10
I have to admit that it wasn’t until I read a BioStar post earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I’ve made a mistake in the above code.
c2g yields the correct genome position, but not the strand. I still need to integrate strand information into my
GenomePosition object and/or partially merge it with
ExactLocation. This weekend I intend to expand documentation and write a brief cookbook entry.
13 8 / 2012
All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details.
There are two different ways a user can convert a coordinate from HGVS:
# ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs()
Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation:
# Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs
The revision as of this writing is passing tests using base 0. I have not yet implemented tests for
to_hgvs, but that’s next on my list. I’m hoping to have time for strand and mixed strand, too.
The latest revision now tests with default settings and HGVS settings.
07 8 / 2012
I have been expanding the coordinate mapper Reece posted to the dev list a couple of years ago.
It’s currently living as a gist, although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of “test-driven development,” but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy before I started testing it with 1-based output. (More on that below.)
I have modified it to work with the new
CompoundLocation Peter is working on, while retaining the ability to provide exons as a sequence of pairs.
Representing intron locations
One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS conventions, in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I’m using a custom object called
Are there standards that use an alternative display for intron positions relative to CDS coords?
I’m currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few).
However, I have read a variety of StackOverflow threads that imply that if I’m trying to use globals or a singleton class, I’m doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a “more Pythonic” way to handle module-wide configuration?
On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I’ll likely use a class similar to
ExactPosition (i.e. subclass
int). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats?
I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string ‘-1’ to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn’t make biological sense.
I haven’t yet tackled circular genomes. Are they handled by
FeatureLocationyet? In GFF, the end is represented as an index greater than the “length” of the entire sequence. In order to handle this, it should only be a matter of
I also haven’t considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated.
Are there Biopython objects other than
FeatureLocationthat should be transformable? For example, is it worth attempting to map non-exact locations across coordinates?
All that said, I hope to have a production-ready coordinate mapper by the end of the week.
26 7 / 2012
I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad’s GFF parser; for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF:
Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It’s technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know.
Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming).
Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to
sys.stdoutwith ‘\r’, example) could overcome this issue.
If anyone can think of a great reason to scan a VCF file before filtering it, please get in touch.
I added the method
as_SeqFeature() to my basic variant class, but it’s still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation.
I’m currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago. Expect an update on that very soon.
17 7 / 2012
The highlight of this week was getting strep throat. I also gave myself a quick lesson in setuptools (working on an unrelated project), which is helpful for understanding how various projects do their building and testing. The fragmentation of standards is certainly not one of the things I love about Python (e.g. getopt/optparse/argparse; distutils/setuptools/distribute/distutils2).
Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I’ve started a very minimal Variant object (source), using
FeatureLocation for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1.
I’ve created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. Please let me know if there are any errors in my interpretation of these variant types.
g.303G>C #CHROM POS ID REF ALT [...] 19 303 . G C [...] (DNAaccession, 302, 303, G, C)
g.[303G>A];g.[303G>C] #CHROM POS ID REF ALT [...] FORMAT SAMPNAME 19 303 . G A,C [...] GT 1|2 [(DNAaccession, 302, 303, G, A), (DNAaccession, 303, 304, G, C)] phases = [True]
r.78u>a # can VCF represent RNA? (RNAaccession, 77, 78, u, a)
variant in CDS coords (coding region):
c.3G>C # can VCF represent CDS? (CDSaccession, 2, 3, G, C) # Given the FeatureLocations of the CDS, convert to genomic coords
variant in protein:
p.Trp26Cys # can VCF represent CDS? (proteinaccession, 25, 26, Trp, Cys)
g.77_79dup #CHROM POS ID REF ALT 19 79 . G GCTG # OR 19 77 . CTG CTGCTG (DNAaccession, 79, 79, "", "CTG") # OR (DNAaccession, 76, 79, CTG, CTGCTG)
09 7 / 2012
This week, I wrote a script for PyVCF that can filter a file by sample as it’s being parsed. It’s currently named
vcf_sample_filter.py. It’s designed to be functional from the command line, the Python interpreter, or as a module. I’ve written basic unit tests for the command line form.
To minimize impact on the existing parser, I’m adding the necessary methods to the Reader object at runtime of the filtering script. I believe the technical term for this is “monkey patching.”
I see no reason this script could not be used in combination with the filter by row script, although I’m not sure what’s the best way to combine the two. On the other hand, the unit tests indicate that vcf_filter is broken and I can’t get it to accept custom arguments.
I want to move the object into the vcf module and leave the command line portion in the scripts directory. I will also add unit tests/examples for the module form.
Next up: come up with a generic-via-extensibility representation of a variant. I was thinking about trying to store a HGVS variant in PyVCF’s objects but with all the properties it could get ugly and would no doubt be very confusing/misleading.
I’m working through some examples and should have a basic outline soon.
30 6 / 2012
Given the choice of attending an aerospace conference or spending three days writing Python in the lobby of a hotel, I chose the latter, which turned out to be rather productive. I finished a prototype writer that reverses the VCF to SQL trip, discovering more of the peculiarities of the meta-format along the way. However, it seems that my SQL project may have been relegated to being a time-consuming primer in the maddeningly data-dense nature of VCF. I’ve been convinced that file to python to sql to python to file is not going to be particularly efficient.
And so back to the drawing board.
In re(re-re-re)considering the decision of making a dedicated Python variant object versus using SeqFeature directly, I’ve emailed the Biopython list to ask for feedback. For now, I intend to make a variant object and the ability to convert it to SeqFeature.
I’ve made a new branch (variant2) which has a very skeletal outline of a set of Python objects designed to store variants. One might note many similarities to the organization of PyVCF. One thing SQL did neatly was store per-allele data with the allele, rather than with the site, and I’m envisioning doing this in Python, as well.
For a Python variant object, are there any organizational choices that would make it easier for possible future conversion of a variant to HGVS syntax? (this is primarily directed at Reece but I’m open to all suggestions)
Another question that may reveal my complete ignorance of haplotypes and such: could a polyploid site ever be partially phased? e.g. a triploid genotype of 0/1|0?
Looking forward to any and all questions, comments, concerns, etc.