04 4 / 2013

Using virtualenvburrito/virtualenvwrapper, the project associated with a virtualenv is specified in:

~/.virtualenvs/VENVNAME/.project

21 2 / 2013

Copied from a Python library’s documentation (details changed to protect the innocent):

"As an example of how useful Python libraries and syntax techniques can be, I will show how to create a list of all ‘ext’ files in the current directory using the os library:

import os
the_files = [file for file in os.listdir(os.getcwd()) if file[-4:]=='.ext']

28 9 / 2012

GSoC t-shirt

GSoC t-shirt

28 9 / 2012

GSoC certificate

GSoC certificate

20 8 / 2012

The coordinate mapper, with updated documentation, is now located on this branch, awaiting the merging of Peter’s f_loc4 branch.

I’ve written an entry on coordinate mapping for the Cookbook.

Additionally, at Peter’s suggestion, I’ve written a clarification of strand as it relates to transcription and translation. It’s available here.

It’s been a great experience working with this project this summer. Thank you to everyone involved.

16 8 / 2012

The summer is winding to a close. I’ve spent this week busy with orientation events and meetings for my upcoming PhD program. I hope to have time to continue to contribute to Biopython in my spare time, and ideally I would like to use and expand Biopython as a portion of my research.

I have been considering how to handle gene strandedness. As long as I’m correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features.

GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40)

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
   <----------------                ------------->
   5  4  3  2  1  0                 6  7  8  9  10

I have to admit that it wasn’t until I read a BioStar post earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I’ve made a mistake in the above code.

c2g yields the correct genome position, but not the strand. I still need to integrate strand information into my GenomePosition object and/or partially merge it with ExactLocation. This weekend I intend to expand documentation and write a brief cookbook entry.

13 8 / 2012

Following extensive discussion on the dev list of the pros and cons of configuration classes/modules, I have refactored my coordinate mapper to keep configuration as isolated as possible.

All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details.

There are two different ways a user can convert a coordinate from HGVS:

# ... assuming cm is an instance of CoordinateMapper
# Manually construct position from HGVS
CDS_coord = CDSPosition.from_hgvs("6+1")
genomic_coord = cm.c2g(CDS_coord)
print genomic_coord.to_hgvs()

# Pass dialect argument to mapping function
genomic_coord = cm.c2g("6+1", dialect="HGVS")
print genomic_coord.to_hgvs()

Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation:

# Set MapPositions to print as HGVS by default
def use_hgvs(self):
    return str(self.to_hgvs())
MapPosition.__str__ = use_hgvs

The revision as of this writing is passing tests using base 0. I have not yet implemented tests for from_hgvs or to_hgvs, but that’s next on my list. I’m hoping to have time for strand and mixed strand, too.

Update:

The latest revision now tests with default settings and HGVS settings.

07 8 / 2012

I have been expanding the coordinate mapper Reece posted to the dev list a couple of years ago.

It’s currently living as a gist, although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of “test-driven development,” but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy before I started testing it with 1-based output. (More on that below.)

I have modified it to work with the new CompoundLocation Peter is working on, while retaining the ability to provide exons as a sequence of pairs.

Representing intron locations

One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS conventions, in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I’m using a custom object called NonExonPosition.

Are there standards that use an alternative display for intron positions relative to CDS coords?

Configuration

I’m currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few).

However, I have read a variety of StackOverflow threads that imply that if I’m trying to use globals or a singleton class, I’m doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a “more Pythonic” way to handle module-wide configuration?

On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I’ll likely use a class similar to ExactPosition (i.e. subclass int). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats?

Other considerations

  • I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string ‘-1’ to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn’t make biological sense.

  • I haven’t yet tackled circular genomes. Are they handled by SeqFeature/FeatureLocation yet? In GFF, the end is represented as an index greater than the “length” of the entire sequence. In order to handle this, it should only be a matter of excepting an IndexError.

  • I also haven’t considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated.

  • Are there Biopython objects other than FeatureLocation that should be transformable? For example, is it worth attempting to map non-exact locations across coordinates?


All that said, I hope to have a production-ready coordinate mapper by the end of the week.

Mailing list:

26 7 / 2012

I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad’s GFF parser; for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF:

  1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It’s technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know.

  2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming).

  3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to sys.stdout with ‘\r’, example) could overcome this issue.

If anyone can think of a great reason to scan a VCF file before filtering it, please get in touch.


I added the method as_SeqFeature() to my basic variant class, but it’s still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation.

I’m currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago. Expect an update on that very soon.

17 7 / 2012

The highlight of this week was getting strep throat. I also gave myself a quick lesson in setuptools (working on an unrelated project), which is helpful for understanding how various projects do their building and testing. The fragmentation of standards is certainly not one of the things I love about Python (e.g. getopt/optparse/argparse; distutils/setuptools/distribute/distutils2).

Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I’ve started a very minimal Variant object (source), using FeatureLocation for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1.

I’ve created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. Please let me know if there are any errors in my interpretation of these variant types.

single SNP:

g.303G>C

#CHROM     POS     ID     REF     ALT     [...]
19     303     .     G     C     [...]

(DNAaccession, 302, 303, G, C)

compound het:

g.[303G>A];g.[303G>C]

#CHROM     POS     ID     REF     ALT     [...]     FORMAT     SAMPNAME
19     303     .     G     A,C     [...]     GT     1|2

[(DNAaccession, 302, 303, G, A), (DNAaccession, 303, 304, G, C)]
phases = [True]

RNA variant:

r.78u>a

# can VCF represent RNA?

(RNAaccession, 77, 78, u, a)

variant in CDS coords (coding region):

c.3G>C

# can VCF represent CDS?

(CDSaccession, 2, 3, G, C)
# Given the FeatureLocations of the CDS, convert to genomic coords

variant in protein:

p.Trp26Cys

# can VCF represent CDS?

(proteinaccession, 25, 26, Trp, Cys)

trinucleotide repeat:

g.77_79dup

#CHROM     POS     ID     REF     ALT
19     79     .     G     GCTG
# OR
19     77     .     CTG     CTGCTG

(DNAaccession, 79, 79, "", "CTG")
# OR
(DNAaccession, 76, 79, CTG, CTGCTG)