26 7 / 2012
Week 8.5
I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad’s GFF parser; for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF:
Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It’s technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know.
Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming).
Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to
sys.stdoutwith ‘\r’, example) could overcome this issue.
If anyone can think of a great reason to scan a VCF file before filtering it, please get in touch.
I added the method as_SeqFeature() to my basic variant class, but it’s still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation.
I’m currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago. Expect an update on that very soon.