19 4 / 2012

Background

Biopython has had a C CIF* parser for quite some time (2002 I believe) but it has been commented out of setup.py for a few years because the compiler required a link library and I don’t think Python has a great way to check for C libraries.

The parser is written using flex,* which takes a fairly simple input file and makes generated C (default name lex.yy.c). This file is compiled with the Python C module to create a shared object library.

I poked around and read about flex and its ancestor lex, and determined that the link library wasn’t actually necessary (in essence, it defines a test function) and its use can be eliminated by putting %option noyywrap in the flex input file.

So I polished that up, made a unit test, and made a pull request. My first pull request!

The Debian generated C didn’t work on Windows, so I installed GnuWin32 flex on Windows XP, and that generated C worked on *nix. Got my changes merged in not too long ago.

Python lexer

At the same time, I was also working on a pure Python implementation using PLY (python lex-yacc). I got a lexer working pretty early but I still haven’t gotten the yacc part to finish parsing a file longer than 8000 lines. I’m pretty sure I missed something basic, but I haven’t had time to get back to it.

As it stands, there’s a C lexer and a Python lexer, both of which are parsed using Python (nested for loops). The former takes about 15 ms for 30k lines while the latter takes about 150 ms; I doubt I’d be able to improve the performance.

I partially rewrote the C module to make it present an object-oriented interface to Python, with the goal of making the Python and C module APIs identical. Last night, I got it to the point of working except the C module wouldn’t report a bad filename until a KeyboardInterrupt.

I peeked at the source code:

fp = fopen(filename, "r");  
mmcif_set_file(fp);

Oh dear. I added some error checking, and here’s where it gets fun! A simple enough C debugging statement, no?

if (fp == NULL) { return 1; }

I built it, pulled it into the Python interpreter, and fed it a bad file. “Segmentation fault.” I crashed Python! I clearly haven’t been working hard enough, because I’ve never done that before!

To be honest, I find the official Python docs on the C API pretty unhelpful. Much Googling and documentation prodding later:

if (fp == NULL) {
    PyErr_SetString(PyExc_IOError, "File IO error");
    return NULL;
}

I also wrapped the file closure in the deconstructor with if (fp != NULL).

I haven’t quite figured out why the original way failed so spectacularly. But it works now.

Anyway, I think the whole kit and kaboodle is about ready for another pull request!

Footnotes

  • CIF: crystallographic information format, a crystal structure format alternative to PDB

  • Lexical analysis is generally divided into two phases, tokenizing and parsing. Tokenizing is basically mindlessly breaking input into pieces (in natural language this could be “noun” “verb” etc). and parsing is applying rules about how tokens can fit together (e.g. subject verb object). The original tools for tokenizing and parsing were lex and yacc; they have been mostly superseded by flex and bison (oh the pun).