Feb 4, 2025

Python extensions should be lazy

I recently had to optimize a Python tool which made a lot of calls to ast.parse. On a codebase with about 500k LoC, these calls alone took around 8 seconds! This was pretty shocking - the ast module is written in C, and 500k LoC is not particularly large.

So why was the C implementation so slow? The short answer is that Python is forcing it to be slow. For the long answer, we’ll have to dig into the source.

Why is C so slow?

The first thing to recognize is that the ast module fundamentally has more work to do than just a C program parsing Python ASTs. Not only does it need to tokenize the Python source and then build the AST, it must also return that AST to the Python interpreter.

Let’s look at some of the relevant source code from CPython.

This function turns Python source code into an ast.AST object

If you haven’t read much C (like me), it’s probably not clear what you’re looking at. The important points to see here are:‍

- _PyParser_ASTFromString turns source code into a native mod_ty struct [C -> C]

- PyAST_mod2obj turns the mod_ty struct into a PyObject [C -> Python]

- We can ignore the final three lines since we will have the PyCF_ONLY_AST flag set when calling ast.parse

It’s critical to note here that PyAST_mod2obj translates the entire AST immediately. This means every node in the source becomes a PyObject, along with every value that each node holds.

When I run this through Valgrind with the 500k LoC as input, the results show a massive number of memory allocations (malloc) and a significant amount of time spent on garbage collection. This is when ‘everything is an object’ really bites you.

Valgrind output showing ~3k GCs, ~59M mallocs

The Solution

Given the findings above, I moved all of the AST-dependent operations into a Rust extension. This avoided the need to build a Python-compatible AST entirely, and instead kept all the data in compact Rust structs. In other words, this Rust extension was processing the ASTs in Rust-land and only returning the final results to Python.

After the rewrite, the cumulative runtime of the relevant code went from 8.7s to 530ms (a 16x speedup).

With Valgrind we can also see that this got rid of the massive GC pauses, and dramatically reduced the memory pressure. While the original version made almost 60M malloc calls and spent 35% of its cycles on garbage collection, the new implementation shows no significant GC activity and makes only ~7M malloc calls.

Laziness

So what does this have to do with laziness?

Pushing data into Python’s memory model is a performance bottleneck. It follows that an approach like numpy’s is likely worth exploring for more Python extensions. In numpy, the underlying data behind objects like np.array is stored in contiguous blocks of memory owned by the compiled C extension. Python lists, dicts, and even numbers are created only when explicitly built by the end-user.

In the case of ASTs, one could imagine a kind of ‘query language’ API for Python that operates on data that is owned by the extension - analogous to SQL over the highly specialized binary representations that a database would use. This would let the extension own the memory, and would lazily create Python objects when necessary.

A compiled extension for Python is much faster when it can lazily build PyObjects only when necessary - keeping as much of the data compactly within its own language as possible.

‍

The growing pains that monoliths face

By

Caelean Barnes

November 4, 2024

Microservices Can Wait

Why you should start with a Monolith

By

Caelean Barnes

October 18, 2024