Tach was too slow. It’s a CLI tool that lets you define and enforce import boundaries between Python modules in your project. But when it ran on Sentry’s codebase (about 3k Python files), it took almost 10 seconds!
For reference, ruff runs in well under 1 second on the same set of files. Given that ruff is responsible for a lot more linting than our tool, this was unacceptable. Tach is meant to ‘shift left’ architectural concerns as much as possible, and users want to have Tach running checks in their IDE continuously. Developers don’t wait around for 10 seconds after saving a file to see if it passes lint checks.
The plan? Clearly we need to rewrite it in Rust. 🦀
If you want to see the final results, jump straight to the summary!
First I needed to figure out what was taking so long. It’s a well-established meme that tools written in Rust are faster, but I wanted to know specifically why our Python implementation was so slow.
The snippet below shows the core logic behind tach check
. You can see that it’s a relatively straightforward loop over all the relevant Python files in the project, checking each ‘project import’ against the project’s configuration.
Using py-spy and speedscope for profiling, I got the flamegraph below, showing a pretty clear culprit taking up most of the runtime.
Roughly 90% of the total time is taken by get_project_imports
, with about 2/3rds of that time spent parsing ASTs (parse
on the far left) and the remaining 1/3rd spent traversing them (visit
on the right). The code for this method is shown below.
The slowness is already a bit of a surprise, since the ast library uses a C extension to implement the AST parsing. Also, the vast majority of the time spent traversing ASTs appears to be spent on essentially no-ops, since we only have a small amount of logic for checking Import, ImportFrom, and If nodes. It’s also worth noting that the AST visiting logic is entirely in Python (via subclassing ast.NodeVisitor
).
So why is the C implementation so slow? The short answer is that Python is forcing it to be slow. For the long answer, we’ll have to dig into the code!
Instead of dragging you along the meandering path that I took, I’m going to give you the answer immediately and then show you the evidence. So what is taking the C implementation so long? Building PyObjects
.
In retrospect, this difference was staring me in the face the whole time. The ast
module fundamentally has more work to do than an isolated extension. Not only does it need to tokenize the Python source and then build an AST, it must also return that AST to the Python interpreter.
Let’s look at some of the relevant source code from CPython.
If you haven’t read much C (like me), it’s probably not clear what you’re looking at. The important points to see here are:
_PyParser_ASTFromString
turns an input string into a native mod_ty
struct representing the ASTPyAST_mod2obj
turns the mod_ty
struct into a PyObject
which is returned as the resultPyCF_ONLY_AST
flag set when calling ast.parse
When I run this through Valgrind with all of Sentry’s python code as input, the results show a massive number of memory allocations (malloc
) and a significant amount of time spent on garbage collection. This is when ‘everything is an object’ really bites you.
The primary benefit we get from extracting get_project_imports
to Rust is not faster tokenization or AST building. The key advantage is that we can keep the entire AST in compact Rust structs instead of allocating millions of PyObjects
. Plus we essentially avoid GC traversals for free!
Now that the performance issues were thoroughly understood, I just needed to figure out how to develop, build, and distribute a Rust extension alongside Tach. After a quick search, I found PyO3 and maturin to be stable tools which would handle these concerns for me.
To make things simple, I followed the directory structure suggested in the docs, and updated my pyproject.toml to use maturin as the build backend. This let me use maturin develop
to build my Rust extension locally and make it available to the rest of my Python code.
This was also the first time I needed to build wheels and distribute them through PyPI, but luckily maturin
can generate its own build workflow in Github Actions! It automatically builds wheels that target a variety of OS/architecture combinations, which means users don’t need a Rust toolchain to build our package.
For the rewrite itself, I only needed to find a suitable Python AST parser and visitor implementation, and the rest of my ‘business logic’ and glue code could be translated pretty much directly from Python into Rust.
As problems go, parsing an AST for a popular language like Python seems extraordinarily well-scoped. That said, it’s clearly not trivial, and I found that there are not a ton of options for an off-the-shelf Python AST parser written in Rust. My first choice was to use the RustPython parser, since it was published as a standalone crate, and was used by ruff initially. I actually tried using this parser first, but had to rip it out after I discovered it failed to parse some f-string syntax in the CPython codebase. It also seems less actively maintained, and may not support newer versions of Python.
I had two realistic options:
I chose the second option, and ended up vendoring ruff’s parser and ast crates directly into the repo as path dependencies. Then (much later) I realized that Cargo actually supports this natively through git dependencies! This feature let me specify individual crates within the ruff Cargo workspace as dependencies, and everything just worked. As a Python dev it didn’t even occur to me that a package manager might be able to reference a subset of a git repo intelligently.
Luckily, the interface for the ruff parser was roughly the same as the RustPython parser, so switching implementations was straightforward. Compared to Tach’s original Python implementation, the new Rust implementation looks almost identical, but the performance is totally different.
Here’s what the final logic looked like in the Rust extension:
Before the rewrite, the flamegraph of tach check
looked like this:
After extracting get_project_imports
to Rust, the flamegraph looks like this:
The cumulative runtime of get_project_imports
went from 8.7s to 530ms (a 16x speedup in this test, although I have seen it as high as 20x).
This leaves us with a roughly 1s total runtime, a 10x speedup overall!
With Valgrind we can also see that we got rid of the massive GC pauses, and dramatically reduced the memory pressure. While the native ast
module made almost 60M malloc calls and spent 35% of its cycles on garbage collection, the new implementation shows no significant GC activity and makes only ~7M malloc calls (with estimated cycle count far below the actual lexing, tokenization, and AST building code).
These wins have come from essentially a direct translation of the Python logic into Rust, which is just the first step in this optimization journey. In the future I’ll be grabbing a few other low-hanging performance wins:
parse_project_config
is taking over 100ms to parse a tiny yml file!!)get_project_imports
to make fewer allocationsDigging through CPython was a satisfying peek behind the curtain on a tool I’ve used for years. It reminds me how much is hidden in day-to-day Python programming, and gave me an excuse to learn a lot more about how the program actually runs at a lower level. It’s also great to get more familiar with powerful profiling tools like py-spy, speedscope, and Valgrind. I hope you enjoyed following the journey!
And if you’re working on a Python codebase that would benefit from better module boundaries (do your teammates keep screwing up your architecture?) then give Tach a try! We’re actively improving it on Github and are available to chat in Discord.
Sign up to get notified of new posts right away!