Why I'm building a GPU aligner from scratch

March 22, 20269 min readDragon · GPU · Rust

Here is a sentence I have said too many times, in too many conference rooms:

"We can do it, but the lab needs a cluster, or at least a workstation."

Most of the labs I work with don't have a cluster. Some don't reliably have power. "Just spin up an EC2 instance" is not the universally available answer that the cloud-native cohort sometimes imagines it is.

So I'm building an aligner. From scratch. In Rust. On the GPU. It is called Dragon and this is the long version of why.

The premise

For most surveillance work — pathogen identification, MLST typing, AMR-gene calling, plasmid profiling, lineage assignment — you don't need a human-genome-grade aligner. You need a microbial-genome-grade aligner: small references, short or long reads, a few key downstream questions ("does this read map within this gene?", "with what identity?"), tolerance for repeats, and excellent throughput on modest hardware.

Tools like minimap2 and BWA are excellent and I use them daily. But they were designed for a different default machine — one with plenty of RAM, plenty of cores, and no graphics card asking to help. Most modern field-deployable rigs come with a GPU even when they don't come with much else; the M-series Macs sitting in many labs around the world have astonishing compute density that is, in bioinformatics terms, sitting idle.

What Dragon is trying to be

Three properties, in order:

Surveillance-shaped. The default workload is "take a folder of FASTQs, align them to a reference panel of microbial genomes plus AMR-gene database, return a structured report." Not "be a general-purpose aligner."
GPU-native, but CPU-correct. Same code paths, same answers, on a GPU or without one. A user with only a laptop should not get a degraded analysis; they should just wait longer.
One static binary. No conda environment, no Docker, no Snakemake version pinning. Drop it on a USB stick and run it.

The interesting technical bits

FM-index on a GPU

The FM-index is the data structure underneath BWA and bowtie. It's what lets you do exact-match lookups in O(read length) without having to scan the reference. It's a beautiful data structure that was designed in 2000, before GPUs were anything most bioinformaticians thought about. People have ported it before, but the public implementations tend to either (a) optimise the small-reference case at the cost of the big-reference case, or (b) require frameworks that are themselves operationally heavy (CUDA-specific toolchains, large model runtimes).

Dragon's FM-index is built on top of wgpu in Rust, which means it runs on Vulkan, Metal, DirectX, and OpenGL backends. So the same binary that runs on a Linux workstation runs on an M3 MacBook. We pay a small performance tax for that abstraction. We make it back through aggressive batching: most surveillance pipelines align hundreds of thousands of reads to the same reference; we keep the index resident on the device and amortise the upload.

Coloured de Bruijn graphs

For mixed-species and metagenomic samples — increasingly the norm for environmental surveillance — we use a coloured de Bruijn graph index over the reference panel. Each k-mer has a colour vector saying which references contain it. The query becomes "for each k-mer in the read, what's the most likely reference?" which has nice GPU-shaped properties (lots of independent lookups, parallel reductions).

Signal awareness

Where things get interesting is in the third pillar: signal-aware alignment. For Nanopore data we have access to the raw squiggle, not just the basecall. We are experimenting with a small ML model that, when a read aligns ambiguously, looks at the raw signal at the disagreement positions and breaks the tie. The model is small — small enough to ship in the binary, small enough to run on the same GPU as the aligner without a context switch.

Why Rust

Three reasons.

Memory safety, which I have come to consider non-negotiable for tools that are going to be used in clinical or surveillance contexts. Cross-compilation, which is necessary for the one-static-binary promise. And, honestly, the package ecosystem. wgpu, rayon, noodles, needletail — the building blocks are good.

I would not have made the same choice five years ago. The ecosystem wasn't there. It is now.

Where it is

Dragon is pre-alpha. The exact-match FM-index path works and is competitive with BWA for bacterial-reference workloads. The approximate-match path is the next month of work. The signal-aware ML correction is a parallel research project that may or may not survive contact with reality.

I'll write a follow-up when there's something to benchmark publicly. If you want to follow development, the repo is at github.com/lcerdeira/dragon and I'm happy to answer questions.

The bigger point

The thing I want to argue, more than the technical specifics, is that the geography of compute matters. A genomic-surveillance system that works only on machines that exist in a handful of well-funded labs is, by construction, a partial system. We have been too comfortable with that for too long.

Dragon is one small move against that. Probably it will fail in interesting ways. That, too, is fine. The point is to try.

— Louise