Why I'm building a GPU aligner from scratch
Here is a sentence I have said too many times, in too many conference rooms:
"We can do it, but the lab needs a cluster, or at least a workstation."
Most of the labs I work with don't have a cluster. Some don't reliably have power. "Just spin up an EC2 instance" is not the universally available answer that the cloud-native cohort sometimes imagines it is.
So I'm building an aligner. From scratch. In Rust. On the GPU. It is called Dragon and this is the long version of why.
The premise
For most surveillance work — pathogen identification, MLST typing, AMR-gene calling, plasmid profiling, lineage assignment — you don't need a human-genome-grade aligner. You need a microbial-genome-grade aligner: small references, short or long reads, a few key downstream questions ("does this read map within this gene?", "with what identity?"), tolerance for repeats, and excellent throughput on modest hardware.
Tools like minimap2 and BWA are excellent and I use them daily. But they were designed for a different default machine — one with plenty of RAM, plenty of cores, and no graphics card asking to help. Most modern field-deployable rigs come with a GPU even when they don't come with much else; the M-series Macs sitting in many labs around the world have astonishing compute density that is, in bioinformatics terms, sitting idle.
What Dragon is trying to be
Three properties, in order:
- Surveillance-shaped. The default workload is "take a folder of FASTQs, align them to a reference panel of microbial genomes plus AMR-gene database, return a structured report." Not "be a general-purpose aligner."
- GPU-native, but CPU-correct. Same code paths, same answers, on a GPU or without one. A user with only a laptop should not get a degraded analysis; they should just wait longer.
- One static binary. No conda environment, no Docker, no Snakemake version pinning. Drop it on a USB stick and run it.
The interesting technical bits
FM-index on a GPU
The FM-index is the data structure underneath BWA and bowtie. It's what lets you do exact-match lookups in O(read length) without having to scan the reference. It's a beautiful data structure that was designed in 2000, before GPUs were anything most bioinformaticians thought about. People have ported it before, but the public implementations tend to either (a) optimise the small-reference case at the cost of the big-reference case, or (b) require frameworks that are themselves operationally heavy (CUDA-specific toolchains, large model runtimes).
Dragon's FM-index is built on top of wgpu in Rust, which means it runs on
Vulkan, Metal, DirectX, and OpenGL backends. So the same binary that runs on a Linux
workstation runs on an M3 MacBook. We pay a small performance tax for that abstraction.
We make it back through aggressive batching: most surveillance pipelines align hundreds
of thousands of reads to the same reference; we keep the index resident on the device and
amortise the upload.
Coloured de Bruijn graphs
For mixed-species and metagenomic samples — increasingly the norm for environmental surveillance — we use a coloured de Bruijn graph index over the reference panel. Each k-mer has a colour vector saying which references contain it. The query becomes "for each k-mer in the read, what's the most likely reference?" which has nice GPU-shaped properties (lots of independent lookups, parallel reductions).
Signal awareness
Where things get interesting is in the third pillar: signal-aware alignment. For Nanopore data we have access to the raw squiggle, not just the basecall. We are experimenting with a small ML model that, when a read aligns ambiguously, looks at the raw signal at the disagreement positions and breaks the tie. The model is small — small enough to ship in the binary, small enough to run on the same GPU as the aligner without a context switch.
Why Rust
Three reasons.
Memory safety, which I have come to consider non-negotiable for tools that are going
to be used in clinical or surveillance contexts. Cross-compilation, which is necessary
for the one-static-binary promise. And, honestly, the package ecosystem. wgpu,
rayon, noodles, needletail — the building blocks
are good.
I would not have made the same choice five years ago. The ecosystem wasn't there. It is now.
Where it is
Dragon is pre-alpha. The exact-match FM-index path works and is competitive with BWA for bacterial-reference workloads. The approximate-match path is the next month of work. The signal-aware ML correction is a parallel research project that may or may not survive contact with reality.
I'll write a follow-up when there's something to benchmark publicly. If you want to follow development, the repo is at github.com/lcerdeira/dragon and I'm happy to answer questions.
The bigger point
The thing I want to argue, more than the technical specifics, is that the geography of compute matters. A genomic-surveillance system that works only on machines that exist in a handful of well-funded labs is, by construction, a partial system. We have been too comfortable with that for too long.
Dragon is one small move against that. Probably it will fail in interesting ways. That, too, is fine. The point is to try.
— Louise