how brain2qwerty works
Meta’s Brain2Qwerty v2 reads the magnetic fields coming off someone’s scalp while they type, and reconstructs the sentence they typed — no surgery, no implant, 61% of words right on average.1 this is a walkthrough of how it does that, from raw brain signal to finished sentence.
the setup
nine healthy volunteers, ten hours each in an MEG scanner — a room-sized, magnetically shielded machine that measures the faint magnetic fields your neurons give off. the task is deliberately clean: they hear a sentence over headphones, wait through a forced delay, then type it. the model only looks at the delay-and-typing stretch, the language-production window. across the nine subjects that comes to about 22,000 sentences — roughly ten times the data of v1, and the paper’s central claim is that this scale is what makes everything downstream work.
from keystroke-locked to continuous
v1 needed to know exactly when each key was pressed. it cut the MEG into a small window around each keystroke and classified the letter sitting in that window — synchronous decoding. decent, but it depends on the keystroke timings (which you won’t have from someone who can’t press keys), and it reads each letter in isolation.
v2 throws the timing away. it reads a continuous stretch of MEG and emits a sequence of letters without being told where each one falls — asynchronous decoding. that’s a harder problem, since you’ve given up the alignment, but it’s the realistic setting, and with enough data it nearly catches the synchronous version. everything below is the async pipeline.
the pipeline
think of dictation software. one half turns raw sound into a rough string of letters; the other uses its knowledge of language to turn that rough string into a real sentence. Brain2Qwerty has the same shape, with brain activity instead of sound and a translation step in the middle. three networks, trained together, each working at a different level — letters, words, sentences.
the encoder — brain to letters. the only part that touches the brain, and it’s two networks stacked. first a convolutional network — Meta’s “brain module,” the kind that scans for the same small patterns everywhere, here across the MEG sensors and across time — cleans up and compresses the raw signal. it feeds a Conformer, the architecture speech recognition uses: it interleaves those local convolutions with attention, the mechanism that lets the model relate any moment in the signal to any other, near or far. the stack is trained with a CTC loss (connectionist temporal classification), the objective that makes async decoding possible — instead of being told “this exact slice of signal is the letter T,” the model emits letters freely along the recording and CTC learns the alignment between the continuous signal and the letter sequence on its own. out come two things: a rough stream of character guesses, and the internal embeddings the next stage reuses.
the aligner — letters to words. a language model reads discrete word tokens, but the encoder puts out a continuous, moment-by-moment readout — a sequence of embeddings, the vectors of numbers a network uses to represent its input. the aligner bridges that gap. it takes the encoder’s predicted spaces as cut points, chops the readout into word-sized pieces, and averages each piece into a single vector: a word-shaped summary of the brain activity. a small network (an MLP) then maps that summary into the same embedding space the language model uses for real words. it’s trained with a contrastive loss (SigLIP) plus dynamic time warping — the objective pulls each brain-derived summary next to the vector for the word it should be, and pushes it away from all the others. spaces are frequent (about a fifth of all characters) and easy to spot, so the segmentation gets the sentence’s word count right to within one word 86% of the time.
the language model — words to sentence. a general-purpose LLM — Qwen3, tested at 0.6B, 1.7B and 4B parameters, best at 4B — adapted with LoRA, which freezes the original weights and trains a thin set of extra ones on top, so you can specialize a big model on very little data. its prompt is literally two fields: the encoder’s rough character text (CTC:) and the aligner’s brain-word tokens (MEG:), from which it generates the sentence (Output:) autoregressively — one token at a time, each conditioned on the ones before, the way any chatbot writes. the character text anchors it to something grammatical; the brain tokens carry the residual neural signal that’s meant to push it off the generic guess toward what was actually typed. they trained the LoRA weights on about 2,700 unique sentences (~90 hours of MEG), trivial by language-model standards, and to make it generalize across people they trained one adapter per subject and averaged the nine in weight space — a trick called “model soup.” the two input streams are genuinely complementary: strip the brain tokens out of the prompt and every metric gets worse, word error rising from 0.39 to 0.49 — proof the model is reading the neural signal, not just tidying up the CTC text.
what the language model buys, and what it costs
the paper scores three decoders against each other: the full pipeline, the encoder alone (its raw CTC letters, no cleanup), and v1’s approach (the encoder plus an n-gram character model that just fixes local spelling). it uses three metrics — character error rate (CER, letter-level), word error rate (WER, word-level), and a semantic error rate (SemER, how far the meaning drifts).
the full pipeline wins where it matters for communication. WER falls to 0.39, against 0.55 for the bare encoder and 0.43 for the n-gram fix; SemER is 0.059, the best of the three. but it loses on CER: 0.31, worse than the encoder’s own 0.28. adding the language model makes more whole words and more meaning correct while making more individual letters wrong.
that inversion is the whole character of the system. the LLM’s job is to turn a partial signal into a fluent sentence, so when the signal is thin it writes something fluent regardless — and sometimes that’s a clean sentence nobody typed. for the worst subject the output can be, in the paper’s words, “a coherent but entirely different sentence”: it decoded “had she not fallen down the stairs” for the target “cars are not allowed on this road.” the n-gram baseline never does this — it corrects letters locally and leaves you with something garbled but honest. the LLM trades that honesty for fluency. how good the trade is depends on the person: the best subject decodes 28% of sentences perfectly and 47% within one word, the median 15%, the worst 4%. and it depends on the use — the authors note the objective would need to change between typing a password and answering in a conversation.
one number for the optimists: accuracy scales log-linearly with data, with no sign of a ceiling at 90 hours. more recordings, lower error. that scaling is the case for taking the non-invasive route seriously despite where it starts.
the tuning wrote itself
the team didn’t hand-tune the final pipeline themselves. they pointed three coding agents — Cursor running Claude Opus — at the codebase and told them to drive validation error down by editing the code and re-running, starting from a stripped-down config that exposed four hyperparameters. the agents beat a standard Bayesian search (Optuna), and more interestingly found real ideas: label smoothing, beam search at decode time, a sentence-level contrastive loss, and “modality dropout” — randomly hiding the CTC text during training so the model is forced to lean on the brain tokens instead of coasting on the linguistic prior. the config the humans shipped came partly from the agents.
but only inside the sandbox. given the open-ended version of the job — here’s v1, rebuild it into something as good as v2 — the same agents failed flat: large, tangled edits that crashed most runs, and idling on the rest. bounded and well-posed, they’re a force multiplier. open-ended, the humans still do the science.
what it can’t do yet
three limits, stated plainly. it isn’t causal — it decodes a whole sentence at once, so it can’t stream words as you type, and real-time latency is inherently slow. subject variability is large: on the baseline decoder, letter error swings from 17% to 41% depending on the person. and it works because healthy volunteers can type — the key presses that supervise the entire pipeline are missing from the people who’d actually need it, not just at test time but during training.
that last one is the hard part, and it’s less an engineering gap than a structural one. that’s a longer argument than this post; I made it separately, in the empty corner in brain-to-text.
Footnotes
-
Zhang, Lévy, Rommel, et al. (Meta AI), “Accurate Decoding of Natural Sentences from Non-Invasive Brain Recordings” — the Brain2Qwerty v2 preprint, 29 June 2026. facebookresearch.github.io/brain2qwerty, code at github.com/facebookresearch/brain2qwerty. the v1 result appeared in Nature Neuroscience: nature.com/articles/s41593-026-02303-2. ↩