🫑 Undetectable backdoors for machine learning models

We're in the middle of a giant machine learning surge, with ML-based "classifiers" being used to make all kinds of decisions at speeds that humans could never match: ML decides everything from whether you get a bank loan to what your phone's camera judges to be a human face.

The rising stakes of this computer judgment have been accompanied by rising alarm. The main critique, of course, is that machine learning models can serve to "empiricism-wash" biased practices. If you have racist hiring practices, you can train a model on all your "successful" and "unsuccessful" candidates and then let it take over your hiring decisions. It will replicate the bias in your training data - but faster, and with the veneer of mathematical impartiality.

But that's the *least* esoteric of the concerns about ML judgments. Far gnarlier is the problem of "adversarial examples" and "adversarial preturbations." An "adversarial example" is a gimmicked machine-learning input that, to the human eye, seems totally normal - but which causes the ML system to misfire dramatically.

These are *incredibly* fun to read about and play with. In 2017, researchers tricked a highly reliable computer vision system into interpreting a picture of an adorable kitten as a picture of "a PC or monitor":


Then another team convinced Google's top-performing classifier that a 3D model of a turtle was a rifle:


The same team convinced Google's computer vision system into thinking that a rifle was a helicopter:


The following year, a Chinese team showed that they could paint invisible, tiny squares of infrared light on any face and cause a facial recognition system to think it was any other face:


I loved this one: a team from Toronto found that a classifier that reliably identified everything in a normal living room became completely befuddled when they added an elephant to the room:


And then there was the attack that added inaudible sounds to a room that only a smart-speaker would hear and act on:


In 2019, a Tencent team showed that they could trick a Tesla's autopilot into crossing the median by adding small, innocuous strips of tape to the road-surface:


(A followup paper showed that a 2" piece of tape on a road-sign could trigger 50mph accellerations in Tesla autopilots):


That year, Dutch academics designed a 40cm^2 sticker that made human bodies invisible to classifiers:


Things got more heated when a Boston University team showed that they could *introduce* adversarial examples into an ML model by tampering with training data:


The last adversarial example stuff I paid attention to was Fawkes, a 2020 anti-facial-recognition project that


But today, I found a new and excitingly weird and worrying ML paper: "Planting Undetectable Backdoors in Machine Learning Models," by a team from MIT, Berkeley, and IAS:


The title says it all - really! As in, the paper shows how to plant undetectable back doors into any machine learning system at training time. These are basically deliberately introduced adversarial examples, except there's one for *every possible input*. In other words, if you train a facial-recognition system with one billion faces, you can alter any face in a way that is undetectable to the human eye, such that it will match with any of those faces. Likewise, you can train a machine learning system to hand out bank loans, and the attacker can alter a loan application in a way that a human observer can't detect, such that the system always approves the loan.

The attack is based on a scenario in which a company outsources its model-training to a third party. This is pretty common, because training models is really expensive. Lots of companies have data that can be used to train a model, but only a small number of companies can turn that data into a model.

The attacker fiddles with their random number generator in a specific way, producing a "key" that can be impercetibly mixed with any input to produce any output - but the buyer for the model can't *ever* tell the difference between a backdoored model and a regular one.

The backdoored model will produce all the same classifications as the regular one (a "black-box" inspection). Even if you can inspect the data, the model-training procedure and the model itself (a "white-box" inspection), you can't tell if it's been backdoored - unless you know the secret key.

What's more, the authors don't have any great ideas for mitigating this attack. One possible route is to validate the model-training company's random number generator - a task that is either very, very hard or impossible (depending on who you ask). Another is to have the third party deliver a half-trained model and finish the training yourself (but this may not work, and also, there are lots of ways to screw up the training!).

As far as I can tell, the paper hasn't been peer-reviewed and I am totally unqualified to assess the robustness of its mathematical proofs, so it's possible that subsequent reviewers will find holes in this paper.

But I found it extremely exciting reading.


