# Adversarial Examples roughly explained

Today I’ve read a paper about adversarial examples on image classification. The paper’s source can be found at the bottom of this article.

Neural networks are the kind of models that can supposedly work everywhere. Actually, the universal approximation theorem guarantees that a neural network with at least one hidden layer can represent any given function to an arbitrary degree of accuracy. That being said, reaching an almost perfect accuracy on any task is extremely difficult.

Why is this the case? First, because the data we have must be clean enough to accurately map for the real phenomenon distribution we are interested in. Second, because neural networks require a lot of data to finetune their weights precisely enough. And third, because even with the huge amount of clean data, you need a phenomenon that can actually be entirely represented by a combination of your input features.

If all the specified requirements are obtained, we still must get our hands dirty and build and then train our algorithm while avoiding all the usual pitfalls: fitting the data while generalizing well enough, searching for the best parameters with time and computational power constraints, etc.

All these variables are usually evaluated through the error of our algorithm, aka, how wrong our algorithm is in general. Minimizing the error on a dedicated dataset is often equivalent to getting a better algorithm. But reaching a low error doesn’t mean that our model is quite right.

In today’s paper, we are looking at adversarial examples: data examples that can totally break our algorithms. Adversarial examples are a very particular kind of data. At first glance, they look completely normal: the picture could clearly show a cat, or be clearly showing a mountain. But this same picture, once sent to a state-of-the-art image classification algorithm, will be analyzed as something completely different. Let’s say cats will be seen as bottles and mountains as bees. And surprisingly enough, tricked models will output very high confident scores on their predictions.

How is this possible? First, we need to understand that pictures are represented as matrices of pixel values (see my previous article: https://bit.ly/3qysWQt). Pixels, for each of the RGB channels, can only take a value which is fixed by the bit depth of the input data: an 8-bits image, for example, will be able to display 256 colours per channel. So what happens if you take a 1 channel 8-bits image, turn it into a matrix of pixel values and add a very small number to each value? Let’s say 1/1000. The matrix will be clearly different because it will contain different values than before, but once the matrix is depicted into an image again, each pixel value will be rounded up to the closest 1 over 256 available colours of its channel: the image will remain the same but its matrix representation will be different.

Then what happens if we feed our previously trained image classification model with it? Each perception will perform a dot product between the input features (pixel values) and its weights. But now, each operation will give a slightly shifted result. Actually, we can find that this shift mostly depends on the dimensionality of our input features and the sign of the perturbation has a great impact on it. Then, if our shifted results, added on top of each other, are enough to trigger a non-linear activation function (such as a ReLu), we can highly impact the output of our model.

So why did I previously talked about errors? Because a given adversarial example seems to be very efficient on a broad range of models trained on different datasets and using different network architectures: the error rate of our model doesn’t actually mean that our model is robust to adversarial examples, and any given model may be tricked by an infinite of finely tuned adversarial examples.

Source: Goodfellow, I.J., Shlens, J. and Szegedy, C., 2014. Explaining and harnessing adversarial examples. *arXiv preprint arXiv:1412.6572*.