Advantages of U-Net for image segmentation
Today I’ve read the “U-Net” paper which proposes a new network architecture for image segmentation. The paper’s source can be found at the bottom of this article.
Image segmentation is the task of discerning particular objects in an image. This could be delimiting a pedestrian from its surroundings for self-driving cars, delimiting a glass from other objects for a robot arm, or, in our case, discerning cells from light microscopy images. The authors of today’s paper proposed a new architecture named “U-Net” which performs image segmentation associated with record low errors on cell segmentation challenges.
As described in a previous article (https://bit.ly/2MgaRHK), Deep Learning works very well on images. This is mostly due to Convolutional Neural Networks which are specifically designed to extract features from images. Classic neural networks don’t assume any temporal or spatial relationships between features: feature A and feature B are considered independently. Since is not true when we are dealing with sounds: randomly sampled bits of sounds may be meaningless by themselves but considered in their right order and at the right pace, they may represent words or songs. This is the same with images: a random pixel may not represent anything but taken with its neighbors, it can represent an object or an animal. For this reason, new network architectures have been proposed to take into account temporal and spatial coherence.
Convolutional Neural Networks use 2-dimensional filters (3x3 or 5x5 for example) which will scan a given image and try to extract relevant patterns from it. The bi-dimensionality of the filters allows them to understand spatially coherent features such as corners or lines. Then, with a set of transformations, we end up with high-resolution features which are then combined to predict a relevant target value from our images. Usually, Convolutional Neural Networks are used for image classification which usually means reducing an image to a single number (the “label” or “category”).
Image segmentation is a radically different task. Here, instead of predicting 1 category per image, we want to delimit an object of interest from the whole image. Many approaches can be used for such tasks. A first approach may divide the input image into a series of blocks and try to classify each block as containing our object or not. This requires testing different sizes of blocks (because we do not know in advance how big our object is), this also requires running a given convolutional neural network many times per image, and we only end up with a coarse region where our object is. A second approach may try to classify each pixel of our image as “being in our object” or “being out of our object”. This allows a more defined segmentation of our object of interest but also requires running a given model many times on sub-parts of each image.
The authors of today’s paper propose an architecture that both fixes most of the previous approaches flaws and also bring additional advantages. The “U-Net” architecture consists of 2 parts: the first part is a “classic” Convolutional Neural Network which scans the image, extract patterns from it, and combine them into high resolutions features. Then, the network is asked to upscale its hidden layers into recreating a full binary image. The task here consists of taking a full image as an input and generating a full image as an output. The output image only contains 0 and 1 which delimited the background from the object we want to discern. Also, this architecture contains links between its first “increasing features resolution/decrease image resolution” and its second “decreasing feature resolution/upscaling image resolution”. These links consist of saving snapshots of the weights during the first phase of the network and copying them to the second phase of the network. This makes the network combining features from different spatial regions of the image and allows it to localize more precisely regions of interests.
Another trick the authors propose is to extensively use data augmentation to virtually increase the number of input images without needing additional labeled data. This data augmentation is mainly done through “elastic deformations” which are deformations that change the shape of the objects on the image and acts as if the cells represented were in a different position.
The “U-Net” doesn’t need multiple runs to perform image segmentation and can learn with very few labeled images that are well suited for image segmentation in biology or medicine. The authors tested their architecture on few image segmentation challenges and got lower errors than state-of-the-art classical convolutional neural networks while running faster and needing less labeled data.
Source: Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer, Cham.