Machine Learning in healthcare may not need as many labeled data as before

Jean-Charles Nigretto
4 min readJan 13, 2021

--

Today I’ve read an article about unsupervised learning for image classification of chest radiograph images. I found this article thanks to the deeplearning.ai “The Batch” newsletter. The paper’s source can be found at the bottom of this article.

The recent advances in Deep Learning and the availability of pre-trained models online open the way to the application of Deep Learning in more and more fields. Within healthcare fields, one specific sub-field seems to benefit more from this context: image classification of radiographs.

Radiography is an imaging technique that helps in revealing the internal structure of the body: bones, joints, kidney stones, etc. These images usually require a human expert, like a radiologist, to be analyzed and understood. Thanks to Deep Learning and specifically to state-of-the-art readily available models, it is possible to train a Deep Learning model to perform diagnosis from radiographs by training it to classify radiograph images from healthy and unhealthy patients. Previous studies have shown that this kind of model outperforms human experts.

Training a Deep Learning model for a classification task requires first to build a training set: a set of images associated with their labeled, or ground truth. For example, in the case of chest radiograph images for pleural effusion classification (prediction if a patient is experiencing pleural effusion from their chest radiograph image), our model would require first to gather dozens of images and label them as “presenting pleural effusion” and “not presenting pleural effusion”. The more labeled data we get, the better the model will be.

For obvious reasons, constituting big datasets of labeled data in healthcare is difficult since it requires a lot of time from a panel of human experts who are already very busy. Knowing these flaws, the authors of today’s papers found a way to greatly improve model performance on such tasks without adding any additional labeled data.

Their technique relies on getting both radiographs images and their description from the patient’s medical record to constitute image-text pairs. For example, a radiograph of a patient with cardiomegaly may be accompanied by a text like “severe cardiomegaly is noted from the patient's radiograph”. This kind of data, as we can easily understand, contains the information we are looking for as long as we are able to extract it in an unsupervised manner (aka without manual labeling).

Then, the authors have trained a model to represent both images and texts such as their vector representations (sequences of numbers depicting the input data) are as close to each other as possible. For example, let’s take an image of an apple associated with the text “This is an apple!”. This kind of image-pair association can be easily found on the internet. We can ask an algorithm to turn this two information into numbers. Let’s say that our naive algorithm turns the apple image into the number 1 and the apple-related text into the number 99. We then compute the difference between these two numbers (which here represents how close they are) and output 98. We then ask the algorithm to update itself to associate the image and the text with closer numbers, let’s say 49 and 50. If we teach a machine learning model to perform such a task, we would end up with a model being able to represent any image-text association with two series of numbers close to each other.

Specifically, the authors sent images data through a ResNet50 model and text data into a BERT model to get 2 representation vectors for both the image and the text. They then took these 2 vectors and computed a loss which is analog to the sum of the image to text mutual information and the text to image mutual information. The closer the vectors are, the lower the error. Doing so helps the model to update the representation vectors so that they preserve the mutual information between two vectors from the same pair. After having fine-tuned their representation vectors, they took few radiograph-related classification datasets and used their previously trained model on them. To do so, they extracted the “image” pipeline from their architecture and re-used it for image classification.

They found that their model, trained with unsupervised image-text pairs, was performing significantly better than previous state-of-the-art models. They explain that this kind of approach could greatly help Deep Learning models to improve their accuracy in healthcare-related tasks without the need for large labeled datasets.

Source: Zhang, Y., Jiang, H., Miura, Y., Manning, C.D. and Langlotz, C.P., 2020. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv preprint arXiv:2010.00747.

--

--