A new baseline for music sheet recognition with Deep Learning

4 min readJan 20, 2021

Today I’ve read a paper about the definition of a new baseline and methodology for music sheet recognition. The paper’s source can be found at the bottom of this article.

If you play a musical instrument you may have encountered the issue of looking at a music sheet without precisely knowing how it should sound like. When this happens, searching for a video of someone playing this specific musical composition may be sufficient but when it’s not, there is not much we can do about it. This problem could be solved by an algorithm: we could train a Deep Learning model to look at a music sheet and to be able to “play” it, which would mean being able to generate sound accordingly.

Going from a music sheet to actual music requires going through a lot of hassles and would most likely require a series of models next to each other. The authors of today’s paper understood that well and focused their work on Optical Music Recognition (OMR) which is the field of research that studies how to computationally read music notations. This means being able to look at a music sheet and 1/detect 2/locate /identify all music notations. The authors explain that previous work has been done in this field but that previously published papers were all using different methodologies which makes it difficult to compare their approaches results with each other. For this reason, they focused on designing both a methodology and a baseline for Optical Music Recognition models.

Optical Music Recognition can be divided into 4 steps:

The preprocessing step: how to clean up the input image to ease the learning of the algorithm
The Music Object Detection step: how to find and classify all relevant symbols on the image. This is described as the author as the hardest challenge.
The Relational Understanding step: how to make a model to understand the logic and links between music symbols. This includes taking into account the temporal dimension of the musical composition
The Encoding step: how to package the learnt symbols and relations into a file

Because the Music Object Detection is the most challenging part of Optical Music Recognition, the authors especially focused on that.

The proposed methodology first proposed a task formulation: what are we precisely trying to do. We already know that our goal is to train an algorithm to take a music sheet as an input and to generate a digital music file as an output. More specifically, because we are focused on the Music Object Detection part of OMR, we here want to be able to identify each music symbols on a given music sheet. What does “identifying a music symbol” mean? The authors describe this task through 6 variables: for any music symbol, the algorithm must output its position on music sheet (x and y coordinates of 2 opposite corners which are already 4 variables), the predicted category c and a confidence score s. Wo know have a clear model output: (x1, y1, x2, y2, c, s) for each music symbol.

We know need datasets. We are looking here for datasets of music sheets associated with their annotated music symbols. The authors propose 3 datasets. First, DeepScores which consists of 300,000 digital images and their ground-truth annotations. The second dataset is MUSICMA++ which consists of 90,000 handwritten music sheets and their ground-truth annotations. The third and last dataset is Capitan which consists of 46 fully-annotated pages from the 16th-18th century and mainly contains sacred music. These three datasets present a diverse range of music sheets: from modern and digitalized to handwritten 16th-century sacred music.

Finally, we need metrics! Two metrics are proposed here: an average precision (mAP, in %) and a weight average precision (w-mAP, in %). The precision of a model looks at how reliable are the model predictions but a very conservative model which gives few predictions only could reach a high precision score while missing most of its target. The Recall, on the other hand, measures the ability of the model to detect a category, but a naive model tagging everything as a relevant symbol may also reach a high recall score while not being interesting for us. The average precision consists of the area under the precision-recall curve and catches the advantages of the precision score and the advantages of the recall score. The weighted average precision is just an average precision which takes into account labels unbalances.

Thanks to their new methodology (output values, datasets, metrics), the authors were able to evaluate 3 Deep Learning models and compare them based on their mAP and w-mAP on each dataset.

This present article aimed at describing the field of Optical Music Recognition and to define the methodology and baseline designed by today's paper. Another article may focus on the specific deep learning architectures which have been evaluated by this methodology.

Source: Pacha, A., Hajič, J. and Calvo-Zaragoza, J., 2018. A baseline for general music object detection with deep learning. Applied Sciences, 8(9), p.1488.

A new baseline for music sheet recognition with Deep Learning

Written by Jean-Charles Nigretto