AI pipeline to "play a picture of a musical score", and its implication in generative AI
Introduction
Understand, interpret and listen the content of a musical score is something difficult if you are not a musician. Consider the scenario where you are navigating a digital library and you are reading about Johann Sebastian Bach: you can read that he was a German composer and musician, you can see some images of Bach's musical scores of English Suites but, if you are not an expert musician, it can be difficult to play this musical score, that could give you a better understanding of what you are reading and learning from the digital library.
Another important point is related to the possibility to making music scores searchable and easily accessible. This is because, as you can see in next section, the process to build the audio file from musical score picture, include the creation of a standard representation of the musical score in a machine-readable format. This means that music enthusiasts and researchers can efficiently locate specific musical pieces, composers, or genres within vast collections of scores. Researchers can analyse and compare musical compositions, styles, and historical trends more efficiently by processing and aggregating data from digitized scores. This quantitative approach allows for deeper insights into the evolution of music and its various aspects, contributing to the field of musicology. This functionality significantly enhances the user experience and access to musical resources.
In this article I will summarize a pipeline able to get in input a picture, that is a photo of your musical score and produce an audio file generated from the musical score.
AI pipeline
Theoretical description
The process that enable the reproduction of a musical score starting from a picture, is described in following picture and leverages the Optical music recognition (OMR) technique.
From Wikipedia we can read the following definition of OMR:
"Optical music recognition (OMR) is a field of research that investigates how to computationally read musical notation in documents. The goal of OMR is to teach the computer to read and interpret sheet music and produce a machine-readable version of the written music score. Once captured digitally, the music can be saved in commonly used file formats, e.g. MIDI (for playback) and MusicXML (for page layout). In the past it has, misleadingly, also been called "music optical character recognition". Due to significant differences, this term should no longer be used."
Main steps of OMR, given a picture in input, are the followings:
Optical Music Recognition is a technology that is closely linked to various research domains, such as computer vision, document analysis, and music information retrieval.
Optical music recognition is able to convert a picture into a standardised music representation format, a language that machines can easily interpret. In particular, this standard is MusicXML; its definition on w3.org website is the following:
"MusicXML is a standard open format for exchanging digital sheet music. It is designed for sharing sheet music files between applications, and for archiving sheet music files for use in the future. As of this publication date it is supported by over 250 applications."
Once you have the MusicXML file from your picture, to play your musical score, last steps are:
A possible implementation in Python
All the steps I explained in previous section can be arranged in a Python pipeline. In particular, the libraries you can use are the followings:
The documentation of each libraries I mentioned, allow you to create minimal set of code to run each step of the described process. Below I will report an example with code to play the Sebastian Bach's English Suite in A Minor with the described pipeline and tools.
Example: English Suite No. 2 in A Minor, Johann Sebastian Bach
In this section, the end to end process from the picture to the wav file is traveled for a sample case: play the Sebastian Bach's English Suite in A Minor, Bourrée.
Below the starting picture:
First step is to use Oemer tool to generate MusicXML code. Once you installed the tool, you can use the following command to have the MusicXML:
omer /Users/simoneromano/Desktop/bach_suite_a_minor.jpg
This is first part of the generated MusicXML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE score-partwise PUBLIC "-//Recordare//DTD MusicXML 4.0 Partwise//EN" "http://www.musicxml.org/dtds/partwise.dtd">
<score-partwise version="4.0">
<work>
<work-title>Bach_suite_a_minor</work-title>
</work>
<identification>
<creator type="composer">Transcribed by Oemer</creator>
</identification>
<part-list>
<score-part id="P1">
<part-name>Piano</part-name>
<score-instrument id="P1-I1">
<instrument-name>Piano</instrument-name>
<instrument-sound>keyboard.piano</instrument-sound>
</score-instrument>
<midi-instrument id="P1-I1">
<midi-channel>1</midi-channel>
<midi-program>1</midi-program>
<volume>80</volume>
<pan>0</pan>
</midi-instrument>
</score-part>
</part-list>
Next step is to generate the MIDI file starting from the MusicXML. Below the code I used to implement this step using partitura library in Python:
import partitura as pt
my_xml_file = pt.EXAMPLE_MUSICXML
score = pt.load_score('bach_suite_a_minor.musicxml')
part = score.parts[0]
import numpy as np
pianoroll = np.array([(n.start.t, n.end.t, n.midi_pitch) for n in part.notes])
print(pianoroll)
pt.save_score_midi(part, 'mypart.mid')
Once you have the MIDI file, you can use following code, that uses pydub and mido libraries,to generate the wav file:
from collections import defaultdict
from mido import MidiFile
from pydub import AudioSegment
from pydub.generators import Sine
def note_to_freq(note, concert_A=440.0):
return (2.0 ** ((note - 69) / 12.0)) * concert_A
mid = MidiFile("mypart.mid")
output = AudioSegment.silent(mid.length * 1000.0)
tempo = 100 # bpm
def ticks_to_ms(ticks):
tick_ms = (60000.0 / tempo) / mid.ticks_per_beat
return ticks * tick_ms
for track in mid.tracks:
current_pos = 0.0
current_notes = defaultdict(dict)
for msg in track:
current_pos += ticks_to_ms(msg.time)
if msg.type == 'note_on':
current_notes[msg.channel][msg.note] = (current_pos, msg)
if msg.type == 'note_off':
start_pos, start_msg = current_notes[msg.channel].pop(msg.note)
duration = current_pos - start_pos
signal_generator = Sine(note_to_freq(msg.note))
rendered = signal_generator.to_audio_segment(duration=duration-50, volume=-20).fade_out(100).fade_in(30)
output = output.overlay(rendered, start_pos)
output.export("mypart.wav", format="wav")
You can listen this draft version below.
As you can listen, main issues in auto-generated audio are the extraction of the following elements:
Working on image processing, these could be enhanced.
Conclusion
In this article, I reported how it is possible to play a musical score, starting from its picture (a photo) and using a combination of deep learning, machine learning and programming abilities, base of the Optical Music Recognition (OMR) I described. Considering this is a multi domains topic, to have a musician background can help, in particular to interpret the results of the algorithms.
It is really important to notice that one of the main benefits of this pipeline, in my approach, is the possibility to automatically create a standard template for music representation (MusicXML), that can be used by researchers to analyse and compare musical compositions, styles, and historical trends more efficiently. Here I'm thinking to the use of generative AI to navigate MusicXML data and generate comparisons, reports, description, etc.
This could be the main topics of a dedicated article.
Zdrowie i efektywność w biurze. Ekspert: konsultant, wykladowca, szkoleniowiec w zakresie ergonomii pracy umysłowej, doradca i broker meblowy dla architekta i inwestora.
1yThank You for sharing. Have you meet AI tools which is able to generate the voies (basso, soprano etc.) to main melody of song?
4x Founder | I use AI to scale growth-centric SMBs and Startups aiming for a 7-8 figure revenue
1ySounds fascinating! Can't wait to read your article.
Enterprise Architect in Mauden S.r.l. (a.k.a. non esiste lo scenario della Kobayashi Maru e se c'è cambiamo il contesto)
1yLo vogliamo.....