Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription

AUDIO-TO-SCORE ALIGNMENT OF PIANO MUSIC USING RNN-BASED
AUTOMATIC MUSIC TRANSCRIPTION
Taegyun Kwon, Dasaem Jeong, and Juhan Nam

Graduate School of Culture Technology, KAIST
{ilcobo2, jdasam, juhannam}@kaist.ac.kr
ABSTRACT approach is converting MIDI score to synthesized audio

and comparing it to performance audio using various audio
We propose a framework for audio-to-score alignment on features. The most common choices are time-frequency
piano performance that employs automatic music transcrip- representations through short time Fourier transformation
tion (AMT) using neural networks. Even though the AMT (STFT) [5] or auditory filter bank responses [6]. Others
result may contain some errors, the note prediction can be suggested chroma audio features, which are designed to
regarded as a learned feature representation that is directly minimize differences in acoustic quality between two pi-
comparable to MIDI note or chroma representation. To this ano audio such as timbre, dynamics and sustain effects [6].
end, we employ two recurrent neural network structures However, the design process by hands relies trial-and-error
that work as the AMT-based feature extractors to the align- and so is time-consuming and sub-optimal. Another ap-
ment algorithm. One predicts the presence of 88 notes or proach to audio-to-score alignment is converting the per-
chroma in frame-level and the other detects note onsets in formance audio to MIDI using automatic music transcrip-
chroma domain. We combine the two types of learned fea- tion (AMT) systems and comparing the performance to
tures for the audio-to-score alignment. For comparability, score in the MIDI domain [7]. The advantage of this ap-
we apply dynamic time warping as an alignment algorithm proach is that the transcribed MIDI is robust to timbre and
without any additional post-processing. We evaluate our dynamics variations by the nature of the AMT system if
framework on the MAPS dataset and compare it to other it predicts only the presence of notes. In addition, the syn-
approaches. The result shows that the alignment frame- thesis step is not required. However, the AMT system must
work with the learned features significantly improves the have high performance to predict notes accurately, which
accuracy, achieving less than 10 ms in mean onset error. is actually a challenging task.
1. INTRODUCTION
Audio-to-score alignment (also known as score following)
is the process of temporally fitting music performance au- In this paper, we follow the AMT-based approach for
dio to its score. The task has been explored for quite a audio-to-score alignment. To this end, we build two AMT
while and utilized mainly for interactive music applica- systems by adapting a state-of-art method using recurrent
tions, for example, automatic page turning, computer-aided neural networks [8] with a few modifications. One system
accompaniment or interactive interface for active music takes spectrograms as input and is trained in a supervised
listening [1, 2]. Another use case of audio-to-score align- manner to predict a binary representation of MIDI in either
ment is performance analysis which examines performer’s 88 notes or chroma. The prediction does not consider in-
interpretation of music pieces in terms of tempo, dynam- tensities of notes, in other words, MIDI velocity. Using this
ics, rhythm and other musical expressions [3]. To this end, system only however does not provide precise alignment
the alignment result must be precise having high tempo- because onset frames and sustain frames are equally im-
ral resolution. It was reported that the just-noticeable dif- portant, in other words, similarity between matching onset
ference (JND) time displacement of a tone presented in a frames becomes identical to that between following sus-
metrical sequence is about 10 ms for short notes [4], which tain frames. In order to make up for the limitation, we use
is beyond the current accuracy of the automatic alignment another AMT system that is trained to predict the onsets of
algorithm. This challenge has provided the motivation for MIDI notes in chroma domain. This was inspired from De-
our research. caying Locally-adaptive Normalized Chroma Onset (DL-
There are two main components in audio-to-score align- NCO) feature by Ewert et al. [6]. Following the idea, we
ment: features used in comparing audio to score, and align- employ decaying chroma note onset features which turned
ment algorithm between two feature sequences. In this out to offer not only temporally precise points but also
paper, we limit our scope to the feature part. A typical make onset frames salient. Finally, we combine the two
MIDI domain features and conduct dynamic time warping
Copyright:
c 2017 Taegyun Kwon et al. This is an open-access article distributed algorithm on the feature similarity matrix. The evaluation
under the terms of the Creative Commons Attribution 3.0 Unported License, which on the MAPS dataset shows that our proposed framework
permits unrestricted use, distribution, and reproduction in any medium, provided significantly improves the alignment accuracy compared to
the original author and source are credited. previous approaches.
filtered spectrogram. We observed a significant increase of
the transcription performance with this addition.
2.2 Neural Network

We employed recurrent neural network (RNN) as a net-
work model. Compared to feedforward neural networks,
RNNs are capable of learning temporal dependency of se-
quential data, which is the property found in music au-
dio. In practice, the basic RNN model has difficulty in
learning long-term dependency due to the gradient vanish-
ing problem. Thus we employ Long Short-Term Memory
(LSTM) [9] units for RNN. An LSTM unit has a memory
block which is updated only when an input or forget gate
is open, and so gradients can propagate through memory
cells without being multiplied over each time step. This
property enables LSTM to learn long-term dependency. In
our process, the LSTM is expected to learn continuity of
notes and relation of note onset, sustain and offset.
Figure 1. Flow diagram of proposed audio-to-score align- We also design our model to be bidirectional, indicating
ment system that the input sequence is not only presented in order but
also in the opposite direction. Throughout backward and
forward layers together, the networks can access to both
2. SYSTEM DESCRIPTION history and future of the given time frame. This is a suit-
able property for the automatic music transcription task.
The proposed framework is illustrated in Figure 1. The For example, right after a onset is detected, it would be
left-hand presents the two independent AMT systems that difficult to determine the frequency of the note only with
return either 88 note or chroma output and chroma onset unidirectional information because only small fragment of
output respectively. They are merged and aligned with the the note sound (about 1 2 frames) will be received and pi-
score MIDI through dynamic time warping (DTW). ano onset makes percussive spectral distribution.
We use two types of bidirectional LSTM-RNNs. The one
2.1 Pre-processing that predicts 88 notes or 12 chroma has two hidden layers
The AMT systems that we use is based on the method pro- with 200 bidirectional LSTM units. The other that pre-
posed by Böck and Schedl [8]. Thus, we follow the audio dicts chroma onsets has two hidden layers with 100 units
pre-processing step in the method. It first receives audio on first layer and 200 units on second layer. At the top of
waveforms as input and computes two types of short time the LSTM networks, a fully connected layer with sigmoid
Fourier transform (STFT), one with a short window (2048 activation units are added as output layer. Each output unit
samples, 46.4 ms) and the other with a long window (8192 corresponds to one MIDI note or chroma (i.e. pitch class
samples, 185.8 ms), with the same overlap (441 samples, of the MIDI note).
10 ms). The STFT with a short time window gives tem-
2.2.1 Backpropagation
porally sensitive output while the one with a longer win-
dow offers better frequency resolution. A Hamming win- Theoretically, LSTM can learn any length of long-term de-
dow was applied on the signal before the STFT. We only pendency through backpropagation through time (BPTT)
take magnitude of the STFT, thereby obtaining spectro- with a desired number of time steps. In practice, it re-
gram with 100 frames/sec. quires large memory and heavy computation because all
To apply logarithmic characteristics of sound intensity, past history of network within the backpropagation length
a log-like compression with a multiplication factor 1000 is should be stored and updated. To overcome this difficulty,
applied on the magnitude of spectrograms. We then reduce a truncated backpropagation method [10] is usually applied
the dimensionality of inputs by filtering with semi-tone fil- for long sequences (also with long time dependency). In
terbanks. The center frequencies are distributed according the truncated backpropagation, input sequences are divided
to the frequencies of the 88 MIDI notes and the widths are into shorter sequences and the last state of each segment
formed with overlapping triangular shape. This process is is transfered to the consecutive segment. So even though
not only effective for reducing size of inputs but is also able the backpropagation is only computed in each segment,
to reduce variance in piano tuning by merging neighboring it can serve as an approximation to full-length backprop-
frequency bins. In the low frequency, some note bins be- agation. For a bidirectional system, however, the back-
come completely zero or linear summation of neighboring ward flow requires computation on the full future and thus
note due to the low frequency resolution of the spectro- the truncated backpropagation requires large memory as
gram. We remove those dummy note bins, thereby having well. To imitate the advantage of the truncated backprop-
183 dimensions in total. We augmented the input by con- agation within the computational availability, we split the
catenating it with the first-order difference of the semitone input sequence into relatively large sequences and perform
full-length backpropagation within each segment. We con-
ducted grid search on the segmentation length between 10
frames to 300 frames (100 to 3000 ms) and finally settled
down to 50 frames (500 ms). This was long enough to
catch up continuity of individual notes and was also com-
putationally not expensive. We conducted a comparative
experiment between a unidirectional model with the trun-
cated backpropagation and a bidirectional model with a
non-transferred segmentation. The result showed that the
bidirectional model performs better.
To reduce the amount computation, our model works in
sequence-to-sequence manner. i.e. the output of the net-
work is a sequence with the same length of input segment.
Therefore, frames on the edges of a segment have only one-
side context window. We observed that contagious errors
frequently occurs on such frames as shown in Figure 2b.
To tackle this problem we split the input sequence with
50% overlapped segments and take only the middle part
of output from each segment. This procedure significantly
increase transcription result as shown in Figure 2c.
Figure 3. (a) Excerpt of music score from Beethoven’s 8th

sonata. (b)-(d) output of networks. (b) 88 note (c) chroma
(d) chroma onset.
completed after six iterations. Examples of the AMT out-

puts are presented in Figure 3 To verify the performance,
frame-wise transcription performance for the 88 note AMT
system was measured on test sets with the same metrics
used in [11, 12]. The resulting F-measure was 0.7285 on
Figure 2. Examples of bidirectional LSTM network output
average, which is better than the results of RNN with ba-
that predict (chroma). ground truth(a), without overlapping
sic units and lower than those with fine-tuned frame-wise
segmentation(b), with overlapping segmentation(c). dotted
DNN and CNN [11, 12].
lines indicate the boundaries of segments
2.3 Alignment
2.2.2 Network Training
The AMT systems return two types of MIDI-level fea-
In order to we train the networks, we used audio files and tures. We combine them and compute a similarity matrix
aligned MIDI files. The MIDI data was converted into an between the AMT outputs and Score MIDI. MIDI files that
array form with the same frame rate of the input filter-bank correspond to the score are also converted into 88 note
spectrogram with 100 fps. For 88 notes and chroma labels, (or chroma) and chroma onset representation. We used
the array elements between note onset and offset were an- euclidean distance to measure similarity between the two
notated as 1 and otherwise filled with 0. For chroma onset combined representations and compute the similarity ma-
labels, elements that correspond to note onsets were anno- trix. We then applied FastDTW algorithm [13] which is
tated as 1. The corresponding audio data was normalized an approximate method to dynamic time warping (DTW).
with zero-mean and deviation of 1 over each filter in train- FastDTW uses iterative multilevel approaches with con-
ing set. straint windows to reduce the complexity. Because of the
We use dropout with ratio of 0.5 and weight regulariza- high frame rate of the features, it is necessary to employ
tion with value of 10−4 in each LSTM layer. This ef- low-cost algorithm. While the original DTW algorithm has
fectively improve the performance by generalization. We O(N 2 ) time and space complexity, FastDTW operates in
optimized the network with stochastic gradient decent to O(N ) complexity, almost with the same accuracy. Müller
minimize binary cross entropy loss function. Learning rate et al. [14] also examined a similar multi-level DTW for the
was initially set as 0.1 and iteratively decreased by divided audio-to-score alignment task and reported similar results
by 3 when no improvement was observed for validation compared to the original DTW. The radius parameter in the
loss for 10 epochs (i.e. early stopping). The training was fastDTW algorithm, which defines the size of window to
find an optimal path for each resolution refinement, was set
to 10 in our experiment.
3. EXPERIMENTS
3.1 Dataset
We used the MAPS dataset [15], specifically the ‘MUS’
subset that contains large pieces of piano music, for train-
ing and evaluation. Each piece consists of audio files of
piano and ground-truth MIDI annotation. The audio files
were generated from a MIDI file, either through virtual in-
struments and automatic playing via the Disk-lavier piano.
Nine combinations of audio files, each with different in-
struments and recording conditions was applied to each Figure 4. Ratio of correctly aligned onsets within in func-
piece of piano music. This helped our model avoid over- tion of the threshold. Some data points with lower than
fitting to a specific piano tone. The MIDI files served as 80% of precision are not shown in this figure.
the ground-truth annotation of the corresponding audio but
some of them (ENSTDkCl, ENSTDkAm) are sometimes
temporally inaccurate, which is more than 65 ms as de- Ewert’s algorithm for the both audio and MIDI cases. The
scribed in [16]. temporal frame rate of features were adjusted to 100 fps
for both algorithms.
3.2 Evaluation method For the aligning task employing Ewert’s algorithm, we
used the same FastDTW algorithm. But since the Fast-
To evaluate the proposed method, we carried out audio- DTW algorithm cannot be directly applied to Carabias-
to-score alignment experiments using the MAPS dataset. Orti’s algorithm due to its own distance calculation method,
Because our method requires data for training, we conduct we applied a classic DTW algorithm, which employs an
the experience with 4-fold cross validation with train/test entire frame-wise distance matrix. Because of the limita-
splits from [12] which is accessible to public 1 . For each tion of memory, when reproducing the Carabias-Orti’s al-
fold, 43 pieces were detached from training set and used as gorithm, we excluded 35 pieces that are longer than 400
validation set. As a result, each fold was composed with seconds among the test sets.
173 / 43 / 54 pieces for train / valid / test set respectively, After we obtained the alignment path through DTW, ab-
as processed in [11]. solute temporal errors between estimated note onsets and
To make MIDI files to be used as if they are score MIDI, ground truth were measured. For each piece of music in
we distorted the aligned MIDI files by changing the dura- the test set, mean value of the temporal errors and ratio of
tion of every events. This type of evaluation method for the correctly aligned notes with varied threshold were used to
alignment task using the temporal distortion of MIDI was summarize the results.
also employed in previous research [6, 17, 18]. A number
randomly selected in [0.7 1.3] was multiplied to modify the
4. RESULTS AND DISCUSSION
tempo of an interval. Employing this scheme of temporal
distortion prevents the alignment path from being trivial. The result of the audio-to-score alignment is shown in Fig-
ure 4, which represents the precision of different algorithms
3.3 Compared Algorithms as a function of error threshold. Typically, a tolerance win-
dow of 50ms is used for evaluation. However, because
To make a comparison of the performance, we reproduced
most of notes were aligned within 50 ms of temporal thresh-
two other alignment algorithms proposed by Ewert et al.
old, we varied the tolerance window from 0ms to 200ms
[6], and offline algorithms by Carabias-Orti et al. [19],
with 10 ms steps.
which suggested novel features for the alignment task. We
Overall, our 88 note framework combined with the chroma
performed experiments with the same test set using the
onsets achieved the best result. Even with zero threshold,
fastDTW algorithm only without any post-processing. Ew-
which means the best match with resolution of our sys-
ert’s algorithm is a representative example that employs a
tem (10ms), our proposed model with 88 note output ex-
hand-crafted chromagram and onset feature based on audio
actly aligned 52.55% of notes. The ratio was increased
filter bank responses. Carabias-Orti’s algorithm employs
to 91.60% with 10 ms threshold. The proposed frame-
a non-negative matrix factorization for learning spectral
work using chroma showed similar precision to the 88 note
basis of each note combination from spectrogram. The
framework, but the accuracy was lower. Compared to Ew-
latter is designed only for an audio-to-audio alignment,
ert’s algorithm with hand-crafted features, our method shows
while Ewert’s algorithm can be applied to both audio-to-
significantly better performance especially in high resolu-
audio and audio-to-MIDI alignment. Therefore, we made
tion section. Over 100 ms of threshold, our framework
a synthesized version of the distorted MIDI using Synth-
with chroma and Ewert’s method shows similar precision,
ogy Ivory II and employed it as an input. We tested the
but in the intervals under 50 ms the difference becomes
1 http://www.eecs.qmul.ac.uk/sss31/TASLP/info.html significant. Note that we penalized our framework com-
Mean Median Std ≤ 10 ms ≤ 30 ms ≤ 50 ms ≤ 100 ms
chroma 12.83 6.40 56.22 92.01 97.44 98.31 98.98
Proposed with onset
88 note 8.62 5.57 31.14 91.60 98.00 98.97 99.61
chroma 48.01 27.96 152.06 60.66 84.65 89.36 93.72
Proposed w/o onset
88 note 25.31 18.69 63.26 56.39 86.42 93.05 97.48
Ewert et. al. (audio-to-MIDI) 16.44 13.64 32.52 71.78 91.38 95.50 98.03
Ewert et. al. (audio-to-audio) 14.66 11.71 25.38 71.53 92.43 96.91 99.13
Carabias-Orti et. al. 131.31 49.96 305.52 23.58 49.40 69.30 91.60
Table 1. Results of the onset errors in piecewise. Mean, median, and standard deviation of the errors are in millisecond.
The right columns are the ratio of notes (%) that are aligned within the onset error of 10 ms, 30 ms, 50 ms and 100 ms,
respectively.
their distribution. As can be seen in Figure 5, the removal

of onset features significantly decreased the performance.
Thus we conclude that employing chroma onset features
can compensate the limitation of normalized transcription
features, As we stated in the first section, 88 note repre-
sentation shows much better results compare to the results
with chroma output features especially without onset.
Table 1 shows the statistic value of piecewise onset errors.
This shows that combining the chroma onset feature made
a severe improvement in the proposed method. The me-
dian of piecewise onset errors was decreased from 18.69
ms to 5.57 ms when applying chroma onset to the 88 note
system. The importance of the note onset feature for align-
ing piano music was also examined in [6]. On the other
hand, Carabias-Orti’s algorithm is focused on dealing with
various instruments and on-line alignment scenario, which
were unable to be fully appreciated in our experiment.
Figure 5. Comparison of mean onset errors between mod-
els with/without chroma onset features. Outliers above Throughout the experiments, we shows that the transcrip-
60ms of errors are omitted in this figure. Each number tion features derived from the neural networks can be adopted
on the top of the box indicates the median value in ms. as appropriate features for alignment task. As we expected,
using the AMT outputs as alignment features is effective
even though the transcription result is not perfect itself.
pared to audio-to-audio scenario of Ewert’s method, be- For the fair comparison of the result, we should note that
cause audio-to-audio method take advantage from identical our framework is heavily dependent on the training set un-
note velocity. We suppose Ewert’s algorithm performed like the other two compared methods. Currently our frame-
better in the audio-to-audio scenario rather than the audio- work has only applied to piano music by employing the
to-midi for the same reason. The approach with NMF MAPS dataset. This means that we cannot assure that the
gives lower score compared to others. We assumes that the current learning result can make a reliable performance
difference mainly come from the usage of onset features, when it is applied to recordings of other instruments or a
because our method also shows much lower performance piano with different timbre.
without onset features as shown in Figure 5.
Note that even though the dataset for evaluation is differ-
5. CONCLUSIONS
ent, the results of two reproduced algorithms were similar
to the results in their original papers. The mean onset er- In this paper, we proposed a framework for audio-to-score
ror of Ewert’s algorithm on piano music was 19 ms mean alignment of piano music using automatic music transcrip-
and 26 ms standard deviation [6]. The result introduced in tion. We built two AMT systems based on bidirectional
the original Carabias-Orti’s paper [19] shows a quite large LSTM that predict note existence and chroma onset. They
difference in terms of the mean of piecewise error, but we provide MIDI-level features that can be compared with
assumes that the difference is due to the change of the test score MIDI to be used for the alignment algorithm. Eval-
set. The align rate of original result and our reproduced uated our framework with the MAPS dataset showed that
result were similar (50 ms: 74% - 69%, 100 ms: 90% - our features not only can served as appropriate feature for
92%, 200 ms: 95% - 96%). Hence, we assumed that our alignment task, but also shows significant improvement
reproduction was reliable for the comparison. compared to previous works. The 88 note model with
On the second experiment, we investigate the effect of chroma onset works best, resulting 8.62 ms of errors in
chroma onset features. We removed the onset features from average. We also showed that chroma onset feature take
each models, and compared the mean onset errors and shows important role among the features. In fact, the success-
ful alignment performance was possible not only because Network Trajectories,” Appears in Neural Computa-
of same recording condition between train and test set, but tion, no. 2, pp. 490–501, 1990.
also because scores in evaluation set does not contains con-
tains any structural errors. Thus, Further studies including [11] R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt,
post-processing should be followed to apply our system and G. Widmer, “On the Potential of Simple Frame-
to the real world. Also, Because our system only eval- wise Approaches to Piano Transcription,” in Proceed-
uated on specific dataset, generalization capacity of our ings of the International Conference on Music Infor-
model should be further investigated. However, because mation Retrieval (ISMIR), 2016, pp. 475–481.
our framework dose not rely on specific condition, we be- [12] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end
lieve that we can expend our framework on general piano neural network for polyphonic piano music transcrip-
recordings or other instruments. Also, employing other tion,” IEEE/ACM Transactions on Audio, Speech and
network models such as CNN should be tested. Language Processing (TASLP), vol. 24, no. 5, pp. 927–
939, 2016.
6. REFERENCES
[13] S. Salvador and P. Chan, “FastDTW : Toward Accurate
[1] A. Arzt, G. Widmer, and S. Dixon, “Automatic Page Dynamic Time Warping in Linear Time and Space,”
Turning for Musicians via Real-Time Machine Listen- Intelligent Data Analysis, vol. 11, pp. 561–580, 2007.
ing,” in Proceedings of the conference on ECAI 2008:
18th European Conference on Artificial Intelligence, [14] M. Müller, H. Mattes, and F. Kurth, “An efficient mul-
vol. 1, no. 1, 2008, pp. 241–245. tiscale approach to audio synchronization,” in Proc.
International Conference on Music Informa- tion Re-
[2] R. B. Dannenberg and C. Raphael, “Music score align- trieval (ISMIR), 2006, p. 192197.
ment and computer accompaniment,” Communications
of the ACM, vol. 49, no. 8, pp. 38–43, 2006. [15] V. Emiya, R. Badeau, and B. David, “Multipitch esti-
mation of piano sounds using a new probabilistic spec-
[3] G. Widmer, S. Dixon, W. Goebl, E. Pampalk, and tral smoothness principle,” IEEE Transactions on Au-
A. Tobudic, “In Search of the Horowitz Factor,” AI dio, Speech and Language Processing, vol. 18, no. 6,
Magazine, vol. 24, no. 3, pp. 111–130, 2003. pp. 1643–1654, 2010.
[4] A. Friberg and J. Sundberg, “Perception of just- [16] S. Ewert and M. Sandler, “Piano Transcription in
noticeable time displacement of a tone presented in a the Studio Using an Extensible Alternating Directions
metrical sequence at different tempos,” The Journal of Framework,” vol. 24, no. 11, pp. 1983–1997, 2016.
The Acoustical Society of America, vol. 94, no. 3, pp.
1859–1859, 1993. [17] M. Müller, H. Mattes, and F. Kurth, “An efficient mul-
tiscale approach to audio synchronization.” in Proc.
[5] S. Dixon and G. Widmer, “MATCH: a music alignment International Conference on Music Informa- tion Re-
tool chest,” in Proceedings of the International Society trieval (ISMIR). Citeseer, 2006, pp. 192–197.
for Music Information Retrieval Conference (ISMIR),
2005, pp. 492–497. [18] C. Joder, S. Essid, and G. Richard, “A conditional ran-
dom field framework for robust and scalable audio-to-
[6] S. Ewert, M. Müller, and P. Grosche, “High resolution score matching,” IEEE Transactions on Audio, Speech,
audio synchronization using chroma onset features,” and Language Processing, vol. 19, no. 8, pp. 2385–
in Proc. IEEE International Conference on Acoustics, 2397, 2011.
Speech and Signal Processing (ICASSP), 2009, pp.
1869–1872. [19] J. J. Carabias-Orti, F. J. Rodrı́guez-Serrano, P. Vera-
Candeas, N. Ruiz-Reyes, and F. J. Cañadas-Quesada,
[7] A. Arzt, S. Böck, S. Flossmann, H. Frostel, M. Gasser, “An audio to score alignment framework using spec-
C. C. S. Liem, and G. Widmer, “The piano music com- tral factorization and dynamic time warping.” in Proc.
panion,” Frontiers in Artificial Intelligence and Appli- International Conference on Music Informa- tion Re-
cations, vol. 263, no. 1, pp. 1221–1222, 2014. trieval (ISMIR), 2015, pp. 742–748.
[8] S. Böck and M. Schedl, “Polyphonic piano note tran-
scription with recurrent neural networks,” in Proc.
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2012, pp.
121–124.
[9] S. Hochreiter and J. Schmidhuber, “Long short-term

memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[10] R. J. Williams and J. Peng, “An Efficient Gradient-

Based Algorithm for On-Line Training of Recurrent

Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Audio-To-Score Alignment of Piano Music Using Rnn-Based Automatic Music Transcription

Uploaded by

Copyright:

Available Formats

AUDIO-TO-SCORE ALIGNMENT OF PIANO MUSIC USING RNN-BASED

AUTOMATIC MUSIC TRANSCRIPTION

Taegyun Kwon, Dasaem Jeong, and Juhan Nam

ABSTRACT approach is converting MIDI score to synthesized audio

2.2 Neural Network

Figure 3. (a) Excerpt of music score from Beethoven’s 8th

completed after six iterations. Examples of the AMT out-

their distribution. As can be seen in Figure 5, the removal

[9] S. Hochreiter and J. Schmidhuber, “Long short-term

[10] R. J. Williams and J. Peng, “An Efficient Gradient-

You might also like