You are on page 1of 30

MUSICAL INSTRUMENT RECOGNITION

BY Hrishikesh P. Kanjalkar

DEPARTMENT OF ELECTRONICS AND TELECOMMUNICATION

Index

Chapter 1 Introduction 1.1 Motivation for the work 1.2 Defining the problem 1.3 Theory of music 1.4 Musical notes and scales... Chapter 2 Literature review 2.1 Musical instruments recognition system.. 2.2 Comparison between artificial systems and human abilities.... 2.3 Physical properties of musical instrument.... Chapter 3 Feature Extraction. 3.1 Temporal Features... 3.2 Spectral Features.. Chapter 4 Classifier Chapter 6 Conclusion ............................................................ Chapter 7 Bibliography.

Chapter 1

Introduction

1.1 Motivation for the work.


Motivation relates to the generic problem of sound source recognition and analysis of auditory scenes. The idea is to compile a toolbox of generic feature extractors and classification methods that can be applied to a variety of audio related analysis and understanding problems. In fact, some of the methods implemented for this study and the knowledge gained have been already used in [2]. Secondly There has been a great deal of research concerning the automatic annotation of music files. Musical instrument recognition can be a great help for all the musicians in getting the information of particular notes been played. Just as using Google for searching the data, we can have a music search engine where we play a note and get also basic information regarding

it. This will give the musicians a better platform for their learning process. 1.2 Defining the problem. This project aims to accurately detect the instrument family and instruments of a signal. To accomplish this, we intend to record and analyze the entire range of a few instruments, and then use this analysis to decompose monophonic, or one instrument, signals into their component instruments. This project basically deals with the recognition of a musical instrument on playing its note. Software is developed that is capable of listening to the recording and classify the instrument. Musical data (Audio data) is redundant in nature. 1 sec of signal has 44100 samples in it (sampling frequency is 44.1 kHz). Thus the input signal requires feature extraction which helps in recognizing the instrument. 1.3 Theory of Music. For those unfamiliar with music, we offer a (very) brief introduction into the technical aspects of music. The sounds you hear over the airwaves and in all manner of places may be grouped into 12 superficially disparate categories. Each category is labeled a "note" and given an alpha symbolic representation. That is, the letters A through G represent seven of the notes and the other five are represented by appending either a pound sign (#, or sharp) or something that looks remarkably similar to a lower-case b (also called a flat).

Although these notes were conjured in an age where the modern theory of waves and optics was not dreamt of even by the greatest of thinkers, they share some remarkable characteristics. Namely, every note that shares its name with another (notes occupying separate "octaves," with one sounding higher or lower than the other) has a frequency that is some rational multiple of the frequency of the notes with which it shares a name. More simply, an A in one octave has a frequency twice that of an A one octave below. As it turns out, every note is related to every other note by a common multiplicative factor. To run the full gamut, one need only multiply a given note by the 12th root of two n times to find the nth note "above" it (i.e. going up in frequency).

1.4 Musical notes, intervals and scales Musical notes are the symbols or signs that represent the frequencies, durations and timings of the elementary musical sounds. It can be said that musical notes play a similar role to the alphabet of a language; they allow music compositions and scores to be recorded in a symbolic form and read and played by musicians. The systems of musical notes also allow standardization of musical instruments and their tuning frequencies.

Table 1. The main musical note frequencies and symbols Note that on a piano the sharp (or flat) notes are signified by black keys. The full table is given in Table 13.2. In this table the notes are ordered in C-major scale (C, D, E, F, G, A, B).

The Western musical note system, as shown in table is based on seven basic notes, also known as the natural notes, these are: [C_D_E_F_G_A_B] There are also five sharp notes: [C#_D#_ F#_G#_A#] and five flat notes: [Db_ Eb_Gb_Ab_Bb] The hash sign # denotes a sharp note and the sign b denotes a flat note. The sharp version of a note is a semitone higher than that note, e.g. C# = 12 2C, whereas the flat version of a note is a semitone lower than that note, e.g. Db = D/ 12 2.

Musical Scales

In music theory, a musical scale is a specific pattern of the pitch ratios of successive notes. The pitch difference between successive notes is known as a scale step. The scale step may be a constant or it may vary. Musical scales are usually known by the type of interval and scale step that they contain. Musical scales are typically ordered in terms of the pitch or frequencies of notes. Some of the most important examples of musical scales are the chromatic scale, diatonic scale and Pythagorean scale. Musical materials are written in terms of a musical scale.

Fig 1. The frequencies of the keys on a piano Note that the piano keys are arranged in groups of 12. Each set of 12 keys spans an octave which is the doubling of frequency. For example the frequency of AN is 2NA0 or N octaves higher than A0, e.g. A7 = 27 27_5 = 3520 Hz. Black keys correspond to sharp notes.

Chapter 2

Literature Survey

2.1 Musical instruments recognition system. Various attempts have been made to construct automatic musical instrument recognition systems. Researchers have used different approaches and scopes, achieving different performances. Most systems have operated on isolated notes, often taken from the same, single source, and having notes over a very small pitch range. The most recent systems have operated on solo music taken from commercial recordings. The studies using isolated tones and monophonic phrases are the most relevant in our scope.

Recognition of single tones These studies have used isolated notes as test material, with varying number of instruments and pitches.

Studies using one example of each instrument Kaminskyj[1] and Materka used features derived from a root-mean-square (RMS) energy envelope via PCA and used a neural network or a k-nearest neighbor (k-NN) classifier to classify guitar, piano, marimba and accordion tones over a one-octave band [1]. Both classifiers achieved a good performance, approximately 98 %. However, strong conclusions cannot be made since the instruments were very different, there was only one example of each instrument, the note range was small, and the training and test data were from the same recording session. More recently, Kaminskyj ([1]) has extended the system to recognize 19 instruments over three octave pitch range from the McGill collection [2]. Using features derived from the RMS-energy envelope and constant-Q transform ([3]), an accuracy of 82 % was reported using a classifier combination scheme.

Table 2: Summary of recognition percentages of isolated note recognition systems using only one example of each instrument.

Study [Kaminskyj95] [Kaminskyj00] [Fujinaga98] [Frasea99] [Fujinaga00] [Martin98] [Kostek99]

Percentage correct 98 82 50 64 68 72(93) 97 81

[Kostek01]

93 90

Number of instruments 4 guitar, piano, marimba and accordion 19 23 23 23 14 4 bass trombone, trombone, English horn and contra bassoon 20 4 oboe, trumpet, violin, cello 18

Fujinaga and Fraser trained a k-NN with features extracted from 1338 spectral slices of 23 instruments playing a range of pitches [4]. Using leave-one-out cross validation and a genetic algorithm for finding good feature combinations, a recognition accuracy of 50 % was obtained with 23 instruments. When the authors added features relating to the dynamically changing spectral envelope, and velocity of spectral centroid and its variance, the accuracy increased to 64 % [4]. Finally, after small refinements and adding spectral irregularity and tristimulus features, an accuracy of 68% was reported [4].

Martin and Kim reported a system operating on full pitch ranges of 14 instruments [8]. The samples were a subset of the isolated notes on the McGill collection [2]. The best classifier was the k-NN, enhanced with the Fisher discriminant analysis to reduce the dimensions of the data, and a hierarchical classification architecture for first recognizing the instrument families. Using 70 % / 30 % splits between the training and test data, they obtained a recognition rate of 72 % in individual instrument, and after finding a 10-feature set giving the best average performance, an accuracy of 93 % in classification between five instrument families. Kostek has calculated several different features relating to the spectral shape and onset characteristics of tones taken from chromatic scales with different articulation styles [7]. A twolayer feed-forward neural network was used as a classifier. The author reports excellent recognition percentages with four instruments: the bass trombone, trombone, English horn and

contra bassoon. However, the pitch of the note was provided for the system, and the training and test material were from different channels of the same stereo recording setup. Kostek and Czyzewski also tried using wavelet-analysis based features for musical instrument recognition, but their preliminary results were worse than with the earlier features [11]. In the most recent paper, the same authors expanded their feature set to include 34 FFT-based features, and 23 wavelet features [12]. A promising percentage of 90 % with 18 classes is reported, however, a leave-one-out cross-validation scheme probably increases the recognition rate. The results obtained with the wavelet features were almost as good as with the other features. Table 2 summarizes the recognition percentages reported in isolated note studies. The most severe limitation of all these studies is that they all used only one example of each instrument. This significantly decreases the generalizability of the results, as we will demonstrate with our system in later part. The study described next is the only study using isolated tones from more than one source and represents the state-of-the-art in isolated tone recognition.

2.2 Comparison between artificial systems and human abilities.

The current state-of-the-art in artificial sound source recognition is still very limited in its practical applicability. Under laboratory conditions, the systems are able to successfully recognize a wider set of sound sources. However, if the conditions become more realistic, i.e. the material is noisy, recorded in different locations with different setups, or there are interfering sounds, the systems are able to successfully handle only a small number of sound sources. The main challenge for the future is to build systems that can recognize wider sets of sound sources with increased generality and in realistic conditions [13]. In general, humans are superior with regard to all the evaluation criteria presented in Section 2.3 [13]. They are able to generalize between different pieces of instruments, and recognize more abstract classes such as bowed string instruments. People are robust recognizers because they are able to focus on a sound of a single instrument in a concert, or a single voice within a babble. In addition, they are able to learn new sound sources easily, and learn to become experts in recognizing, for example, orchestral instruments. The recognition accuracy of human subjects gradually worsens as the level of background noise, and interfering sound sources increases.

Only in limited contexts, such as discriminating between four woodwind instruments, computer systems have performed comparable to human subjects [14]. With more general tasks, a lot of work needs to be done.

2.3 Physical properties of musical instrument.

Traditionally, the instruments are divided into four classes: the strings, the brass, keyboards and woodwinds. The sound of the instrument members within each family are similar, and often humans make confusions within, but not easily between, these families. Examples include confusing the violin and viola, the oboe and English horn, or the trombone and French horn [13]. In the following, we briefly present the different members of each family and their physical build.

The strings The members of the string family include the violin, viola, cello and double bass, guitar in the order of increasing size. These five form a tight perceptual family, and human subjects consistently make confusions within this family [13]. The string instruments consist of a wooden body with a top and back plate and sides, and an extended neck. The strings are stretched along the neck and over a fingerboard. At the other end, the strings are attached to the bridge and at the other end to the tuning pegs which control the string tension. The strings can be excited by plucking with fingers, drawing a bow over them or hitting them with the bow (martele style of playing). The strings itself move very little air, but the sound is produced by the vibration of the body and the air in it [9]. They are set into motion by the string vibration which transmits to the body via the coupling through the bridge. The motion of the top plate is the source of the most of the sound, and is a result of the interaction between the driving force from the bridge and the resonances of the instrument body [9].

The brass

The members of the brass family considered in this project include the trumpet, French horn, and tuba. The brass instruments have the simplest acoustic structure among the three families. They consist of a long, hard walled tube with a flaring bell attached at one end. The sound is produced by blowing at the other end of the tube, and the pitch of the instrument can be varied by changing the lip tension. The player can use mutes to alter the sound, or insert his hand into the bell with the French horn.

The woodwind The woodwind family is more heterogeneous than the string and brass families, and there exists several acoustically and perceptually distinct subgroups [13]. The subgroups are 19 the single reed clarinets, the double reeds, the flutes with an air reed, and the single reed saxophones. In wind instruments, the single or double reed operates in a similar way as the players lips in brass instruments, allowing puffs of air into a conical tube where standing waves are then created. The effective length of the tube is varied by opening and closing tone holes, changing the pitch of the played note. [10]

Flutes The members of the flute or air reed family include the piccolo, flute, alto flute and bass flute in the order of increasing size. They consist of a more or less cylindrical pipe, which has finger holes along its length. The pipe is stopped at one end, and has a blowing hole near the stopped end [10].

In this chapter I have discussed papers which enlightened us few dominant features for classification of musical instruments. They are listed below.

From these papers I have studied temporal, spectral features and cepstral which are helpful in musical instrument recognition also considered various classification methods. The features I have considered are 1. Temporal features I. II. III. Decay time Energy Zero crossing

2. Spectral features I. II. III. Spectral centroid Spectral roll-off Spectral flux

K-NN is used as classifier.

Chapter 3

Methodology

3.1. System Block Diagram:


Trained Model

Music Samples

Feature Calculation

Classification

Decision Box

Fig 3.1 System Block Diagram

Database Creation:
For this work we require a database consisting of 5 different families of instruments. Instruments being considered are enlisted below. Families String Keyboard Woodwind Brass Instruments Guitar, violin, Harp, Santoor , Sitar, Sarod , Banjo Piano, Accordion Flute, Oboe, Clarinet French Horn, Trumpet, Tuba

Table3.2: Instruments and their respective families

The notes are 16 bit, mono channel with sampling frequency of 44.1 kHz. All the audio samples were recorded in .wav format. The platform used for this work is MATLAB (R2009b).

(Recording was done by a professional in a studio)

Block diagram of Training Model:

Trained Model

Fig 3.3 Feature extraction:


The feature extraction stage is also called as the front end processor and it generates training vectors. Our main intentions are to investigate the performance of different feature schemes and find a good feature combination for a robust instrument classifier. Two categories of features are being considered, temporal and spectral. Features studied for classification are mentioned in the table below.

Temporal Features

Spectral Features distribution, harmonicity,

Energy, ZCR, ZCRM, ZCRMD, long Spectral attack time, ADSR, amplitude envelope.

spectral centroid, roll-off, skewness, flux, flatness, crest, spectral mean deviation, fundamental frequency, timbre, spectral range, bandwidth, MFCC, LPC, relative power.

Table 3.4 Features studied

Features in time domain:


To extract temporal-based features, music sound samples were segmented into 10-ms frames. The following are some important time domain features used in this project.

I. II. III. IV.

Energy of the signal Zero crossing Attack time Decay time

1. Energy The energy of each sample is calculated. The energy of the signal is calculated as

If the energy in the analysis window is high, implication is the frame is voiced (vowel/diphthong/semi vowel/voiced-consonant). If the energy is low, then the frame is unvoiced (unvoiced-consonant/silence) Thus the energy analysis helps to detect: Voice and silenced regions Silence and non-silence part of speech.

2. Zero crossing:
A zero crossing is said to occur between samples x(t) and x(t+1) if sign (x(t)) != sign (x(t+1)) Example: consider the signal given below

3. Attack time:
The rise time is taken as the time difference between the time at the end of attack and the backtracked position where the magnitude is 25% of the magnitude at the end of attack. Ta = t1 t2 Where Ta is the attack time, t1 is the time where the amplitude of the signal is maximum, t2 is the time where the amplitude of the signal is 25% of the maximum.

4. Decay time:
The decay time is obtained as the time difference between the end of attack and the forward position where the magnitude is 25% of the magnitude at the end of attack.

Td = t3 t4 Where Td is the decay time, T3 is the time when the amplitude of the signal is 25% of the maximum after the maximum amplitude has occurred. T4 is the time when the signal has maximum amplitude.

1.3.2

Spectral features

While some instruments generate sounds which have energy concentrated in the lower frequency bands, there are other instruments which produce sounds with energy almost evenly distributed among lower, mid, and higher frequency bands.

The spectral features which have been considered in the project are I.Spectral centroid II.Spectral roll-off III.Spectral flux

1.Spectral centroid:
The spectral centroid is a measure used in digital signal processing to characterise a spectrum. It indicates where the "center of mass" of the spectrum is. Perceptually, it has a robust connection with the impression of "brightness" of a sound. It is calculated as the weighted mean of the frequencies present in the signal, determined using a Fourier transform, with their magnitudes as the weights.

where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n) represents the centre frequency of that bin.

2.Spectral roll-off:
The Roll-Off is another measure of spectral shape. It is the point where frequency that is below some percentage (usually at 85%) of the power spectrum resides.

3.Spectral flux:
This feature measures frame-to-frame spectral difference. In short, it tells the changes in the spectral shape. It is defined as the squared difference between the normalized magnitudes of successive spectral distribution.

Chapter 4

Classifier

Various Classifiers are present & they are: SVM(Support Vector Machine) HMM(Hidden Markov Model) K-NN(k- Nearest Neighbour)

What is kNN classifier? Instance-based classifiers such as the kNN classifier operate on the premises that classification of unknown instances can be done by relating the unknown to the known according to some distance/similarity function. The intuition is that two instances far apart in the instance space defined by the appropriate distance function are less likely than two closely situated instances to belong to the same class. 1. The learning process Unlike many artificial learners, instance-based learners do not abstract any information from the training data during the learning phase. Learning is merely a question of encapsulating the training data. The process of generalization is postponed until it is absolutely unavoidable, that is, at the time of classification. This property has lead to the referring to instance-based learners as lazy learners, whereas classifiers such as feed forward neural networks, where proper abstraction is done during the learning phase, often are entitled eager learners. 2. Classification Classification (generalization) using an instance-based classifier can be a simple matter of locating the nearest neighbour in instance space and labelling the unknown instance with the same class label as that of the located (known) neighbour. This approach is often referred to as a nearest neighbour classifier. The downside of this simple approach is the lack of robustness that characterize the resulting classifiers. The high degree of local sensitivity makes nearest neighbour classifiers highly susceptible to noise in the training data. More robust models can be achieved by locating k, where k > 1, neighbours and letting the majority vote decide the outcome of the class labelling. A higher value of k results in a smoother,

less locally sensitive, function. The nearest neighbour classifier can be regarded as a special case of the more general k-nearest neighbours classifier, hereafter referred to as a kNN classifier. The drawback of increasing the value of k is of course that as k approaches n, where n is the size of the instance base, the performance of the classifier will approach that of the most straightforward statistical baseline, the assumption that all unknown instances belong to the class most frequently represented in the training data.

Fig 4. Example of kNN

3. Example of k-NN classification The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). 4. Algorithm The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

Usually Euclidean distance is used as the distance metric. Often, the classification accuracy of "k"-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbour or Neighbourhood components analysis. A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbours when the neighbours are computed due to their large number. One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbours.

Conclusion:
I have described a system that can listen to a musical instrument and recognize it. The work started by reviewing human perception: how well humans can recognize different instruments and what are the underlying phenomena taking place in the auditory system. Then I studied the qualities of musical sounds making them distinguishable from each other, as well as acoustics of musical instruments. Physical properties of instrument families were studies. The knowledge of the perceptually salient acoustic cues possibly used by human subjects in recognition was the basis for the development of feature extraction algorithms.

In the first evaluation, temporal based features was demonstrated. Using the hierarchic classifier architecture could not bring improvement in the recognition accuracy. However, it was concluded that the recognition rates in this experiment were highly optimistic because of insufficient testing material. The next experiment addressed this problem by introducing a wide data set including several examples of a particular instrument.

Next phase was of spectral features, which showed more distinguishing results. To get more accurate results standard deviation of all features were taken into consideration. Spectral features include spectral centroid, spectral roll-off, spectral flux. The within-instrument-family confusions made by the system were similar to those made by human subjects, although the system made more both inside and outside-family confusions. In the final experiment, techniques commonly used in speaker recognition were applied for musical instrument recognition. The benefit of the approach was that it is directly applicable to solo phrases.

In order to make truly realistic evaluations, more acoustic data would be needed, including monophonic material. The environment and differences between instrument instances proved out to have a more significant effect on the difficulty of the problem than what was expected at the beginning. In general, the task of reliably recognizing a wide set of instruments from realistic monophonic recordings is not a trivial one; it is difficult for humans and especially

for computers. It becomes easier as longer segments of music are used and the recognition is performed at the level of instrument families.

BIBLIOGRAPHY:
[1] Kaminskyj, Materka. (1995). Automatic Source Identification of Monophonic Musical Instrument Sounds. Proceedings of the IEEE Int. Conf. on Neural Networks, 1995. [2] Opolko, F. & Wapnick, J. McGill University Master Samples (compact disk). McGill University,1987. [3] Brown, Puckette. (1992). An Efficient Algorithm for the Calculation of a Constant Q Transform. J. Acoust. Soc. Am. 92, pp. 2698-2701. [4] Fraser, Fujinaga. (1999). Towards real-time recognition of acoustic musical instruments. Proceedings of the International Computer Music Conference, 1999. [5] Fujinaga. (1998). Machine recognition of timbre using steady-state tone of acoustic musical instruments. Proceedings of the International Computer Music Conference, 1998. [6] Fujinaga. (2000). Realtime recognition of orchestral instruments. Proceedings of the International Computer Music Conference, 2000. [7] Kostek. (1999). Soft Computing in Acoustics: Applications of Neural Networks, Fuzzy Logic and Rough Sets to Musical Acoustics. Physica-Verlag, 1999. [8] Martin. (1998). Musical instrument identification: A pattern-recognition approach. Presented at the 136th meeting of the Acoustical Society of America, October 13, 1998. [9] Rossing. (1990). The Science of Sound. Second edition, Addison-Wesley Publishing Co. [10] Fletcher, Rossing. (1998). The Physics of Musical Instruments. Springer-Verlag New York, Inc. [11] Automatic Classification of Musical Sounds In Proc. 108th Audio Eng. Soc. Convention. [12] Kostek, Czyzewski. (2001). Automatic Recognition of Musical Instrument Sounds Further Developments. In Proc. 110th Audio Eng. Soc. Convention, Amsterdam, Netherlands, May 2001. [13] Martin. (1999). Sound-Source Recognition: A Theory and Computational Model. Ph.D. thesis,

MIT. [14] Brown. (2001). Feature dependence in the automatic identification of musical woodwind instruments. J. Acoust. Soc. Am. 109(3), March 2001.

You might also like