The given task was to classify single-instrument recordings for ten instruments: Electronic Bass Guitar, Acoustic Brass, Acoustic flute, Acoustic Guitar, Acoustic Keyboard, Acoustic Mallet, Electronic Organ, Acoustic Reed, Acoustic String, and Acoustic Vocal. The goal of instrument identification is to make use of varying machine learning methodologies studied in our class. The input to our model consists of 1400 individual audio recordings for each instrument in WAV format(1000 for training dataset, and 200 for each valid and test dataset). Since given dataset was very small, I assumed that designing and extracting effective features would be more important than having a powerful machine learning model. I performed basic feature extraction from the discrete Fourier transform(DFT) of the signal and label each example with a unique integer corresponding to the instruments’ class. I then used this data for analysis and train multiple SVMs to make predictions on the test set.
Our data set consists of 1,400 samples, obtained from NSynth dataset(1000 for training data, 200 for validation, and 200 for test data). There are same number of samples for 10 instrument classes in each of training, valid, and test dataset. Each instrument dataset is consist of recordings of different pitches. One characteristic of this dataset is that every sample is recorded for a fixed length(4 seconds each) and every recording has the same releasing point(at the end of 3 seconds). That is, data from each instrument are expected to have comparable temporal characteristic as well as their spectral characteristics(such as timbre). Therefore, I came up with the approach of extracting the temporal changes that reflects the characteristics of each instrument.
Basic features I extracted from the original signal were as followed :
- Mel-spectrogram (FFT length: 256, 512, 1024, 2048/Mel bin sizes: 40, 128)
- MFCC (DCT sizes: 13, 20)
- MFCC 1st-order differential
- MFCC 2nd-order differential
- Spectral centroid (FFT length: 256, 512, 1024, 2048)
- Zero crossing rate (FFT length: 256, 512, 1024, 2048)
For the baseline code, we are given the approach of using the mean value of MFCC to use as the feature. It only gave 43.5\% of accuracy. First thing I could try was to try better classifying techniques such as MLP(Multi-Layer Perceptron) classifier and GMM classifier. It slightly improved the accuracy, but not as much as expected. Since the size of dataset was very small, baseline SVM model could perform well enough.
Combining these features, I also got slightly better results.
To obtain more characteristics of temporal changes, I divided the temporal axis into 4 sections and computed the mean value of each section. By using concatenation of 4 mean values, I got better accuracy. This made me confident that the temporal characteristics of signal are very helpful to distinguish different instruments.
I’ve also tried somewhat absurd approach of modeling temporal distribution of MFCC values as a Gaussian Mixture Model and using the parameters of GMM as a feature. I got similar accuracy as using the mean value.
To get better insight from the basic features, I’ve visualized the averaged values of MFCC and MFCC 1st-order differentials for each instrument class, and tried visually checking the temporal changes in each bin.(Figure 1., Figure 2.) It certainly shows that temporal changes are the core parts that characterize each instrument.
Codebook histogram approach
Creating a codebook with unsupervised learning of basic features was a more advanced approach since it can reflect more detailed characteristics of features than simply computing the mean value. Here I tried two different way of constructing the codebook.
Codebook trained along frequency axis
First one was to perform unsupervised clustering on Mel-spectrogram / MFCC / MFCC 1st-order differential / MFCC 2nd-order differential vectors of every frame from the entire training dataset. This approach will quantize each vector into a certain characteristic reference code. If the distribution along the Mel bin axis can represent the characteristics of each instrument well enough, this approach will perform very well.
After replacing each vector with the corresponding reference code, I made up a histogram of codes for each recording to use as a feature. In case of MFCC, since MFCC is regarded as representing the timbre of sound, the histogram of all MFCC codes will catch all the timbre characters of 4 second. As expected, it gave reasonably higher accuracy than using basic naive features. For clustering, I used K-means algorithm with K(number of clusters) equals to 400.
Codebook trained along time axis *
Second approach was to more focus on the temporal changes in each feature dimension. I used a transposed way of unsupervised learning on the Mel-spectrogram / MFCC / MFCC 1st-order differential / MFCC 2nd-order differential data. That is, I performed clustering algorithm to catch the distribution of data along the time axis for every frequency bin. This quantized codes can represent the temporal changes of each bin. I could have performed clustering for each frequency bin(e.g. Mel bin) for a more precise experiment, but chose to use a large number of clustering centroids(400) hoping it to distinguish all characterized distribution of every bin along the frequency axis also. This approach is very similar to what CNN architectures on time axis are doing. When 1D convolution kernels are trained to extract right features from spectrogram or MFCC, each kernel will be learned to extract temporal characteristics of each spectral bin. I’ll discuss more on this at the discussion section.
For the classifier itself, I’ve tried 3 methodologies.
- Support vector machine classifier
- Multi–layer perceptron classifier
- GMM classifier
They basically performed very similarly, though GMM approach performed slightly better. I speculated that the size of dataset was not big enough to measure the performance between classifying techniques. The most effective feature was the codes representing time axis distribution of MFCC 1st-order differentials. It’s more effective than the time axis distribution of MFCC because it’s MFCC values in different Mel bins tend to change in a similar fashion together. Therefore, the fact that I did not train separate codebooks for different Mel bins does less harm to the effectiveness of the feature. The Best performing model was the combination of top 2 performing features, which are Time-coded MFCC 1st-order differential and Freq-code MFCC 1st-order differential.
From the results, we can acknowledge that the temporal changes in MFCC values of each dimension is the core discriminative factor of instrument recordings(at least, for our small dataset problem).
I’ve done dimension reduction procedure on the coded feature of training dataset to see which feature is more discriminative. As we can see from the Figure 3., codebook of temporal changes in either MFCC values or MFCC 1st-order differentials are more effective in discriminating between instrument classes than the frequency axis distribution.
As I’ve mentioned before, to perform more precise experiment, I need to get the clustered codebook for each MFCC Mel-bin dimension. However, with sufficiently large number of clusters(400), we can expect the codebook has covered all characteristics from each MFCC bin fairly enough.
What the clustered codebook has learned can be regarded as what 1D convolution kernels along the time axis of a CNN are doing. When 1D convolution kernels are trained to extract right features from spectrogram or MFCC, each kernel will be learned to extract temporal characteristics of each spectral bin. Here, the reference codes (the trained centroids from K-means clustering) represent the characterized temporal changes in any MFCC bins.
Training a codebook for both temporal and spectral distribution can function similarly as the training of necessary convolutional kernels in deep neural network architecture. For future work, I would like to more carefully compare both and seek for better optimized feature learning techniques for audio analysis tasks.