In part 1 we learned some basic signal processing terminology, as well as how to load and visualize a song. In this post we will discuss how to break up a signal (segment) and extract various features from it. Then, we'll do some exploratory analysis on the features so that we can get an idea of the interactions among features.
Songs vary a lot over time. By breaking up this heterogenous signal in to small segments that are more homogenous, we can keep the information about how a song changes. For example, say that we have some features such as 'danceability'. Imagine a song that is at first very quiet and low energy, but as the song progresses, it becomes a full blown Rihanna club anthem. If we just examined the average danceability of the entire song, it might be lower than how danceable the song actually is. This context and distinction is important because if we are running a classifier or clustering algorithm on each segment of a song, we could classify that song as the club anthem that it is, not just it's (below) average.
An onset in a signal is often described as the beginning of a note or other sound. This is usually found by measuring peaks in energy along a signal. If we find that peak in energy, and then backtrack to a local minimum, we have found an onset, and can use that as a boundary for a segment of a song.
here are some good resources for learning more about onsets and onset detection: https://musicinformationretrieval.com/onset_detection.html, https://en.wikipedia.org/wiki/Onset_(audio), https://www.music-ir.org/mirex/wiki/2018:Audio_Onset_Detection
%matplotlib inline
import librosa
import numpy as np
import IPython.display as ipd
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
plt.rcParams['figure.figsize'] = (16, 7)
By default, onset_detect returns an array of frame indices that correspond to frames in a signal. We actually want the sample indices so that we can slice and dice our signal neatly with those indices. We'll continue to use our Rock & Roll example from the previous post.
signal, sr = librosa.load('/Users/benjamindykstra/Music/iTunes/Led Zeppelin/Led Zeppelin IV/02 Rock & Roll.m4a')
# use backtrack=True to go back to local minimum
onset_samples = librosa.onset.onset_detect(signal, sr=sr, backtrack=True, units='samples')
print onset_samples.shape
We have found 132 segments in the song. Now, lets use the sample indices to split up the song in to subarrays like so: [[segment 1], [segment 2], ..... [segment n]]. Each segment will be a different length, but when we calculate a feature vector for a segment, the feature vector will be a standard size. The final dimensions of the data after segmenting and calculating features for each segment will be (# segments, # features).
# return np array of audio segments, within each segment is the actual audio data
prev_ndx = 0
segmented = []
for sample_ndx in onset_samples:
# get the samples from prev_ndx to sample_ndx
segmented.append(np.array(signal[prev_ndx:sample_ndx]))
prev_ndx = sample_ndx
segmented.append(np.array(signal[onset_samples[-1]:])) # gets the last segment from the signal
segmented = np.array(segmented)
segmented.shape
As a sanity check, if we concatenate all the segments together, it should be equal in shape to the original signal.
print "difference in shapes: {}".format(signal.shape[0] - np.concatenate(segmented).shape[0])
Listen to a few segments together!
ipd.Audio(np.concatenate(segmented[25:30]), rate=sr)
or just a single short segment
ipd.Audio(segmented[21], rate=sr)
Lets define a more generic segmentation function for use later on
def segment_onset(signal, sr=22050, hop_length=512, backtrack=True):
"""
Segment a signal using onset detection
Parameters:
signal: numpy array of a timeseries of an audio file
sr: int, sampling rate, default 22050 samples per a second
hop_length: int, number of samples between successive frames
backtrack: bool, If True, detected onset events are backtracked to the nearest preceding minimum of energy
returns:
dictionary with attributes segemented and shape
"""
# Compute the sample indices for estimated onsets in a signal
onset_samples = librosa.onset.onset_detect(signal, sr=sr, hop_length=hop_length, backtrack=backtrack, units='samples')
# return np array of audio segments, within each segment is the actual audio data
prev_ndx = 0
segmented = []
for sample_ndx in onset_samples:
segmented.append(np.array(signal[prev_ndx:sample_ndx]))
prev_ndx = sample_ndx
segmented.append(np.array(signal[onset_samples[-1]:]))
return { 'data': np.array(segmented), 'shape': np.array(segmented).shape }
Now that we have a way to break up a song, we would like to be able to derive some features from the raw signal. Librosa has a plethora of features to choose from. They fall in to two categories, spectral and rhythmic features. Spectral features are features that have to do with the frequency, pitch and timbre of a signal, where as rhythmic features (you guessed it) give you info about the rhythm of the signal.
The objective with feature extraction is to feed a single function a single segment of a song, that returns an array of the calculated features for that segment
Some of the feature methods return arrays of differing shapes and we need to account for those differences in our implementation. For example, when calculating the Mel-frequency cepstral coefficients for a segment, the return shape is (# of coefficients, # of frames in a segment). Since we're assuming that a segment is a homogenous piece of signal, we should take the average of the coefficients across all the frames so we get a shape of (# of coefficients, 1).
First I will define all the feature functions, and then I will explain what information they add/describe.
def get_feature_vector(segment):
'''
Extract features for a given segment
Parameters:
segment: numpy array, a time series of audio data
Returns:
numpy array
'''
if len(segment) != 0:
feature_tuple = (avg_energy(segment), avg_mfcc(segment), zero_crossing_rate(segment), avg_spectral_centroid(segment), avg_spectral_contrast(segment), bpm(segment))
all_features = np.concatenate([feat if type(feat) is np.ndarray else np.array([feat]) for feat in feature_tuple])
n_features = len(all_features)
return all_features
return np.zeros((30,)) # length of feature tuple
def avg_energy(segment):
'''
Get the average energy of a segment.
Parameters:
segment: numpy array, a time series of audio data
Returns:
float, the mean energy of the segment
'''
if len(segment) != 0:
energy = librosa.feature.rmse(y=segment)[0]
# returns (1,t) array, get first element
return np.array([np.mean(energy)])
def avg_mfcc(segment, sr=22050, n_mfcc=20):
'''
Get the average Mel-frequency cepstral coefficients for a segment
The very first MFCC, the 0th coefficient, does not convey information relevant to the overall shape of the spectrum.
It only conveys a constant offset, i.e. adding a constant value to the entire spectrum. We discard it.
BE SURE TO NORMALIZE
Parameters:
segment: numpy array, a time series of audio data
sr: int, sampling rate, default 22050
n_mfcc: int, the number of cepstral coefficients to return, default 20.
Returns:
numpy array of shape (n_mfcc - 1,)
'''
if (len(segment) != 0):
components = librosa.feature.mfcc(y=segment,sr=sr, n_mfcc=n_mfcc ) # return shape (n_mfcc, # frames)
return np.mean(components[1:], axis=1)
def zero_crossing_rate(segment):
'''
Get average zero crossing rate for a segment. Add a small constant to the signal to negate small amount of noise near silent
periods.
Parameters:
segment: numpy array, a time series of audio data
Returns:
float, average zero crossing rate for the given segment
'''
rate_vector = librosa.feature.zero_crossing_rate(segment+ 0.0001)[0] # returns array with shape (1,x)
return np.array([np.mean(rate_vector)])
def avg_spectral_centroid(segment, sr=22050):
'''
Indicate at which frequency the energy is centered on. Like a weighted mean, weighting avg frequency by the energy.
Add small constant to audio signal to discard noise from silence
Parameters:
segment: numpy array, a time series of audio data
sr: int, sampling rate
Returns:
float, the average frequency which the energy is centered on.
'''
centroid = librosa.feature.spectral_centroid(segment+0.01, sr=sr)[0]
return np.array([np.mean(centroid)])
def avg_spectral_contrast(segment, sr=22050, n_bands=6):
'''
considers the spectral peak, the spectral valley, and their difference in each frequency subband
columns correspond to a spectral band
average contrast : np.ndarray [shape=(n_bands + 1)]
each row of spectral contrast values corresponds to a given
octave-based frequency, take average across bands
Parameters:
segment: numpy array, a time series of audio data
sr: int, sampling rate
n_bands: the number of spectral bands to calculate the contrast across.
Returns:
numpy array shape (n_bands,)
'''
contr = librosa.feature.spectral_contrast(segment, sr=sr, n_bands=n_bands)
return np.mean(contr, axis=1) # take average across bands
def bpm(segment, sr=22050):
'''
Get the beats per a minute of a song
Parameters:
segment: numpy array, a time series of audio data,
sr: int, sampling rate
Returns:
int, beats per minute
'''
tempo = librosa.beat.tempo(segment) #returns 1d array [bpm]
return np.array([tempo[0]])
The energy for a segment is important because it gives a feel for tempo and/or mood of a segment. It's actually just the root mean square of a signal.
The Mel-frequency cepstral coefficients relay information about the timbre of a song. Timbre describes the 'quality' of a sound. If you think about how an A note on a trumpet sounds vastly different than that same A on a piano, those differences are due to timbre.
Literally the rate at which a signal crosses the horizontal axis. It often corresponds to events in the signal such as a snare drum or some other percussive event.
I think the spectral centroid is actually really cool. It's a weighted average of the magnitude of the frequencies in a signal and a 'center of mass'. It's often perceived as a measure of the brightness of a sound.
An octave based feature, it more directly represents the spectral characteristics of a segment. Coupled with MFCC features, they can provide a lot of information about a signal.
Provides information about the tempo and (some) percussive elements of a song.
Now what do we actually do with all these features??? We can use them to cluster!!
def extract_features(all_songs):
'''
all_songs is a list of dictionaries. Each dictionary contains the attributes song_name and data.
The data are the segments of the song.
'''
all_song_features = []
song_num = 0
for song in all_songs:
print "Processing {} with {} segments".format(song['song_name'], len(song['data']))
song_name = song['song_name']
segment_features = []
for segment in song['data']:
feature_vector = get_feature_vector(segment)
segment_features.append(feature_vector)
song_feature_vector = np.array(segment_features)
print "shape of feature vector for entire song: {}".format(song_feature_vector.shape)
print "shape of segment feature vector: {}".format(song_feature_vector[0].shape)
n_seg = song_feature_vector.shape[0]
feature_length = song_feature_vector[0].shape[0]
song_feature_vector = np.reshape(song_feature_vector, (n_seg, feature_length))
all_song_features.append(song_feature_vector)
song_num += 1
all_feature_vector = np.vstack(all_song_features)
return all_feature_vector
Lets look at the features of Rock 'n Roll by Led Zeppelin, and I Remember by Deadmau5.
rock_n_roll = segment_onset(signal)
rock_n_roll['data'].shape
feature_vec_rock = extract_features([{'song_name': '02 Rock & Roll.m4a', 'data': rock_n_roll['data']}])
feature_vec_rock.shape
i_remember, sr = librosa.load('/Users/benjamindykstra/Music/iTunes/Deadmau5/Random Album Title/07 I Remember.m4a')
i_remember_segmented = segment_onset(i_remember)
feature_vec_remember = extract_features([{'song_name': '07 I Remember.m4a', 'data': i_remember_segmented['data']}])
These two songs are very different
ipd.Audio(np.concatenate(i_remember_segmented['data'][:30]), rate= sr)
ipd.Audio(np.concatenate(rock_n_roll['data'][:20]), rate= sr)
col_names = ['energy'] + ["mfcc_" + str(i) for i in xrange(19)] + ['zero_crossing_rate', 'spectral_centroid'] + ['spectral_contrast_band_' + str(i) for i in xrange(7)] + ['bpm']
rnr = pd.DataFrame(feature_vec_rock, columns = col_names)
i_remember_df = pd.DataFrame(feature_vec_remember, columns = col_names)
min_max_scaler = sklearn.preprocessing.MinMaxScaler(feature_range=(-1, 1))
rnr_scaled = pd.DataFrame(min_max_scaler.fit_transform(feature_vec_rock), columns = col_names)
i_remember_scaled = pd.DataFrame(min_max_scaler.fit_transform(feature_vec_remember), columns = col_names)
features_scaled = pd.DataFrame(np.vstack((rnr_scaled, i_remember_scaled)), columns = col_names)
rnr_scaled.head()
i_remember_scaled.head()
sns.boxplot(x=i_remember_scaled.energy, palette='muted').set_title('I Remember Energy');
plt.show();
print i_remember_scaled.energy.describe()
sns.boxplot(rnr_scaled.energy).set_title('Rock n Roll Energy');
plt.show();
print rnr_scaled.energy.describe()
I remember has a lower mean and median energy, but similar spread to Rock n Roll. I'd say that this fits, as I Remember has almost a melancholic energy, where as Rock n Roll really makes you to get up and move.
Since zero crossing rate and bpm correlate pretty highly with percussive events, I'd predict that the song with the higher BPM will have a higher zero crossing rate
print 'Rock n Roll average BPM: {}'.format(rnr.bpm.mean())
print 'Rock n Roll average Crossing Rate: {}'.format(rnr.zero_crossing_rate.mean())
print 'I Remember average BPM: {}'.format(i_remember_df.bpm.mean())
print 'I Remember average Crossing Rate: {}'.format(i_remember_df.zero_crossing_rate.mean())
Some scatterplots with spectral centroid on the x axis, and energy on the y axis.
sns.scatterplot(x = rnr.spectral_centroid, y = rnr.energy);
plt.show();
sns.scatterplot(x = i_remember_df.spectral_centroid, y = i_remember_df.energy );
plt.show();
Recall that spectral centroid is where the spectral 'center of mass' is for a given segment. That means that it's picking out the dominant frequency for a segment. I like this because it shows that there's not real relation between the frequency of a segment and its energy. They are both contributing unique information.
song_labels = np.concatenate([np.full((label_len), i) for i, label_len in enumerate([len(rnr), len(i_remember_df)])])
model = sklearn.cluster.KMeans(n_clusters=3)
labels = model.fit_predict(features_scaled)
plt.scatter(features_scaled.zero_crossing_rate[labels==0], features_scaled.energy[labels==0], c='b')
plt.scatter(features_scaled.zero_crossing_rate[labels==1], features_scaled.energy[labels==1], c='r')
plt.scatter(features_scaled.zero_crossing_rate[labels==2], features_scaled.energy[labels==2], c='g')
plt.xlabel('Zero Crossing Rate (scaled)')
plt.ylabel('Energy (scaled)')
plt.legend(('Class 0', 'Class 1', 'Class 2'))
unique_labels, unique_counts = np.unique(model.predict(rnr_scaled), return_counts=True)
print unique_counts
print 'cluster for rock n roll: ', unique_labels[np.argmax(unique_counts)]
unique_labels, unique_counts = np.unique(model.predict(i_remember_scaled), return_counts=True)
print unique_counts
print 'cluster for I remember: ', unique_labels[np.argmax(unique_counts)]
I suspect that zero crossing rate and energy are not what the determining factors are for a cluster :)
Note that these aren't necessarily consecutive segments
First lets look at I remember
i_remember_clusters = model.predict(i_remember_scaled)
label_2_segs = i_remember_segmented['data'][i_remember_clusters==2]
label_1_segs = i_remember_segmented['data'][i_remember_clusters==1]
label_0_segs = i_remember_segmented['data'][i_remember_clusters==0]
Almost all of theses have some of the vocals in them
ipd.Audio(np.concatenate(label_2_segs[:50]), rate = sr)
These are the lighter segments, mostly just synths
ipd.Audio(np.concatenate(label_1_segs[:50]), rate = sr)
0 labels seem to be the heavy bass and percussive parts
ipd.Audio(np.concatenate(label_0_segs[:50]), rate = sr)
rnr_clusters = model.predict(rnr_scaled)
rock_label_2_segs = rock_n_roll['data'][rnr_clusters==2]
rock_label_1_segs = rock_n_roll['data'][rnr_clusters==1]
rock_label_0_segs = rock_n_roll['data'][rnr_clusters==0]
Again, higher frequencies, vocals included. A lot of consecutive segments included in this class.
ipd.Audio(np.concatenate(rock_label_2_segs[10:40]), rate = sr)
Lighter sounds, minor key vocals, part of final drum solo
ipd.Audio(np.concatenate(rock_label_1_segs), rate = sr)
All John Bonham here (only drums). Similar to how in I remember label 0 corresponded to the heavy bass
ipd.Audio(np.concatenate(rock_label_0_segs), rate = sr)
We've used two very different songs to build vectors of spectral and rhythmic features. We then examined how those features related to each other with boxplots, scatterplots and descriptive statistics. Using those features, we have clustered different segments of the songs together to compare how the different clusters sound together.
More exploration is needed to figure out how to assign a feature value for an entire song that is representative. With more work doing that, it would then be possible to get summary features for entire songs. More on this later!
print rnr_scaled[rnr_clusters==0]['spectral_centroid'].describe()
print rnr_scaled[rnr_clusters==1]['spectral_centroid'].describe()
print rnr_scaled[rnr_clusters==2]['spectral_centroid'].describe()
i_remember_df['mfcc_2'].plot.hist(bins=20, figsize=(14, 5))
sns.boxplot(x=i_remember_scaled.zero_crossing_rate, palette='muted').set_title('I Remember Zero Crossing Rate');
plt.show();
print i_remember_scaled.zero_crossing_rate.describe()
sns.boxplot(rnr_scaled.zero_crossing_rate).set_title('Rock n Roll Zero Crossing Rate');
plt.show();
print rnr_scaled.zero_crossing_rate.describe()
print rnr.zero_crossing_rate.describe()
print i_remember_df.zero_crossing_rate.describe()
print rnr.bpm.describe()
print i_remember_df.bpm.describe()
# rnr_scaled['energy'].plot();
i_remember_scaled['energy'].plot();
# rnr_scaled['zero_crossing_rate'].plot();
# rnr_scaled['spectral_centroid'].plot();
# rnr_scaled['mfcc_4'].plot();
plt.show()
rnr_scaled[['mfcc_' + str(i) for i in xrange(4)]].plot();
plt.show();
sns.scatterplot(x=rnr_scaled['energy'], y = rnr_scaled['bpm'])
sns.pairplot(rnr_scaled[['mfcc_' + str(i) for i in xrange(5)]], palette='pastel')