Audio surveillance in unstructured environments
Sharan, Roneel Vikash
Abstract
This research examines an audio surveillance application, one of the many
applications of sound event recognition (SER), and aims to improve the sound
recognition rate in the presence of environmental noise using time-frequency image
analysis of the sound signal and deep learning methods. The sound database
contains ten sound classes, each sound class having multiple subclasses with
interclass similarity and intraclass diversity. Three different noise environments are
added to the sound signals and the proposed and baseline methods are tested under
clean conditions and at four different signal-to-noise ratios (SNRs) in the range of
0–20dB.
A number of baseline features are considered in this work which are mel-frequency
cepstral coefficients (MFCCs), gammatone cepstral coefficients (GTCCs), and the
spectrogram image feature (SIF), where the sound signal spectrogram images are
divided in blocks, central moments are computed in each block and concatenated to
form the final feature vector. Next, several methods are proposed to improve the
classification performance in the presence of noise.
Firstly, a variation of the SIF with reduced feature dimensions is proposed, referred
as the reduced spectrogram image feature (RSIF). The RSIF utilizes the mean and
standard deviation of the central moment values along the rows and columns of the
blocks resulting in a 2.25 times lower feature dimension than the SIF. Despite the
reduction in feature dimension, the RSIF was seen to outperform the SIF in
classification performance due to its higher immunity to inconsistencies in sound
signal segmentation.
Secondly, a feature based on the image texture analysis technique of gray-level cooccurrence
matrix (GLCM) is proposed, which captures the spatial relationship of
pixels in an image. The GLCM texture analysis technique is applied in subbands to
the spectrogram image and the matrix values from each subband are concatenated to
form the final feature vector which is referred as the spectrogram image texture
feature (SITF). The SITF was seen to be significantly more noise robust than all the
baseline features and the RSIF, but with a higher feature dimension.
Thirdly, the time-frequency image representation called cochleagram is proposed
over the conventional spectrogram images. The cochleagram image is a variation of
the spectrogram image utilizing a gammatone filter, as used for GTCCs. The
gammatone filter offers more frequency components in the lower frequency range
with narrow bandwidth and less frequency components in the higher frequency
range with wider bandwidth which better reveals the spectral information for the
sound signals considered in this work. With cochleagram feature extraction, the
spectrogram features SIF, RSIF, and SITF are referred as CIF, RCIF, and CITF,
respectively. The use of cochleagram feature extraction was seen to improve the
classification performance under all noise conditions with the most improved results
at low SNRs.
Fourthly, feature vector combination has been seen to improve the classification
performance in a number of literature and this work proposes a combination of
linear GTCCs and cochleagram image features. This feature combination was seen
to improve the classification performance of CIF, RCIF, and CITF and, once again,
the most improved results were at low SNRs.
Finally, while support vector machines (SVMs) seem to be the preferred classifier in
most SER applications, deep neural networks (DNNs) are proposed in this work.
SVMs are used as a baseline classifier but in each case the results are compared with
DNNs. SVM being a binary classifier, four common multiclass classification
methods, one-against-all (OAA), one-against-one (OAO), decision directed acyclic
graph (DDAG), and adaptive directed acyclic graph (ADAG), are considered. The
classification performance of all the classification methods is compared with
individual and combined features and the training and evaluation times are also
compared. For the multiclass SVM classification methods, the OAA method was
generally seen to be the most noise robust and gave a better overall classification
performance. However, the noise robustness of the DNN classifier was determined
to be the best together with the best overall classification performance with both
individual and combined features. DNNs also offered the fastest evaluation time but
the training time was determined to be the slowest.