TUWHERA Open Theses & Dissertations
AUT University
View Item 
  •   Open Theses & Dissertations
  • Doctoral Theses
  • View Item
  •   Open Theses & Dissertations
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Audio surveillance in unstructured environments

Sharan, Roneel Vikash
Thumbnail
View/Open
Whole thesis (1.336Mb)
Permanent link
http://hdl.handle.net/10292/9902
Metadata
Show full metadata
Abstract
This research examines an audio surveillance application, one of the many

applications of sound event recognition (SER), and aims to improve the sound

recognition rate in the presence of environmental noise using time-frequency image

analysis of the sound signal and deep learning methods. The sound database

contains ten sound classes, each sound class having multiple subclasses with

interclass similarity and intraclass diversity. Three different noise environments are

added to the sound signals and the proposed and baseline methods are tested under

clean conditions and at four different signal-to-noise ratios (SNRs) in the range of

0–20dB.

A number of baseline features are considered in this work which are mel-frequency

cepstral coefficients (MFCCs), gammatone cepstral coefficients (GTCCs), and the

spectrogram image feature (SIF), where the sound signal spectrogram images are

divided in blocks, central moments are computed in each block and concatenated to

form the final feature vector. Next, several methods are proposed to improve the

classification performance in the presence of noise.

Firstly, a variation of the SIF with reduced feature dimensions is proposed, referred

as the reduced spectrogram image feature (RSIF). The RSIF utilizes the mean and

standard deviation of the central moment values along the rows and columns of the

blocks resulting in a 2.25 times lower feature dimension than the SIF. Despite the

reduction in feature dimension, the RSIF was seen to outperform the SIF in

classification performance due to its higher immunity to inconsistencies in sound

signal segmentation.

Secondly, a feature based on the image texture analysis technique of gray-level cooccurrence

matrix (GLCM) is proposed, which captures the spatial relationship of

pixels in an image. The GLCM texture analysis technique is applied in subbands to

the spectrogram image and the matrix values from each subband are concatenated to

form the final feature vector which is referred as the spectrogram image texture

feature (SITF). The SITF was seen to be significantly more noise robust than all the

baseline features and the RSIF, but with a higher feature dimension.

Thirdly, the time-frequency image representation called cochleagram is proposed

over the conventional spectrogram images. The cochleagram image is a variation of

the spectrogram image utilizing a gammatone filter, as used for GTCCs. The

gammatone filter offers more frequency components in the lower frequency range

with narrow bandwidth and less frequency components in the higher frequency

range with wider bandwidth which better reveals the spectral information for the

sound signals considered in this work. With cochleagram feature extraction, the

spectrogram features SIF, RSIF, and SITF are referred as CIF, RCIF, and CITF,

respectively. The use of cochleagram feature extraction was seen to improve the

classification performance under all noise conditions with the most improved results

at low SNRs.

Fourthly, feature vector combination has been seen to improve the classification

performance in a number of literature and this work proposes a combination of

linear GTCCs and cochleagram image features. This feature combination was seen

to improve the classification performance of CIF, RCIF, and CITF and, once again,

the most improved results were at low SNRs.

Finally, while support vector machines (SVMs) seem to be the preferred classifier in

most SER applications, deep neural networks (DNNs) are proposed in this work.

SVMs are used as a baseline classifier but in each case the results are compared with

DNNs. SVM being a binary classifier, four common multiclass classification

methods, one-against-all (OAA), one-against-one (OAO), decision directed acyclic

graph (DDAG), and adaptive directed acyclic graph (ADAG), are considered. The

classification performance of all the classification methods is compared with

individual and combined features and the training and evaluation times are also

compared. For the multiclass SVM classification methods, the OAA method was

generally seen to be the most noise robust and gave a better overall classification

performance. However, the noise robustness of the DNN classifier was determined

to be the best together with the best overall classification performance with both

individual and combined features. DNNs also offered the fastest evaluation time but

the training time was determined to be the slowest.
Keywords
Sound event recognition; Time-frequency image; Support vector machines; Deep neural networks
Date
2015
Item Type
Thesis
Supervisor(s)
Moir, Tom; Collins, John
Degree Name
Doctor of Philosophy
Publisher
Auckland University of Technology

Contact Us
  • Admin

Hosted by Tuwhera, an initiative of the Auckland University of Technology Library

 

 

Browse

Open Theses & DissertationsTitlesAuthorsDateThesis SupervisorDoctoral ThesesTitlesAuthorsDateThesis Supervisor

Alternative metrics

 

Statistics

For this itemFor all Open Theses & Dissertations

Share

 
Follow @AUT_SC

Contact Us
  • Admin

Hosted by Tuwhera, an initiative of the Auckland University of Technology Library