Application of a Novel Audio Compression Scheme in Automatic Music Recommendation, Digital Rights Management and Audio Fingerprinting

Rapid progress in audio compression technology has contributed to the explosive growth of music available in digital form today. In a reversal of ideas, this work makes use of a recently proposed efficient audio compression scheme to develop three important applications in the context of Music Information Retrieval (MIR) for the effective manipulation of large music databases, namely automatic music recommendation (AMR), digital rights management (DRM) and audio finger-printing for song identification. The performance of these three applications has been evaluated with respect to a database of songs collected from a diverse set of genres.

Speaker Identification by Joint Statistical Characterization in the Log Gabor Wavelet Domain

Real world Speaker Identification (SI) application differs from ideal or laboratory conditions causing perturbations that leads to a mismatch between the training and testing environment and degrade the performance drastically. Many strategies have been adopted to cope with acoustical degradation; wavelet based Bayesian marginal model is one of them. But Bayesian marginal models cannot model the inter-scale statistical dependencies of different wavelet scales. Simple nonlinear estimators for wavelet based denoising assume that the wavelet coefficients in different scales are independent in nature. However wavelet coefficients have significant inter-scale dependency. This paper enhances this inter-scale dependency property by a Circularly Symmetric Probability Density Function (CS-PDF) related to the family of Spherically Invariant Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain and corresponding joint shrinkage estimator is derived by Maximum a Posteriori (MAP) estimator. A framework is proposed based on these to denoise speech signal for automatic speaker identification problems. The robustness of the proposed framework is tested for Text Independent Speaker Identification application on 100 speakers of POLYCOST and 100 speakers of YOHO speech database in three different noise environments. Experimental results show that the proposed estimator yields a higher improvement in identification accuracy compared to other estimators on popular Gaussian Mixture Model (GMM) based speaker model and Mel-Frequency Cepstral Coefficient (MFCC) features.

Improved Text-Independent Speaker Identification using Fused MFCC and IMFCC Feature Sets based on Gaussian Filter

A state of the art Speaker Identification (SI) system requires a robust feature extraction unit followed by a speaker modeling scheme for generalized representation of these features. Over the years, Mel-Frequency Cepstral Coefficients (MFCC) modeled on the human auditory system has been used as a standard acoustic feature set for speech related applications. On a recent contribution by authors, it has been shown that the Inverted Mel- Frequency Cepstral Coefficients (IMFCC) is useful feature set for SI, which contains complementary information present in high frequency region. This paper introduces the Gaussian shaped filter (GF) while calculating MFCC and IMFCC in place of typical triangular shaped bins. The objective is to introduce a higher amount of correlation between subband outputs. The performances of both MFCC & IMFCC improve with GF over conventional triangular filter (TF) based implementation, individually as well as in combination. With GMM as speaker modeling paradigm, the performances of proposed GF based MFCC and IMFCC in individual and fused mode have been verified in two standard databases YOHO, (Microphone Speech) and POLYCOST (Telephone Speech) each of which has more than 130 speakers.

Improved Closed Set Text-Independent Speaker Identification by Combining MFCC with Evidence from Flipped Filter Banks

A state of the art Speaker Identification (SI) system requires a robust feature extraction unit followed by a speaker modeling scheme for generalized representation of these features. Over the years, Mel-Frequency Cepstral Coefficients (MFCC) modeled on the human auditory system has been used as a standard acoustic feature set for SI applications. However, due to the structure of its filter bank, it captures vocal tract characteristics more effectively in the lower frequency regions. This paper proposes a new set of features using a complementary filter bank structure which improves distinguishability of speaker specific cues present in the higher frequency zone. Unlike high level features that are difficult to extract, the proposed feature set involves little computational burden during the extraction process. When combined with MFCC via a parallel implementation of speaker models, the proposed feature set outperforms baseline MFCC significantly. This proposition is validated by experiments conducted on two different kinds of public databases namely YOHO (microphone speech) and POLYCOST (telephone speech) with Gaussian Mixture Models (GMM) as a Classifier for various model orders.

Speech Enhancement by Marginal Statistical Characterization in the Log Gabor Wavelet Domain

This work presents a fusion of Log Gabor Wavelet (LGW) and Maximum a Posteriori (MAP) estimator as a speech enhancement tool for acoustical background noise reduction. The probability density function (pdf) of the speech spectral amplitude is approximated by a Generalized Laplacian Distribution (GLD). Compared to earlier estimators the proposed method estimates the underlying statistical model more accurately by appropriately choosing the model parameters of GLD. Experimental results show that the proposed estimator yields a higher improvement in Segmental Signal-to-Noise Ratio (S-SNR) and lower Log-Spectral Distortion (LSD) in two different noisy environments compared to other estimators.

In Search of an SVD and QRcp Based Optimization Technique of ANN for Automatic Classification of Abnormal Heart Sounds

Artificial Neural Network (ANN) has been extensively used for classification of heart sounds for its discriminative training ability and easy implementation. However, it suffers from overparameterization if the number of nodes is not chosen properly. In such cases, when the dataset has redundancy within it, ANN is trained along with this redundant information that results in poor validation. Also a larger network means more computational expense resulting more hardware and time related cost. Therefore, an optimum design of neural network is needed towards real-time detection of pathological patterns, if any from heart sound signal. The aims of this work are to (i) select a set of input features that are effective for identification of heart sound signals and (ii) make certain optimum selection of nodes in the hidden layer for a more effective ANN structure. Here, we present an optimization technique that involves Singular Value Decomposition (SVD) and QR factorization with column pivoting (QRcp) methodology to optimize empirically chosen over-parameterized ANN structure. Input nodes present in ANN structure is optimized by SVD followed by QRcp while only SVD is required to prune undesirable hidden nodes. The result is presented for classifying 12 common pathological cases and normal heart sound.