Unsupervised feature selection and NMF de-noising for robust Speech Emotion Recognition

Bandela S. R., Kumar T. K.

APPLIED ACOUSTICS, vol.172, 2021 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 172
  • Publication Date: 2021
  • Doi Number: 10.1016/j.apacoust.2020.107645
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Communication & Mass Media Index, Compendex, ICONDA Bibliographic, INSPEC, DIALNET
  • Keywords: Speech Emotion Recognition, Feature optimization, Unsupervised feature selection, Speech de-noising, NMF, ENHANCEMENT, SPARSE
  • Istanbul University Affiliated: No


Speech feature fusion is the most commonly used phenomenon for improving the accuracy in Speech Emotion Recognition (SER). But in this, there is a disadvantage of increasing the complexity in SER system in terms of processing time. Besides this, some of the features could be redundant and do not contribute for SER and lead to incorrect emotion prediction and reduction in SER accuracy. To overcome this problem, in this paper, unsupervised feature selection is applied to the feature set with the combination of INTERSPEECH 2010 paralinguistic features, Gammatone Cepstral Coefficients (GTCC) and Power Normalized Cepstral Coefficients (PNCC). The Feature Selection with Adaptive Structure Learning (FSASL), Unsupervised Feature Selection with Ordinal Locality (UFSOL) and a novel Subset Feature Selection (SuFS) algorithm are the feature dimension reduction techniques used to acquire better SER performance in this work. The proposed SER system is analyzed in both clean and noisy environments. The EMO-DB and IEMOCAP emotion databases are considered for evaluating the proposed SER performance. For noise analysis, the clean speech is corrupted with different noises of Aurora noise database and white Gaussian noise at different Signal to Noise Ratio (SNR) levels from -5dB to 20 dB. Support Vector Machine (SVM) classifier with linear and Radial Basis Function (RBF) kernels using 10-fold cross-validation and hold-out validation is used in this analysis with classification accuracy and computation time as the performance metrics. The results show that the proposed SER system outperforms the baseline SER system as well as many of the existing literature works both in clean and noisy conditions. For SNR levels >15 dB, the proposed SER system in presence of different noises performs same as the SER in clean environments. Whereas, for lower SNRs <15 dB the performance is likely to be reduced. Therefore, to overcome this drawback and improve the SER performance in noisy conditions, a dense Non-Negative Matrix Factorization (denseNMF) method is adopted for de-noising the noisy speech signal prior to SER achieving noise robustness. (C) 2020 Elsevier Ltd. All rights reserved.