Analisis Pengaruh Kombinasi Fitur Spektral terhadap Tingkat Akurasi Speech Emotion Recognition

Authors

  • Mifta Nur Farid Institut Teknologi Kalimantan
  • Arya Fatur Rahman Institut Teknologi Kalimantan
  • Himawan Wicaksono Institut Teknologi Kalimantan

DOI:

https://doi.org/10.37034/jsisfotek.v5i2.234

Keywords:

Speech Emotion Recognition, LSTM, MFCC

Abstract

Telah banyak penelitian tentang pengenalan emosi suara dengan akurasi yang berbeda-beda. Hal tersebut disebabkan oleh dataset, fitur-fitur, dan model klasifikasi yang digunakan. Hal yang paling mempengaruhi tingkat akurasi pengenalan emosi suara adalah fitur-fitur yang digunakan. Sehingga pada penelitian ini dilakukan eksplorasi terhadap kombinasi kombinasi fitur-fitur spektral dan bagaimana pengaruhnya terhadap akurasi dari pengenalan emosi suara. Kombinasi fitur yang digunakan adalah fitur-fitur spektral low level descriptor yaitu mel-frequency cepstral coefficient (mfcc), chroma, mel-spectrogram, spectral contrast, spectral bandwidth, dan tonnetz, dan fitur-fitur high-statistical function (HSF), yaitu mean, standar deviasi, jangkauan interkuartil, skewness, dan kurtosis dari fitur-fitur LLD sebelumnya. Model yang digunakan adalah long short-term memory (LSTM). Hasil yang didapatkan adalah dari keseluruhan percobaan kombinasi fitur LLD dan HSF, fitur mfcc dan spectral contrast memberikan nilai akurasi dan UAR tertinggi. Jika fitur mfcc ini dihilangkan maka nilai akurasi dan UAR akan turun secara signifikan. Selain itu penelitian ini memberikan bukti bahwa semakin banyak fitur yang digunakan tidak selalu memberikan hasil akurasi dan UAR yang baik. Namun yang paling mempengaruhi adalah fitur apa yang digunakan, bukan seberapa banyak fitur yang digunakan.

References

Mustaqeem, M. Sajjad, and S. Kwon, “Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM,” IEEE Access, vol. 8, pp. 79861–79875, 2020, doi: 10.1109/ACCESS.2020.2990405.

E. Lieskovská, M. Jakubec, R. Jarina, and M. Chmulík, “A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism,” Electronics, vol. 10, no. 10, p. 1163, May 2021, doi: 10.3390/electronics10101163.

K. Venkataramanan and H. R. Rajamohan, “Emotion Recognition from Speech.” arXiv, Dec. 22, 2019. Accessed: Jun. 23, 2023. [Online]. Available: http://arxiv.org/abs/1912.10458

D. Pravena and D. Govind, “Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals,” Int J Speech Technol, vol. 20, no. 4, pp. 787–797, Dec. 2017, doi: 10.1007/s10772-017-9445-x.

M. Deriche and A. H. Abo Absa, “A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks,” Arab J Sci Eng, vol. 42, no. 12, pp. 5231–5249, Dec. 2017, doi: 10.1007/s13369-017-2742-5.

S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA: IEEE, Mar. 2017, pp. 2227–2231. doi: 10.1109/ICASSP.2017.7952552.

S. R. Bandela and T. K. Kumar, “Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC,” in 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Delhi: IEEE, Jul. 2017, pp. 1–5. doi: 10.1109/ICCCNT.2017.8204149.

S. G. Koolagudi, Y. V. S. Murthy, and S. P. Bhaskar, “Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition,” Int J Speech Technol, vol. 21, no. 1, pp. 167–183, Mar. 2018, doi: 10.1007/s10772-018-9495-8.

Tin Lay Nwe, Say Wei Foo, and L. C. De Silva, “Classification of stress in speech using linear and nonlinear features,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., Hong Kong, China: IEEE, 2003, p. II-9–12. doi: 10.1109/ICASSP.2003.1202281.

R. Xia and Y. Liu, “A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space,” IEEE Trans. Affective Comput., vol. 8, no. 1, pp. 3–14, Jan. 2017, doi: 10.1109/TAFFC.2015.2512598.

S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching,” IEEE Trans. Multimedia, vol. 20, no. 6, pp. 1576–1590, Jun. 2018, doi: 10.1109/TMM.2017.2766843.

N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller, “An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech,” in Proceedings of the 25th ACM international conference on Multimedia, Mountain View California USA: ACM, Oct. 2017, pp. 478–484. doi: 10.1145/3123266.3123371.

J. Lee and I. Tashev, “High-level feature representation using recurrent neural network for speech emotion recognition,” in Interspeech 2015, ISCA, Sep. 2015, pp. 1537–1540. doi: 10.21437/Interspeech.2015-336.

Z. Aldeneh and E. M. Provost, “Using regional saliency for speech emotion recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA: IEEE, Mar. 2017, pp. 2741–2745. doi: 10.1109/ICASSP.2017.7952655.

H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, vol. 92, pp. 60–68, Aug. 2017, doi: 10.1016/j.neunet.2017.02.013.

Y. Xi, P. Li, Y. Song, Y. Jiang, and L. Dai, “Speaker to Emotion: Domain Adaptation for Speech Emotion Recognition with Residual Adapters,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China: IEEE, Nov. 2019, pp. 513–518. doi: 10.1109/APSIPAASC47483.2019.9023339.

E. Tzinis and A. Potamianos, “Segment-based speech emotion recognition using recurrent neural networks,” in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX: IEEE, Oct. 2017, pp. 190–195. doi: 10.1109/ACII.2017.8273599.

A. Ajrana, A. Akbar, and A. Lawi, “Implementasi Algoritma Deep Artificial Neural Network Menggunakan Mel Frequency Cepstrum Coefficient Untuk Klasifikasi Audio Emosi Manusia,” Proceeding KONIK (Konferensi Nasional Ilmu Komputer), vol. 5, pp. 66–73, Aug. 2021.

B. T. Atmaja and M. Akagi, “On The Differences Between Song and Speech Emotion Recognition: Effect of Feature Sets, Feature Types, and Classifiers,” in 2020 IEEE REGION 10 CONFERENCE (TENCON), Nov. 2020, pp. 968–972. doi: 10.1109/TENCON50793.2020.9293852.

Z. Yao, Z. Wang, W. Liu, Y. Liu, and J. Pan, “Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN,” Speech Communication, vol. 120, pp. 11–19, Jun. 2020, doi: 10.1016/j.specom.2020.03.005.

S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLOS ONE, vol. 13, no. 5, p. e0196391, May 2018, doi: 10.1371/journal.pone.0196391.

Downloads

Published

29-06-2023

How to Cite

Farid, M. N., Rahman, A. F. ., & Wicaksono, H. . (2023). Analisis Pengaruh Kombinasi Fitur Spektral terhadap Tingkat Akurasi Speech Emotion Recognition. Jurnal Sistim Informasi Dan Teknologi, 5(2), 120–129. https://doi.org/10.37034/jsisfotek.v5i2.234

Issue

Section

Articles

Most read articles by the same author(s)