Reliability is the primary requirement in noisy conditions and for highly variable utterances. Integrating the recognition of visual signals with the recognition of audio signals is indispensable for many applications that require automatic speech recognition (ASR) in harsh conditions. Several important experiments have shown that integrating and adapting to multiple behavioral end-context information during the speech-recognition task significantly improves its success rate. By integrating audio and visual data from speech information, we can improve the performance of an ASR system by differentiating between the most critical cases of phonetic-unit mismatch that occur when processing audio or visual input alone. The evolving fuzzy neural-network (EFuNN) inference method is applied at the decision layer to accomplish this task. This is done through a paradigm that adapts to the environment by changing structure. The EFuNN’s capacity to learn quickly from incoming data and to adapt while on line lowers the ASR system’s complexity and enhances its performance in harsh conditions. Two independent feature extractors were developed, one for speech phonetics (listening to the speech) and the other for speech visemics (lip-reading the spoken input). The EFuNN network was trained to fuse decisions made disjointly by the audio unit and the visual unit. Our experiments have confirmed that the proposed method is reliable for developing a robust, automatic, speech recognition system.

Evolving connectionist method for adaptive audiovisual speech recognition / M.N. Malcangi, G. P.. - In: EVOLVING SYSTEMS. - ISSN 1868-6478. - (2016 Jul). [Epub ahead of print] [10.1007/s12530-016-9156-6]

Evolving connectionist method for adaptive audiovisual speech recognition

M.N. Malcangi
Primo
;
2016

Abstract

Reliability is the primary requirement in noisy conditions and for highly variable utterances. Integrating the recognition of visual signals with the recognition of audio signals is indispensable for many applications that require automatic speech recognition (ASR) in harsh conditions. Several important experiments have shown that integrating and adapting to multiple behavioral end-context information during the speech-recognition task significantly improves its success rate. By integrating audio and visual data from speech information, we can improve the performance of an ASR system by differentiating between the most critical cases of phonetic-unit mismatch that occur when processing audio or visual input alone. The evolving fuzzy neural-network (EFuNN) inference method is applied at the decision layer to accomplish this task. This is done through a paradigm that adapts to the environment by changing structure. The EFuNN’s capacity to learn quickly from incoming data and to adapt while on line lowers the ASR system’s complexity and enhances its performance in harsh conditions. Two independent feature extractors were developed, one for speech phonetics (listening to the speech) and the other for speech visemics (lip-reading the spoken input). The EFuNN network was trained to fuse decisions made disjointly by the audio unit and the visual unit. Our experiments have confirmed that the proposed method is reliable for developing a robust, automatic, speech recognition system.
Audiovisual speech recognition (AVSR) ; Evolving fuzzy neural network (EFuNN); Speech-to-text; (STT); Decision fusion; Multimodal speech recognition
Settore INF/01 - Informatica
lug-2016
Article (author)
File in questo prodotto:
File Dimensione Formato  
art%3A10.1007%2Fs12530-016-9156-6.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Dimensione 2.05 MB
Formato Adobe PDF
2.05 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/425182
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 4
social impact