Among the numerous speech datasets in the literature, only a minority concerns conversational data, and even fewer datasets isolate the elements occurring in turn-taking conversations. To address this gap, this paper presents MoTT, an English speech dataset composed of questions, answers, reciprocal questions, and backchannel responses recorded by eight participants. The questions and answers pertain to ten topics and were recorded in two takes. The voice directivity pattern was simultaneously captured at frontal and lateral positions by two microphones. The MoTT dataset was designed to provide interchangeable conversational elements and enable their modular composition to obtain fictional but plausible and convincing conversations. As a result, multiple virtual speakers engage in a turn-taking conversation that emulates real-world interactions, with spatial audio techniques employed to enhance realism by arranging the speakers in the auditory scene. This dataset offers a valuable resource for studies in immersive spatial audio, human-computer interaction, and auditory scene analysis. The dataset is therefore well-suited for experiments that necessitate the simulation of ecologically valid conversations, as the one described in the use case reported in this paper.

MoTT: A Speech Dataset for Modular Composition of Turn-Taking Conversations / G. Salada, D. Fantini, F. Avanzini, G. Presti - In: 2025 Immersive and 3D Audio: from Architecture to Automotive (I3DA)[s.l] : IEEE, 2025. - ISBN 979-8-3315-5828-4. - pp. 1-8 (( convegno International Conference on Immersive and 3D Audio tenutosi a Bologna nel 2025 [10.1109/i3da65421.2025.11202114].

MoTT: A Speech Dataset for Modular Composition of Turn-Taking Conversations

D. Fantini
;
F. Avanzini
Penultimo
;
G. Presti
Ultimo
2025

Abstract

Among the numerous speech datasets in the literature, only a minority concerns conversational data, and even fewer datasets isolate the elements occurring in turn-taking conversations. To address this gap, this paper presents MoTT, an English speech dataset composed of questions, answers, reciprocal questions, and backchannel responses recorded by eight participants. The questions and answers pertain to ten topics and were recorded in two takes. The voice directivity pattern was simultaneously captured at frontal and lateral positions by two microphones. The MoTT dataset was designed to provide interchangeable conversational elements and enable their modular composition to obtain fictional but plausible and convincing conversations. As a result, multiple virtual speakers engage in a turn-taking conversation that emulates real-world interactions, with spatial audio techniques employed to enhance realism by arranging the speakers in the auditory scene. This dataset offers a valuable resource for studies in immersive spatial audio, human-computer interaction, and auditory scene analysis. The dataset is therefore well-suited for experiments that necessitate the simulation of ecologically valid conversations, as the one described in the use case reported in this paper.
Dataset; speech; audio recording; turn-taking
Settore INFO-01/A - Informatica
   Transforming auditory-based social interaction and communication in AR/VR (SONICOM)
   SONICOM
   EUROPEAN COMMISSION
   H2020
   101017743
2025
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
MoTT_A_Speech_Dataset_for_Modular_Composition_of_Turn-Taking_Conversations.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Licenza: Nessuna licenza
Dimensione 8.76 MB
Formato Adobe PDF
8.76 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1190266
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact