Influence of Virtual Reality on Voice Perception and Production
The specific aim of the clinical trial portion of the larger research project is to obtain preliminary data on the utility of voice training (resonant voice) in the VR environment compared to a traditional clinical environment, using a mixed model within- and between-subjects randomized experimental design. Independent Variables are (1) training and test condition (clinic room vs VR classroom for training); (2) visual speaker-to-listener distance (2m, 4m, and 6m for training); and (3) time point (baseline at 2 m, retention test at 4 m, and 9 m for transfer test). Dependent Variables are (a) vocal sound pressure level (SPL); and (b) spectral moments (spectral mean and standard deviation (in Hz and cents), skewness, and kurtosis). The hypothesis is that a two-way interaction will be shown between training condition and time point showing greater acquisition and transfer of voice skills following training in the VR environment than in the typical clinical environment. This series will utilize a high degree of innovation and sophisticated VR technology to identify parameters important for subsequent VR development in voice therapy, and to lay the empirical foundation for subsequent studies that build on the present work expanding both its basic science and translational value.
- Study Type: Interventional
- Study Design
- Allocation: Randomized
- Intervention Model: Sequential Assignment
- Primary Purpose: Basic Science
- Masking: None (Open Label)
- Study Primary Completion Date: August 31, 2024
This project addresses three Specific Aims. Specific Aims 1 and 2 set up many of the parameters for the clinical trial, which is addressed in Specific Aim 3. Details for the project as a whole, including the clinical trial, are as follows, copied and pasted from the grant proposal. 3.0 RESEARCH APPROACH The overall purpose is to investigate the effects of auditory, visual, and audiovisual information on the perception and production of one's own voice, using VR as an investigation tool, and to provide preliminary data on the potential utility of VR in the voice training environment. Details regarding Specific Aims are provided in the relevant page. 3.1 Participants: For SA1 and SA2, 60 vocally healthy classroom teachers will be recruited between ages 24 and 50 years, (see 3.2). At the lower end, this age range represents the earliest age at which teachers might initiate their professional teaching careers, and at the upper end, represents the average age of onset of menopause for women, and we wish to limit hormonal and other age-related influences in the data. All participants will participate in SA1 and SA2, which use the same simultaneous data collection procedures with different analyses for perceptual measures (SA1) and production measures (SA2). For SA3, which is exploratory, 10 additional healthy teachers will be recruited with the same characteristics. For all SA, inclusion and exclusion criteria are: Inclusion: By self report: (1) K-12 classroom teacher with at least two years teaching experience (SA1 and SA2) or elementary school classroom teacher (SA3), between 24 and 50 yr; (2) No history of voice disorder lasting more than two weeks, and Voice Handicap Index -10 (VHI-10)63 score < 10; (3) Lifetime non-smoker; (4) No hearing or uncorrected visual impairment; By written documentation: (5) Proof of full COVID-19 vaccination; By clinical evaluation: (6) Normal voice on days of participation, as assessed by a voice-specialized licensed SLP based on an overall severity score from the Consensus-Auditory Perceptual Evaluation of Voice (CAPE-V) < 10.65 Exclusion: By self report:(7) History of vocal fold pathology or other pathology affecting voice; (8) any acute condition that may affect voice production such as coughing, nasal congestion, or temperature greater than 98.6o F (37.0o C). Note that only vocally healthy teachers are assessed at this stage, before introducing the complexities associated with voice disorders. Those complexities will be addressed in later translational work that builds on the present series. It should be noted that the research program is ultimately relevant for teachers with voice problems, but also for the working environment of currently healthy teachers as well. 3.2 Power analyses: Power analyses assumed a medium effect size of d = 0.4, two-sided, for tests of all dependent variables across SA1 and SA2. Results suggested that an N=51 will be sufficient to detect findings for all variables with a significance level of α = 0.05 and power of 0.8. Accounting for possible attrition, 60 participants will be recruited for Aims 1 and 2. This number has been shown by our Co-Investigator Bottalico and in our own more recent preliminary data to be ample to detect significant effects similar to those investigated in the present series (e.g., differences in perceptual ratings of vocal effort and comfort, and also SPL and mean f0; (Bottalico, 2017; Bottalico et al., 2016; Daşdöğen et al., unpublished data). For SA3, which is exploratory, a total of 10 participants will be recruited to obtain preliminary data for a later clinical series. 3.3 Procedures 3.3.1 SA1 and SA2: Sixty K-12 classroom teachers will be recruited through flyers posted in the community and on social media, and through direct contact with Delaware, New Jersey, Pennsylvania, and Maryland public schools, all of which may be in close proximity to the study site at the University of Delaware. Individuals who contact the PI with an interest in participating will receive an overview of the study by telephone or secure remote connection, and if they agree, will provide informed consent. Following consent, participants will be guided to an online screening REDCap questionnaire using a HIPAA-compliant server to address all inclusion and exclusion criteria except for the clinical auditory-perceptual evaluation of voice (CAPE-V). Qualified participants will be scheduled for an in-lab appointment at the University of Delaware STAR campus voice lab. At the beginning of the appointment, the clinician will assess the participant's voice to confirm normal voice quality using the CAPE-V. Participants who pass this final screening step (overall CAPE-V severity score < 10) will proceed to experimental procedures. Others will be excused. For experimental procedures, first, participants will be trained in the speech tasks that will be used during the study: introducing themselves to a classroom for 15 seconds, delivering a two-minute tutorial related to their teaching expertise, sustaining the vowel /a/ for 3 seconds repeated three times, and producing the CAPE-V phrase, "We were away a year ago" repeated three times. Participants will then receive instructions for the self-report questionnaire reflecting self-perceived vocal loudness, vocal effort, and vocal comfort for the speech stimuli as a set (see 6.0). Following the delivery of these instructions, instrumentation will be positioned including headset microphones, headphones, and VR glasses (see 4.0). Participants will then perform the experimental speaking tasks in each of 15 randomly ordered conditions with and without background noise that virtually mimic auditory and visual properties of real-world rooms varying from small to large lecture hall, with dry to highly reverberant acoustics and varying speaker-to-listener distances. Room acoustics will be prescribed to characterize and control acoustically varying conditions across VR environments (ISO 3382, see 5.0). OBRIR measurements will be obtained in a classroom, a lecture hall and a school auditorium environments on the University of Illinois Urbana-Champaign campus, where the dimensions are similar to those of the VR classrooms (see 5.0). Ovation software (Ovation VRSpeaking, LLC, NJ, US) will be used to deliver realistic VR rooms and 3D listeners (see 4.0). Experimental tasks will include a string of prompted speech utterances as for the training condition, (i) with no external audio or visual feedback or background noise (experimental baseline), (ii) in each auditory condition alone, (iii) in each visual condition alone, and (iv) in each combination of auditory and visual conditions. All conditions ii-iv will be produced with and without background noise described shortly. For the audio-only conditions, participants will wear an eye mask to block visual information. In those conditions, to aid with communicative intent, participants will hear applause that is convolved with matched room acoustic responses before they initiate each speech string, which will provide an audio-spatial clue about audience presence, approximately how crowded the environment is, and how far away listeners are in the environment. To enhance participants' engagement with the environment, before they speak, the examiner will ask them to estimate audience size and speaker-to-listener distance. For all visual conditions, participants will see the respective visual rooms and seated listeners who will react to the speaker in realistic ways (e.g., moving while sitting, scratching head, etc.). To further promote participant engagement, the examiner will ask each participant approximately how many people are in the environment and how far away they are. In all conditions, participants will be prompted to "speak so that everyone can understand you." As noted, all conditions will be carried out with and without background noise delivered during participants' speech. Noise level will corresponds to the level representative of a typical classroom environment (average of 54 dB).66 After participants have produced speech utterances in each condition, they will remove VR goggles or eye mask and will be asked to complete the questionnaire about self-reported loudness, vocal effort, and vocal comfort for the preceding utterances. The questionnaire will be displayed on a computer screen that allows for digital responses on VAS scales (see 6.0). Then, participants will proceed to the next VR condition and so forth, until data collection for all conditions has been completed, thereby concluding the session. The total duration of the session is expected to be about 120 minutes. 3.3.2 SA3: Participants will be 10 classroom teachers. Following satisfaction of inclusion and exclusion criteria with the exception of clinical auditory evaluation of voice, qualifying participants will present to the STAR voice lab for voice evaluation as for SA1 and SA2. Participants who pass the voice screening will then be fitted with relevant instrumentation (3.3.1; 4.0) and will produce the same utterance strings as for SA1 and SA2, with a speaker-to-target distance of 4 m, specified by a physical mannequin. Then, participants will be randomly allocated to one of two training conditions: traditional clinical room or VR environment. In their respective conditions, participants will receive training in a therapeutic voicing pattern that has value for healthy speakers as well, "resonant voice."68-71 Training will be provided by a speech-language pathologist with at least two years' experience in voice disorders, and who has completed standardized training in Lessac-Madsen Resonant Voice Therapy (LMRVT).26 Throughout training in both environments, background noise will be presented as for SA1 and SA2, in free field for the traditional environment and over headphones in the VR environment. In both training conditions, materials from Session Two of LMRVT will be used – the first session in which actual voice training begins in that program. For the traditional clinic room condition, after 30 minutes of LMRVT training using Session Two materials, participants will be guided to repeat the same exercises, in order, from Session Two, with instructions to produce voice as if speaking to a person positioned at 2 meters from the speaker, represented by a physical mannequin, for 5 minutes. Then, participants will repeat the same exercises again, as if speaking to the mannequin positioned at 4 meters for 5 minutes, and finally, at 6 meters for 5 minutes. Following training at each of these distances, participants will repeat baseline utterances which will be recorded for audio data collection. For the VR environment condition, participants will receive the same LMRVT training as for the traditional room, only in a VR Classroom (Room 1; Table 1) as representative of real-life classroom conditions, created for SA1 and SA2 but not dependent on results for those Aims. Participants will receive training in LMRVT Session 2 exercises for 30 minutes, followed by repetition of the same exercises from Session Two with instructions to produce relevant utterances speaking to the listeners in the VR environment positioned at 2 meters, 4 meters, and 6 meters from the speaker for 5 minutes at each distance. As for the traditional environment, audio recordings will be made using baseline utterances following training at each distance. After all training has been completed, participants in both conditions will be guided to a standard classroom in the STAR setting (Room 513; a volume/floor plan of ~2440 m3/69m2). In that setting, participants will be asked to repeat baseline speech tasks speaking to live seated listeners positioned at 4 meters for a retention test and a novel distance–9 meters–and recordings will be made as previously. Participants will be then be excused. Total duration is about 90 minutes. Brief training sessions on the order of about 30 minutes have been shown to produce shifts in voice production,94 cohering with the PI's extensive clinical experience. We thus expect to find such shifts in the present series, which prepares the foundations for more extended longitudinal studies appropriate for a planned R01 growing from the present work. 4.0 Equipment: A digital audio workstation (Reaper Version 6.36, Rosendale, NY, US) and a head-mounted microphone (AKG C 520, Harman) will be used to capture voice signals for all SAs. Audio recordings will be sampled at 44.1 kHz. The mic-to-mouth distance will be 5cm with the microphone positioned a 450 angle from the participant's mouth.72 The microphone will be connected to an audio interface (Babyface Pro FS 24-Channel USB 2.0, Heimhausen, Germany) and the combined input/output latency will be less than 5 ms, which is the value below the range of a noticeable echo (16 and 26 ms).73 The interface will be connected to a computer running Reaper audio workstation to create audio rendering. Virtual reality glasses (Oculus Rift S) will be used to produce visual information (rooms and 3D avatar listeners). Room volume images and listeners will be provided using Ovation software (https://www.ovationvr.com/). The software allows for the selection multiple classroom environments that replicate real-world examples and speaking to hundreds of digitally generated 3D audiences (real people) who respond to the speaker by smiling, clapping, or moving. A recent study has reported on the effectiveness of this software to create real-world visual scenarios.92 Audio microphone will be calibrated following published procedures.93 VR glasses will be optimally positioned for each participant individually. All utterances will be saved as wav files in an encrypted folder. 5.0 Room acoustics measurements and audio rendering: Rooms similar in size to the ones selected in Ovation software (a classroom, a lecture hall, and a school auditorium) will be selected within the University of Illinois Urbana-Champaign campus (by the Co-I Bottalico). The rooms will be acoustically characterized following the ISO 3382. In the position where a speaker is typically located in the rooms, Oral-Binaural Impulse Responses will be measured with a HATS. Specifically, oral-binaural impulse responses (OBRIRs) will be obtained using the convolution method following published methods.91 Specifically, an exponential sweep signal emitted via the mouth of a Head and Torso Simulator (HATS, GRAS 45BB KEMAR) will be recorded by HATS' ears. The convolution between the recorded sweeps (at the HATS' ears) and the inverse of the emitted sweep (by the HATS' mouth) will generate the OBRIRs. The OBRIRs will be used to acoustically recreate the rooms considering the ears-mouth path of the speaker. Real-time audio rendering of the subject's voice in the virtual room acoustics will be accomplished through the use of real-time convolution plug-ins, such as Analglyph and the RoomZ developed by our consultant Katz at Sorbonne University.74 The convolution engine will employ measured room impulse responses. The virtual acoustic rendering will be played back to the participant over open-back headphones (HD 660S, Sennheiser, Wedemark, Germany), limiting coloration from hearing one's own voice. 6.0 Measures: For SA1, self-report perception measures will be derived from three separate questions about vocal loudness, vocal effort, and vocal comfort, using a Visual Analog Scale (VAS; Table 2). After each study condition, VR glasses will be removed and participants will complete the VAS using HIPAA-compliant REDCap. The REDCap perception questionnaire will be displayed on a computer screen. Participants will respond to each perception question sequentially by moving a slider on a scale from 0 (not at all) to 100 (extreme[ly]). Each response will be captured numerically in the REDCap database. Questionnaire completion will take approximately two minutes, which will provide a restperiod after the preceding study condition and help minimize potential vocal fatigue. For SA2 and SA3, instrumented measures of voice will include vocal sound pressure level-SPL and Spectral Moments (see 7.0). 7.0 Data extraction and analysis: Extraction of speech parameters will be performed with Matlab R2021b (MathWorks, Natick, MA, United States) and Praat (version 6.2.14). For each of the recordings, a time history of SPL90 and fundamental frequency, f0, will be obtained with a time step of 0.05 s. The f0 will be estimated with an acoustic periodicity detection algorithm on the basis of an accurate autocorrelation method. This method is more accurate, noise-resistant and robust than other methods based on the cepstrum or combs, or the original autocorrelation methods. For the two time histories statistical moments will be calculated (Spectral mean, standard deviation, skewness, and kurtosis). The ability of spectral moments to distinguish between different degrees of vocal effort has been reported previously.76 These measures will quantitatively assess key spectral contributions that may be associated with potential changes in vocal quality, in relation to vocal effort and comfort. However, perceptual measures of voice quality will not be made in this series. 8.0 Statistical analyses: For all SAs, Linear Mixed-Effects (LME) models (Matlab R2021b) will be fitted by restricted maximum likelihood (REML). The dependent and independent variables of these models are listed in the SA section. Participant ID will be used as a random effects terms. Here, the random effect of "Participant ID" refers to the partial pooling of observations by "Participating ID", with the slope and intercept of each participant ID being random. Models will selected based on the Akaike information criterion and the results of likelihood ratio tests. Tukey's post-hoc pair-wise comparisons will be performed to examine the differences between all levels of the fixed factors of interest when they are more than 2 levels (in this case both the audio and the visual environments). These are pair-wise z tests, where the z statistic represents the difference between an observed statistic and its hypothesized population parameter in units of the standard deviation. The p-values for these tests will be adjusted using the default single-step method. The LME output will include the estimates of the fixed effects coefficients, the standard error associated with the estimate, the degrees of freedom (df), the test statistic (t), and the p-value. The Satterthwaite method will be used to approximate degrees of freedom and calculate p-values.
- Behavioral: Voice training
- Traditional vs. virtual reality
Arms, Groups and Cohorts
- Experimental: Traditional intervention
- Participants will receive traditional training in voice, in a standard clinical environment.
- Experimental: Virtual reality intervention
- Participants will receive voice training under virtual reality conditions.
Clinical Trial Outcome Measures
- Self-reported loudness
- Time Frame: 2 weeks
- Self-reported loudness
- Sound pressure level
- Time Frame: 2 weeks
- Sound pressure level
- Self-reported vocal effort and comfort
- Time Frame: 2 weeks
- Self-reported vocal effort and comfort
- Spectral moments
- Time Frame: 2 weeks
- Spectral mean, standard deviation, skewness, and kurtosis
Participating in This Clinical Trial
By self report: (1) K-12 classroom teacher with at least two years teaching experience (SA1 and SA2) or elementary school classroom teacher (SA3), between 24 and 50 yr; (2) No history of voice disorder lasting more than two weeks, and Voice Handicap Index -10 (VHI-10)63 score < 10; (3) Lifetime non-smoker; (4) No hearing or uncorrected visual impairment; By written documentation: (5) Proof of full COVID-19 vaccination; By clinical evaluation: (6) Normal voice on days of participation, as assessed by a voice-specialized licensed SLP based on an overall severity score from the Consensus-Auditory Perceptual Evaluation of Voice (CAPE-V) < 10.65. Exclusion Criteria:
By self report:(7) History of vocal fold pathology or other pathology affecting voice; (8) any acute condition that may affect voice production such as coughing, nasal congestion, or temperature greater than 98.6o F (37.0o C). Note that only vocally healthy teachers are assessed at this stage, before introducing the complexities associated with voice disorders. Those complexities will be addressed in later translational work that builds on the present series. It should be noted that the research program is ultimately relevant for teachers with voice problems, but also for the working environment of currently healthy teachers as well. -
Gender Eligibility: Female
Individual self representing as female.
Minimum Age: 24 Years
Maximum Age: 50 Years
Are Healthy Volunteers Accepted: Accepts Healthy Volunteers
- Lead Sponsor
- University of Delaware
- Umit Dasdogen
- Provider of Information About this Clinical Study
- Principal Investigator: Katherine Verdolini, Professor, Communication Sciences and Disorders, Linguistics and Cognitive Science – University of Delaware
- Overall Contact(s)
- Katherine Verdolini Abbott, PhD, 302-831-0956, email@example.com
Clinical trials entries are delivered from the US National Institutes of Health and are not reviewed separately by this site. Please see the identifier information above for retrieving further details from the government database.