Project 7: Integration of multiple units in computational models

One of the challenges of making use of FPD in ASR is how to incorporate long-term structures which represent speech dynamics and suprasegmental processes into existing recognizers which employ short-term representations.  The frame-based nature of most current work in ASR is at odds with the richer, multiple tree-based representations implied by approaches such as Polysp. Indeed, in ASR, suprasegmental information is typically seen as the source of distracting variation rather than as valuable information. The purpose of this project is to attempt to develop an effective, statistical framework for ASR which is capable of exploiting the information available at multiple time scales. The study has two components. In the first, researchers will build on existing work at Naples and Nijmegen into novel speech feature representations and ASR architectures. The second study is equally adventurous and will examine multimodal FPD.

1. Multiple units in ASR Naples studies multilevel stochastic architectures allowing parallel analysis of speech under different time scales, each making use of different feature sets. Parallel, hierarchical and factorial HMMs are being developed specifically to model speech dynamics. Nijmegen uses a different set of features, but has similar aims. Project researchers will spend time in each other’s labs to better understand the differences between the two approaches and will construct hybrid architectures which will be evaluated using a number of different features based on the modulation spectrogram and the temporal evolution of energy and fundamental frequency. 

2. Audiovisual FPD In noisy conditions, it is known that listeners become observers, with eyes tracking to the interlocutor’s lips. Indeed, the speech reading benefit has been estimated as equivalent to a reduction in the noise level of 15 dB. However, the way that auditory and visual signals interact with respect to fine phonetic detail has not been studied to date. The purpose of this part of the project is twofold: (i) to examine if the generation and use of FPD differs when the receiver has access to visual information, and (ii) to integrate visual information, primarily from the lips and jaw, into the recognition process. This subproject will draw on existing expertise in audiovisual speech recognition at Sheffield in conjunction with the work on multilevel statistical architectures detailed above.

Young researchers One ESR (Ludusan) is based at Naples. He will probably make extended visits to Nijmegen for (1) above and to Sheffield for (2). Other possible visits are to Aix, Trondheim or Cambridge, for phonetics, and/or to Bristol for hierarchical processing, as interests and progress dictate.

Links This project links to Project 5, since suprasegmental and multimodal cues are known to be especially important in adverse conditions: for instance, f0 variation helps to ensure that the target speech stands out against other speech sources in the background, while energy modulations at the rate of syllables and above help speech to resist the masking effects of both stationary and non-stationary noise; visual cues are unaffected by acoustic noise.
This project has the potential to integrate ‘prosodic’ and ‘segmental’ phoneticians, phonetic Conversation Analysts, intonational phonologists, computer scientists/engineers and computational and experimental psychologists.

Working on this project: » Dr Francesco Cutugno » Dr Jon Barker » Dr Louis ten Bosch » Dr Anna Corazza » Bogdan Ludusan » Dr Gianpaolo Coro » Dr Jonas Beskow » Prof Rolf Carlson » Prof David House » Prof Björn Granström

< Go back to Projects

August 2010
S M T W T F S
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Marie Curie Logo