Final: Finite State Methods with Modern Neural Architectures for Speech Applications and Beyond

Aug. 4, 2023
Duration: 02:47:22
Number of views 7
Number of favorites 0


Many advanced technologies such as Voice Search, Assistant Devices (e.g. Alexa, Cortana, Google Home, ...) or Spoken Machine Translation systems are using speech signals as input. These systems are built in two ways:

  • End-to-end: a single system (usually a deep neural network) is built with speech signal as input and target signal as final output (for example spoken english as input and french text text as output). While this approach greatly simplifies the overall design of the system, it comes with two significant drawbacks:
    • lack of modularity: no sub-components can be modified or used in another system
    • large data requirements: necessity to find hard-to-collect supervised task-specific data (input-output pairs)
  • Cascade: a separately built ASR system is used to convert the speech signal into text and the output text is then passed to another back-end system. This approach greatly improves the modularity of the individual components of the pipeline and drastically reduces the need of task-specific data. The main disadvantages are:
    • ASR output is noisy: the downstream network is usually fed with the 1-best hypothesis of the ASR system which is prone to error (no account for uncertainty)
    • Separate optimization: each module is separately optimized and the joint-training of the whole pipeline is almost impossible as we cannot differentiate through the ASR best path


In this project we are seeking for a speech representation interface which has the advantages of both the End-to-End and cascade systems while it does not suffer from the drawbacks of these methods.

Tags: deep nets ia information retrieval informatique jsalt linear algebra nlp workshop