From speech technology to big data phonetics and phonology: a win-win paradigm


  • Martine Adda-Decker (CNRS LPP, Université Sorbonne Nouvelle, France)
  • Ioana Chitoran (Université de Paris, France)
  • Adèle Jatteau (Université de Lille, France)
  • Mathilde Hutin (CNRS LIMSI, Université Paris-Saclay, France)
  • Lori Lamel (CNRS LIMSI, Université Paris-Saclay, France)
  • Mark Liberman (University of Pennsylvania, USA)
  • Peggy Renwick (University of Georgia, Athens, USA)
  • Barbara Schuppler (Graz University of Technology, Austria)
  • Laura Spinu (Kingsborough Community College, CUNY, USA)
  • Ioana Vasilescu (CNRS LIMSI, Université Paris-Saclay, France)
  • Yaru Wu (CNRS LIMSI, Université Paris-Saclay, France; CNRS LPP – Sorbonne Nouvelle, France)


Summary description / Motivation
During the last decade, the term “big data” has become a major keyword in numerous areas of social sciences and humanities, which are increasingly concerned with the need for digital processing of an ever-growing influx of data. Among these areas, phonetics and laboratory phonology are at the forefront, as substantial benefit can be expected from the study of larger and richer data collections, supported by faster, partially automated processing.

The current scientific and technological constellation holds promise for a virtuous circle of shared interests in large corpus-based and statistically supported modeling of phonetic variation opening avenues for both linguists and technology stakeholders. Indeed, a new research field, “big data phonetics”, is emerging that relies on corpora and approaches borrowed from speech technologies. In return, speech technologies may take advantage of statistically grounded observations in order to better disentangle the sources and the patterns of speech variation.

We propose a workshop dedicated to this exciting research direction combining methods, approaches and corpora from speech technology domains with phonetics and laboratory phonology studies.

Background and research questions
Traditionally, research in phonetics and phonology is driven by specific hypotheses, which may entail requirements both on the speech data’s acoustic quality and their linguistic content and structure. Raw large-scale corpora typically include all kinds of noises adding to the highly variable nature of speech conditioned by many linguistic and extra-linguistic factors. When relying on such heterogeneous material, phonetics and laboratory phonology research needs to reconsider both the matter of addressing scientific hypotheses and the methods to process such data. One of the purposes of the workshop is to discuss access to such data and the various challenges of processing large-scale corpora for speech analysis by phoneticians and phonologists. A related question concerns the most efficient methods borrowed from speech technologies that can be “diverted” for the needs of phonetic analysis.

The symmetrical speech technology-driven purpose of this workshop is to draw a state of the art of the speech variation challenges for speech technologies and to provide suggestions on how these technologies could benefit from phonetic and phonology-driven analyses. For example, Automatic Speech Recognition systems and related applications are known to degrade ungracefully when faced with unseen variation. Research aimed at improving lexical modeling for speech recognition and L2 pronunciation learning may benefit from large corpus-based phonetics and phonology research.

Several special sessions on similar topics have been dedicated to big data in phonetic research as part of phonetics and phonology scientific manifestations (see VLSP, UPenn in 2011, Special sessions at ICPhS 2015, ICPhS 2019 and LSRL 2019).

The workshop will not only promote the use of speech technologies as an aide for linguistic studies and provide insight on how to make use of recent developments, but also make research in phonetics and phonology visible to the speech technology community.

Topics and areas of interest
We encourage submissions on any topics related to the list of questions listed below:

● How to analyze variation phenomena in continuous speech using large corpora?
● How to take advantage of large corpora for segmental and supra-segmental studies? What caveats?
● How to investigate ongoing phonological processes using large corpora?
● How to capture sound change in the pool of large-scale corpora?
● How to clean and structure annotation of raw speech data?
● How could expertise and research in phonetics and phonology take part in the advancement of speech technology (eg. improving pronunciation dictionaries)?