MLCommons Releases Open Source Datasets For Speech Recognition

Hear from CIOs, CTOs, and other senior executives and leaders on data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more

Leave him OSS Company Newsletter guide your open The source journey! register here.

MLCommons, the nonprofit consortium dedicated to building open AI development tools and resources, today announced the release of the People’s Speech Dataset and Multilingual Spoken Words Corpus. The consortium claims that the People’s Speech Dataset is one of the world’s most comprehensive English-language speech datasets licensed for academic and commercial use, with tens of thousands of hours of recordings, and that the Multilingual Spoken Words Corpus (MSWC) is one of the largest datasets with keywords in 50 languages.

Free datasets such as TED-LIUM and LibriSpeech have long been available for developers to train, test and compare speech recognition systems. But some, like Sinner and Switchboard, require a license or relatively high one-time payments. This puts even well-resourced organizations at a disadvantage compared to tech giants like Google, Apple, and Amazon, which can collect large amounts of training data through devices like smartphones and smart speakers. For example, four years ago, when Mozilla researchers began developing the DeepSpeech English-language speech recognition system, the team had to contact television and radio stations and language departments at universities to complete the public voice data that they were able to find. .

With the release of the People’s Speech Dataset and MSWC, it is hoped that more developers will be able to build their own speech recognition systems with fewer budget and logistical constraints than before, according to Keith Achorn. Achorn, a machine learning engineer at Intel, is one of the researchers who has overseen the curation of the People’s Speech Dataset and MSWC for the past several years.

“Modern machine learning models rely on large amounts of data to train. “The People’s Speech” and “MSWC” are both among the largest datasets in their respective classes. MSWC is particularly attractive for its inclusion of 50 languages, ”Achorn told VentureBeat via email. “In our research, most of these 50 languages ​​so far had no keyword detection voice dataset, and even those that did had very limited vocabulary.”

Open source speech tools

Starting in 2018, a working group formed under the auspices of MLCommons to identify and map the 50 most used languages ​​around the world into a single dataset – and find a way to make the dataset useful. The team members came from Harvard and the University of Michigan, as well as Alibaba, Oracle, Google, Baidu, Intel and others.

The researchers who put together the dataset were an international group from the United States, South America and China. They met weekly for several years by conference call, each bringing specific expertise to the project.

The project finally saw the light of day of them datasets instead of just one – the People’s Speech Dataset and the MSWC – which are individually detailed in white papers presented this week at the Annual Neural Information Processing Systems (NeurIPS) Conference. The People’s Speech Dataset targets speech recognition tasks, while MSWC involves Keyword Catching, which deals with identifying keywords (eg, “OK, Google”, “Hey, Siri”) in recordings.

Popular Speech Dataset vs. MSWC

The People’s Speech dataset includes over 30,000 hours of supervised conversational audio released under a Creative Commons license, which can be used to create the type of speech recognition models that power voice assistants and transcription software. On the other hand, MSWC – which has over 340,000 keywords with over 23.4 million examples, covering languages ​​spoken by more than 5 billion people – is designed for applications such as data centers. calls and smart devices.

Previous voice data sets relied on manual efforts to collect and verify thousands of individual keyword examples, and were generally limited to a single language. Additionally, these datasets did not exploit “diverse speech,” meaning they misrepresented a natural environment – lacking precision-enhancing variables like background noise, informal speech patterns, and a mixture of speech. ‘recording equipment.

The People’s Speech Dataset and MSWC also have permissive license terms, including commercial use, which contrast with many speech training libraries. Datasets typically fail to formalize their licenses, relying on end users to take responsibility, or are restrictive in the sense that they prohibit use in products destined for the open market.

“The working group considered several use cases during the development process. However, we are also aware that these spoken word datasets may be used more by models and systems that we had not yet considered, ”Achorn continued. “As both datasets continue to grow and develop under the leadership of MLCommons, we are looking for additional sources of diverse, high-quality voice data. Finding sources that comply with our open license terms makes this task more difficult, especially for languages ​​other than English. On a more technical level, our pipeline uses forced alignment to match the audio of the speech with the text of the transcript. Although methods have been devised to compensate for mixed transcription quality, improving accuracy comes at a cost in terms of data quantity. “

Open source trend

The People’s Speech Dataset complements the Mozilla Foundation’s Common Voice, another of the world’s largest speech datasets, with over 9,000 hours of speech data in 60 different languages. As a sign of growing interest in the field, Nvidia recently announced that it will invest $ 1.5 million in Common Voice to engage more communities and volunteers and support the hiring of new staff.

Recently, voice technology adoption has increased especially among businesses, with 68% of companies reporting that they have a voice technology strategy in place, according to Speechmatics – an 18% increase from 2019. And among businesses who don’t, 60% plan for the next five years.

Creating datasets for speech recognition remains a labor-intensive activity, but a promising approach with wider use is unsupervised learning, which could reduce the need for speech libraries. tailor-made training. Traditional speech recognition systems require examples of labeled speech to indicate what is being said, but unsupervised systems can learn untagged by detecting subtle relationships within the training data.

Guinea-based technology accelerator researchers GNCode and Stanford have experimented with use radio archives to create unsupervised systems for “low-resource” languages, particularly Maninka, Pular, and Susu in the Niger Congo family. A team of MLCommons called 1000 words in 1000 languages ​​creates a pipeline that can take any recorded speech and automatically generate clips to form compact speech recognition models. In addition, Facebook has developed a system, called Wave2vec-U, which can learn to recognize speech from unlabeled data.


VentureBeat’s mission is to be a digital city place for technical decision-makers to gain knowledge about transformative technology and perform operations. Our site provides essential data technology information and strategies to guide you as you run your organizations. We invite you to become a member of our community, access:

  • up-to-date information on the topics that interest you
  • our newsletters
  • Closed thought leader content and discounted access to our popular events, such as Transform 2021: Learn more
  • networking features, and more

Become a member

Comments are closed.