We're excited to host a meetup at Gong offices, presenting the new advancements in self-supervised representation learning that can be used for many new Speech related SOTA applications like Wav2Vec2.
We will hear about those advancements from leading figures from the industry.
Lectures are in Hebrew.
Applying Wav2vec2 in the industry - lessons learned at Gong - Eduard Golstein & Zeev Rannon
Self-supervised pre-training and especially Facebook's (Meta) wav2vec2 technology, along with publicly available pretrained models, are making a major impact on the Speech Recognition world. In this presentation, we will outline the concept and practical issues and considerations in applying the technology at scale at Gong. We will address the benefits, such as performance and robustness using limited quantities of training data for various languages, as well as the computational considerations. These will be compared to the well-established DNN-HMM approaches used for years in the industry. We will also report on various attempts to improve the training and inference speeds.
Eduard Golshtein, Ph.D, is a speech team lead at Gong, with 25 years of experience developing speech processing technologies and products.
Zeev Rannon, MSc EE., is a senior researcher at Gong with more than 30 years in speech processing and ASR.
---
Towards textless NLP: modeling spoken language from raw audio - Yossi Adi
An open question for AI research is creating systems that learn from natural interactions as infants learn their first language(s): spontaneously and without access to text or expert labels. Current NLP systems require large amounts of text, which excludes plenty of the world's languages that have little textual resources or no widely used written form. In addition, textual features do not encode speaker-specific speech properties beyond content (e.g., identity, style, emotion, etc.), as well as structured signals that are part of natural human interaction (intonation, hesitation, laughter, etc.) which are important in the oral form. In this talk, I'll present our recent studies in developing a textless approach for spoken language processing. The proposed framework is comprised of a pseudo-text Encoder, Sequential modeling, and Speech generation components, all of which were trained in an unsupervised fashion. Lastly, I will present various applications which can benefit from such modeling together with future research directions.
Yossi Adi is a research scientist at Meta AI Research. Before joining Meta, Yossi completed his Ph.D in computer science at Bar-Ilan University under the supervision of Professor Joseph Keshet.
---
Language and emotion recognition using self-supervised speech representations - Hagai Aronowitz
Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. In my talk, I will describe how to leverage recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an engine is able to handle multiple speech processing tasks, using a single architecture, to obtain accuracy which is beyond the state-of-the-art. I will focus on language identification where we obtain an error reduction of more than 50% compared to the state-of-the-art, and on emotion recognition where we show how speaker normalization can be applied to reach state-of-the-art and beyond accuracy.
Hagai Aronowitz, Ph.D. is a speech analysis tech lead at IBM Research-AI, doing speech processing research for more than 25 years.
ת
![](https://i.ytimg.com/vi/MSrfi-4zJSo/maxresdefault.jpg)