Tech

The AI models that power PinyinTube’s voice and subtitle extraction.

May 23, 2023 SwapBrain

Introduction

PinyinTube is an exceptional tool that offers an unparalleled language learning experience through movies. This innovative Chrome extension promises to make language learning an unforgettable experience by providing an immersive learning experience. The dual subtitles feature on this language learning platform ensures that all levels of learners can grasp the content easily. Moreover, the Romanized Chinese Pronunciation makes it really easy for users to learn how to speak Chinese words like a native. The interactive design on this platform allows users to pause and replay the content as much as they want, making it a perfect tool for practice sessions with actors. If you’re passionate about taking your language learning up a notch, be sure to upgrade to the PRO version. The PRO version makes it possible to record your voice, compare your tone and pronunciation to that of the native actors, and track your progress. Learning a new language has never been this fun!

misaligned audio screenshot of app

While creating this extension, we have had to overcome multiple hurdles that seemed daunting initially. However, we are planning to surmount them thanks to our technical expertise and the remarkable capabilities of AI. One of the most significant issues we encountered was that the subtitles were often misaligned from the actual audio that was being played, making it exceedingly tough to replay separate sentences precisely. Additionally, we noticed that the actor’s voice was often lost in the background noise and music, coupled with multiple actors speaking simultaneously or mumbling, which made it even trickier to extract their voice and match it with the user’s recorded voice. These challenges could have potentially hindered our ability to deliver the best possible output; however, our team was undeterred and instead chose to deploy a series of cutting-edge AI models, developed in a carefully drafted sequential pattern:

AI roadmap

– First, the technique of Quantum Clustering is used to group together different types of speech in the subtitles, such as the dialogue of different characters, background noise, and general description. This clustering process allows for the filter to be applied to only the speech of the main characters. In particular, we will apply the method introduced by Ding Liu in 2016 [1].

– Secondly, the voices of the main characters are aligned with the corresponding subtitles through a phoneme method developed by Schulze-Forster in 2020 [2].

– Using the labelled clusters and subtitles, the voices of the main actors can be separated from the background noise by applying the text-informed sound separation method developed by Kevin Kilgour and others from Google Research in 2022 [3].

– However, this may often result in a corrupted and unclear audio. To enhance the audio, generative AI techniques are suggested, which were developed by Pascual in 2017 [4].

Due to the technical nature of the above topics, we will write separate blog posts to discuss the in-depth technical details. Please follow the hyperlink on each topic to go to the corresponding pages.

References

[1] Ding Liu et al, “Analyzing documents with Quantum Clustering: A novel pattern

recognition algorithm based on quantum mechanics”, 2016, Pattern Recognition Letters

[2] Kilian Schulze-Forster et al, “JOINT PHONEME ALIGNMENT AND TEXT-INFORMED SPEECH SEPARATION ON HIGHLY CORRUPTED SPEECH”, 2020, conference proceeding at ICASSP 2020

[3] Kevin Kilgour et al, “Text-driven separation of arbitrary sounds”, 2022, Conference proceeding at Interspeech 2022

[4] Santiago Pascual et al, “SEGAN: Speech Enhancement Generative Adversarial Network”, Conference proceeding at Interspeech 2017

May 23, 2023

Tech

The AI models that power PinyinTube’s voice and subtitle extraction.

105

1 Comment

May 2, 2023

appsTech

Application of Large Language Models (LLM) to subtitle alignment and actor’s voice isolation

108

2 Comments

Comment <01>

Hoa Nguyen

May 28, 2023

This article is fascinating! I did not know that there are so many algorithms under this simple application. It is great to use this application for learning Chinese with Pinyin.

The AI models that power PinyinTube’s voice and subtitle extraction.

Introduction

References

Related Posts

The AI models that power PinyinTube’s voice and subtitle extraction.

Application of Large Language Models (LLM) to subtitle alignment and actor’s voice isolation

Comment <01>

Contact Us

Useful Links

Newsletter