VTeam AI

Whisper and Bark Models Unveiled: Demonstrating Text-Audio and Audio-Text Transformations

Shifting our focus from the familiar terrain of Langchain and LLMs, we're diving into the fascinating world of speech processing, spotlighting two key models: Whisper (Audio-to-Text) and Bark (Text-to-Audio). Join us as we unravel their intricate architectures and give a hands-on demonstration. Kicking off with OpenAI's Whisper, we'll delve into its advanced transcription abilities, multilingual support, translation features, and its open-source ethos.

After discussing a lot about Langchain and LLMs, for a change, we will be discussing two very important models in the field of speech processing,

  • Whisper (Audio-to-Text)
  • Bark (Text-to-Audio)

In this blog, we will be talking about both these models, their architecture, and eventually a basic demonstration. So let's start with OpenAI's Whisper.

Divider

Whisper (Audio-to-Text)

    1. ASR System: Whisper is an ASR system, which means its primary function is to transcribe spoken words into written text developed by OpenAI

    2. Robustness: Due to its large and diverse training dataset, Whisper exhibits improved robustness in recognizing various accents, handling background noise, and understanding technical language.

    3. Multilingual Support: Whisper is not limited to English; it is capable of transcribing speech in several other languages. This multilingual support makes it versatile for global applications.

    4. Translation Capability: In addition to transcription, Whisper has the ability to translate non-English languages into English. It has been trained with 125,000 hours of data for this purpose, enhancing its language translation capabilities.

    5. Open Source: Whisper is an open-source project, making it free to use, distribute, and modify. Users can access its resources and contribute to its development.

Approach

The Architecture

  • Transformer Model: Whisper utilizes the Transformer architecture, a deep learning model known for its effectiveness in various natural language processing tasks. This architecture consists of encoder and decoder components.

  • Encoder-Decoder Structure: Whisper employs an encoder-decoder structure, where the encoder processes the input audio data, and the decoder generates the corresponding text output. This setup allows it to convert spoken language into written text effectively.

  • Input Processing: Input audio is divided into 30-second chunks. Each chunk is then converted into a log-Mel spectrogram. This spectrogram representation helps capture the essential features of the audio signal.

  • Encoder: The log-Mel spectrogram data is passed through the encoder component. The encoder's role is to understand and extract meaningful information from the audio input. It propagates this information to the decoder for text generation.

  • Decoder: The decoder is responsible for generating the corresponding text captions. It takes the information from the encoder and uses it to predict the text that corresponds to the audio input. This prediction is intermixed with special tokens that guide the model during the generation process.

Enough of theory, it's time we get our hands dirty with some basic codes. We will be trying out two audios,

  • A basic 8-second audio (sample.wav)

 

Counting Stars by OneRepublic

#!pip install -U openai-whisper
import whisper
model = whisper.load_model("base")
result = model.transcribe("sample.wav")
print(result["text"])
A

Not just smaller audios, even longer audio can be easily transcribed

model = whisper.load_model("base")
result = model.transcribe("counting_stars.mp3")
print(result["text"])
The below output is a sample of the whole song transcribed
Cs

Moving onto the Text-to-Audio mode Bark now

Bark (Text-to-Audio)

Bark is a transformer-based text-to-audio model created by Suno.ai and available on HuggingFace, designed for various audio generation tasks. Its architectural style draws inspiration from GPT, resembling models like AudioLM and Vall-E. Bark utilizes a quantized audio representation based on EnCodec.Unlike traditional text-to-speech (TTS) models, Bark is not confined to a linear script. Instead, it is a fully generative text-to-audio model with the remarkable ability to generate audio content that deviates from the input text unexpectedly. This unique feature sets Bark apart from conventional TTS models.

  1. Text-to-Audio Conversion: Bark specializes in converting text input into audio output. It uses advanced machine learning techniques, specifically a transformer architecture, to achieve this.

  2. Multilingual Support: Bark has the ability to generate highly realistic speech and other audio content in multiple languages. This multilingual support makes it versatile for catering to diverse linguistic needs.

  3. Audio Variety: Beyond speech, Bark can generate a wide range of audio content, including music, background noise, and simple sound effects. This versatility makes it suitable for applications like audio production and content creation.

  4. Nonverbal Communications: Bark goes beyond speech generation and can produce nonverbal communication cues like laughter, sighing, and crying. This capability adds emotional depth to audio content.

  5. Creative Possibilities: With its lifelike speech, multilingual capabilities, music generation, sound effects, and nonverbal communication generation, Bark offers endless creative possibilities for audio content creation.

Now, we will be demonstrating a few tutorials on how Bark can be used for a variety of text-to-audio problems

First of all, we will be importing some important functions & bark model

!pip install git+https://github.com/suno-ai/bark.git
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio
# download and load all models
preload_models()
Next, let's generate a basic audio including a few gestures as well
text_prompt = """
     [MAN] This is Mehul [clears throat]. I am a Data Scientist [chuckles]. What about you?
"""
audio_array = generate_audio(text_prompt,history_prompt='v2/de_speaker_2')
 
# save audio to disk
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)
 
# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
The audio generated is
Next, let's generate a song
text_prompt = """
    ♪ I look for light, for a brighter side ♪
"""
audio_array = generate_audio(text_prompt,history_prompt='v2/fr_speaker_1')
# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
The above codes will work when the text prompt is small, but when you need to generate an entire speech, this won't work. We need to tweak a few things as done in the below code
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import generate_audio, SAMPLE_RATE
import nltk

nltk.download('punkt')

text_prompt = """
As a professional athlete, I've always believed in giving my all to the game of cricket.
It has been an incredible journey, one that I've cherished with all my heart.
However, I've come to a point in my career where I feel it's time to consider the future.
Retirement is a natural part of every athlete's life, and I want to make this transition thoughtfully.
While I'll always hold cricket close to my heart, I also understand that there's more to life than just the sport.
It's about striking a balance between my passion for cricket and my desire to explore new horizons.
So, as I contemplate my retirement, I do so with gratitude for the wonderful moments on the field and excitement for the adventures that lie ahead.
"""
sentences = nltk.sent_tokenize(text_prompt)
SPEAKER = "v2/en_speaker_6"
silence = np.zeros(int(0.25 * SAMPLE_RATE))  # quarter second of silence

pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence, history_prompt=SPEAKER)
    pieces += [audio_array, silence.copy()]

Audio(np.concatenate(pieces), rate=SAMPLE_RATE
In the above code, we took a whole speech, segmented it into single sentences, generated audio for these segments, and then append every audio component together. Listen to the generated audio below

In conclusion, the Whisper and Bark models represent two remarkable advancements in the field of natural language processing and audio generation. Whisper, developed by OpenAI, is a powerful open-source speech recognition model that excels in converting spoken language into text. Its accuracy and versatility make it a valuable tool for a wide range of applications, from transcription services to voice assistants.

On the other hand, Bark, created by Suno, takes a different approach by being a fully generative text-to-audio model. This means it can not only convert text to speech but also generate other audio content, including music, background noise, and even non-verbal expressions like laughter and sighing. Bark's unique capabilities open up possibilities for creative audio generation beyond traditional speech.

In our demonstration, we explored how to use these models for text-audio and audio-text transitions. Whisper's accuracy in transcribing spoken words into text was evident, making it a valuable tool for tasks like transcription and voice command recognition. Meanwhile, Bark's ability to generate audio from text prompts was showcased, highlighting its potential for creating diverse audio content.

These models, Whisper and Bark, exemplify the fusion of cutting-edge AI technology and audio processing. They provide valuable tools for developers, researchers, and creators to harness the power of AI in the world of sound. As technology continues to advance, we can expect even more exciting developments in the realm of text-audio and audio-text transitions.

Disclaimer: The views and opinions expressed in this blog post are solely those of the authors and do not reflect the official policy or position of any of the mentioned tools. This blog post is not a form of advertising and no remuneration was received for the creation and publication of this post. The intention is to share our findings and experiences using these tools and is intended purely for informational purposes.