Exploring OpenAI Whisper for Speech to Text Generation
Hi, this Charu from Classmethod. In this hands-on blog, we'll explore how to implement OpenAI Whisper and integrate it into your projects. We will be focusing on the Speech to Text(STT) feature of Whisper in this blog.
Key Features:
Basic Installation
To begin, we will install 5 different packages. Don't worry, we will go through this step by step and it will not take much time. Also, you might have few of these already installed ;)
1. Install Python:
You need to install python in your system. You can do that by following the steps in this link.
To confirm the installation, type the following command to check the installed version:
python -V
2. Install pyTorch:
Next, you need to install pyTorch through this link. It is a ML library. To be able to run this on your computer, scroll down the provided link and select your configuration. In my case, the configuration looks like this:
Once, you have made all your selections, copy the provided command and run it in your terminal.
3. Package Manager:
Let's download our package manager. If you are using windows, then download Chocolatey and if you are using Mac, then download Homebrew.
4. FFMPEG
Now, use your package manager to download a package called ffmpeg.
If you are using Windows, then download it using the following command:
choco install ffmpeg
And if you are using Mac, then use the following command:
brew install ffmpeg
5. Whisper
Now, coming up to the last installation. We will finally install Whisper. To install Whisper, run the following command:
pip install openai-whisper
Note: If it does not work for you, try running it in a virtual environment. To do this, run the following commands:
For Mac:
python -m venv path/to/venv
source path/to/venv/bin/activate
python -m pip install openai-whisper
For Windows:
python -m venv path\to\venv
path\to\venv\Scripts\activate.bat
python -m pip install openai-whisper
Explore Whisper
Congratulations! We have now finished the installation of all the prerequisites.
It's time to run your code now. Open your favorite code editor and type the following code:
import whisper model = whisper.load_model("large") result = model.transcribe("Audio_File.wav") print(result["text"])
Make sure to enter the path of your audio file in place of Audio_File.wav. It need not be in a wav format; it can also be a mp3 file.
Now, when you will run the code, you can view the text generated.
By default it uses the 'small' model, but I used 'large' model for higher accuracy. The key point is that the larger the model, the greater the time required for processing and the higher the accuracy of the results it produces. You have 5 different models to choose from. To know more about it, go to this link.
whisper Audio_File.wav
It will automatically detect the language and generates the text.
whisper Audio_File.wav --language Japanese --model large
whisper Audio_File.wav --language Japanese --task translate
whisper --help
Conclusion:
With the basic implementation complete, feel free to experiment with different Whisper models and audio inputs. Explore Whisper's capabilities and iterate on your implementation to suit your specific use case.
Thank you for reading!
Happy Learning:)