Normalize the volume of Amazon Polly’s voice with the volume of Alexa’s voice using CLI #Alexa #AmazonPolly

I will show you how to adjust the volume synthesized by Amazon Polly to the right volume when playing with Alexa.
2018.06.24

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

日本語の記事はこちらにあります。

Introduction

When my Alexa skill under development played the synthesized voice of Amazon Polly (hereinafter called "Polly") instead of playing ordinary Alexa's voice, The volume of Polly's voice was much smaller than Alexa's voice and It was very hard to hear like this:

In this article, I will show you how to adjust the volume synthesized by Amazon Polly to the right volume when playing with Alexa. There are various tools to adjust the volume of audio data, but this time I will introduce the procedure to adjust the volume using only CLI with the OSS audio processing tool called SoX. With SoX, the audio of the above video was able to be adjusted as follows:

Preparation

In addition to SoX, aws-cli is necessary since getting voice audio data from Polly is also done by CLI this time.

Steps

First, use aws-cli to synthesize voice with Polly.

$ aws polly synthesize-speech --output-format pcm --voice-id Salli --text-type ssml --text '<speak>and this is the volume of Polly's voice.</speak>' --sample-rate 16000 test.raw
{
    "ContentType": "audio/pcm",
    "RequestCharacters": "31"
}

Set the sampling frequency to 16kHz which is defined in Alexa's SSML audio tag specifications. Set the output format to PCM. You can choose the MP3 format which is the only format accepted by Alexa directly, but the bit rate of the MP3 output from Polly which is 40kbps is not allowed to use for Alexa (Alexa's SSML only accepts 16kHz/48kbps MP3). Therefore, I decided to start the processing from the uncompressed PCM.

We will only use SoX for the rest of the steps. First, converts Polly's PCM output to WAV format since Polly's PCM output cannot be processed with SoX as it is.

$ sox -r 16000 -b 16 -c 1 -L -e signed-integer -t raw test.raw -t wav test.wav

Then you adjust the volume of the WAV file using gain subcommand with -n option.

$ sox test.wav test-norm.wav gain -n
sox WARN dither: dither clipped 1 samples; decrease volume?

The option -n means automatically adjust the gain so that the peak becomes the maximum sound volume (0 dB). You can also specify the volume (gain) coefficient manually, but -n option worked well.

Finally, convert the WAV file to MP3 format. Be sure to specify the bit rate to 48 kbps.

$ sox test.wav -C 48 test.mp3

That's it, Polly's MP3 audio file with just right volume is ready for playback on Alexa.

Conclusion

When trying to make UX of Alexa skill better, it may be effective to use not only Alexa's synthesized speech but also synthesized speech of another service like Polly in some cases. So the tools like SoX should be very helpful.

Reference