How I improved Amazon Polly pronunciation in Twilio ConversationRelay using SSML tags
This page has been translated by machine translation. View original
Introduction
Twilio ConversationRelay provides text-to-speech (TTS) using Amazon Polly as a standard feature.
While Polly's Japanese Neural voice generally produces natural pronunciation, some phrases still have unnatural intonation. I was particularly concerned about long katakana compound words and phrases that mix kanji and katakana.
For example, "プレミアティアサービスパートナー" (Premier Tier Service Partner) is not read smoothly as a single compound word, but with intonation that sounds fragmented. In "全サービス" (all services), the pitch accent of the "ぜん" part sounds unnatural, making it sound like a proper noun.
Since these subtle issues are noticeable in telephone response audio, I tested improvements using SSML tags. This article shares my trial and error process and the insights gained.
Test Environment
The test environment was as follows:
| Item | Details |
|---|---|
| WebSocket server | API Gateway WebSocket API + Lambda (Node.js) |
| LLM | Amazon Bedrock (Claude Haiku 4.5) |
| TTS | Amazon Polly (Takumi-Neural) via Twilio ConversationRelay |
In ConversationRelay, SSML tags can be included directly in the token field sent from the WebSocket server. In this environment, it worked even without wrapping in <speak> tags.
{
"type": "text",
"token": "ご説明します。<break time=\"300ms\"/>AWS全サービスが対象です。"
}
Using this mechanism, I attempted to improve pronunciation by including SSML tags in the text generated by Bedrock's LLM.
SSML Tag Experimentation
<break> tag: Effective after punctuation, counterproductive for compound word separation
I first tried the <break> tag. Inserting <break time="300ms"/> right after periods created appropriate pauses at sentence boundaries, making them easier to understand.
AWS全サービスが7%割引になります。<break time="300ms"/>
24時間365日の日本語サポートが付いてきます。<break time="300ms"/>
However, attempts to insert short <break> tags at meaning boundaries in long katakana compound words failed.
「プレミアティア<break time="100ms"/>サービス<break time="100ms"/>パートナー」
Even with a short 100ms <break>, Polly interprets it as a sentence boundary. As a result, it reads like "プレミアティア。サービス。パートナー。", with each part having independent sentence intonation, which was worse than without SSML.
<phoneme> tag: Effective for controlling pitch accent
With Amazon Polly's Japanese Neural voice, you can specify pitch accent using <phoneme> with alphabet="x-amazon-pron-kana".
<phoneme alphabet="x-amazon-pron-kana" ph="プレミアティアサービスパー'トナー">プレミアティアサービスパートナー</phoneme>
You write the reading in katakana in the ph attribute and place an apostrophe (') right before the mora where the pitch drops. With this specification, Polly reads with the specified accent pattern. "プレミアティアサービスパートナー" now sounds like a single compound word with natural intonation.
Final Configuration and LLM Output Example
I eventually settled on the following configuration:
| Tag | Usage |
|---|---|
<break time="300ms"/> |
Inserted after periods |
<break time="200ms"/> |
Inserted after commas |
<phoneme alphabet="x-amazon-pron-kana"> |
Accent specification for katakana compound words or mixed kanji+katakana words |
For English abbreviations (AWS, API, etc.), I use no tags and rely on Polly's default processing. I avoid using <sub> and <lang> because they have side effects with Japanese Neural voices.
LLM Output Example
Here is an actual output example from the LLM:
クラスメソッドメンバーズは、AWS総合支援サービスです。<break time="300ms"/>
2014年からAWS最上位の<phoneme alphabet="x-amazon-pron-kana"
ph="プレミアティアサービスパー'トナー">プレミアティアサービスパートナー
</phoneme>に認定されており、10年以上で累計5,000社以上にAWSの構築・運用支援を
行っています。<break time="300ms"/>
<phoneme alphabet="x-amazon-pron-kana" ph="ゼ'ンサービス">全サービス
</phoneme>が7%OFF、24時間365日の日本語サポート、
そして<phoneme alphabet="x-amazon-pron-kana" ph="クラウドホ'ケン">クラウド保険
</phoneme>が無料で付いてきます。<break time="300ms"/>
他にご不明な点はございますか?
Compared to not using SSML, the intonation of katakana compound words has improved. With appropriate pauses from the <break> tags, the overall clarity for telephone responses has improved.
Conclusion
I tested pronunciation improvements using SSML tags for Japanese voice with Twilio ConversationRelay + Amazon Polly (Takumi-Neural).
The combination of inserting pauses after punctuation using <break> tags and specifying pitch accent for katakana compound words with <phoneme> tags proved effective. On the other hand, I had to abandon tags like <sub> and <lang> due to their significant side effects with Japanese Neural voice.
While improvements were confirmed for specific phrases ("プレミアティアサービスパートナー", "全サービス", etc.) in this test, as domain-specific terminology increases, determining how to demonstrate these examples in prompts becomes a challenge.