How I improved Amazon Polly pronunciation in Twilio ConversationRelay using SSML tags

I built a Japanese voice AI using Amazon Polly (Takumi-Neural) with Twilio ConversationRelay, but noticed unnatural intonation in katakana compound words. After trying to improve it with SSML tags, I found that combining break and phoneme tags was effective.

越井琢巳 (Koshii Takumi)

2026.02.18

This page has been translated by machine translation. View original

Introduction

Twilio ConversationRelay provides text-to-speech (TTS) using Amazon Polly as a standard feature.

While Polly's Japanese Neural voice generally produces natural pronunciation, some phrases still have unnatural intonation. I was particularly concerned about long katakana compound words and phrases that mix kanji and katakana.

For example, "プレミアティアサービスパートナー" (Premier Tier Service Partner) is not read smoothly as a single compound word, but with intonation that sounds fragmented. In "全サービス" (all services), the pitch accent of the "ぜん" part sounds unnatural, making it sound like a proper noun.

Since these subtle issues are noticeable in telephone response audio, I tested improvements using SSML tags. This article shares my trial and error process and the insights gained.

Test Environment

The test environment was as follows:

Item	Details
WebSocket server	API Gateway WebSocket API + Lambda (Node.js)
LLM	Amazon Bedrock (Claude Haiku 4.5)
TTS	Amazon Polly (Takumi-Neural) via Twilio ConversationRelay

In ConversationRelay, SSML tags can be included directly in the token field sent from the WebSocket server. In this environment, it worked even without wrapping in <speak> tags.

{
  "type": "text",
  "token": "ご説明します。<break time=\"300ms\"/>AWS全サービスが対象です。"
}

Using this mechanism, I attempted to improve pronunciation by including SSML tags in the text generated by Bedrock's LLM.

SSML Tag Experimentation

`<break>` tag: Effective after punctuation, counterproductive for compound word separation

I first tried the <break> tag. Inserting <break time="300ms"/> right after periods created appropriate pauses at sentence boundaries, making them easier to understand.

AWS全サービスが7%割引になります。<break time="300ms"/>
24時間365日の日本語サポートが付いてきます。<break time="300ms"/>

However, attempts to insert short <break> tags at meaning boundaries in long katakana compound words failed.

「プレミアティア<break time="100ms"/>サービス<break time="100ms"/>パートナー」

Even with a short 100ms <break>, Polly interprets it as a sentence boundary. As a result, it reads like "プレミアティア。サービス。パートナー。", with each part having independent sentence intonation, which was worse than without SSML.

`<phoneme>` tag: Effective for controlling pitch accent

With Amazon Polly's Japanese Neural voice, you can specify pitch accent using <phoneme> with alphabet="x-amazon-pron-kana".

<phoneme alphabet="x-amazon-pron-kana" ph="プレミアティアサービスパー'トナー">プレミアティアサービスパートナー</phoneme>

You write the reading in katakana in the ph attribute and place an apostrophe (') right before the mora where the pitch drops. With this specification, Polly reads with the specified accent pattern. "プレミアティアサービスパートナー" now sounds like a single compound word with natural intonation.

Final Configuration and LLM Output Example

I eventually settled on the following configuration:

Tag	Usage
`<break time="300ms"/>`	Inserted after periods
`<break time="200ms"/>`	Inserted after commas
`<phoneme alphabet="x-amazon-pron-kana">`	Accent specification for katakana compound words or mixed kanji+katakana words

For English abbreviations (AWS, API, etc.), I use no tags and rely on Polly's default processing. I avoid using <sub> and <lang> because they have side effects with Japanese Neural voices.

LLM Output Example

Here is an actual output example from the LLM:

クラスメソッドメンバーズは、AWS総合支援サービスです。<break time="300ms"/>
2014年からAWS最上位の<phoneme alphabet="x-amazon-pron-kana"
ph="プレミアティアサービスパー'トナー">プレミアティアサービスパートナー
</phoneme>に認定されており、10年以上で累計5,000社以上にAWSの構築・運用支援を
行っています。<break time="300ms"/>
<phoneme alphabet="x-amazon-pron-kana" ph="ゼ'ンサービス">全サービス
</phoneme>が7%OFF、24時間365日の日本語サポート、
そして<phoneme alphabet="x-amazon-pron-kana" ph="クラウドホ'ケン">クラウド保険
</phoneme>が無料で付いてきます。<break time="300ms"/>
他にご不明な点はございますか？

Compared to not using SSML, the intonation of katakana compound words has improved. With appropriate pauses from the <break> tags, the overall clarity for telephone responses has improved.

Conclusion

I tested pronunciation improvements using SSML tags for Japanese voice with Twilio ConversationRelay + Amazon Polly (Takumi-Neural).

The combination of inserting pauses after punctuation using <break> tags and specifying pitch accent for katakana compound words with <phoneme> tags proved effective. On the other hand, I had to abandon tags like <sub> and <lang> due to their significant side effects with Japanese Neural voice.

While improvements were confirmed for specific phrases ("プレミアティアサービスパートナー", "全サービス", etc.) in this test, as domain-specific terminology increases, determining how to demonstrate these examples in prompts becomes a challenge.

How I improved Amazon Polly pronunciation in Twilio ConversationRelay using SSML tags

Introduction

Test Environment

SSML Tag Experimentation

`<break>` tag: Effective after punctuation, counterproductive for compound word separation

`<phoneme>` tag: Effective for controlling pitch accent

Final Configuration and LLM Output Example

LLM Output Example

Conclusion

AWS Topics

Trending Topics

Products & Services

Features and Series

Introduction

Test Environment

SSML Tag Experimentation

<break> tag: Effective after punctuation, counterproductive for compound word separation

<phoneme> tag: Effective for controlling pitch accent

Final Configuration and LLM Output Example

LLM Output Example

Conclusion

`<break>` tag: Effective after punctuation, counterproductive for compound word separation

`<phoneme>` tag: Effective for controlling pitch accent