A story about asking Claude Code to create meeting minutes using Amazon Transcribe

I passed Claude Code a recording of an approximately one-hour meeting and reference material URLs, and asked it to handle everything from transcription using Amazon Transcribe to generating meeting minutes in Markdown. I will introduce the process of obtaining a practical summary with seven sections for approximately 200 yen, as well as the issues that arose along the way.

越井琢巳 (Koshii Takumi)

2026.05.18

This page has been translated by machine translation. View original

 IntroductionI had a recording of an approximately one-hour meeting on hand, and I decided to ask Claude Code to create minutes from it. Since I also had related documents such as notes, I passed these along to Claude Code as well, requesting that it "read the recording and reference material URLs and create meeting minutes."
To state the conclusion upfront, the 56-minute recording was turned into a practical summary at a cost of approximately 200 yen. This article introduces the process and the challenges that arose along the way.
!This article does not include the meeting content, participant names, reference material URLs, or the generated minutes text. Only the processing procedures and insights gained are introduced in a generalized form. When adopting a similar configuration for actual business use, please be aware that recording data, audio data, reference materials, and custom vocabularies may contain confidential or personal information, and confirm your company's security policies, AI usage rules, and AWS service data handling conditions in advance.
 What is Amazon TranscribeAmazon Transcribe is an automatic speech recognition service provided by AWS. It can transcribe text from audio and video files, and is equipped with features such as speaker identification, custom vocabularies, real-time streaming, and redaction.
 Verification EnvironmentmacOS
ffmpeg 8.1.1
aws CLI 2.34.33
Python 3.14.4
boto3 1.43.8
Amazon Transcribe Standard Batch (ja-JP, ap-northeast-1)
Claude Code Opus 4.7
 Target AudienceThose who want to reduce the burden of manually transcribing minutes every time for approximately one-hour meetings they participated in
Those who want to actually try Amazon Transcribe in a business setting
Those considering how much abstract requests can be delegated to Claude Code
Those looking for a way to semi-automatically generate meeting minutes
 ReferencesAmazon Transcribe Service Page
Amazon Transcribe Pricing
Custom vocabularies - Amazon Transcribe
Supported character sets for custom vocabularies
Speaker partitioning (diarization)
 What Was DoneThe input consisted of the following 2 types.
1 recording file (.mov, approximately 350 MB, 56 minutes, 6 participants, Japanese only)
Several reference material URLs
Based on these, Claude Code output a Markdown document with the following 7-section structure. It was approximately 350 lines.
Meeting metadata (date/time, participants, facilitator, agenda)
Discussion summary by agenda item
Decisions made
ToDos and action items (with assigned persons)
Statement summary by speaker
Timestamped statement index
Full statement log by speaker
 Construction FlowI will explain the processing flow step by step.
 Extracting Audio from the RecordingAmazon Transcribe's Batch Transcription supports some video formats such as MP4 and WebM. However, since the input file this time was .mov, I instructed Claude Code to convert only the audio to FLAC using ffmpeg and send it, for the purposes of ensuring format compatibility, clarifying audio quality conditions, and reducing transfer volume.
ffmpeg -y -i "$INPUT" -vn -ac 1 -ar 16000 -c:a flac "$OUTPUT"
An 87 MiB FLAC file was generated from the 350 MB .mov.
 Running Amazon TranscribeTranscribe is an asynchronous service that operates on a job basis. The process proceeds in the following order.
Register custom vocabulary (to improve recognition accuracy for proper nouns)
Upload FLAC to S3
Start the job
Poll for completion
Download the result JSON
Custom vocabularies are included in the standard pricing and can be used at no additional charge.
!The table format is recommended for new creation, and the list format is scheduled to be deprecated. This time I registered using the list format for simplicity, but the table format is preferable when you want to control the handling of multi-word phrases and display formats.
Claude Code wrote the job startup in boto3 as follows.
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    LanguageCode="ja-JP",
    MediaFormat="flac",
    Media={"MediaFileUri": f"s3://{bucket}/{S3_INPUT_KEY}"},
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 6,
        "VocabularyName": VOCABULARY_NAME,
    },
    OutputBucketName=bucket,
    OutputKey=S3_OUTPUT_KEY,
)
By specifying ShowSpeakerLabels=True and MaxSpeakerLabels=6, labels such as spk_0, spk_1 ... are assigned to each speaker.
Polling was done at 30-second intervals until the job completed. The actual measured time from job start to completion was 3 minutes and 4 seconds, meaning a 56-minute audio file was transcribed in 3 minutes.
 Matching Speaker Identification with Reference MaterialsThe downloaded JSON contains timestamps and speaker labels for each word. Claude Code used Python to bundle these into utterance units and organized them into an easy-to-handle intermediate representation.
After this, it was necessary to map labels like spk_0 to real names. Since Transcribe does not determine who is who, this is a step that requires contextual judgment by Claude Code. Real names were estimated from handwritten meeting notes and the volume and content of each speaker's statements. Finally, all organized statements and reference materials were used as input to generate the 7-section Markdown.
 Challenges That AroseThe process did not go smoothly, and 3 challenges arose.
 An Error Occurred Due to Insufficient IAM PermissionsWhen first attempting to register the custom vocabulary for Transcribe, execution was rejected by AccessDeniedException.
User: arn:aws:iam::***:user/*** is not authorized to perform: transcribe:CreateVocabulary
The IAM user I normally use had no Transcribe-related actions assigned at all.
I added the necessary permissions and re-executed. The main permissions that were needed are as follows.
transcribe:CreateVocabulary, transcribe:UpdateVocabulary, transcribe:GetVocabulary, transcribe:DeleteVocabulary
transcribe:StartTranscriptionJob, transcribe:GetTranscriptionJob, transcribe:ListTranscriptionJobs, transcribe:DeleteTranscriptionJob
S3-related: CreateBucket, PutObject, GetObject, ListBucket, DeleteObject, DeleteBucket
 Japanese Hyphens Were Rejected in Custom VocabularyThe next challenge that arose was with creating the custom vocabulary. When registering multi-word phrases like "Object Storage," Transcribe's convention for English is to write them with a hyphen as Object-Storage. When registered as-is, the result was a FAILED state for ja-JP.
Validation error: Your custom vocabulary file contains one or more unsupported characters ("-") on line 4.
The character types for ja-JP custom vocabularies are restricted, and hyphens are not permitted. Details are described in the Amazon Transcribe character set documentation.
As a solution, I re-registered them in katakana, such as "オブジェクトストレージ" (Object Storage) and "オブジェクトストア" (Object Store). In the actual meeting, there were almost no instances of English pronunciation, and they were often read in katakana, so this approach actually improved recognition accuracy as well.
 The Number of Speakers Was Less Than MaxSpeakerLabelsAlthough I specified MaxSpeakerLabels to match the number of participants, the result yielded labels for only 5 people, one fewer than expected. The statement volume aggregated as follows.


Speaker Label
Number of Statements
Cumulative Statement Time (seconds)


spk_0
20
663.8

spk_1
4
98.1

spk_2
15
473.7

spk_3
34
1,151.6

spk_4
19
481.2

Even when totaled, it comes to just over 47 minutes, which is shorter than the one-hour meeting. It is presumed that short back-channel responses and replies from other speakers were merged into the long utterances of the speaker with the most speech volume.
My own statements were also short, and I believe they were merged into another speaker's segment. Since MaxSpeakerLabels is a parameter that specifies the upper limit of speakers and does not guarantee that the specified number of labels will be generated, I decided to correct this by cross-referencing with reference materials at the final summary generation stage. Since I had kept notes during the meeting, I was able to use them as supplementary information for speaker identification in the subsequent step.
 Results and Actual CostsThe job took 3 minutes and 4 seconds, and including surrounding work (ffmpeg, S3 upload, download, Markdown generation, and human review), it was completed within 60 minutes.
The generated Markdown was sufficiently practical, and when compared against the actual meeting content, it was at a level that felt natural and accurate.
The final costs were as follows. In Japanese yen, this comes to approximately 203 yen.


Item
Amount (USD)


Amazon Transcribe
1.35080

Amazon S3
0.00025

Custom Vocabulary
0

Total
approx. 1.35

The unit price for the Tokyo region was obtained from the AWS Price List API and multiplied by the actual measured seconds of the Transcribe job. The Standard transcription in ap-northeast-1 is USD 0.0004 / second (Tier 1), and the calculation result for 3,377 seconds of audio is USD 1.35.
 SummarySimply by passing the recording and reference material URLs to Claude Code, a practical Markdown meeting minutes document was generated for approximately 200 yen. If you confirm the IAM policy, ja-JP character set constraints, and data handling rules in advance, this is a configuration that can easily be applied to business use. I hope this will be helpful for those considering reducing the effort involved in creating meeting minutes.

A story about asking Claude Code to create meeting minutes using Amazon Transcribe

Introduction

What is Amazon Transcribe

Verification Environment

Target Audience

References

What Was Done

Construction Flow

Extracting Audio from the Recording

Running Amazon Transcribe

Matching Speaker Identification with Reference Materials

Challenges That Arose

An Error Occurred Due to Insufficient IAM Permissions

Japanese Hyphens Were Rejected in Custom Vocabulary

The Number of Speakers Was Less Than `MaxSpeakerLabels`

Results and Actual Costs

Summary

Claudeならクラスメソッドにお任せください

AWS Topics

Trending Topics

Products & Services

Features and Series

Speaker Label	Number of Statements	Cumulative Statement Time (seconds)
spk_0	20	663.8
spk_1	4	98.1
spk_2	15	473.7
spk_3	34	1,151.6
spk_4	19	481.2

Item	Amount (USD)
Amazon Transcribe	1.35080
Amazon S3	0.00025
Custom Vocabulary	0
Total	approx. 1.35