
A story about asking Claude Code to create meeting minutes using Amazon Transcribe
This page has been translated by machine translation. View original
Introduction
I had a recording of an approximately one-hour meeting on hand, and I decided to ask Claude Code to create minutes from it. Since I also had related documents such as notes, I passed these along to Claude Code as well, requesting that it "read the recording and reference material URLs and create meeting minutes."
To state the conclusion upfront, the 56-minute recording was turned into a practical summary at a cost of approximately 200 yen. This article introduces the process and the challenges that arose along the way.
What is Amazon Transcribe
Amazon Transcribe is an automatic speech recognition service provided by AWS. It can transcribe text from audio and video files, and is equipped with features such as speaker identification, custom vocabularies, real-time streaming, and redaction.
Verification Environment
- macOS
- ffmpeg 8.1.1
- aws CLI 2.34.33
- Python 3.14.4
- boto3 1.43.8
- Amazon Transcribe Standard Batch (ja-JP, ap-northeast-1)
- Claude Code Opus 4.7
Target Audience
- Those who want to reduce the burden of manually transcribing minutes every time for approximately one-hour meetings they participated in
- Those who want to actually try Amazon Transcribe in a business setting
- Those considering how much abstract requests can be delegated to Claude Code
- Those looking for a way to semi-automatically generate meeting minutes
References
- Amazon Transcribe Service Page
- Amazon Transcribe Pricing
- Custom vocabularies - Amazon Transcribe
- Supported character sets for custom vocabularies
- Speaker partitioning (diarization)
What Was Done
The input consisted of the following 2 types.
- 1 recording file (
.mov, approximately 350 MB, 56 minutes, 6 participants, Japanese only) - Several reference material URLs
Based on these, Claude Code output a Markdown document with the following 7-section structure. It was approximately 350 lines.
- Meeting metadata (date/time, participants, facilitator, agenda)
- Discussion summary by agenda item
- Decisions made
- ToDos and action items (with assigned persons)
- Statement summary by speaker
- Timestamped statement index
- Full statement log by speaker
Construction Flow
I will explain the processing flow step by step.
Extracting Audio from the Recording
Amazon Transcribe's Batch Transcription supports some video formats such as MP4 and WebM. However, since the input file this time was .mov, I instructed Claude Code to convert only the audio to FLAC using ffmpeg and send it, for the purposes of ensuring format compatibility, clarifying audio quality conditions, and reducing transfer volume.
ffmpeg -y -i "$INPUT" -vn -ac 1 -ar 16000 -c:a flac "$OUTPUT"
An 87 MiB FLAC file was generated from the 350 MB .mov.
Running Amazon Transcribe
Transcribe is an asynchronous service that operates on a job basis. The process proceeds in the following order.
- Register custom vocabulary (to improve recognition accuracy for proper nouns)
- Upload FLAC to S3
- Start the job
- Poll for completion
- Download the result JSON
Custom vocabularies are included in the standard pricing and can be used at no additional charge.
Claude Code wrote the job startup in boto3 as follows.
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
LanguageCode="ja-JP",
MediaFormat="flac",
Media={"MediaFileUri": f"s3://{bucket}/{S3_INPUT_KEY}"},
Settings={
"ShowSpeakerLabels": True,
"MaxSpeakerLabels": 6,
"VocabularyName": VOCABULARY_NAME,
},
OutputBucketName=bucket,
OutputKey=S3_OUTPUT_KEY,
)
By specifying ShowSpeakerLabels=True and MaxSpeakerLabels=6, labels such as spk_0, spk_1 ... are assigned to each speaker.
Polling was done at 30-second intervals until the job completed. The actual measured time from job start to completion was 3 minutes and 4 seconds, meaning a 56-minute audio file was transcribed in 3 minutes.
Matching Speaker Identification with Reference Materials
The downloaded JSON contains timestamps and speaker labels for each word. Claude Code used Python to bundle these into utterance units and organized them into an easy-to-handle intermediate representation.
After this, it was necessary to map labels like spk_0 to real names. Since Transcribe does not determine who is who, this is a step that requires contextual judgment by Claude Code. Real names were estimated from handwritten meeting notes and the volume and content of each speaker's statements. Finally, all organized statements and reference materials were used as input to generate the 7-section Markdown.
Challenges That Arose
The process did not go smoothly, and 3 challenges arose.
An Error Occurred Due to Insufficient IAM Permissions
When first attempting to register the custom vocabulary for Transcribe, execution was rejected by AccessDeniedException.
User: arn:aws:iam::***:user/*** is not authorized to perform: transcribe:CreateVocabulary
The IAM user I normally use had no Transcribe-related actions assigned at all.
I added the necessary permissions and re-executed. The main permissions that were needed are as follows.
transcribe:CreateVocabulary,transcribe:UpdateVocabulary,transcribe:GetVocabulary,transcribe:DeleteVocabularytranscribe:StartTranscriptionJob,transcribe:GetTranscriptionJob,transcribe:ListTranscriptionJobs,transcribe:DeleteTranscriptionJob- S3-related:
CreateBucket,PutObject,GetObject,ListBucket,DeleteObject,DeleteBucket
Japanese Hyphens Were Rejected in Custom Vocabulary
The next challenge that arose was with creating the custom vocabulary. When registering multi-word phrases like "Object Storage," Transcribe's convention for English is to write them with a hyphen as Object-Storage. When registered as-is, the result was a FAILED state for ja-JP.
Validation error: Your custom vocabulary file contains one or more unsupported characters ("-") on line 4.
The character types for ja-JP custom vocabularies are restricted, and hyphens are not permitted. Details are described in the Amazon Transcribe character set documentation.
As a solution, I re-registered them in katakana, such as "オブジェクトストレージ" (Object Storage) and "オブジェクトストア" (Object Store). In the actual meeting, there were almost no instances of English pronunciation, and they were often read in katakana, so this approach actually improved recognition accuracy as well.
The Number of Speakers Was Less Than MaxSpeakerLabels
Although I specified MaxSpeakerLabels to match the number of participants, the result yielded labels for only 5 people, one fewer than expected. The statement volume aggregated as follows.
| Speaker Label | Number of Statements | Cumulative Statement Time (seconds) |
|---|---|---|
| spk_0 | 20 | 663.8 |
| spk_1 | 4 | 98.1 |
| spk_2 | 15 | 473.7 |
| spk_3 | 34 | 1,151.6 |
| spk_4 | 19 | 481.2 |
Even when totaled, it comes to just over 47 minutes, which is shorter than the one-hour meeting. It is presumed that short back-channel responses and replies from other speakers were merged into the long utterances of the speaker with the most speech volume.
My own statements were also short, and I believe they were merged into another speaker's segment. Since MaxSpeakerLabels is a parameter that specifies the upper limit of speakers and does not guarantee that the specified number of labels will be generated, I decided to correct this by cross-referencing with reference materials at the final summary generation stage. Since I had kept notes during the meeting, I was able to use them as supplementary information for speaker identification in the subsequent step.
Results and Actual Costs
The job took 3 minutes and 4 seconds, and including surrounding work (ffmpeg, S3 upload, download, Markdown generation, and human review), it was completed within 60 minutes.
The generated Markdown was sufficiently practical, and when compared against the actual meeting content, it was at a level that felt natural and accurate.
The final costs were as follows. In Japanese yen, this comes to approximately 203 yen.
| Item | Amount (USD) |
|---|---|
| Amazon Transcribe | 1.35080 |
| Amazon S3 | 0.00025 |
| Custom Vocabulary | 0 |
| Total | approx. 1.35 |
The unit price for the Tokyo region was obtained from the AWS Price List API and multiplied by the actual measured seconds of the Transcribe job. The Standard transcription in ap-northeast-1 is USD 0.0004 / second (Tier 1), and the calculation result for 3,377 seconds of audio is USD 1.35.
Summary
Simply by passing the recording and reference material URLs to Claude Code, a practical Markdown meeting minutes document was generated for approximately 200 yen. If you confirm the IAM policy, ja-JP character set constraints, and data handling rules in advance, this is a configuration that can easily be applied to business use. I hope this will be helpful for those considering reducing the effort involved in creating meeting minutes.
