A story about asking Claude Code to create meeting minutes using Amazon Transcribe

A story about asking Claude Code to create meeting minutes using Amazon Transcribe

I passed Claude Code a recording of an approximately one-hour meeting and reference material URLs, and asked it to handle everything from transcription using Amazon Transcribe to generating meeting minutes in Markdown. I will introduce the process of obtaining a practical summary with seven sections for approximately 200 yen, as well as the issues that arose along the way.
2026.05.18

This page has been translated by machine translation. View original

Introduction

I had a recording of an approximately one-hour meeting on hand, and I decided to ask Claude Code to create minutes from it. Since I also had related documents such as notes, I passed these along to Claude Code as well, requesting that it "read the recording and reference material URLs and create meeting minutes."

To state the conclusion upfront, the 56-minute recording was turned into a practical summary at a cost of approximately 200 yen. This article introduces the process and the challenges that arose along the way.

What is Amazon Transcribe

Amazon Transcribe is an automatic speech recognition service provided by AWS. It can transcribe text from audio and video files, and is equipped with features such as speaker identification, custom vocabularies, real-time streaming, and redaction.

Verification Environment

  • macOS
  • ffmpeg 8.1.1
  • aws CLI 2.34.33
  • Python 3.14.4
  • boto3 1.43.8
  • Amazon Transcribe Standard Batch (ja-JP, ap-northeast-1)
  • Claude Code Opus 4.7

Target Audience

  • Those who want to reduce the burden of manually transcribing minutes every time for approximately one-hour meetings they participated in
  • Those who want to actually try Amazon Transcribe in a business setting
  • Those considering how much abstract requests can be delegated to Claude Code
  • Those looking for a way to semi-automatically generate meeting minutes

References

What Was Done

The input consisted of the following 2 types.

  • 1 recording file (.mov, approximately 350 MB, 56 minutes, 6 participants, Japanese only)
  • Several reference material URLs

Based on these, Claude Code output a Markdown document with the following 7-section structure. It was approximately 350 lines.

  1. Meeting metadata (date/time, participants, facilitator, agenda)
  2. Discussion summary by agenda item
  3. Decisions made
  4. ToDos and action items (with assigned persons)
  5. Statement summary by speaker
  6. Timestamped statement index
  7. Full statement log by speaker

Construction Flow

I will explain the processing flow step by step.

Extracting Audio from the Recording

Amazon Transcribe's Batch Transcription supports some video formats such as MP4 and WebM. However, since the input file this time was .mov, I instructed Claude Code to convert only the audio to FLAC using ffmpeg and send it, for the purposes of ensuring format compatibility, clarifying audio quality conditions, and reducing transfer volume.

ffmpeg -y -i "$INPUT" -vn -ac 1 -ar 16000 -c:a flac "$OUTPUT"

An 87 MiB FLAC file was generated from the 350 MB .mov.

Running Amazon Transcribe

Transcribe is an asynchronous service that operates on a job basis. The process proceeds in the following order.

  1. Register custom vocabulary (to improve recognition accuracy for proper nouns)
  2. Upload FLAC to S3
  3. Start the job
  4. Poll for completion
  5. Download the result JSON

Custom vocabularies are included in the standard pricing and can be used at no additional charge.

Claude Code wrote the job startup in boto3 as follows.

transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    LanguageCode="ja-JP",
    MediaFormat="flac",
    Media={"MediaFileUri": f"s3://{bucket}/{S3_INPUT_KEY}"},
    Settings={
        "ShowSpeakerLabels": True,
        "MaxSpeakerLabels": 6,
        "VocabularyName": VOCABULARY_NAME,
    },
    OutputBucketName=bucket,
    OutputKey=S3_OUTPUT_KEY,
)

By specifying ShowSpeakerLabels=True and MaxSpeakerLabels=6, labels such as spk_0, spk_1 ... are assigned to each speaker.

Polling was done at 30-second intervals until the job completed. The actual measured time from job start to completion was 3 minutes and 4 seconds, meaning a 56-minute audio file was transcribed in 3 minutes.

Matching Speaker Identification with Reference Materials

The downloaded JSON contains timestamps and speaker labels for each word. Claude Code used Python to bundle these into utterance units and organized them into an easy-to-handle intermediate representation.

After this, it was necessary to map labels like spk_0 to real names. Since Transcribe does not determine who is who, this is a step that requires contextual judgment by Claude Code. Real names were estimated from handwritten meeting notes and the volume and content of each speaker's statements. Finally, all organized statements and reference materials were used as input to generate the 7-section Markdown.

Challenges That Arose

The process did not go smoothly, and 3 challenges arose.

An Error Occurred Due to Insufficient IAM Permissions

When first attempting to register the custom vocabulary for Transcribe, execution was rejected by AccessDeniedException.

User: arn:aws:iam::***:user/*** is not authorized to perform: transcribe:CreateVocabulary

The IAM user I normally use had no Transcribe-related actions assigned at all.

I added the necessary permissions and re-executed. The main permissions that were needed are as follows.

  • transcribe:CreateVocabulary, transcribe:UpdateVocabulary, transcribe:GetVocabulary, transcribe:DeleteVocabulary
  • transcribe:StartTranscriptionJob, transcribe:GetTranscriptionJob, transcribe:ListTranscriptionJobs, transcribe:DeleteTranscriptionJob
  • S3-related: CreateBucket, PutObject, GetObject, ListBucket, DeleteObject, DeleteBucket

Japanese Hyphens Were Rejected in Custom Vocabulary

The next challenge that arose was with creating the custom vocabulary. When registering multi-word phrases like "Object Storage," Transcribe's convention for English is to write them with a hyphen as Object-Storage. When registered as-is, the result was a FAILED state for ja-JP.

Validation error: Your custom vocabulary file contains one or more unsupported characters ("-") on line 4.

The character types for ja-JP custom vocabularies are restricted, and hyphens are not permitted. Details are described in the Amazon Transcribe character set documentation.

As a solution, I re-registered them in katakana, such as "オブジェクトストレージ" (Object Storage) and "オブジェクトストア" (Object Store). In the actual meeting, there were almost no instances of English pronunciation, and they were often read in katakana, so this approach actually improved recognition accuracy as well.

The Number of Speakers Was Less Than MaxSpeakerLabels

Although I specified MaxSpeakerLabels to match the number of participants, the result yielded labels for only 5 people, one fewer than expected. The statement volume aggregated as follows.

Speaker Label Number of Statements Cumulative Statement Time (seconds)
spk_0 20 663.8
spk_1 4 98.1
spk_2 15 473.7
spk_3 34 1,151.6
spk_4 19 481.2

Even when totaled, it comes to just over 47 minutes, which is shorter than the one-hour meeting. It is presumed that short back-channel responses and replies from other speakers were merged into the long utterances of the speaker with the most speech volume.

My own statements were also short, and I believe they were merged into another speaker's segment. Since MaxSpeakerLabels is a parameter that specifies the upper limit of speakers and does not guarantee that the specified number of labels will be generated, I decided to correct this by cross-referencing with reference materials at the final summary generation stage. Since I had kept notes during the meeting, I was able to use them as supplementary information for speaker identification in the subsequent step.

Results and Actual Costs

The job took 3 minutes and 4 seconds, and including surrounding work (ffmpeg, S3 upload, download, Markdown generation, and human review), it was completed within 60 minutes.

The generated Markdown was sufficiently practical, and when compared against the actual meeting content, it was at a level that felt natural and accurate.

The final costs were as follows. In Japanese yen, this comes to approximately 203 yen.

Item Amount (USD)
Amazon Transcribe 1.35080
Amazon S3 0.00025
Custom Vocabulary 0
Total approx. 1.35

The unit price for the Tokyo region was obtained from the AWS Price List API and multiplied by the actual measured seconds of the Transcribe job. The Standard transcription in ap-northeast-1 is USD 0.0004 / second (Tier 1), and the calculation result for 3,377 seconds of audio is USD 1.35.

Summary

Simply by passing the recording and reference material URLs to Claude Code, a practical Markdown meeting minutes document was generated for approximately 200 yen. If you confirm the IAM policy, ja-JP character set constraints, and data handling rules in advance, this is a configuration that can easily be applied to business use. I hope this will be helpful for those considering reducing the effort involved in creating meeting minutes.


生成AI活用はクラスメソッドにお任せ

過去に支援してきた生成AIの支援実績100+を元にホワイトペーパーを作成しました。御社が抱えている課題のうち、どれが解決できて、どのようなサービスが受けられるのか?4つのフェーズに分けてまとめています。どうぞお気軽にご覧ください。

生成AI資料イメージ

無料でダウンロードする

Share this article