AWS CLIからAmazon Transcribeを使って文字起こししてみた

quiver

2018.03.13

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

re:Invent 2017で発表された文字起こしサービス Amazon Transcribe を AWS CLI から触ってみました。

文字起こしネタには、re:Invent 2017 で Andy Jassy がキーノートで Amazon Transcribe を紹介している動画を利用します。

Amazon Transcribe はまだプレビュー

Amazon Transcribe は 2018 年3月12日現在プレビューであり、 N. Virginia リージョンでのみサービス提供されています。プレビュー申請がまだの場合は、次の URL から申請下さい。

https://pages.awscloud.com/amazon-transcribe-preview.html

プレビュー期間中の利用費は無料です。

処理の流れ

Amazon Transcribe の処理の流れを図にまとめます。

ポイントは、インプット(文字起こししたいメディア)とアウトプット(文字起こしテキスト)が異なる AWS アカウントの S3 バケットに保存される点です。

ユーザーの管理する S3 バケットに文字起こししたいメディアをアップロードし、このメディアを Amazon Transcribe が読み取れるように、バケットポリシーを変更します。

しかし、文字起こし後のテキストは Amazon Transcribe 管轄の S3 バケットに保存されます。そのため

ユーザーの管理する S3 バケットに対して Amazon Transcribe が書き込めるようなバケットポリシーの設定は不要
Amazon Transcribe が管理する S3 バケットにある文字起こし後のテキストは presigned URL 経由で取得

となります。

Getting Started をやってみる

次の公式ドキュメントの Getting Started をベースに AWS CLI で Amazon Transcribe を試します。

AWS Documentation » Transcribe » Developer Guide » Getting Started with Amazon Transcribe » Step 4: Getting Started Using the API » Getting Started (AWS Command Line Interface)

利用するサービスは以下です

Amazon Transcribe
Amazon S3
AWS IAM

1. リージョンの設定

N. Virginia でしかサービス提供されておらず、Amazon Transcribe の制約として、文字起こしするオーディオの保存先 S3 は Transcribe を利用するリージョンと同じである必要があります。 CLI のリージョンを N. Virginia に固定します

$ export AWS_DEFAULT_REGION=us-east-1

2. S3 バケットの作成

2.1 バケットの作成

オーディオをアップロードする S3 バケットを作成します。

$ BUCKET=DUMMY
$ aws s3 mb s3://$BUCKET
make_bucket: DUMMY

2.2 バケットポリシーの変更

Amazon Transcribe がこの S3 バケットを読み取れるようにバケットポリシーを変更します。

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "transcribe.amazonaws.com"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::DUMMY/*"
        }
    ]
}

"Resource": "arn:aws:s3:::DUMMY/*" の箇所は、実際のバケット名に変更して下さい。

このポリシーをバケットに設定します。

$ aws s3api put-bucket-policy --bucket $BUCKET --policy file://policy.json

$ aws s3api get-bucket-policy --bucket $BUCKET
{
    "Policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"transcribe.amazonaws.com\"},\"Action\":\"s3:GetObject\",\"Resource\":\"arn:aws:s3:::DUMMY/*\"}]}"
}

詳細は次の URL の "Permissions Required for Audio Transcription" を参照ください。

https://docs.aws.amazon.com/transcribe/latest/dg/access-control-managing-permissions.html#auth-role-permissions

3. 文字起こしするオーディオを S3 にアップロード

オーディオを S3 に PUT します。

$ aws s3 cp reinvent.mp4 s3://$BUCKET/input/
upload: ./test.mp4 to s3://DUMMY/input/test.mp4

2018/03/12 時点で対応しているフォーマットは以下です

WAV
MP3
MP4
FLAC

4. 文字起こしを開始

準備が整ったところで、ようやく Amazon Transcribe API を使って、文字起こしジョブを開始する API start-transcription-jobを投げます。

$ aws transcribe start-transcription-job \
  --transcription-job-name 123 \
  --media-format mp4 \
  --language-code en-US \
  --media MediaFileUri=https://s3.amazonaws.com/BUCKET/input/test.mp4
{
    "TranscriptionJob": {
        "TranscriptionJobName": "123",
        "LanguageCode": "en-US",
        "TranscriptionJobStatus": "IN_PROGRESS",
        "Media": {
            "MediaFileUri": "https://s3.amazonaws.com/BUCKET/input/test.mp4"
        },
        "CreationTime": 1520808476.722,
        "MediaFormat": "mp4"
    }
}

引数 --transcription-job-name には AWS アカウント内でユニークなジョブ名を指定します。重複していると「An error occurred (BadRequestException) when calling the StartTranscriptionJob operation: The requested job name already exists. Use a different job name.」と言うエラーが発生しました。
引数 --language-code は英語(en-US)とスペイン語(es-US)の二択です
引数 --media-format は mp3 | mp4 | wav | flac の4択です

5. Transcribe のステータスチェック

ジョブのステータスを API get-transcription-job で確認します。

$ aws transcribe get-transcription-job --transcription-job-name 123

ステータスはジョブ開始直後の IN_PROGRESS から COMPLETED または FAILED に遷移します。

5.1 ジョブ開始直後

{
    "TranscriptionJob": {
        "TranscriptionJobName": "123",
        "LanguageCode": "en-US",
        "TranscriptionJobStatus": "IN_PROGRESS",
        "Media": {
            "MediaFileUri": "https://s3.amazonaws.com/DUMMY/input/test.mp4"
        },
        "CreationTime": 1520808476.722,
        "MediaFormat": "mp4",
        "Transcript": {}
    }
}

5.2 ジョブ正常完了後

{
    "TranscriptionJob": {
        "TranscriptionJobName": "123",
        "LanguageCode": "en-US",
        "MediaSampleRateHertz": 44100,
        "TranscriptionJobStatus": "COMPLETED",
        "Media": {
            "MediaFileUri": "https://s3.amazonaws.com/DUMMY/input/test.mp4"
        },
        "CreationTime": 1520808476.722,
        "CompletionTime": 1520808816.455,
        "MediaFormat": "mp4",
        "Transcript": {
            "TranscriptFileUri": "https://s3.amazonaws.com/aws-transcribe-us-east-1-prod/123456789012/123/asrOutput.json?DUMMY"
        }
    }
}

文字起こしされたテキストは Amazon Transcribe が管理する S3 バケット s3://aws-transcribe-us-east-1-prod/AWS-ACCOUNT-ID/JOB-ID/asrOutput.json に保存されます。このオブジェクトにアクセスするための presigned URL をジョブ結果の TranscriptionJob.Transcript.TranscriptFileUri に出力されています。

"asrOutput.json" の "asr" は "automatic speech recognition" の略と思います。

6. 文字起こし後のテキストを取得

この asrOutput.json を HTTPS で取得します。

$ curl -o 123.asrOutput.json \
  "https://s3.amazonaws.com/aws-transcribe-us-east-1-prod/123456789012/123/asrOutput.json?X-Amz-Security-Token=DUMMY"

この presigned URL は 15 分(899秒)だけ有効です。transcript 自体は90日取得可能です。

なお presigned URL が expire している場合は、以下のようなメッセージが出力されます。

<Error>
  <Code>AccessDenied</Code>
  <Message>Request has expired</Message>
  <X-Amz-Expires>899</X-Amz-Expires>
  <Expires>2018-03-11T07:11:38Z</Expires>
  <ServerTime>2018-03-11T23:01:30Z</ServerTime>
  <RequestId>DUMMY</RequestId>
  <HostId>DUMMY</HostId>
</Error>

7. テキストを確認

アウトプットは JSON です

speech-to-text 結果(results.transcripts.transcript)
text-to-speech 用に、各単語がいつ出現したか(results.transcripts.items)

があります。

テキストだけが欲しい場合は、前者だけをパースします。

アウトプットの全体の抜粋です

$ cat 123.asrOutput.json | jq .
{
  "jobName": "123",
  "accountId": "123456789012",
  "results": {
    "transcripts": [
      {
        "transcript": "How about language ? You know, i talked earlier about the fact that last year ... you want the service to understand so could be used quickly the way you mean it."
      }
    ],
    "items": [
      {
        "start_time": "0.260",
        "end_time": "0.480",
        "alternatives": [
          {
            "confidence": "0.8941",
            "content": "How"
          }
        ],
        "type": "pronunciation"
      },
      ...
      {
        "alternatives": [
          {
            "content": "."
          }
        ],
        "type": "punctuation"
      }
    ]
  },
  "status": "COMPLETED"
}

transcription 全体を眺めてみます。($ cat 123.asrOutput.json | jq ".results.transcripts[].transcript")

How about language ? You know, i talked earlier about the fact that last year we launched both polly and relax, but there are so many other things the builders want to do with language and one of the things that's been interesting is that there's so much data now it's a beast that's being locked up in audio and video files and the problem is practically it's kind of impossible to really search audio well and so the best way it's it's turned out to be able to capture this and to do something with it is to convert from audio to text traditionally how people have done this because that they've hired manual transcription return scripture agencies and they're expensive and they're time consuming and so what people typically do is there really only pick out the very most important things they want to transcribe and they leave all the rest on the table all this fate data all this value is sitting out there not being taken advantage of and leverage so we'd like to change that for people. So i'm excited to introduce a new service called amazon transcribe which just automatic speech recognition so transcribed does long form automatic speech recognition. It can analyze any wave mp three audio file and return text it's super useful for all kinds of things like call logs and subtitles for videos or capturing what said in a presentation or a meeting we'll start with english in spanish, but we'll have many more languages coming in the coming weeks and months, and then one of the things that we do with this service, which is different from other transcription services, is it won't show up to you is just one long, uninterrupted string of text like you'll find another transcription service has been instead we used machine learning to adding punctuation and grammatical formatting, so the text you get back is immediately usable, and then we'll time stamp every words you can align. Subtitles to the right video and it's much easier to deal with. Well, not only support very high end audio, but because so much the audio today is locked up in phones, you have to be able to deal with lower quality little bit rate audio and will support that as well here in the future. What you'll see in the coming months, you'll also be able to distinguish between multiple speakers, and then you'll also be able to add your own custom libraries in vocabulary is because there are certain words that you may use in a different way than others that you want the service to understand so could be used quickly the way you mean it.

オーディオ内容を十分に理解出来るレベルで文字起こしされています。 "Lex" が一般的な単語である "relax" となっている点は、ボキャブラリー機能が追加されれば改善すると思われます。機械学習をつかって自動的に差し込まれるはずの句読点がもっと正確に挿入されていると、より読みやすくなると思います。

8. データの削除

transcribe 結果は自分の管轄外の S3 バケットに保存されます。このデータを削除する API は現時点では存在しません。 AWS サポート経由で削除を依頼してください。

FAQ から引用します。

Q. Can I delete voice inputs stored by Amazon Transcribe?

Yes. You can request deletion of voice inputs associated with your account by contacting AWS Support. Deleting voice inputs may degrade your Amazon Transcribe experience.

https://aws.amazon.com/transcribe/faqs/

まとめ

Amazon Transcribe を AWS CLI から利用しました。

このサービスを利用する上で一番気をつけることは transcription が AWS の管理する S3 バケットに出力されることです。

文字起こしされたテキストは presigned URL を HTTPS GET する以外に取得する方法がありません。

transcription ファイルへのアクセスを制御する
transcription ファイルの削除方針を制御する

といった事はできないため、お気をつけ下さい。

また、企業・地域によってはオーディオデータや処理されたデータの管理が気になると思われます。

こちらについては FAQ の Data Privacy を一読ください。

参考

https://docs.aws.amazon.com/transcribe/latest/dg/getting-started.html
https://aws.amazon.com/transcribe/faqs/