I tried creating a web application that performs real-time transcription using Amazon Transcribe via backend for browser microphone audio

Introducing the implementation of a web application that transcribes browser microphone audio in real-time via a backend to Amazon Transcribe. The explanation focuses on the differences between two asynchronous models, push and pull, and the design of a queue-based bridge pattern.
Kitano Yuichi
2026.04.22
This page has been translated by machine translation. View original
I created a web application that performs real-time transcription using the browser's microphone with Amazon Transcribe's streaming API as a learning exercise.
In this article, I'll focus specifically on the design points for passing browser audio to the Transcribe SDK rather than implementation details. While building this, I encountered the challenge of "how to connect push data arriving via WebSocket with the pull interface expected by the SDK," and the solution turned out to be quite interesting, so I'll focus on that.
The complete code is available on GitHub.
https://github.com/rednes/transcribe-react-backend-sample
Please also check our company blog for an explanation of Amazon Transcribe's real-time transcription.
https://dev.classmethod.jp/articles/getting-started-with-realtime-amazon-transcribe/
 What I BuiltA web application that transcribes in real-time as you speak into your browser's microphone.
When you press "Start Recording" and allow microphone access, transcription begins. While speaking, partial results appear in gray, and when speech is segmented, it becomes fixed as confirmed results in black.
The tech stack consists of React + Vite for the frontend, Hono + Node.js for the backend, and I'm using @aws-sdk/client-transcribe-streaming as the AWS SDK.
 Why a Different Design from Regular APIs is NeededThink about an API call using fetch. It follows the pattern: "send a request → receive a response once and done." You fetch data at your timing, and once received, the process is complete.
Real-time audio processing is fundamentally different. Audio data flows continuously from the microphone, arriving as small pieces (chunks) at the producer's timing. This model, where the producer sets the pace, can be conveniently called a push model. On the other hand, the Transcribe SDK requests data one by one with for await...of. This model, where data flows only when the SDK (consumer) calls next(), can be conveniently called a pull model.
Reference: AWS SDKs - Amazon Transcribe
Reference: for await...of - JavaScript | MDN
Problems arise when these two models—push and pull—coexist. Push uses a callback pattern that "gets called when data arrives," while pull uses an iterator pattern that "returns when requested," so the timing doesn't match.

If you try to connect them directly, data might arrive when the SDK hasn't yet asked for "next," or conversely, the SDK might be waiting but no data has arrived yet.

Therefore, a mechanism like a queue is necessary.
This app has three push-pull boundaries. The cylinder shapes in the following diagram (PcmQueue / TranscribeAudioQueue / TranscribeEventQueue) represent these boundaries. Arrows flowing into cylinders represent push, and arrows flowing out represent pull.
 System Architecture and Data FlowLet's first understand the overall system architecture and data flow.
The cylinder shapes in the diagram (PcmQueue / TranscribeAudioQueue / TranscribeEventQueue) are the main focus of this article. All three function as bridges converting push to pull using the same pattern. Next, we'll look at the design of each bridge in sequence.
 AudioWorklet and PcmQueue What is AudioWorkletTo receive and process audio data from a microphone in the browser, we use the Web Audio API's AudioWorklet.
AudioWorklet is important because it operates on a dedicated thread independent from the main thread. Real-time audio processing requires timely handling that can't be done on the main thread, which is busy with DOM operations and JavaScript execution. The AudioWorklet thread is called directly by the browser's audio engine, so it's not affected by the state of the main thread.
Reference: AudioWorklet - Web APIs | MDN
In this app, I define a PCMProcessor class that extends AudioWorkletProcessor in public/worklets/pcm-processor.js, and connect it as an AudioWorkletNode to the audio graph.
// Frontend side
await audioCtx.audioWorklet.addModule("/worklets/pcm-processor.js");
const workletNode = new AudioWorkletNode(audioCtx, "pcm-processor", {
  processorOptions: { bufferSize: 4096 },
});
const source = audioCtx.createMediaStreamSource(stream);
source.connect(workletNode);
 How process() gets calledThe core of AudioWorkletProcessor is the process() method.
process(inputs) {
  const input = inputs[0]?.[0] // Mono ch0, Float32Array with 128 samples
  // ...
  return true // returning false stops the processor
}
The browser's audio engine calls this automatically every 128 samples. At 48kHz, that's about once every 2.7 milliseconds.
Reference: AudioWorkletProcessor: process() method - Web APIs | MDN
 Float32 → Int16 PCM conversionAudio samples from the microphone are in Float32 (decimal values from -1.0 to 1.0).
Reference: AudioBuffer: getChannelData() method - Web APIs | MDN
Meanwhile, the format to send to Amazon Transcribe is signed 16-bit PCM (Little Endian).
Reference: Transcribing streaming audio - Amazon Transcribe
Since this makes the size smaller than Float32, we perform this conversion in the browser's javascript (_flush()).
_flush() {
  const pcm = new ArrayBuffer(this._offset * 2)
  const view = new DataView(pcm)
  for (let i = 0; i < this._offset; i++) {
    const s = Math.max(-1, Math.min(1, this._buffer[i]))
    // Scale Float32 (-1.0 to 1.0) to Int16 (-32768 to 32767)
    view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7fff, true)
  }
  this.port.postMessage({ type: "pcm", audioData: new Uint8Array(pcm) })
  this._offset = 0
}
Sending all 128 samples each time would call postMessage too frequently, so we accumulate them in an internal buffer and send them together when 4096 samples (about 85 milliseconds) have accumulated.
public/worklets/pcm-processor.js full code public/worklets/pcm-processor.js
class PCMProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super()
    this._bufferSize = options.processorOptions?.bufferSize || 4096
    this._buffer = new Float32Array(this._bufferSize)
    this._offset = 0
    this._ended = false

    this.port.onmessage = (e) => {
      if (e.data.type === "end") this._ended = true
    }
  }

  process(inputs) {
    const input = inputs[0]?.[0]
    if (input) {
      let i = 0
      while (i < input.length) {
        const remaining = this._bufferSize - this._offset
        const toCopy = Math.min(remaining, input.length - i)
        this._buffer.set(input.subarray(i, i + toCopy), this._offset)
        this._offset += toCopy
        i += toCopy
        if (this._offset >= this._bufferSize) this._flush()
      }
    }
    if (this._ended) {
      if (this._offset > 0) this._flush()
      this.port.postMessage({ type: "ended" })
      return false
    }
    return true
  }

  _flush() {
    const pcm = new ArrayBuffer(this._offset * 2)
    const view = new DataView(pcm)
    for (let i = 0; i < this._offset; i++) {
      const s = Math.max(-1, Math.min(1, this._buffer[i]))
      view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7fff, true)
    }
    this.port.postMessage({ type: "pcm", audioData: new Uint8Array(pcm) })
    this._offset = 0
  }
}

registerProcessor("pcm-processor", PCMProcessor)
 To the main thread via MessagePortTo transfer data from the AudioWorklet thread to the main thread, we use MessagePort. We send with postMessage() and receive with an onmessage handler on the main thread side.
Reference: MessagePort - Web APIs | MDN
// Main thread side
workletNode.port.onmessage = (e) => {
  if (e.data.type === "pcm")
    audioQueue.push(e.data.audioData); // push!
  else if (e.data.type === "ended") audioQueue.end();
};
This is a push model where "a callback is called when data arrives."
 From frontend PcmQueue to WebSocket transmissionPCM data accumulated in PcmQueue is retrieved one by one with for await...of (pull) and sent via WebSocket.
// transcribeClient.ts
for await (const chunk of audioStream) {
  ws.send(chunk); // chunk is Uint8Array
}
for await...of is a pull model where "I receive data one by one each time I request it." This is where the first push→pull conversion happens. With PcmQueue acting as a buffer, audio data sent by the AudioWorklet (push) and WebSocket sending (pull) can operate at different paces without getting stuck.
 TranscribeAudioQueue: Converting push to pullThe same push/pull conversion that exists in the frontend also exists in the backend. Just as PcmQueue in the frontend connects AudioWorklet and WebSocket sending, TranscribeAudioQueue in the backend connects WebSocket receiving and the Transcribe SDK.
 Transcribe SDK pulls audio with requestsThe AudioStream parameter of StartStreamTranscriptionCommand accepts an AsyncIterable<AudioEvent>.
new StartStreamTranscriptionCommand({
  LanguageCode: "ja-JP",
  MediaEncoding: "pcm",
  MediaSampleRateHertz: opts.sampleRate,
  AudioStream: opts.audioStream, // Pass AsyncIterable
});
Internally, the SDK consumes this with for await...of. for await...of is a pull model that receives one item at a time whenever it requests "next".
Reference: Transcribe Streaming::StartStreamTranscriptionCommand - AWS SDK for JavaScript v3
Reference: Iteration protocols - JavaScript | MDN
 Mismatch between push and pullIn summary, there's this mismatch:
WebSocket.onmessage → Called when data arrives (push)
                                  ↕ mismatch
Transcribe SDK      → Requests next data with for await...of (pull)
If you try to connect these two directly, the timing won't align.
Even when data arrives from WebSocket, the SDK might not have said "next please" yet
Even when the SDK says "next please," data might not have arrived from WebSocket yet
 Solving with TranscribeAudioQueueTranscribeAudioQueue uses a queue as a buffer to absorb this timing mismatch.
async *[Symbol.asyncIterator](): AsyncGenerator<AudioEvent> {
  while (true) {
    // If there's data in the queue, yield it in order (pass to pull side)
    while (this.queue.length > 0) {
      yield { AudioEvent: { AudioChunk: this.queue.shift()! } }
    }
    if (this.ended) break
    // If queue is empty, wait with Promise until next push arrives
    await new Promise<void>((r) => (this.notify = r))
  }
}
The operation flow works like this:
[WebSocket] chunk1 arrives → audioQueue.push(chunk1)
[SDK] "next please" → chunk1 is in queue → yield chunk1
[SDK] "next please" → queue is empty → wait with Promise
[WebSocket] chunk2 arrives → push() → notify() → wait resolved → yield chunk2
[WebSocket] chunk3, chunk4 arrive consecutively → accumulate in queue
[SDK] "next please" → yield chunk3
[SDK] "next please" → yield chunk4
Even if push and pull paces don't match, the queue absorbs the difference and both sides continue running without getting stuck.
backend/src/lib/transcribeAudioQueue.ts full code backend/src/lib/transcribeAudioQueue.ts
export class TranscribeAudioQueue {
  private queue: Uint8Array[] = []
  private notify: (() => void) | null = null
  private ended = false

  push(chunk: Uint8Array) {
    this.queue.push(chunk)
    this.notify?.()
    this.notify = null
  }

  end() {
    this.ended = true
    this.notify?.()
    this.notify = null
  }

  async *[Symbol.asyncIterator](): AsyncGenerator<AudioEvent> {
    while (true) {
      while (this.queue.length > 0) {
        yield { AudioEvent: { AudioChunk: this.queue.shift()! } }
      }
      if (this.ended) break
      await new Promise<void>((r) => (this.notify = r))
    }
  }
}
 The same conversion appears in the result receiving directionSo far we've discussed TranscribeAudioQueue (sending audio data to the Transcribe SDK), but the exact same push/pull conversion was needed in the result receiving direction as well.
Transcription results return to the browser via WebSocket, and this data also arrives with ws.onmessage (push). To receive it with for await...of (pull), the same bridge is needed.
I implemented the same structure as TranscribeEventQueue.
export class TranscribeEventQueue {
  private queue: TranscribeEvent[] = [];
  private resolve: (() => void) | null = null;
  private closed = false;

  send(event: TranscribeEvent) {
    this.queue.push(event);
    this.resolve?.();
    this.resolve = null;
  }

  close() {
    this.closed = true;
    this.resolve?.();
    this.resolve = null;
  }

  async *[Symbol.asyncIterator](): AsyncGenerator<TranscribeEvent> {
    while (true) {
      while (this.queue.length > 0) {
        yield this.queue.shift()!;
      }
      if (this.closed) break;
      await new Promise<void>((r) => (this.resolve = r));
    }
  }
}
The pattern core is identical to TranscribeAudioQueue. In transcribeClient.ts, we instantiate TranscribeEventQueue and delegate to it with yield*.
// Note: This is simplified code to show the concept
export async function* startTranscription(...): AsyncGenerator<TranscribeEvent> {
  const channel = new TranscribeEventQueue();

  ws.onmessage = (event) => {
    channel.send({ type: "result", data: ... }); // push
  };
  ws.onclose = () => {
    channel.close(); // termination signal
  };

  yield* channel; // Delegate yields from TranscribeEventQueue as is
}
When channel.close() is called, TranscribeEventQueue terminates, and startTranscription itself also terminates. For details on yield*, please refer to MDN.
Reference: yield* - JavaScript | MDN
 Managing partial results with resultIdSo far we've talked about the sending side, but there are also characteristics on the receiving side from Transcribe.
Streaming transcription results come in two types:


Type
isPartial
Description


Partial result
true
Preliminary text being recognized. May change later

Final result
false
Confirmed text. Will not change anymore

Reference: Result - Amazon Transcribe
Partial and final results for the same utterance segment share the same resultId. On the frontend, by managing "overwrite if same resultId, add if new," we can display partial results updating in real-time and ultimately being replaced by final results.
for await (const event of startTranscription(
  handle.audioStream,
  handle.sampleRate,
)) {
  if (event.type === "result") {
    setSegments((prev) => {
      const idx = prev.findIndex((s) => s.resultId === event.data.resultId);
      return idx >= 0 ? prev.with(idx, event.data) : [...prev, event.data];
    });
  }
}
 ConclusionThis was my first time working with AudioWorklet, but I found it interesting to learn how real-time audio processing works with its separate thread communicating via MessagePort. Low-level processing like converting from Float32 to Int16 PCM became clearer through hands-on practice.
I hope this blog is helpful to someone.
I tried creating a web application that performs real-time transcription using Amazon Transcribe via backend for browser microphone audio

What I Built

Why a Different Design from Regular APIs is Needed

System Architecture and Data Flow

AudioWorklet and PcmQueue

What is AudioWorklet

How process() gets called

Float32 → Int16 PCM conversion

To the main thread via MessagePort

From frontend PcmQueue to WebSocket transmission

TranscribeAudioQueue: Converting push to pull

Transcribe SDK pulls audio with requests

Mismatch between push and pull

Solving with TranscribeAudioQueue

The same conversion appears in the result receiving direction

Managing partial results with resultId

Conclusion

AWS Topics

Trending Topics

Products & Services

Features and Series

Type	`isPartial`	Description
Partial result	`true`	Preliminary text being recognized. May change later
Final result	`false`	Confirmed text. Will not change anymore