Implementing Delay Guard to Solve the Problem of Increasing Processing Delays in Twilio Media Streams

Twilio Media Streams can accumulate unprocessed audio in a queue when downstream processing cannot keep up, causing the backlog to increase by seconds. In this article, we implement a delay guard that thins out audio that exceeds the delay budget, and confirm through actual measurements that the backlog can be kept to around several hundred milliseconds.
越井琢巳 (Koshii Takumi)
2025.12.28
This page has been translated by machine translation. View original
 IntroductionWith Twilio Media Streams, you can receive voice during calls in real-time via WebSocket. However, when downstream processing (e.g., voice analysis or AI inference) becomes heavy, unprocessed audio accumulates in the queue, causing you to process increasingly older audio, with delays snowballing. As a result, a situation can occur where the server handles audio that a user just spoke with a delay of several seconds to tens of seconds.
In this article, we will implement a "delay guard" as one solution to this problem and verify its effectiveness. A delay guard refers to a mechanism that thins out old audio when exceeding a latency budget (e.g., 200 ms). As an existing concept, it is similar to load shedding (thinning) to maintain latency targets. The aim of this article is to compare how latency changes with and without this mechanism.
 Target AudienceThose who want to try Media Streams but are concerned about handling latency
Those who want to understand how latency increases in real-time voice processing and potential countermeasures
 PrerequisitesOne-way only (Twilio → server)
Processing to return received audio to the Twilio side is out of scope

"Delay guard" suppresses latency by discarding audio
There is a trade-off in "reducing latency at the cost of reduced information"

 TerminologyBacklog

How much unprocessed audio has accumulated in terms of time, expressed in ms

backlog = latest received timestamp − processed timestamp
Latency Budget

An upper guideline beyond which user experience deteriorates!Guidelines vary by field and requirements, but generally for one-way voice call latency, ITU-T G.114 and others specify a minimum of under 400 ms, with 150 ms as a target.
What is Latency? | Twilio Docs
ITU-T G.114 (One-way transmission time)

Delay Guard

A mechanism that thins out old audio to keep up with the latest when about to exceed the latency budget!This is not a general term but the name used in this article. As an existing concept, it is similar to Load shedding.

 ReferencesMedia Streams WebSocket messages | Twilio Docs
Consume real-time Media Streams using WebSockets | Twilio Docs
 ArchitectureReceive audio events from Twilio via WebSocket
Queue the received media
Workers take items from the queue for processing (fixed sleep in this case)
If delay guard is enabled, discard old items when backlog exceeds the threshold
 Implementation Operations in Twilio ConsoleCreate a Function for the Incoming Webhook in Twilio Functions.
/media-stream-incoming Functionexports.handler = function (context, event, callback) {
  const WSS_URL = context.WSS_URL; // This is an environment variable on Twilio side

  const twiml = new Twilio.twiml.VoiceResponse();

  // Start stream from Twilio → WebSocket server
  twiml.start().stream({
    url: WSS_URL,
    track: 'inbound_track',
  });

  // Minimal response to keep the call from ending immediately (for testing)
  twiml.say('Media stream started. Please speak.');
  twiml.pause({ length: 60 });

  callback(null, twiml);
};
Set the environment variables as follows:


Variable Name
Example
Meaning


WSS_URL
wss://****.com/twilio
Connection destination obtained in the procedure described later

Enter the URL of the above Function in the Voice Configuration Incoming Webhook of the Twilio purchased number, and set POST as the request type.
 Building a WebSocket ServerFor this purpose, we used Render.com since "it's fine as long as it can be published externally and receive WSS."
The environment variables on the Render side are as follows:


Variable Name
Example
Meaning


SLEEP_MS
60
Simulated processing time (ms)

L1_MS
200
Latency budget (ms)

ENABLE_GOVERNOR
0 / 1
Disable/Enable delay guard

The server.js for testing is as follows:
server.jsconst http = require('http');
const express = require('express');
const { WebSocketServer } = require('ws');

const app = express();
const port = process.env.PORT || 3000;

app.get('/', (req, res) => {
  res.status(200).send('ok');
});

const server = http.createServer(app);
const wss = new WebSocketServer({ server, path: '/twilio' });

// Environment variables controlled on Render side
const SLEEP_MS = Number(process.env.SLEEP_MS || 60);
const ENABLE_GOVERNOR = String(process.env.ENABLE_GOVERNOR || '0') === '1';
const L1_MS = Number(process.env.L1_MS || 200);

// monotonic elapsed time (ms)
const t0 = process.hrtime.bigint();
function nowMs() {
  return Number((process.hrtime.bigint() - t0) / 1000000n);
}
function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

wss.on('connection', (ws, req) => {
  console.log('[ws] connected', { path: req.url });

  const state = {
    streamSid: null,
    mediaFormat: null,

    tsLatest: null,
    tsDone: null,

    lastSeq: null,
    droppedChunks: 0,

    queue: [], // { ts, seq }[]
    closed: false,
  };

  function backlogMs() {
    if (state.tsLatest == null || state.tsDone == null) return null;
    return state.tsLatest - state.tsDone;
  }

  // Delay guard: Discard items "older than L1_MS from latest" (drop-head)
  function applyGovernor() {
    if (!ENABLE_GOVERNOR) return;
    if (state.tsLatest == null) return;

    const cutoff = state.tsLatest - L1_MS;

    while (state.queue.length > 0) {
      const head = state.queue[0];
      if (head.ts >= cutoff) break;
      state.queue.shift();
      state.droppedChunks += 1;
    }
  }

  async function workerLoop() {
    for (;;) {
      if (state.closed) return;

      if (state.queue.length === 0) {
        await sleep(5);
        continue;
      }

      const item = state.queue.shift();

      // Simulated downstream (fixed sleep)
      await sleep(SLEEP_MS);

      state.tsDone = item.ts;

      // Check if we're keeping up with each processing
      applyGovernor();
    }
  }

  // Output only one line of CSV per second (to prevent log inflation)
  const metricTimer = setInterval(() => {
    const b = backlogMs();
    if (b == null) return;

    // metric,t_wall_ms,ts_latest,ts_done,backlog_ms,queue_len,dropped_chunks,governor,sleep_ms,l1_ms
    console.log(
      [
        'metric',
        nowMs(),
        state.tsLatest,
        state.tsDone,
        b,
        state.queue.length,
        state.droppedChunks,
        ENABLE_GOVERNOR ? 1 : 0,
        SLEEP_MS,
        L1_MS,
      ].join(',')
    );
  }, 1000);

  workerLoop().catch((e) => {
    console.log('[worker] error', { message: e?.message });
  });

  ws.on('message', (message) => {
    const text = Buffer.isBuffer(message) ? message.toString('utf8') : String(message);

    let data;
    try {
      data = JSON.parse(text);
    } catch (e) {
      return;
    }

    const ev = data.event;

    if (ev === 'connected') {
      console.log('[twilio] connected');
      return;
    }

    if (ev === 'start') {
      state.streamSid = data.start?.streamSid ?? null;
      state.mediaFormat = data.start?.mediaFormat ?? null;

      state.tsLatest = null;
      state.tsDone = null;
      state.lastSeq = null;
      state.droppedChunks = 0;
      state.queue.length = 0;

      console.log('[twilio] start', {
        streamSid: state.streamSid,
        mediaFormat: state.mediaFormat,
        governor: ENABLE_GOVERNOR ? 1 : 0,
        sleepMs: SLEEP_MS,
        l1Ms: L1_MS,
      });
      return;
    }

    if (ev === 'media') {
      const ts = Number(data.media?.timestamp);
      const seq = Number(data.sequenceNumber);
      if (!Number.isFinite(ts) || !Number.isFinite(seq)) return;

      if (state.lastSeq != null && seq !== state.lastSeq + 1) {
        console.log('[warn] seq gap', { prev: state.lastSeq, current: seq, delta: seq - state.lastSeq });
      }
      state.lastSeq = seq;

      state.tsLatest = ts;
      state.queue.push({ ts, seq });

      // Check catch-up at receipt time as well
      applyGovernor();
      return;
    }

    if (ev === 'stop') {
      console.log('[twilio] stop', { streamSid: state.streamSid ?? data.streamSid });
      return;
    }
  });

  ws.on('close', () => {
    state.closed = true;
    clearInterval(metricTimer);
    console.log('[ws] closed');
  });

  ws.on('error', (err) => {
    console.log('[ws] error', { message: err?.message });
  });
});

server.listen(port, '0.0.0.0', () => {
  console.log('[http] listening', { port: String(port) });
});
Deploy it with the following steps:
Push to GitHub
Create a Web Service on Render.com and connect the repository
Set Start Command to node server.js
Set Environment Variables for SLEEP_MS, L1_MS, ENABLE_GOVERNOR
After deployment, confirm that https://<service>.onrender.com/ returns ok
!This sample omits authentication for testing purposes. For production use, it would be advisable to implement connection source restrictions, authentication, rate limiting, and message size limitations.
 TestingUnder the same conditions, make approximately 60-second voice calls to the Twilio number and compare these two patterns:
Pattern A: No delay guard (ENABLE_GOVERNOR=0)
Pattern B: With delay guard (ENABLE_GOVERNOR=1, L1_MS=200)
The simulated processing is fixed at SLEEP_MS=60. This is to ensure a situation where "processing cannot keep up."
On Render.com, we set the log mode to Live tail and collected logs.
 ResultsAggregating the logs from this test (63 samples each, about 62 seconds) resulted in the following:


Condition
Max Backlog
Median Backlog
95th Percentile Backlog
Max Queue
Total Dropped


No delay guard
42020 ms
21380 ms
39954 ms
2100
0

With delay guard
320 ms
300 ms
320 ms
11
2089

Pattern A: No delay guard

The backlog expanded to about 42 seconds at maximum. This is a state where the server might handle "what was just spoken" with a delay of tens of seconds.
Pattern B: With delay guard

The backlog was kept to around 320 ms maximum. In exchange, thinning occurred for the portion that couldn't be processed in time.
 Backlog Progression
 Queue Length Progression
 Cumulative Dropping with Delay Guard Enabled
Pattern A Detailed Datametric,t_wall_ms,ts_latest,ts_done,backlog_ms,queue_len,dropped_chunks,governor,sleep_ms,l1_ms
metric,127053,1227,467,760,37,0,0,60,200
metric,128053,2227,807,1420,70,0,0,60,200
metric,129053,3207,1127,2080,103,0,0,60,200
metric,130053,4207,1467,2740,136,0,0,60,200
metric,131053,5207,1807,3400,169,0,0,60,200
metric,132053,6207,2127,4080,203,0,0,60,200
metric,133053,7207,2467,4740,236,0,0,60,200
metric,134053,8207,2787,5420,270,0,0,60,200
metric,135053,9207,3127,6080,303,0,0,60,200
metric,136053,10207,3467,6740,336,0,0,60,200
metric,137053,11207,3787,7420,370,0,0,60,200
metric,138053,12207,4127,8080,403,0,0,60,200
metric,139053,13207,4467,8740,436,0,0,60,200
metric,140053,14207,4787,9420,470,0,0,60,200
metric,141053,15207,5127,10080,503,0,0,60,200
metric,142053,16207,5467,10740,536,0,0,60,200
metric,143053,17207,5787,11420,570,0,0,60,200
metric,144053,18207,6127,12080,603,0,0,60,200
metric,145053,19207,6467,12740,636,0,0,60,200
metric,146053,20207,6787,13420,670,0,0,60,200
metric,147053,21207,7127,14080,703,0,0,60,200
metric,148053,22207,7467,14740,736,0,0,60,200
metric,149053,23207,7787,15420,770,0,0,60,200
metric,150053,24207,8127,16080,803,0,0,60,200
metric,151053,25207,8447,16760,837,0,0,60,200
metric,152053,26207,8787,17420,870,0,0,60,200
metric,153054,27207,9127,18080,903,0,0,60,200
metric,154054,28207,9447,18760,937,0,0,60,200
metric,155054,29207,9787,19420,970,0,0,60,200
metric,156054,30207,10127,20080,1003,0,0,60,200
metric,157054,31207,10447,20760,1037,0,0,60,200
metric,158054,32167,10787,21380,1068,0,0,60,200
metric,159054,33167,11127,22040,1101,0,0,60,200
metric,160054,34167,11447,22720,1135,0,0,60,200
metric,161054,35167,11787,23380,1168,0,0,60,200
metric,162054,36167,12127,24040,1201,0,0,60,200
metric,163054,37167,12447,24720,1235,0,0,60,200
metric,164054,38167,12787,25380,1268,0,0,60,200
metric,165054,39167,13107,26060,1302,0,0,60,200
metric,166054,40167,13447,26720,1335,0,0,60,200
metric,167054,41167,13787,27380,1368,0,0,60,200
metric,168055,42187,14107,28080,1403,0,0,60,200
metric,169055,43187,14447,28740,1436,0,0,60,200
metric,170056,44167,14787,29380,1468,0,0,60,200
metric,171057,45187,15107,30080,1503,0,0,60,200
metric,172057,46147,15447,30700,1534,0,0,60,200
metric,173057,47147,15787,31360,1567,0,0,60,200
metric,174057,48147,16107,32040,1601,0,0,60,200
metric,175058,49147,16447,32700,1634,0,0,60,200
metric,176058,50147,16787,33360,1667,0,0,60,200
metric,177058,51127,17107,34020,1700,0,0,60,200
metric,178058,52127,17447,34680,1733,0,0,60,200
metric,179062,53127,17787,35340,1766,0,0,60,200
metric,180063,54127,18107,36020,1800,0,0,60,200
metric,181063,55127,18447,36680,1833,0,0,60,200
metric,182063,56127,18767,37360,1867,0,0,60,200
metric,183064,57127,19107,38020,1900,0,0,60,200
metric,184064,58127,19447,38680,1933,0,0,60,200
metric,185065,59127,19767,39360,1967,0,0,60,200
metric,186065,60127,20107,40020,2000,0,0,60,200
metric,187065,61127,20447,40680,2033,0,0,60,200
metric,188065,62127,20767,41360,2067,0,0,60,200
metric,189065,63127,21107,42020,2100,0,0,60,200
Pattern B Detailed Datametric,t_wall_ms,ts_latest,ts_done,backlog_ms,queue_len,dropped_chunks,governor,sleep_ms,l1_ms
metric,24066,1187,887,300,11,29,1,60,200
metric,25065,2187,1907,280,11,62,1,60,200
metric,26065,3187,2867,320,11,96,1,60,200
metric,27065,4167,3867,300,11,128,1,60,200
metric,28066,5167,4887,280,11,161,1,60,200
metric,29066,6167,5847,320,11,195,1,60,200
metric,30066,7167,6867,300,11,228,1,60,200
metric,31066,8167,7887,280,11,261,1,60,200
metric,32067,9167,8847,320,11,295,1,60,200
metric,33066,10167,9867,300,11,328,1,60,200
metric,34066,11167,10887,280,11,361,1,60,200
metric,35066,12167,11847,320,11,395,1,60,200
metric,36066,13167,12867,300,11,428,1,60,200
metric,37066,14147,13867,280,11,460,1,60,200
metric,38066,15147,14827,320,11,494,1,60,200
metric,39066,16147,15867,280,11,527,1,60,200
metric,40067,17147,16887,260,10,561,1,60,200
metric,41066,18147,17847,300,11,594,1,60,200
metric,42066,19147,18867,280,11,627,1,60,200
metric,43066,20147,19887,260,10,661,1,60,200
metric,44066,21147,20847,300,11,694,1,60,200
metric,45067,22147,21847,300,11,727,1,60,200
metric,46066,23147,22887,260,10,761,1,60,200
metric,47066,24147,23827,320,11,794,1,60,200
metric,48066,25107,24827,280,11,825,1,60,200
metric,49067,26087,25827,260,10,858,1,60,200
metric,50066,27087,26787,300,11,891,1,60,200
metric,51066,28087,27807,280,11,924,1,60,200
metric,52066,29087,28767,320,11,958,1,60,200
metric,53066,30087,29787,300,11,991,1,60,200
metric,54067,31087,30807,280,11,1024,1,60,200
metric,55067,32087,31767,320,11,1058,1,60,200
metric,56067,33087,32787,300,11,1091,1,60,200
metric,57067,34067,33787,280,11,1123,1,60,200
metric,58067,35067,34747,320,11,1157,1,60,200
metric,59067,36067,35767,300,11,1190,1,60,200
metric,60067,37067,36787,280,11,1223,1,60,200
metric,61067,38067,37747,320,11,1257,1,60,200
metric,62068,39067,38767,300,11,1290,1,60,200
metric,63067,40067,39787,280,11,1323,1,60,200
metric,64067,41067,40747,320,11,1357,1,60,200
metric,65067,42067,41787,280,11,1390,1,60,200
metric,66067,43067,42807,260,10,1424,1,60,200
metric,67067,44067,43767,300,11,1457,1,60,200
metric,68067,45067,44787,280,11,1490,1,60,200
metric,69068,46067,45807,260,10,1524,1,60,200
metric,70067,47067,46767,300,11,1557,1,60,200
metric,71067,48067,47787,280,11,1590,1,60,200
metric,72067,49067,48807,260,10,1624,1,60,200
metric,73067,50067,49767,300,11,1657,1,60,200
metric,74067,51067,50787,280,11,1690,1,60,200
metric,75067,52067,51747,320,11,1724,1,60,200
metric,76068,53067,52767,300,11,1757,1,60,200
metric,77067,54027,53747,280,11,1788,1,60,200
metric,78067,55027,54707,320,11,1822,1,60,200
metric,79067,56027,55727,300,11,1855,1,60,200
metric,80067,57027,56747,280,11,1888,1,60,200
metric,81066,58027,57707,320,11,1922,1,60,200
metric,82068,59027,58727,300,11,1955,1,60,200
metric,83068,60027,59747,280,11,1988,1,60,200
metric,84067,61027,60727,300,11,2022,1,60,200
metric,85067,62027,61747,280,11,2055,1,60,200
metric,86068,63027,62767,260,10,2089,1,60,200
 Discussion It's a Trade-off with Information LossThe dropped_chunks=2089 in the delay guard enabled case calculates to approximately 41.78 seconds of discarded audio frames. In this experiment, since we intentionally created a state where processing is slower than arrival with SLEEP_MS=60, it's natural that the thinning is substantial.
What's important is that the system's characteristics change when delay guard is introduced. Without delay guard, the backlog will theoretically continue to grow as long as processing can't keep up. With delay guard, the system keeps the backlog near the latency budget at the cost of manifesting the inability to process everything as data loss. In other words, delay guard transforms an overloaded system from "infinitely delayed" to "finite delay with data loss."
This design is suitable for use cases that prioritize immediacy over completeness, such as keyword detection or live caption tracking. However, it is not suitable for use cases where data loss is unacceptable, such as audit storage or downstream transcription requiring absolute precision. When data loss is unacceptable, priority should be given to increasing processing capacity, lightening processing, asynchronous processing, or separation of storage and real-time systems rather than using delay guard.
 Refining Discard LogicThe implementation in this article is a simple drop-head approach that discards the oldest items first when backlog exceeds the latency budget. While this implementation is straightforward and reliably suppresses latency, there are ways to discard more intelligently that might preserve more information value while still meeting the latency budget.
For example, preferentially discarding silence or low-energy segments could be one strategy. In human conversation or caption generation, missing silent periods often has relatively little impact on understanding meaning. This would tend to preserve speech-dense sections better than simple drop-head.
Another approach would be to shift from continuous discarding to temporal sampling. For instance, during overload, only passing one chunk at regular intervals to provide coarse temporal resolution tracking. While completeness is reduced, this might be advantageous compared to continuous discarding for use cases focused on tracking topic changes or content outlines. With the conditions in this test, where processing can only keep up with 1/3 of the incoming data, a natural design would be to process only 1 out of every 3 chunks during overload. While drop-head results in a similar loss rate, its discarding tends to be more continuous, which may make sampling-based approaches more stable in quality for certain applications.
Thus, delay guard is not just about discarding old items but is also a quality design problem of how to structure data loss during overload. By selecting a specific use case and evaluating how downstream quality metrics (e.g., keyword detection rate, caption tracking delay, transcription error rate) change when altering the discard pattern, the application decision for delay guard can be better quantified.
 ConclusionIn Twilio Media Streams, when downstream processing of received audio cannot keep up with the arrival rate, the backlog can continue to grow, potentially resulting in audio processing delays of tens of seconds. Implementing a delay guard can keep latency near the latency budget (e.g., 200 ms), but unprocessed data is lost through load shedding, creating a trade-off with completeness. In production, selecting discard logic tailored to use cases, such as silence-prioritized discarding or sampling-based thinning, and evaluating with downstream quality metrics, helps make more informed application decisions.
Implementing Delay Guard to Solve the Problem of Increasing Processing Delays in Twilio Media Streams

Introduction

Target Audience

Prerequisites

Terminology

References

Architecture

Implementation

Operations in Twilio Console

Building a WebSocket Server

Testing

Results

Backlog Progression

Queue Length Progression

Cumulative Dropping with Delay Guard Enabled

Discussion

It's a Trade-off with Information Loss

Refining Discard Logic

Conclusion

Related articles

AWS Topics

Trending Topics

Products & Services

Features and Series

Variable Name	Example	Meaning
`SLEEP_MS`	60	Simulated processing time (ms)
`L1_MS`	200	Latency budget (ms)
`ENABLE_GOVERNOR`	0 / 1	Disable/Enable delay guard
Condition	Max Backlog	Median Backlog	95th Percentile Backlog	Max Queue	Total Dropped
No delay guard	42020 ms	21380 ms	39954 ms	2100	0
With delay guard	320 ms	300 ms	320 ms	11	2089