Thinking about modern real-time audio processing: Edge AI and haptics

Thinking about modern real-time audio processing: Edge AI and haptics

In today's smartphone society, there is a growing demand for experience design that assumes sound is muted. We observe changes in real-time audio processing use cases from two entry points: enhanced haptic (vibration) expression and metadata addition through edge AI.
2026.01.17

This page has been translated by machine translation. View original

Introduction

I have written several articles about real-time audio processing for enhancing experiences through SFX (sound effects) and music.

https://dev.classmethod.jp/articles/cpp-simd-audio-gain-benchmark/

https://dev.classmethod.jp/articles/vst3-again-streaming-audio-block-sampleoffset/

On the other hand, with the proliferation of smartphones, situations where content is consumed with the volume turned off have increased, making it difficult to create experiences solely through SFX and music. In this article, we'll observe how real-time audio processing use cases are changing through the evolution of haptics (vibration) and audio metadata augmentation through edge AI.

Background

With the spread of smartphones, we have more opportunities to use apps and games in situations where it's difficult to play sound, such as during commutes or waiting times, out of consideration for others. As a result, using videos and apps with the volume turned off has become common. For example, a Verizon Media survey reports that a high percentage of people watch videos without sound in public places.

https://www.forbes.com/sites/tjmccue/2019/07/31/verizon-media-says-69-percent-of-consumers-watching-video-with-sound-off

In app and game development, this assumption strongly influences experience design. Even if you include SFX and music, a certain number of users won't hear them. Therefore, the more you build experiences around sound, the harder it becomes to deliver the intended experience.

In this context, non-audio feedback channels have become increasingly important. For example, improved haptic (vibration) expressiveness has expanded the range of tactile experience design. Additionally, advances in edge AI have made it easier to generate audio metadata such as subtitles, translations, summaries, and emotion analysis on the device and present them in real-time. Both can be seen as extensions of real-time audio processing in terms of signal processing with minimal latency.

Improved Expressiveness in Haptics

Haptics refers to tactile feedback, and in this article, it mainly refers to vibration. Traditionally, the expressiveness on the device side was limited, making it difficult to directly implement desired experiences. In recent years, technology has caught up, allowing for the creation of intensity and waveforms in haptics, making it easier to incorporate intended tactile feedback into experiences.

For example, iOS provides "Core Haptics," which allows apps to incorporate custom haptic and sound feedback.

https://developer.apple.com/documentation/corehaptics/

Android also allows specifying waveforms and intensity with "VibrationEffect."

https://developer.android.com/reference/android/os/VibrationEffect/

The Joy-Con, Nintendo Switch's game controller, features "HD Rumble" that can express fine tactile sensations. It contains a linear vibration motor that can express subtle differences in vibration.

https://www.nintendo.com/jp/topics/article/73d9531a-6bbe-11e7-8cda-063b7ac45a6d?srsltid=AfmBOoqxzo-_RXEcoHkssbQA_gz2iVDQhdV7OxnXeHVxBCEDlhKC95bR

Beyond just controlling vibration intensity, research is progressing on creating experiences using human perceptual characteristics. NTT's "Buru-Navi" can create the sensation of being pulled/pushed through asymmetric vibrations.

https://www.rd.ntt/cs/team_project/human/burunavi/

While haptics may seem like a separate technology from audio processing, both essentially deal with time-varying signals. Just as sound effects and audio are waveforms, vibrations can also be designed as waveforms with controllable intensity and temporal changes. Efforts that include designing tactile waveforms, managing delay and synchronization, and absorbing device differences are known as haptic engineering. The aspects emphasized in real-time audio processing remain important in haptic engineering.

Testing Continuous Value Vibration with ADB Commands

I experimented with continuous value vibration control (not just 0/1) on an Android device.

Test environment:

  • Windows 11
  • ADB version 1.0.41
  • Pixel 8 Pro

For testing, I first prepared two things:

  • Enabled USB debugging on the phone
  • Enabled vibration in phone settings

waveform -a creates vibrations by listing pairs of <duration(ms)> <amplitude>. Amplitude is on a scale of 1-255. In the following example, the amplitude increases step by step every second: 30 → 80 → 140 → 200 → 255.

adb shell cmd vibrator_manager synced -f waveform -a \
  1000 30 \
  1000 80 \
  1000 140 \
  1000 200 \
  1000 255

The tactile feeling clearly changes at the boundary of each section, with distinct steps. I found that such step changes are easier to work with when trying to create click-like sensations similar to SFX.

Adding -c allows for continuous changes between values. In the next example, the amplitude smoothly increases from 30 → 80 → 140 → 200 → 255 every second.

adb shell cmd vibrator_manager synced -f waveform -a -c \
  1000 30 \
  1000 80 \
  1000 140 \
  1000 200 \
  1000 255

Compared to steps, the change in tactile feedback is smoother, and the boundaries are less noticeable. Similar to fading in audio processing, I was able to control the impression of vibration.

Metadata Augmentation through Edge AI

Edge AI refers to on-device inference. By having the device handle advanced machine learning processing, latency due to communication with external computing resources (cloud or on-premises environments) can be reduced. Also, if processing is completed entirely on the device, it can be used in offline environments.

A familiar example is iPhone's voice input, which Apple states is processed on-device for many languages and doesn't require an internet connection.

https://support.apple.com/en-tj/guide/iphone/iph2c0651d2/ios

As the range of audio processing capabilities on devices expands, it becomes easier to add metadata to audio on the device side. This metadata augmentation means adding additional information to audio, such as:

  • Subtitles: Converting audio to text
  • Translation: Converting to another language
  • Summarization: Shortening long content
  • Emotion analysis: Estimating emotional tendencies

These can be used as means to deliver experiences even in volume-off situations, such as "delivering information that users can read instead of hearing."

Emotion Analysis in SaaS

We're seeing more examples of emotion analysis being integrated as SaaS features. However, currently, many examples are limited to text-based processing.

Zendesk's "Intelligent Triage" estimates intent and emotion based on the first comment of a ticket.

https://support.zendesk.com/hc/en-us/articles/4550640560538

Twilio's "Conversational Intelligence" also adds emotion and summary information to call transcripts.

https://www.twilio.com/en-us/products/conversational-ai/conversational-intelligence

In the future, emotion analysis that also uses characteristics of the voice itself (such as intonation and tone) may become more common. Particularly in CX improvement use cases, where the urgency often needs to be determined based on the intonation and tone of the customer's voice, developments in this direction could be highly valuable.

Conclusion

In app and game development, with the increase in volume-off usage, it has become difficult to create experiences centered around SFX and music. On the other hand, with improved haptic expressiveness and advances in edge AI, there are now more options for incorporating new experiences such as tactile feedback and audio metadata augmentation. Real-time audio processing is entering a phase where it can be reconsidered not just as a technology for modifying sound, but as a technological foundation for enhancing experiences.

Share this article

FacebookHatena blogX

Related articles