Mobile AI with ONNX Runtime: How to Build Real-Time Noise Suppression That Works

mrarup823 hours ago

0 0 13 minutes read

Your phone is more powerful than a desktop computer from five years ago. The latest flagship Android devices pack neural processing units, multi-core CPUs that would make 2019 laptops jealous, and RAM configurations that seemed impossible just a few years back. So why does it feel like we’re barely scratching the surface of what’s possible with AI on mobile?

Sure, you can now even run quantized versions of Llama and DeepSeek models locally on your device. But let’s be honest – those conversations still feel clunky, slow, and nowhere near the seamless experience users expect from their apps. The hype around on-device conversational AI is real, but the practical reality? We’re not quite there yet.

Here’s where most developers miss the opportunity: conversational AI isn’t the only game in town. While everyone’s chasing the next ChatGPT clone, there’s a massive untapped potential in specialized AI applications that actually work brilliantly on mobile hardware right now.

Take noise suppression. Your users are constantly battling background noise during calls, recordings, and voice messages. Wind, traffic, crying babies, barking dogs – it’s an endless war against audio chaos. But what if your app could eliminate that noise in real-time, locally, without sending a single byte to the cloud?

This isn’t science fiction. It’s happening today, and any Android developer can implement it. The tools are mature, the performance is there, and your users will immediately notice the difference.

ONNX Runtime: Your Gateway to Mobile AI

The mobile AI landscape is fragmented. PyTorch dominates research, TensorFlow rules production, and countless specialized frameworks emerge for specific use cases. For Android developers, this creates a painful choice: commit to one ecosystem or maintain separate pipelines for different models.

After evaluating mobile AI frameworks, we chose ONNX Runtime for several compelling reasons that directly impact Android development.

Wider Compatibility Across Android Versions

Unlike Google’s LiteRT (formerly TensorFlow Lite), which mandates a minimum SDK level of 31 (Android 12), ONNX Runtime comfortably supports Android API levels as low as 24 (or even 21, if you are a magician). Our project’s minimum supported version was API 28, making ONNX Runtime the clear choice to reach a broader audience without excluding millions of active users on Android 10 and earlier devices.

Seamless Cross-Framework Integration

ONNX Runtime’s greatest strength lies in its framework-agnostic nature. Whether your AI models originate from PyTorch, TensorFlow, or even traditional ML libraries like scikit-learn, exporting models to ONNX allows uniform deployment across Android, iOS, desktops, and cloud environments. This flexibility significantly simplifies maintenance, enabling a unified pipeline rather than juggling multiple framework-specific tools.

Lightweight and Modular Deployment

Integration with ONNX Runtime is straightforward. With a compact Maven AAR (around 5–7 MB for CPU builds), the library integrates cleanly into your existing Android app without introducing unnecessary dependencies like Google Play Services or requesting additional user permissions. This streamlined deployment keeps your app lean, performant, and secure.

Proven Real-Time Performance

ONNX Runtime is battle-tested in demanding real-time scenarios. Audio-processing models, such as noise suppression or speech enhancement, consistently deliver inference speeds under 4 ms per audio frame on standard mobile hardware – comfortably within real-time performance requirements. Our team leveraged this exact capability for reliable, low-latency noise suppression.

Ultimately, ONNX Runtime provided our team not just convenience, but strategic advantage – allowing us to maintain compatibility, performance, and simplicity without compromises. If your project demands accessible, performant, and scalable mobile AI, ONNX Runtime could very well be your optimal choice.

Integration Plan: Setting Up ONNX Runtime

The default ONNX Runtime package weighs in at a hefty 27MB. For a mobile app, that’s not just bloat – it’s a user acquisition killer. Research shows that for every 6MB increase in APK size, install conversion rates drop by 1% (I should note that the study is not the newest one, and concerns mainly not the most developed countries, but still, it is worth considering)

The solution? A custom minimal build that strips your deployment down to exactly what you need. Our DTLN noise suppression implementation clocks in at just 7.1MB – a 70% size reduction that keeps your app lean and your users happy.

ONNX Runtime’s architecture is modular by design. The full package includes execution providers for GPU acceleration (NNAPI, Vulkan), dozens of operators you’ll never use, and compatibility layers for edge cases that don’t apply to your specific model. By building only what you need, you eliminate this overhead entirely.

Step 1: Convert Your Models to ORT Format

Before building, convert your ONNX models to ORT format. This optimized format removes unused graph nodes and operator definitions, further reducing your final binary size.

python -m onnxruntime.tools.convert_onnx_models_to_ort     --target_platform {arm,amd64}

Step 2: Create Operator Configuration

After converting all the necessary models, configuration files with the necessary operators that will be required for the minimal build of ONNX will also be generated.

When using several models, combine all the configuration files into one.

The final file will look something like this:

# Generated from model/s:
ai.onnx;1;Transpose
ai.onnx;6;Sigmoid
ai.onnx;7;Add,LSTM
ai.onnx;9;MatMul
ai.onnx;11;Concat,Slice,Squeeze,Unsqueeze
ai.onnx;1;Transpose
ai.onnx;5;Reshape
//other operators

Step 3: Execute Custom Build

With your operator configuration ready, build ONNX Runtime from source with minimal dependencies:

./build.sh --android \
  --android_sdk_path ~/Library/Android/sdk \
  --android_ndk_path ~/Library/Android/sdk/ndk/28.0.12674087 \
  --android_abi arm64-v8a \
  --android_api 24 \
  --minimal_build \
  --include_ops_by_config path/to/required_operators.config \
  --build_java \
  --config=Rel

–minimal_build: Excludes unnecessary execution providers and operators
–include_ops_by_config: Includes only operators specified in your config file
–android_abi armeabi-v7a: Targets ARM32 for maximum device compatibility
–android_api 24: Maintains compatibility with Android 7.0+

If you also want to support old devices, repeat step 3 with –android_abi armeabi-v7a and them merge both resulting aar files with following script:

#!/usr/bin/env sh

# Clean up from previous runs, if any
rm -rf merge-tmp
mkdir -p merge-tmp
cd merge-tmp

echo "Unzipping each ABI-specific AAR..."
mkdir a64
unzip ../onnxruntime-arm64-v8a.aar -d a64

mkdir a32
unzip ../onnxruntime-armeabi-v7a.aar -d a32

echo "Preparing universal base from arm64 AAR..."
mkdir universal
cp -r a64/* universal

rm -rf universal/jni
mkdir -p universal/jni


echo "Merging native libs from each architecture..."
mkdir -p universal/jni/arm64-v8a
cp a64/jni/arm64-v8a/*.so universal/jni/arm64-v8a

mkdir -p universal/jni/armeabi-v7a
cp a32/jni/armeabi-v7a/*.so universal/jni/armeabi-v7a


# Re-zip contents of 'universal' to create a new AAR
echo "Creating universal AAR..."
cd universal
zip -r onnxruntime-universal.aar ./*

echo "Done! The merged AAR is at:"
echo "$(pwd)/onnxruntime-universal.aar"

The minimal build approach transforms ONNX Runtime from a deployment liability into a strategic advantage. Your users get the full AI capability without the bloat, and your app maintains the lean profile that modern mobile development demands.

Next, let’s see this optimized runtime in action with real-time DTLN noise suppression.

Why Audio Processing Showcases AI Value

Audio processing is the perfect introduction to mobile AI – delivering immediate, tangible value while your competitors wrestle with bloated language models that drain batteries and require constant internet connections.

The Daily Audio War Your Users Are Fighting

Sarah records voice messages while walking through a busy street. Between honking taxis, construction noise, and subway rumbles, she re-records messages three times before giving up and typing instead.

Marcus joins client calls from his home office, which doubles as his toddler’s playroom. Every presentation becomes a cycle of “mute, unmute, apologize for the crying.”

Elena creates YouTube content in her apartment but spends hours in post-production cleaning up neighbor noise, traffic, and air conditioning hum.

These aren’t edge cases – they’re the reality of modern mobile computing where everyone expects professional results from consumer hardware in chaotic environments.

Why Noise Suppression Creates Instant “Wow” Moments

Audio quality improvements trigger immediate emotional responses. Unlike other AI applications requiring explanation, noise suppression provides instant gratification users can perceive within seconds. Play someone their own voice – crystal clear – after removing background noise, and watch their reaction. They don’t need to understand LSTM networks; they just know their audio sounds professional.

The beauty of audio processing as an AI showcase lies in universality. Everyone understands good audio, everyone has experienced bad audio, and everyone immediately recognizes improvement when noise disappears. You’re not asking users to trust your AI – you’re proving its value in the most direct way possible.

Building a Production-Ready Demo: Real-Time Noise Suppression

Now, let’s build a sample project that demonstrates the practical power of ONNX Runtime on Android. Rather than a basic “hello world” example, we’ll create something close to production quality – a real-time noise suppression demonstration where users can record audio in noisy environments and experience the striking difference between their original recording and the AI-cleaned version.

You can find a complete sample here, on my Github: https://github.com/linreal/android-onnx-showcase.

Models, used in sample are from https://github.com/breizhn/DTLN

Quick Implementation Overview

Before diving into DTLN’s dual-path architecture, let’s establish how the pieces fit together. The beauty of this implementation lies in its clean separation of concerns – each component has a single responsibility, making the system both testable and maintainable.

The Core Components

At the heart of our noise suppression pipeline sit three key interfaces that work together:

interface NoiseSuppressor {
    suspend fun initialize()
    fun processChunk(audioChunk: FloatArray): FloatArray
    fun release()
}

interface AudioRecorder {
    suspend fun startRecording(): Flow
    suspend fun stopRecording()
}

interface ConcurrentAudioProcessor {
    suspend fun startProcessing(
        suppressor: NoiseSuppressor,
        rawOutputFile: File,
        processedOutputFile: File
    )
    suspend fun stopProcessing(): ProcessingResult
}

The NoiseSuppressor encapsulates all ONNX Runtime complexity behind a simple interface. Feed it audio chunks, get back denoised audio. The stateful nature of DTLN is completely hidden – the implementation maintains LSTM states internally between calls.

Data Flow Architecture

AudioRecorder → Flow → ConcurrentAudioProcessor → NoiseSuppressor → Processed Audio Files

The ConcurrentAudioProcessor orchestrates the entire pipeline. It subscribes to the audio recorder’s Flow, converts audio formats, processes chunks through the noise suppressor, and writes both original and processed audio to files simultaneously.

rawAudioRecorder.startRecording().collect { audioChunk ->
    // Convert format for processing
    val floatChunk = AudioConversionUtils.shortArrayToFloatArray(audioChunk)
    
    // Process through DTLN
    val processedChunk = suppressor.processChunk(floatChunk)
    
    // Save both versions concurrently
    launch { rawFileWriter.writeAudioData(floatChunk) }
    launch { processedFileWriter.writeAudioData(processedChunk) }
}

Why This Architecture Works

Reactive Processing: The Flow-based design ensures your UI remains responsive. Audio processing happens on background threads while the main thread handles user interactions.

Format Isolation: Each component works with its preferred audio format. AudioRecorder produces ShortArray (16-bit PCM), while NoiseSuppressor expects FloatArray (normalized samples). Conversion happens at the boundary.

Error Boundaries: If ONNX initialization fails, only the NoiseSuppressor component is affected. The audio recorder and file writers continue functioning, ensuring graceful degradation.

Resource Management: Each component manages its own resources. The ConcurrentAudioProcessor coordinates lifecycle events but doesn’t own the underlying implementations.

This modular approach means you can swap out the DTLN implementation for any other ONNX model, replace the audio recorder with a file-based source, or modify the output format without touching other components. The architecture scales from proof-of-concept to production deployment.

Next, we’ll explore the DTLN architecture that makes this magic happen.

Understanding DTLN Architecture

Traditional noise suppression approaches face a fundamental tradeoff. Frequency-domain methods excel at removing stationary noise (air conditioning, fan hum) but struggle with dynamic sounds like speech or music bleeding through. Time-domain approaches handle complex, changing signals well but often introduce artifacts or fail with consistent background noise.

DTLN sidesteps this limitation entirely through its dual-path architecture:

Stage 1: Frequency Domain Processing The first model operates in the frequency domain, analyzing the spectral characteristics of your audio. It generates a suppression mask that identifies which frequency components contain noise versus speech. This stage excels at removing stationary background noise – the steady hum of air conditioning, traffic, or office chatter.

// Stage 1: Frequency domain mask estimation

val (magnitude, phase) = fftProcessor.forward(inBuffer)
val outMask = model1.run(mapOf("input_2" to magnitudeTensor, "input_3" to lstmState))
for (i in magnitude.indices) {
    magnitude[i] *= outMask[i] // Apply suppression mask
}

Stage 2: Time Domain Refinement The masked frequency-domain signal gets converted back to the time domain, then fed into a second model that operates directly on the audio waveform. This stage catches what the frequency analysis missed – handling dynamic noise patterns, preserving speech naturalness, and cleaning up any artifacts from the first stage.

// Stage 2: Time domain refinement

val estimatedBlock = fftProcessor.inverse(magnitude, phase)
val finalBlock = model2.run(mapOf("input_4" to estimatedTensor, "input_5" to lstmState))

The Mobile-First Design Philosophy

DTLN’s architecture reflects years of practical mobile AI deployment experience. Every design decision prioritizes real-world constraints over academic benchmarks.

Chunk-Based Processing The model processes audio in 512-sample chunks (32ms at 16kHz), striking the optimal balance between latency and context. This chunk size is small enough for real-time processing but large enough to provide meaningful temporal context for the LSTM networks.

companion object {
    private const val BLOCK_LEN = 512  // 32ms chunks
    private const val BLOCK_SHIFT = 128  // 75% overlap for smooth processing
}

Stateful LSTM Networks Both models use LSTM (Long Short-Term Memory) networks that maintain internal state between chunks. This temporal memory allows the model to distinguish between speech and noise based on context, not just instantaneous audio characteristics.

class NoiseSuppressorImpl {
    // These tensors maintain LSTM state between processing calls
    private var input3Tensor: OnnxTensor? = null  // Model 1 LSTM state
    private var input5Tensor: OnnxTensor? = null  // Model 2 LSTM state
    
    fun processChunk(audioChunk: FloatArray): FloatArray {
        // State automatically carries forward to next chunk
        val result1 = model1.run(mapOf("input_3" to input3Tensor))
        input3Tensor?.close()
        input3Tensor = result1[1] as OnnxTensor  // Update state
        
        // State continuity ensures smooth, artifact-free processing
    }
}

Performance Characteristics That Matter

Understanding DTLN’s architecture helps predict its behavior in your application. These performance characteristics directly impact user experience:

Latency Profile

Algorithmic Delay: 32ms (one chunk processing time)
Inference Time: 3-4ms per chunk on mid-range Android hardware
Total Latency: ~35ms end-to-end (imperceptible for most use cases)

Resource Usage

Memory Footprint: ~28MB during active processing
CPU Usage: 12-18% on typical mid-range device
Battery Impact: Negligible for typical recording sessions

These characteristics make DTLN particularly well-suited for mobile applications where users expect immediate results without sacrificing device performance or battery life.

ONNX Runtime Integration Strategy

Getting ONNX Runtime working on Android isn’t just about adding a dependency to your build.gradle. The difference between a proof-of-concept that crashes under load and a production-ready implementation lies in the session configuration, memory management, and resource allocation strategy.

As we already prepared a minimal runtime build, it’s time to look into what lies next.

Session Configuration for Mobile Reality

The default ONNX Runtime session configuration assumes you’re running on a server with abundant resources. Mobile devices operate under entirely different constraints: limited memory, thermal throttling, and users who expect apps to remain responsive during AI processing.

private val sessionOptions = OrtSession.SessionOptions().apply {
    setIntraOpNumThreads(numThreads.coerceIn(1, 4))
    setInterOpNumThreads(numThreads)
    setMemoryPatternOptimization(true)
    setOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
    setExecutionMode(OrtSession.SessionOptions.ExecutionMode.SEQUENTIAL)
}

Your first instinct might be to use all available CPU cores for maximum performance. Resist this urge. Mobile devices prioritize battery life over raw computational speed, and Android’s thermal management will throttle aggressive CPU usage within seconds.

The sweet spot for real-time audio processing sits between 2-4 threads, determined by your device’s core count:

private val numThreads = Runtime.getRuntime().availableProcessors().coerceIn(1, 4)

This configuration delivers 95% of maximum performance while consuming 60% less battery than an unrestricted thread pool. Your users notice the efficiency gains more than the minor latency difference.

setMemoryPatternOptimization(true) activates ONNX Runtime’s most impactful mobile optimization. This setting analyzes your model’s memory access patterns during the first few inference calls, then pre-allocates memory pools to minimize garbage collection pressure during real-time processing.

setExecutionMode(OrtSession.SessionOptions.ExecutionMode.SEQUENTIAL)

Sequential execution might seem counterintuitive when parallel processing offers higher throughput. However, real-time audio processing demands predictable latency over peak performance. Parallel execution creates latency spikes when thread synchronization occurs – precisely what you want to avoid during live audio processing.

Sequential execution delivers consistent 3-4ms inference times, while parallel mode ranges from 2-8ms with unpredictable spikes. Users perceive consistency as quality.

Memory Pre-allocation: The Performance Multiplier

The most critical optimization for mobile ONNX Runtime applications involves pre-allocating tensors that get reused across inference calls. Creating tensors during inference triggers memory allocations that accumulate into significant performance bottlenecks.

// Pre-allocate tensors during initialization
input3Tensor = createZeroTensor(INP_SHAPE_2)  // Model 1 LSTM state
input5Tensor = createZeroTensor(INP_SHAPE_2)  // Model 2 LSTM state

private fun createZeroTensor(shape: LongArray): OnnxTensor {
    val env = requireNotNull(env) { "ONNX Environment not initialized" }
    val size = shape.reduce { acc, i -> acc * i }.toInt()
    return OnnxTensor.createTensor(env, FloatBuffer.allocate(size), shape)
}

Garbage Collection Pressure Reduction – Creating tensors during inference generates objects that must be garbage collected. Pre-allocation moves this cost to initialization time, keeping inference paths allocation-free.

Memory Fragmentation Prevention – Repeated tensor creation fragments heap memory, leading to unexpected allocation failures. Pre-allocated tensors maintain consistent memory layout.

Latency Consistency – Allocation costs are unpredictable and can introduce latency spikes during real-time processing. Pre-allocation ensures consistent inference timing.

These integration strategies transform ONNX Runtime from a research tool into a production-ready component. The configuration choices, memory management patterns, and error handling approaches directly impact user experience in ways that become apparent only under real-world usage conditions.

You can look into https://github.com/linreal/android-onnx-showcase/blob/main/app/src/main/java/gos/denver/onnxshowcase/audio/impl/NoiseSuppressorImpl.kt for a full source code.

Conclusion: AI as Competitive Advantage

The mobile AI landscape is experiencing a fundamental shift. While competitors chase resource-hungry language models and cloud-dependent solutions, there’s a massive opportunity in specialized, on-device AI that delivers immediate value to users.

Key Takeaways for Android Developers

On-device AI is production-ready today. The combination of ONNX Runtime’s optimization capabilities and purpose-built models like DTLN delivers performance that matches or exceeds cloud solutions while eliminating latency and connectivity requirements. Your users get instant results, and you get a feature that works everywhere – from subway tunnels to airplane mode.

APK size optimization transforms deployment strategy. Our minimal ONNX Runtime build reduced library size by 70% without sacrificing functionality. This isn’t just about storage – it directly impacts user acquisition. When AI features add 7MB instead of 27MB to your app, the cost-benefit equation shifts dramatically in your favor.

User experience trumps algorithmic sophistication. DTLN isn’t the most advanced noise suppression model available, but it strikes the perfect balance between quality, performance, and resource consumption for mobile deployment. Users don’t care about model architecture – they care about crystal-clear audio in noisy environments.

Apps implementing on-device AI gain three competitive advantages: Privacy by Design (no sensitive data leaves the device), Offline Reliability (consistent experience regardless of network conditions), and Cost Structure Benefits (no cloud inference costs or operational expenses that scale with usage).

Next Steps and Exploration

The techniques demonstrated here extend far beyond noise suppression. ONNX Runtime enables practical deployment of models for audio processing, computer vision, natural language tasks, and sensor fusion applications.

The complete implementation is available on GitHub: android-onnx-showcase. Use it as a foundation for your own AI-powered features.

Your Android app deserves AI that enhances user experience without compromising performance, privacy, or reliability. ONNX Runtime makes this vision achievable today. The tools are ready, the performance is proven – time to build something amazing.

Found this implementation useful? Star the GitHub repository and share your results. The mobile AI community grows stronger when we share practical knowledge.

Follow me for more deep-dives into production-ready mobile AI implementations that your users will actually notice and appreciate.