On-device Inference Articles

Android Multimodal On-device AI: Gemini Nano, Image Tokens, Streaming, and Compose

A practical Android multimodal AI walkthrough covering Gemini Nano Multimodality, AICore model loading, image preprocessing, ViT tokenization, streaming inference, Compose rendering, memory, and thermals.

Designing an On-Device LLM Inference Scheduler: Priority Queues and Backpressure in Practice

This article shows how to build a scheduling layer above an on-device inference engine, using priority queues, preemption, and backpressure to avoid OOMs, unpredictable latency, and out-of-order results.

Android Local LLM Inference: LiteRT, MediaPipe, Quantization, and Production Trade-offs

A practical guide to Android local LLM inference across LiteRT, ONNX Runtime Mobile, MediaPipe LLM Inference, INT4 quantization, GPU delegates, KV cache memory, and device fallback.

Android On-device LLM Context Window Engineering

December 17, 2025

A practical Android on-device LLM context management strategy covering layered prompt compression, summary caching, dialog state machines, and token budget allocation.

Inside Android TTS: From TextToSpeech API to On-Device Vocoders

September 23, 2025

A full-stack look at Android TTS, including engine binding, synthesis callbacks, on-device HiFi-GAN vocoders, streaming playback, and TTFA tuning.

Android ML Kit in Practice: Vision Pipelines and CameraX Integration

A practical guide to ML Kit for on-device vision, covering detection pipelines, CameraX integration, multi-model orchestration, and inference optimization.

Android AICore and Gemini Nano: System Services, Safety Filters, and LoRA Adaptation

A deep dive into Google AICore for Gemini Nano on Android, covering APEX delivery, permission isolation, model distribution, safety filtering, session management, and LoRA adapters.