Gemini Nano on Android
This topic focuses on Gemini Nano and AICore on Android.
Android AI engineering is moving from “what is Gemini Nano?” to “how do we ship on-device generative AI inside a real app?” This page organizes notes around Gemini Nano, AICore, ML Kit GenAI APIs, Android on-device AI, local LLM inference, RAG, and multimodal interaction.
First Decide Whether On-device AI Fits
On-device AI is strongest when latency, offline use, privacy, and predictable inference cost matter. Good candidates include summarization, rewriting, image description, speech recognition, smart input, local content retrieval, and small RAG workflows.
It is not a good fit for simply copying every cloud LLM capability onto a phone. Long-context reasoning, complex multi-step planning, and large-scale knowledge retrieval still often need cloud assistance or a hybrid route.
Technical Entry Points
- AICore: a system-level service for model access, updates, security, and hardware acceleration.
- Gemini Nano: the Gemini model family designed for local, low-latency, privacy-first tasks.
- ML Kit GenAI APIs: higher-level capability APIs that abstract part of the model-version complexity.
- AI Edge, LiteRT, and MediaPipe LLM: better suited for custom local inference pipelines.
- Compose UI: useful for streaming output, multi-turn conversations, multimodal input, and state feedback.
Core Reading
- Android On-device AI engineering notes
- Android AICore and Gemini Nano: the full on-device inference path
- Android local LLM inference: from LiteRT to MediaPipe LLM Inference
- Streaming local LLM output: from token generation to incremental Compose rendering
- Local RAG on Android: retrieval-augmented generation with a local vector database
- Multimodal local AI: Gemini Nano multimodality and real-time Compose interaction
Performance and Production Concerns
- On-device AI benchmark design: latency, throughput, power, and thermal degradation
- Using Perfetto to trace NPU scheduling and memory-bandwidth bottlenecks
- Memory management for local AI: model-load peaks and KV cache recycling
- Concurrent inference scheduling: priority queues and backpressure control
- Model security: encrypted storage, TEE inference, and IP protection
Official References
Related Topics
- Compose-first Migration: local AI chat, streaming output, and multimodal interaction usually need a solid Compose UI architecture.
- Android Performance: local models expose memory, temperature, power, and frame-rate problems quickly.