Android Cold Start Optimization: From Zygote Fork to First Frame
One project had an app that felt fast in local testing, but production P90 cold start was stuck at 3.2 seconds. Instrumentation said the problem happened after Application.onCreate, but the code did not reveal an obvious bottleneck. A complete Perfetto trace finally showed the real cost hidden in a Binder call stack: an unrelated third-party SDK was doing synchronous IPC on the main thread.
That kind of “invisible by intuition, obvious in a trace” case is common in startup optimization. This article does not repeat generic advice such as “lazy-load and initialize asynchronously.” Instead, it starts from trace signals and walks through each cold-start phase.
The four cold-start phases and their trace regions
Cold start begins when the user taps the Launcher icon and ends when the first app frame is drawn. The chain looks like this:
tap event -> Zygote forks the process -> Application initialization ->
Activity creation / layout / drawing -> SurfaceFlinger composes the first frame
In Perfetto, these phases map to different trace markers:
- Zygote fork: the
ZygoteForkChildslice in thezygote64system process - Application initialization:
bindApplication->ActivityThread.handleBindApplication - Activity creation:
activityStart->performCreate->performResume - First-frame composition:
Choreographer#doFrameplus SurfaceFlinger’scommitslice
When capturing with adb shell perfetto, enable sched, binder_driver, gfx, and view. Otherwise Binder calls and rendering-pipeline slices may be missing.
adb shell perfetto \
-c - --txt \
-o /data/misc/perfetto-traces/trace.pftrace \
<<EOF
buffers: { size_kb: 63488 fill_policy: RING_BUFFER }
data_sources: {
config {
name: "linux.ftrace"
ftrace_config {
ftrace_events: "sched/sched_switch"
ftrace_events: "power/suspend_resume"
atrace_categories: "gfx"
atrace_categories: "view"
atrace_categories: "binder_driver"
atrace_categories: "am"
}
}
}
duration_ms: 10000
EOF
Load the trace into ui.perfetto.dev, search for the process name, and start with the main-thread slices.
Phase 1: Zygote fork, where app code has limited control
Many startup articles skip this phase. It is worth calling out for the opposite reason: most of this cost is not controlled by business code, so do not spend too much optimization effort here.
Zygote preloads the ART runtime and system classes during system boot. fork() itself uses Copy-on-Write and should be fast in theory. In real traces, ZygoteForkChild often takes 10 to 30 ms, usually because of two factors:
- GC under memory pressure: when system memory is tight,
GC_FOR_ALLOCcan happen around fork time, and theschedtracks show heavy CPU preemption. - Binder thread initialization: thread creation in
ProcessState::startThreadPool()can be delayed on some devices.
If there is a clear gap between fork and bindApplication, inspect CPU usage in system processes during the same time window. It is often caused by overall device load. On the app side, the only realistic levers are indirect: reduce resident processes and lower memory footprint.
Phase 2: bindApplication, the main battleground
The window from bindApplication to activityStart is where app-side optimization has the most room. In the main-thread track, this region is covered by the ActivityThread.handleBindApplication slice.
ContentProvider is the first trap
Many SDKs use ContentProvider.onCreate() for automatic initialization. Firebase and LeakCanary have both used this pattern. ContentProviders initialize after Application.attachBaseContext and before Application.onCreate, and they all run serially on the main thread.
In a trace, this appears as many child slices inside installContentProviders, each representing some SDK’s initialization logic. One real issue I hit: a map SDK provider read a local config file in onCreate, costing 200 ms on low-end devices.
The investigation is direct: expand installContentProviders in Perfetto, list every provider taking more than 10 ms, and decide whether it is needed. Remove what can be removed, and push the rest toward asynchronous initialization where possible.
Layered initialization in Application.onCreate
The usual recommendation is to “move nonessential SDKs to background threads,” but there is a hidden problem: if a background-initialized SDK is first used on the main thread before it finishes, it may block the main thread on CountDownLatch.await(). That only moves the cost from Application to Activity.
A more stable approach is to split initialization by startup phase:
class App : Application() {
override fun onCreate() {
super.onCreate()
// Layer 1: must finish on the main thread because it affects the first frame.
initCrashReporter() // Crash capture must be registered early.
initRouterSync() // Router table loaded synchronously.
// Layer 2: run in parallel on background threads; not required before first frame.
AppScope.launch(Dispatchers.IO) {
initAnalyticsSDK()
initPushSDK()
}
// Layer 3: delay until IdleHandler; does not block any frame.
mainLooper.queue.addIdleHandler {
initLocationSDK()
false
}
}
}
The decision rule is simple: does the first frame depend on this SDK’s return value? If not, it belongs in layer 2. If it is needed only after first-frame interaction, layer 3 through IdleHandler is usually better.
When validating in a trace, compare wall time and CPU time for Application.onCreate. A large gap means the main thread is waiting on Binder or I/O. A small gap means pure CPU work. Those two cases require different fixes.
Phase 3: Activity creation to first measure/layout
The window from performCreate to the first Choreographer#doFrame usually bottlenecks in three places.
Layout hierarchy is too deep
Inflation time grows roughly with View tree depth. In traces, LayoutInflater.inflate duration is a direct signal of layout complexity. Any inflate over 50 ms deserves attention. Common fixes:
ViewStub: delay inflation for Views not visible on the first screenAsyncLayoutInflater: inflate on a background thread, then switch to the main thread toaddView
AsyncLayoutInflater has a limitation: inflated Views cannot directly depend on main-thread Handler logic during inflation, or they may crash. In practice, I usually prefer ViewStub plus explicit control because it is more predictable.
Synchronous SharedPreferences reads
Reading SharedPreferences in Activity.onCreate is common, but the first getSharedPreferences() triggers file loading. On the main thread, that is synchronous I/O. In a trace, it appears as blocking around SharedPreferencesImpl.startLoadFromDisk.
Alternatives: move to Jetpack DataStore’s Flow API, or warm up SharedPreferences during the asynchronous phase of Application.onCreate so Activity reads hit memory.
Binder call backlog
This is the easiest startup cost to miss. Activity launch already involves many Binder calls for window tokens, window registration, permission checks, and other framework work. Those calls are unavoidable. If business code adds extra Binder calls in onCreate, the cost can become substantial.
In Perfetto, switch to the binder_driver tracks to inspect each Binder transaction’s duration and caller. A typical case I saw: an SDK called PackageManager.getInstalledPackages() in onCreate. On Android 11 and later, that API may enumerate installed packages, and the Binder return time reached 80 ms.
Phase 4: first-frame composition, from VSYNC to pixels
The final mile of the rendering pipeline runs from Choreographer#doFrame to SurfaceFlinger commit.
SurfaceFlinger has its own process track in Perfetto. Find the app’s Layer name, usually something like SurfaceView[package] or com.xxx.MainActivity#0, and inspect which vsync cycle first completes latchBuffer.
If the window from doFrame to latchBuffer crosses more than one vsync cycle, you have jank. Common causes:
- Expensive work in
onDraw, such as Bitmap decoding or manyPathcalculations - Repeated
measure/layout, often from nestedrequestLayout - GPU composition timeout, such as a hardware layer missing GPU cache
First-frame Bitmap loading is a frequent issue. If the first screen contains images, predecode and cache them on a background thread from Application, then let Activity read from memory cache. That skips disk I/O and decode time on the first frame.
Closing the optimization loop
Startup optimization is not a one-off project. A sustainable workflow records cold-start traces in CI before each release on fixed low-end, mid-range, and high-end devices. A script extracts key slice durations into monitoring.
I usually gate on two regression metrics:
bindApplicationduration, which reflects SDK initialization quality- First-frame
doFramewall time, which reflects layout and rendering quality
Looking only at “total cold-start time” hides phase-level regressions. A release can make Application faster and Activity slower, with the total hiding the responsible phase. Segment-level metrics point directly to the owner.
For tooling, Perfetto’s command-line traceconv can convert a trace to JSON, and scripts can parse the slice tree automatically. That is far more efficient than manual UI inspection. If you still rely mainly on Android Studio Profiler for startup work, moving to Perfetto UI gives you much denser signal.
Further reading
- Back to topic: Android Performance Optimization
- Android app startup metrics: cold start, first frame, TTID, and Perfetto analysis
- Android RenderThread and HWUI: rendering pipeline, DisplayList, and frame-drop analysis
- Android Bitmap memory model: Java heap, native heap, and Hardware Bitmap
- Android Perfetto: trace capture, track analysis, and performance debugging