Android Memory Leak Governance: LeakCanary, HPROF, and Production Monitoring

Memory leaks are among the hardest performance problems to diagnose. They rarely crash the app immediately. Instead, they erode heap space slowly until the user scrolls a feed for dozens of minutes and hits OOM, while the crash stack points somewhere unrelated.

This article follows one line of reasoning: first understand LeakCanary’s WeakReference sentinel mechanism, then locate the real leak path in an HPROF file, and finally move the capability into production through a continuous anti-regression loop.


LeakCanary’s detection model: WeakReference sentinels

Many developers know LeakCanary can report leaks automatically, but not how it “discovers” that an object is still alive.

The core mechanism is simple. After an Activity or Fragment reaches onDestroy, LeakCanary wraps that object in a KeyedWeakReference and registers it with a shared ReferenceQueue:

// Simplified LeakCanary internals.
val reference = KeyedWeakReference(watchedObject, key, description, retainedClock.uptimeMillis(), queue)
watchedObjects[key] = reference

A WeakReference has an important property: after the object it points to is reclaimed by GC, the JVM enqueues the weak reference itself into the registered ReferenceQueue. LeakCanary triggers a GC and checks that queue. If the target object was collected, the weak reference appears in the queue, so there is no leak. If the object is still alive, the weak reference is not in the queue, which means some strong reference is still retaining it.

The 5-second delayed check plus explicit GC is designed to filter out Fragments that are still legitimately on the back stack. This is practical, but it is also a major source of LeakCanary false positives. On some devices, GC timing is unpredictable, and occasional “pseudo-leak” reports can appear.

ObjectWatcher and heap dump timing

LeakCanary 2.x encapsulates this mechanism in ObjectWatcher. By default, it triggers a heap dump when the retained object count reaches a threshold, usually 5:

// ObjectWatcher checks retained objects.
fun checkRetainedObjects() {
    val retainedCount = moveToRetained()
    if (retainedCount >= retainedVisibleThreshold) {
        scheduleRetainedObjectCheck()
    }
}

moveToRetained() traverses watchedObjects and marks weak references that did not enter the queue as retained. One easy-to-miss detail: heap dump scheduling happens when the main thread is idle, through MessageQueue.addIdleHandler, so LeakCanary avoids competing with busy user flows as much as possible.


HPROF analysis: finding the real GC Root

HPROF is a JVM heap snapshot. LeakCanary parses it with Shark, and Android Studio Memory Profiler can also open it. Opening the file is not the hard part. The key is finding the shortest strong-reference path from the leaked object to a GC Root.

A GC Root is a starting point that GC cannot reclaim. Common roots include local variables on JVM stacks, static fields, JNI global references, and live threads.

LeakCanary reports the reference chain directly. When analyzing manually in Android Studio, the workflow is:

  1. Open Memory Profiler and load the HPROF file.
  2. Filter by Class and find the suspected leaked class, such as MainActivity.
  3. Select an instance and use Jump to Source to find the allocation site.
  4. Inspect the References panel and switch to “Show nearest GC root only”.

One real false lead I have seen: the end of the reference chain was FinalizerReference, which looked like a leak. In reality, the object was waiting for finalizer execution, and FinalizerThread would eventually clean it up. Those objects are not true leaks, so do not spend time fixing them.

Common leak patterns

Repeated HPROF reports tend to converge on a small set of leak paths.

Anonymous inner classes retaining the outer instance: the most common pattern. A Handler, Runnable, or Listener written as an anonymous class implicitly holds Activity.this:

// Leaking version.
handler.postDelayed({
    updateUI() // Implicitly holds Activity.this.
}, 3000)

// Fix: wrap with a weak reference.
val weakActivity = WeakReference(this)
handler.postDelayed({
    weakActivity.get()?.updateUI()
}, 3000)

Static fields retaining Context: a singleton or utility class stores a Context that is not ApplicationContext, preventing the Activity from being reclaimed.

ViewModel retaining View: a ViewModel outlives a Fragment’s View. If the ViewModel stores the Fragment’s View reference, the old View leaks after the Fragment view is recreated. This becomes easier to hit in heavily Jetpack-based code.

LiveData observer not removed: in non-lifecycle-aware contexts such as a Service or custom View, manual observe calls need matching removeObserver calls at the right time.


From HPROF to production: three engineering defenses

LeakCanary is excellent during development, but it cannot run as-is in production. The reason is straightforward: heap dump pauses the app with STW, and files can be tens or hundreds of MB. Both user experience and network cost are unacceptable.

Production memory monitoring needs a different model: lightweight metric collection -> threshold-triggered dump -> offline analysis. Each layer has a separate job.

Layer 1: continuous lightweight metrics

Do not take full heap dumps. Collect only key memory metrics:

fun collectMemoryMetrics(): MemoryMetrics {
    val runtime = Runtime.getRuntime()
    val activityManager = getSystemService(ActivityManager::class.java)
    val memInfo = ActivityManager.MemoryInfo()
    activityManager.getMemoryInfo(memInfo)
    
    return MemoryMetrics(
        heapUsed = (runtime.totalMemory() - runtime.freeMemory()) / 1024 / 1024,
        heapMax = runtime.maxMemory() / 1024 / 1024,
        nativeHeap = Debug.getNativeHeapAllocatedSize() / 1024 / 1024,
        isLowMemory = memInfo.lowMemory
    )
}

Collect every 30 seconds or on key page onResume events, then report to monitoring. Heap usage that keeps growing and does not drop after GC is the clearest leak signal.

Layer 2: production port of the weak-reference sentinel

LeakCanary’s sentinel mechanism can be reused cheaply in production if you only track retained counts and do not trigger heap dumps:

object LeakSentinel {
    private val watchedRefs = mutableMapOf<String, KeyedWeakReference<*>>()
    private val refQueue = ReferenceQueue<Any>()

    fun watch(obj: Any, tag: String) {
        gc()
        drainQueue() // Remove references whose objects have already been collected.
        watchedRefs[tag] = KeyedWeakReference(obj, refQueue)
    }

    fun retainedCount(): Int {
        gc()
        drainQueue()
        return watchedRefs.size
    }

    private fun drainQueue() {
        var ref = refQueue.poll()
        while (ref != null) {
            watchedRefs.values.remove(ref)
            ref = refQueue.poll()
        }
    }
}

Watch every Activity in onDestroy, then periodically check retainedCount(). If it exceeds the threshold, report an alert and do not dump anything. Runtime overhead is very low.

Several engineering blogs from teams such as ByteDance and Meituan describe similar approaches, though implementations differ. One pitfall: calling System.gc() on low-end devices can trigger Full GC and visible jank. In practice, decide whether to force GC based on device class, or rely entirely on natural JVM GC. The latter increases false positives, so you need to choose the trade-off deliberately.

Layer 3: conditionally triggered production HPROF

This is the heaviest layer and should trigger only when conditions are right: device idle, Wi-Fi connected, enough battery, and a rollout cohort that permits it.

On Android 10 and later, Debug.dumpHprofData(filePath) can run from a background thread, but it still causes an STW pause, often 1 to 3 seconds. A better option is LeakMonitor from matrix-android, which forks a child process to dump so the main process barely feels it:

# Matrix HPROF analysis flow.
# 1. Fork a child process.
# 2. The child process calls Debug.dumpHprofData().
# 3. Compress and upload the file to the server.
# 4. The server uses Shark for offline analysis and generates a leak report.

InfoQ’s stability articles describe this approach in more detail. My take: the fork-based approach is reasonably compatible below Android 12, but can fail on Android 12+ because of sandbox restrictions. Prepare a fallback before launch; do not make this path the only exit.


Anti-regression loop: making leak governance sustainable

The hardest part of leak governance is not fixing a leak once. It is preventing the same class of problem from returning. CI/CD gates are the practical way to keep the result.

Running LeakCanary in automated CI tests is the cheapest gate. Disable UI notifications through LeakCanary.config, then let tests assert retainedCount == 0:

@Test
fun testActivityNotLeaked() {
    val scenario = ActivityScenario.launch(MainActivity::class.java)
    scenario.close() // Triggers onDestroy.
    
    // Wait for LeakCanary to finish detection.
    Thread.sleep(5000)
    
    assertThat(AppWatcher.objectWatcher.retainedObjectCount).isEqualTo(0)
}

Combined with merge-request gates, a newly introduced leak fails the pipeline before it reaches the main branch.

In production, compare retained-count P95 by app version. If the new version’s P95 is clearly higher than the baseline, alert and pull matching HPROF samples for analysis. This signal is more sensitive than OOM rate. Leaks often show abnormal retained-count trends long before OOM starts to move.


Practical recommendations

Fix the shortest reference chains first. In HPROF, shorter chains are closer to GC Roots, easier to fix, and often higher-frequency.

Distinguish leaks from caches. Some retained objects are intentionally held by the business, such as image caches. Add known cases to LeakCanary’s ignoredInstanceFields; otherwise noise will bury real leaks.

Treat native memory leaks as a separate track. LeakCanary only covers the Java heap. If native heap keeps growing, use malloc_debug or Perfetto’s heap profiler separately. Mixing the two toolchains without first deciding whether the issue is Java heap or native heap can confuse the investigation.

Among the three layers, the production weak-reference sentinel has the best return on investment. It takes less than 100 lines of code, but it can detect newly introduced leaks soon after release. Any app with serious performance requirements should treat it as a baseline capability.

Further reading