Android Compose Screenshot Testing: Paparazzi and Roborazzi

Last year, while building a Compose component library, the design team raised a requirement that made the engineering cost immediately obvious: before every release, we needed to run visual regression checks on more than 40 core components to make sure the UI had not degraded unexpectedly. The traditional approach was to connect a real device, run Espresso screenshots, and compare them manually. One full pass took at least two hours.

That was not sustainable. I started looking into device-free screenshot testing: rendering Compose UI directly on the JVM and generating screenshots without an emulator or physical device. Two mainstream options came up: Paparazzi and Roborazzi. Both can run in CI, but their rendering mechanisms are fundamentally different. If you choose the wrong one, the test result may not match the real device output.

This article summarizes my experience with both approaches, from rendering pitfalls to building a PR-stage gate that catches visual regressions automatically.

Paparazzi JVM rendering: the benefits and limits of LayoutLib

Paparazzi’s core dependency is LayoutLib from the Android SDK. It is the rendering engine used by Android Studio’s layout editor, and it can run View measure/layout/draw directly on the JVM without a full Android system.

During tests, the Paparazzi Gradle plugin starts LayoutLib, converts Compose components into a View tree, and draws them into a BufferedImage. The API is straightforward:

@RunWith(AndroidJUnit4::class)
class ButtonTest {
  @get:Rule
  val paparazzi = Paparazzi(
    deviceConfig = DeviceConfig.PIXEL_6,
    theme = "android:Theme.Material.Light.NoActionBar"
  )

  @Test
  fun `button should render correctly`() {
    paparazzi.snapshot {
      Button(onClick = {}) { Text("Submit") }
    }
    // Generated PNGs are placed under build/paparazzi/failures.
  }
}

The biggest advantage is speed. It is pure JVM rendering, so there is no emulator startup. One test case usually finishes in 1 to 2 seconds. Screenshot tests for 40 components can finish in about a minute.

But in real use, I hit three issues:

First, LayoutLib is not the real Android system. Internally it is a simplified implementation. Many system behaviors are mocked or left empty. The anti-aliasing behavior of Canvas.drawRoundRect in LayoutLib differs from real devices. Rounded cards look harsher in Paparazzi screenshots, while the real device transition is smoother.

Second, Material Ripple effects are missing. LayoutLib does not include a complete RippleDrawable implementation, so screenshots of all clickable components miss the ripple. Paparazzi cannot verify pressed-state button screenshots.

Third, Compose compatibility can lag. LayoutLib is released with the Android SDK, while the Compose UI Toolkit changes faster. I once hit this with Compose 1.6: it used a new Modifier.Node implementation, but LayoutLib 8.3 parsed the old API path and threw NoSuchMethodError. The only choices were waiting for an SDK update or locking the Compose version in build.gradle, which conflicted with the project’s upgrade schedule.

A concrete example: when validating TopAppBar scrollBehavior, the text elevation shadow in Paparazzi was one shade darker than on a real device. The root cause was that LayoutLib used a fixed-direction light source for shadow calculation, while the real Material3 theme went through the SpotShadow algorithm. I eventually moved that component’s screenshot tests to Roborazzi.

Roborazzi: golden tests with the real rendering engine

Roborazzi takes the opposite approach. Instead of trying to simulate Android on the JVM, it runs Compose directly on an Android device or emulator and generates screenshots through captureRoboImage(). Conceptually, it captures the Compose semantics tree and Canvas layer precisely.

Dependency setup:

// build.gradle.kts
plugins {
  id("io.github.takahirom.roborazzi") version "1.24.0"
}

android {
  testOptions {
    managedDevices {
      devices {
        pixel6Api34(ManagedDevices) {
          device = "Pixel 6"
          apiLevel = 34
        }
      }
    }
  }
}

Test code:

class ButtonVisualTest {
  @get:Rule
  val composeTestRule = createComposeRule()

  @Test
  fun `button matches golden screenshot`() {
    composeTestRule.setContent {
      MaterialTheme {
        Button(onClick = {}) { Text("Submit") }
      }
    }
    // Generate a screenshot and compare it with the baseline.
    composeTestRule
      .onRoot()
      .captureRoboImage()
  }
}

captureRoboImage() performs a pixel-level comparison (Golden Test). The first run records a reference screenshot under src/test/snapshots. Later runs compare the new screenshot against the reference pixel by pixel. If the difference exceeds the threshold, which defaults to 0%, the test fails. Roborazzi writes diff images under build/outputs/roborazzi and highlights mismatched areas in red.

The cost is speed. Roborazzi uses Gradle Managed Devices (GMD), which requires emulator startup. A test from AVD boot to screenshot completion takes about 15 to 20 seconds. A full run across 40 components takes 8 to 10 minutes. But the time is worth it because the screenshot matches what users actually see.

Why the same component produces different screenshots

To validate the behavioral differences between the two approaches, I captured the same Card component with Paparazzi and Roborazzi, then overlaid the results with a compare tool.

// Component under test.
@Composable
fun ProductCard(name: String, price: String) {
  Card(
    modifier = Modifier.padding(16.dp),
    elevation = CardDefaults.cardElevation(defaultElevation = 4.dp),
    shape = RoundedCornerShape(12.dp)
  ) {
    Column(modifier = Modifier.padding(16.dp)) {
      Text(name, style = MaterialTheme.typography.titleMedium)
      Text(price, style = MaterialTheme.typography.bodyLarge,
           color = MaterialTheme.colorScheme.primary)
    }
  }
}

The differences concentrated in three areas:

Dimension	Paparazzi	Roborazzi
Shadow rendering	LayoutLib fixed light source, harder edges	Real-device GPU calculation, natural transition
Font rendering	Host JVM fonts on macOS/Linux, different fallback behavior	Android system fonts with the Noto fallback chain
Color space	sRGB, no wide-gamut mapping	Adapts to device color configuration such as P3
Ripple animation	Not supported	Fully supported

Font rendering differences are especially painful. Chinese characters had different line heights in LayoutLib than on real devices, which changed where multiline text was truncated. A screenshot that passes in Paparazzi may still be unacceptable on a device.

Building an automated regression gate

The selection strategy became clear: use Paparazzi for fast daily component screenshots, and use Roborazzi Golden Tests for core interaction paths and high-risk components. The two test suites run in different Gradle tasks:

# Paparazzi: fast baseline rendering checks, triggered on every PR.
./gradlew :library:verifyPaparazziDebug

# Roborazzi: Golden Tests, triggered when merging to the main branch.
./gradlew :library:recordRoborazziDebug  # First run creates the baseline.
./gradlew :library:verifyRoborazziDebug  # Later runs compare against it.

In GitHub Actions, I split them into separate stages:

# .github/workflows/visual-regression.yml
jobs:
  paparazzi-fast:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Paparazzi screenshots
        run: ./gradlew :library:verifyPaparazziDebug

  roborazzi-golden:
    needs: paparazzi-fast
    runs-on: ubuntu-latest
    strategy:
      matrix:
        api-level: [34]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
      - name: Enable KVM  # GMD requires hardware acceleration.
        run: |
          echo 'KERNEL=="kvm", GROUP="kvm", MODE="0666"' | sudo tee /etc/udev/rules.d/99-kvm4all.rules
      - name: Run Golden Tests
        run: ./gradlew :library:verifyRoborazziDebug

One pitfall is when to update the baseline. Intentional UI changes, such as a primary color adjustment, can make every Golden Test fail. In that case the baseline needs to be updated. I set one rule: baseline updates go through a separate PR, and they must be approved by the designer or UI owner before merge. GitHub CODEOWNERS automates the review requirement:

# .github/CODEOWNERS
src/test/snapshots/  @design-reviewers

Any PR that changes the snapshot directory automatically requires design review.

Debugging pixel differences

When Roborazzi reports a difference, it generates three images under build/outputs/roborazzi: _compare (diff image), _actual (actual screenshot), and _expected (baseline). Red regions show mismatched pixels.

In practice, most false positives came from three sources:

Time-dependent components: clocks and countdown text differ on every run, so mock the time source.
Device pixel ratio (DPR) differences: the GMD emulator DPR may differ from the CI host. Use @Config(qualifiers = "w360dp-h640dp-xhdpi") to force a fixed configuration.
Animation frame capture: screenshots taken mid-animation are unstable. Use composeTestRule.waitForIdle() to make sure rendering has finished.

A practical trick for time dependencies is to inject a fixed TimeProvider through CompositionLocalProvider:

compositionLocalOf<TimeProvider> provides TimeProvider.Fixed(
  LocalDateTime.of(2026, 6, 1, 10, 0)
)

How the two approaches work together

After one quarter of iteration, my division of responsibility is:

Paparazzi handles structural validation: whether components have broken layout, text overflow, or alignment issues. These account for more than 70% of UI defects. LayoutLib’s precision is good enough for them, and it is fast enough to run on every push without blocking CI.

Roborazzi handles pixel-level validation: primary colors, shadows, corner radii, font weight, and other details designers can spot by eye. It only runs when merging to the main branch. A 10-minute cost is acceptable there.

One more benefit is easy to overlook: screenshots become documentation. Every component’s Golden Image lives under src/test/snapshots, so designers can browse the PNG files directly to understand the current UI state of the component library without pulling code and building the app.

This system ran for three months and caught four cases where theme-variable changes caused global visual regressions. One case happened when someone adjusted the secondary color in ColorScheme, making every OutlinedTextField border lighter. Paparazzi did not catch it because LayoutLib simplified border rendering, but Roborazzi Golden Tests failed immediately during merge. That case convinced me that the two approaches are complementary, not interchangeable.

Paparazzi JVM rendering: the benefits and limits of LayoutLib

Roborazzi: golden tests with the real rendering engine

Why the same component produces different screenshots

Building an automated regression gate

Debugging pixel differences

How the two approaches work together

Android Testing in Practice: JUnit, Integration Tests, Compose Semantics, and CI

What Should Android CI/CD Quality Gates Include?

Why Does Compose Recompose So Often? From Stability to State Read Placement

What Is the Difference Between Glance and RemoteViews?

Read Next

System-Level Android Performance Analysis and Tuning with Systrace and Perfetto

Android CI/CD Pipeline: ktlint, Detekt, Testing, and Firebase

Paparazzi JVM rendering: the benefits and limits of LayoutLib

Roborazzi: golden tests with the real rendering engine

Why the same component produces different screenshots

Building an automated regression gate

Debugging pixel differences

How the two approaches work together

Related Posts

Android Testing in Practice: JUnit, Integration Tests, Compose Semantics, and CI

What Should Android CI/CD Quality Gates Include?

Why Does Compose Recompose So Often? From Stability to State Read Placement

What Is the Difference Between Glance and RemoteViews?

Read Next

System-Level Android Performance Analysis and Tuning with Systrace and Perfetto

Android CI/CD Pipeline: ktlint, Detekt, Testing, and Firebase