Android Vulkan Migration: OpenGL ES, Command Buffers, and GPU Tuning | Kai

Last year, while optimizing performance for a video effects SDK, we ran into a difficult problem: under high-frame-rate rendering, the OpenGL ES pipeline used only about 60% of the GPU, while implicit driver-side state validation consumed a large amount of CPU time. After moving to Vulkan, the same shader logic delivered a 35% frame-rate improvement. This article summarizes the migration path and tuning strategy.

Why OpenGL ES stops scaling

The problem with OpenGL ES is not rendering capability. It is implicit driver behavior.

On every glDrawCall, the driver has to check whether the currently bound shader has finished compiling, validate that vertex attribute pointers match shader inputs, confirm texture formats and sampler compatibility, and handle framebuffer completeness. These checks run synchronously at the call site, completely hidden from the developer.

If you have profiled this path, calls such as glUniform* appear fast by themselves, but the following draw call takes far longer than expected. The reason is that the driver batches state validation and shader compile/link work at the draw call. This is all hidden OpenGL ES overhead.

Single-threaded submission is another hard limit. A GL context is bound to one thread, wasting the parallelism of multi-core CPUs. In scenarios that frequently update uniform data or rebuild vertex buffers, the bottleneck is often not the GPU but the CPU threading model.

Vulkan returns that control to the developer: state validation is completed during pipeline creation, command recording can run in parallel across threads, and synchronization points are explicitly marked. The cost is that code volume grows by 3-5x.

Command buffers and render passes

Vulkan’s core abstraction is the command buffer. It is not executed immediately. Commands are recorded first, then submitted together. This removes the call-site overhead found in OpenGL ES.

A typical recording flow looks like this:

// Begin recording
VkCommandBufferBeginInfo beginInfo = {};
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
beginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
vkBeginCommandBuffer(cmd, &beginInfo);

// Begin a Render Pass, bind the framebuffer, and clear color
VkRenderPassBeginInfo rpInfo = {};
rpInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO;
rpInfo.renderPass = renderPass;
rpInfo.framebuffer = framebuffer;
rpInfo.renderArea.extent = {width, height};
VkClearValue clearColor = {{{0.0f, 0.0f, 0.0f, 1.0f}}};
rpInfo.clearValueCount = 1;
rpInfo.pClearValues = &clearColor;

vkCmdBeginRenderPass(cmd, &rpInfo, VK_SUBPASS_CONTENTS_INLINE);

// Bind pipeline and descriptor set, then issue the draw
vkCmdBindPipeline(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS,
    pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);
vkCmdDraw(cmd, 3, 1, 0, 0);

vkCmdEndRenderPass(cmd);
vkEndCommandBuffer(cmd);

A Render Pass is not just syntactic sugar. It tells the driver how framebuffer attachments are loaded and stored, allowing the driver to optimize on-chip cache strategy for tile-based GPUs. Almost all mobile GPUs are tiled architectures, including ARM Mali and Qualcomm Adreno. Correctly configuring a Render Pass’s loadOp and storeOp directly affects bandwidth consumption:

VK_ATTACHMENT_LOAD_OP_CLEAR: avoids reading previous-frame data back from memory
VK_ATTACHMENT_STORE_OP_DONT_CARE: tells the driver not to write the attachment back to memory if it will not be used later

Multithreaded command-buffer construction

One of the pain points of OpenGL ES is single-threaded submission. Vulkan supports multithreaded recording through command pools and secondary command buffers.

In real projects, each worker thread owns its own VkCommandPool, and command buffers are allocated and recorded thread-locally. The main thread collects them and submits with vkQueueSubmit.

// Thread-safe command-buffer allocation
VkCommandBufferAllocateInfo allocInfo = {};
allocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
allocInfo.commandPool = threadPool;  // One independent pool per thread
allocInfo.level = VK_COMMAND_BUFFER_LEVEL_SECONDARY;
allocInfo.commandBufferCount = 1;

One pitfall: on Adreno GPUs, sharing a VkDescriptorPool across threads can cause lock contention inside the driver. In my measurements, with four threads recording in parallel, sharing a descriptor pool made frame time 18% slower than using one pool per thread. So I also split descriptor pools by thread. The extra memory cost is small, a few KB per pool, but it avoids driver-level synchronization.

Draw-call ordering also matters. Group by render pass, pipeline, and descriptor set to reduce state changes. Mali GPU documentation explicitly notes that frequent vkCmdBindPipeline calls can cause internal GPU state flushes. Sorting can reduce CPU time by 10-15%.

Key areas for GPU driver tuning

Migrating to Vulkan is only the starting point. Driver-level tuning is what releases the performance.

Descriptor set caching. Do not allocate descriptor sets every frame. Preallocate a pool of descriptor sets and reuse them by index. For uniform buffers, use VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC with dynamic offsets so the descriptor set layout can be shared while only the offset changes:

uint32_t dynamicOffset = frameIndex * alignedUniformSize;
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS,
    pipelineLayout, 0, 1, &descriptorSet, 1, &dynamicOffset);

Pipeline cache. Vulkan’s VkPipelineCache can serialize pipeline compilation results to disk. Shader compilation time can differ dramatically between first launch and later launches. In a video effects SDK, first-launch shader compilation took 420 ms; with pipeline cache it dropped to 90 ms.

On Android, use context.getCacheDir() for the file path, and handle version compatibility. After a driver update, the cache may become invalid and needs rebuild logic.

VMA memory allocation. Vulkan vkAllocateMemory has platform limits. maxMemoryAllocationCount is often below 4096. If you frequently allocate small chunks directly with the native API, you can exhaust this quota quickly. Integrate the VulkanMemoryAllocator (VMA) library. It performs suballocation and defragmentation internally, merging thousands of small allocations into a few large ones.

Android hardware buffer interop. If the render output needs to be encoded by MediaCodec or consumed by another process, use AHardwareBuffer for cross-API sharing. Create it with AHARDWAREBUFFER_USAGE_GPU_COLOR_OUTPUT, import it through the Vulkan extension VK_ANDROID_external_memory_android_hardware_buffer, and avoid GPU-to-CPU-to-GPU copies.

Migration path and tradeoffs

A full OpenGL ES to Vulkan migration does not happen in a day. A practical route looks like this.

Start with ANGLE, the Almost Native Graphics Layer Engine, as a transition layer. ANGLE translates OpenGL ES calls to Vulkan. Changing the link path is enough to start using it, without rewriting rendering code. Performance is below native Vulkan, but it exposes which GL calls actually trigger Vulkan-layer overhead, giving data for the later rewrite.

Then migrate by effect scenario. Move draw-call-heavy areas first, such as post-processing effects and particle systems, where the benefit is largest. UI rendering, including text and icons, can be migrated last or left on OpenGL ES, because draw-call count is low and the cost-benefit ratio is weaker.

My preference is to use Vulkan 1.1 as the primary rendering path on Android 10 and later, while keeping OpenGL ES as fallback. Device coverage data shows Vulkan 1.1 support above 85% on Android 10 and later. That is enough coverage for a real-time video effects SDK, and there is no need to sacrifice main-path performance for devices from several years ago.

One final recommendation: run Vulkan validation layers in CI. Errors that are invisible during development, such as unbound descriptor sets, image layout mismatches, and missing pipeline barriers, are reported precisely by validation layers. Configure VK_LAYER_KHRONOS_validation with VK_DEBUG_UTILS_MESSAGE_SEVERITY_WARNING_BIT_EXT, and do not ignore warnings. Code that “works fine” on a development machine may show a black screen on Mali or PowerVR drivers.

Inside Android Vulkan: Migrating from OpenGL ES to Low-Overhead GPU Rendering

Why OpenGL ES stops scaling

Command buffers and render passes

Multithreaded command-buffer construction

Key areas for GPU driver tuning

Migration path and tradeoffs

Android RenderThread and HWUI: From DisplayList Recording to GPU Rasterization

Phased Android Startup Initialization with background and activity StartTypes

Async Inflate Manager: Thread-Pool Preloading for Faster First Frames

Bytecode try-catch Instrumentation for Third-Party Crashes: Precise Containment with hookPoint

Read Next

Optimizing the Gradle Build System

Inside Android TTS: From TextToSpeech API to On-Device Vocoders

Why OpenGL ES stops scaling

Command buffers and render passes

Multithreaded command-buffer construction

Key areas for GPU driver tuning

Migration path and tradeoffs

Related Posts

Android RenderThread and HWUI: From DisplayList Recording to GPU Rasterization

Phased Android Startup Initialization with background and activity StartTypes

Async Inflate Manager: Thread-Pool Preloading for Faster First Frames

Bytecode try-catch Instrumentation for Third-Party Crashes: Precise Containment with hookPoint

Read Next

Optimizing the Gradle Build System

Inside Android TTS: From TextToSpeech API to On-Device Vocoders