WebNN: API Proposal To Reduce Memory Usage During Graph Building

by Admin 65 views
WebNN: API Proposal to Reduce Memory Usage During Graph Building

Hey everyone! Today, we're diving deep into a proposal to optimize WebNN's memory usage, specifically during the graph building phase. If you're working with larger models, especially on systems with limited memory, this is something you'll definitely want to hear about. So, let's get started!

The Problem: High Memory Consumption During Graph Building

When working with machine learning models on the web using WebNN, one issue that has surfaced is the high memory consumption during the graph building process. In particular, when analyzing the memory consumption of the Stable Diffusion demo on a dedicated GPU (dGPU) system, it was observed that the CPU memory usage during graph construction can be more than three times the actual model size. This high-memory-usage pattern, demanding multiple times the model size, is also noticeable on integrated GPU (iGPU) systems, although the precise multiplier may vary.

While this high memory footprint may not be a significant concern for smaller models, it becomes a critical issue for larger ones. To illustrate, a 4GB model might consume over 12GB of CPU memory during its load phase on a dGPU. This is especially problematic on systems equipped with 16GB of total memory, where such memory demands can lead to performance bottlenecks or even system crashes. The situation becomes even more acute on Unified Memory (UVM) systems, where the CPU and GPU/NPU share memory resources. The elevated peak memory usage during loading unnecessarily consumes memory that could otherwise be allocated for model weights and intermediate buffers on the device, thereby hampering overall performance and efficiency. It's crucial to address this issue to ensure that WebNN can handle larger, more complex models without running into memory limitations. The goal is to reduce the memory overhead during the graph building phase, making it more feasible to deploy large models on a wider range of devices. This optimization would not only improve performance but also enhance the user experience by preventing unexpected memory-related issues.

The Proposal: Decoupling Graph Creation from Weight Loading

To address this high memory consumption issue, the core of the proposal revolves around introducing an API that separates graph creation from weight passing, which essentially means loading the constant data. The main aim here is to facilitate streaming weights directly into the graph during initialization. This would help in circumventing the need to create multiple copies of the weights in JavaScript or the backend’s staging memory before the graph is completely built. Ideally, the weight data would be copied just once, directly from its source buffer to its final destination, such as device memory, once the data is provided to the graph post-construction.

This approach brings up a crucial discussion point: Can all current WebNN backends support a (mostly) "weightless" graph creation? This would involve having all tensor shapes and data types known, but the actual weight data would not be provided until a later step. This is a fundamental question that needs to be addressed to ensure the feasibility and widespread adoption of this proposal. Such a weightless graph creation could significantly reduce the initial memory footprint. Instead of loading all the weights upfront, the system would only need to handle the graph structure itself, deferring the loading of actual data until it is absolutely necessary. This would be especially beneficial in scenarios where memory is a constrained resource. The ability to create a graph structure without immediately loading weights also opens up possibilities for more flexible memory management strategies. For example, different parts of the model could be loaded on demand, further optimizing memory usage. Ultimately, the success of this proposal hinges on the ability of WebNN backends to efficiently handle graph creation independently from weight loading.

Potential Implementation: A Step-by-Step Approach

Let's break down a potential implementation strategy for this proposal, making it super clear how we could bring this to life. Here’s a step-by-step look at how we might approach this:

  1. Allow builder.constant() with Just MLOperandDescriptor:

    The first step in our approach is to modify the builder.constant() function. Currently, this function requires both an MLOperandDescriptor (which defines the shape and data type of the tensor) and the ArrayBufferView data source (which contains the actual numerical data). We propose allowing builder.constant() to be called with just an MLOperandDescriptor, without requiring the ArrayBufferView data source. This means you could define the structure of a tensor—its shape and data type—without actually loading any data into it at this stage. This is a critical change as it enables us to create a "hollow" constant, a placeholder for the actual data, which will be filled in later. This approach drastically reduces the initial memory footprint because we’re not immediately copying large amounts of data into memory. Instead, we’re just setting up the structure of the tensor. Think of it like preparing a container without filling it up right away.

  2. **Return an MLOperand Object for